Wrong Drive / Duplicate Drive?

Steve Burridge · April 7, 2020

Hi,

My apologies if this type of issue is already covered by another thread... This is a new one on me and I am somewhat struggling to understand what type of issue i am looking at; therefore best to begin with a screen shot.. then the events..

In short, I previously had a 750GB drive as disk 10 on my array. As I have done many times, I replaced the drive with a 3TB drive which rebuilt successfully. I had carried out the rebuild without the shares mounted therefore shutdown the array and restarted to mount them... my typically rock solid 24x7 system crashed. Any subsequent attempts to restart the array and shares resulted in a crash... start just the array without shares and no crash. A look at disk 10 showed whilst the drive had no data errors, it did have a large number of SMART errors...

So... I once again replaced disk 10 with a new drive... tested error free, this time a 4TB model. The rebuild completed successfully with zero errors, stopped the array, restarted and once again a crash...

Upon the reboot I am now faced with the oddity shown in the attached screen show, Disk10 is in error, status "Wrong", displaying a split line in identification with 1x3TB drive and 1x4TB drive having the same drive ID.

Has anyone seen this and possibly know the cause or more importantly the resolution. I can only hope that resolving this issue with Disk10 resolves the crash issue; either way this issue needs sorting before I look any broader.

Regards

Steve

JorgeB · April 7, 2020

Unassign disk10, start the array, when it crashes see if you can download the diags on the console, by typing "diagnostics", then attach here.

Steve Burridge · April 7, 2020

Unassigned disk10, started array, crashed as predicted. Diags attached.

tower-diagnostics-20200407-1817.zip

JorgeB · April 7, 2020

Diags are before array start (or after a reboot).

Steve Burridge · April 7, 2020

After the crash (which reboots the box), before array start.

JorgeB · April 7, 2020

1 minute ago, Steve Burridge said:

(which reboots the box)

OK, then the diags don't help, try this:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=781601

Steve Burridge · April 7, 2020

I should have thought of redirecting the logs from tmp... I have set up mirroring to flash as the quickest option. I shall reboot, start the array (invoking the crash) then pull the mirrored log from flash... trusting that works... Watch this space.

Steve Burridge · April 7, 2020

Anything personal need filtering from syslog? Not seeing anything at a glance

Steve Burridge · April 7, 2020

Here is the log. I did'nt see anything personal.

syslog

JorgeB · April 7, 2020

Although there are no errors logged it's crashing while mounting disk10:

Apr  7 18:46:13 Tower kernel: REISERFS (device md10): checking transaction log (md10)
Apr  7 18:48:51 Tower kernel: microcode: microcode updated early to revision 0x2f, date = 2019-02-17

Check filesystem on disk10:

https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui

P.S. Reiserfs is not recommended for a long time.

Steve Burridge · April 7, 2020

It is crashing with Disk10 assigned or unassigned... Would that indicate FS Corruption being emulated? and indeed could that lead to a crash?

Im slowly working through the Reiserfs Drives.

JorgeB · April 7, 2020

Yes, if there's filesystem corruption it will also be on the emulated disk.

Steve Burridge · April 8, 2020

Just a brief update to assure that i am still working through this. I completed the scan of Disk10 overnight with no issues being found and have consequently began the rebuild today. To avoid the crash, I am once again carrying out the rebuild with no mounted shares.

There are a large number of drives on this box and a few of the drives are not running on the fastest of controllers thus the rebuild will probably take 30 odd hours to complete.

I appreciate your ongoing help and moral support... stay safe and look after yourself... will check in with status after the rebuild!

JorgeB · April 8, 2020

If it crashes before the rebuild it will almost certainly crash after the rebuild, though I was expecting reiserfsck to find some corruption.

Steve Burridge · April 9, 2020

Hi,

Checking back in, as we both somewhat expected; it is crashing post rebuild...

What is odd is that...

1. Try to start array and mount shares... and immediate crash.

2. Try to mount array without shares mounted... no crash. able to scan, repair, rebuild, etc... all looks good in the world.

3. Try to stop array, or indeed reboot (which should perform an elegant shutdown of the array) and there is an immediate crash.

Ive been running this set up for a long time, drive failures, upgrades, etc... all routine... nothing like this.

Any ideas?

JorgeB · April 9, 2020

And you're sure no fs corruption was found on that disk? You can post the reiserfsck output, you can also try mounting that disk manually to confirm that is the problem:

mkdir /temp
mount -o ro /dev/sdX1 /temp

Replace X with the correct letter, note the 1 after the device, i.e., sdc1

If it works unmount with

umount /temp

Steve Burridge · April 9, 2020

I just rerun the scan to be sure... here is the output.

Disk10 is obviously md10.

I have not tried mounting the individual disk, will try that... I suspect it will be fine or indeed may trigger the crash. If it does I shall try another disk and see if the same behavior persists. I fear I may be into a process of elimination.

reiserfsck --check started at Thu Apr 9 19:55:46 2020
###########
Replaying journal: Done.
Reiserfs journal '/dev/md10' in blocks [18..8211]: 0 transactions replayed

^[[AChecking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
Leaves 214516
Internal nodes 1360
Directories 8381
Other files 22060
Data block pointers 215131773 (8 of them are zero)
Safe links 0
###########
reiserfsck finished at Thu Apr 9 20:44:27 2020

Steve Burridge · April 9, 2020

This definitely relates somehow to Disk10... Am I missing anything obvious?

Apr 7 18:46:13 Tower emhttpd: reiserfs: resizing /mnt/disk9
Apr 7 18:46:13 Tower emhttpd: shcmd (259): mkdir -p /mnt/disk10
Apr 7 18:46:13 Tower emhttpd: shcmd (260): mount -t reiserfs -o noatime,nodiratime /dev/md10 /mnt/disk10
Apr 7 18:46:13 Tower kernel: REISERFS (device md10): found reiserfs format "3.6" with standard journal
Apr 7 18:46:13 Tower kernel: REISERFS (device md10): using ordered data mode
Apr 7 18:46:13 Tower kernel: reiserfs: using flush barriers
Apr 7 18:46:13 Tower kernel: REISERFS (device md10): journal params: device md10, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
Apr 7 18:46:13 Tower kernel: REISERFS (device md10): checking transaction log (md10)

crash occurs here!

Apr 7 18:48:51 Tower kernel: microcode: microcode updated early to revision 0x2f, date = 2019-02-17
Apr 7 18:48:51 Tower kernel: Linux version 4.19.107-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Mar 5 13:55:57 PST 2020
Apr 7 18:48:51 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot
Apr 7 18:48:51 Tower kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Apr 7 18:48:51 Tower kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Apr 7 18:48:51 Tower kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Apr 7 18:48:51 Tower kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Apr 7 18:48:51 Tower kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Apr 7 18:48:51 Tower kernel: BIOS-provided physical RAM map:

JorgeB · April 9, 2020

7 minutes ago, Steve Burridge said:

This definitely relates somehow to Disk10...

Yep, I posted that above a couple of days ago, see if it crashes mounting manually, or just make it unmoutable and start the array, you can do that by changing the filesystem on that disk to another one, just don't format it after array start.

Steve Burridge · April 10, 2020

A long night of tearing down, building back up, switching cables, drive bay locations etc... resulted in no variance associated with any of the physical pieces.

I then started pouring over settings and for the purposes of eliminating power faults, potential wake up delays, etc... thus I disabled spin-down by setting spin-down delay to never. No crashing, fully mounted array and shares.

This is not full resolution and indeed its odd, given I run two hefty 1200W server power supply's; but it is the solid progress that I have been looking for and somewhat locks we into focused activity I need to do to identify what the root cause is. I suspect a drive, connector, lead etc may have some fault and may be showing high resistance and pulling down power. Anyhow... I'm pretty confident I should be able to zero in on the issue with some testing over the weekend.

Thanks again for your support... I shall feedback what I find.

JorgeB · April 10, 2020

56 minutes ago, Steve Burridge said:

thus I disabled spin-down by setting spin-down delay to never. No crashing, fully mounted array and shares.

Seems unlikely that would be the issue.

57 minutes ago, Steve Burridge said:

I suspect a drive, connector, lead etc may have some fault and may be showing high resistance and pulling down power.

This would make more sense, but please keep us updated.

Steve Burridge · April 12, 2020

Yes, I fully agree with each of your statements. I am still working through things trying to fully identify the cause of the issue... Thus far the root cause is proving to be quite evasive... but I'm still plugging away and hope to get there eventually.

I shall attempt to cover where I am and maybe where to next and in doing so try to only flag the pieces I have tested myself and can replicate as fact. Everything else is somewhat theory and speculation.

Spin-Down
- Theory: The disabling of spin-down is likely masking a power issue; e.g. possibly associated with one or more rails from power supply not being able to handle the spike/peak load of running up all drives at the same time. If nothing else; the spin-down disable/enable does provide a means of both working around the fault for continued use and recreation of the fault to aid resolution.
- Fact: Start array with spin-down enabled - Immediate hardware crash/reset with no errors logged.
- Fact: Start array with spin-down disabled - Array starts with no issues. System remains 100% stable, cannot replicate crash.
- Fact: Start array with spin-down disabled then enable spin-down with aggressive spin-down values - System remains 100% stable, cannot replicate crash.
- Fact: Ensure all drives are spun up then start array with spin-down enabled - Immediate hardware crash/reset with no errors logged.
- Fact: Start array with spin-down disabled then enable spin-down with aggressive spin-down values, allow drives to idle and spin-down en-mass, attempt to quickly wake all of the drives, repeat... - System remains 100% stable, cannot replicate crash.
Power Supply Issue, Cabling Issue, Etc... (Lots of Effort, Near to exhausted things to test related to power or cables)
- Theory: If power or cabling were a fault then a systematic process of elimination should (at some point) resolve the issue.
- Fact: The power draw from this box barely exceeds 500 Watt; even when put under artificial load/stress in my efforts to recreate the crash condition. This box is running with 2x 1200Watt (Highly regarded) Newton power supplies.
- Fact: Both replacement of power supply's with spare and switching of cables with known good cables from another box has made no difference to when and how the crash occurs. No degree of replacement of power or communications cables results in any changes of outcome to the scenarios listed under section 1.
Other Hardware / Controller Etc... (No activity as of yet... However everything is error free and performs 100% once the array is started)
- Theory: If not power then Issue may relate to DMA or interrupt type overload of a controller... increased noise, cross-talk or suchlike??
- To do: This will require a process of elimination. Easiest way (If I can) would be to swap the array and boot drive/configuration onto another box and controllers.
Software / Configuration (Ad-hoc activity to date)
- Theory: Had this system up for many years, through many versions. Anything is possible... right?
- Fact: Disabling VM support, disabling Docker, removing all addons, etc... - Makes no difference
- To do: Make a backup of the boot drive and various configuration areas such as system and appdata. Rebuild boot drive from clean image and revert back to plain vanilla as far as possible.

I am fortunate that I do have a solid workaround to this issue; regardless of not being able to explain exactly what the root cause is. Disabling spin-down prior to shut--down or start-up of the array, enabling once started, results in the array starting, operating and performing without errors. This may end up being one of those issues that may seemingly just go away at some point and I may never be the wiser as to why it occurred or what change resolved it.

Edited April 12, 2020 by Steve Burridge

Steve Burridge · April 28, 2020

I thought I would drop in a final post on this issue. The server has been up and running with no issues whatsoever since April 12. Spin-down has been enabled for all of that time (having re-enabled after starting the array) and there have been many periods whereby all drives have been spinning and indeed whereby most drives have been spun-down. I have placed reasonable duty on the server through hosting of a number of docker containers and regular put up and pull down of VM's.

I decided to execute a controlled power cycle of the server to see if the issue persisted as before. I can verify that both the crash condition when starting the array with spin-down enabled and the workaround of starting the array with spin-down disabled (and enabling after start) still persist exactly as before.

As I have previously stated; this may just be one of those odd issues that is somehow unique to my own system and that I shall never get to the bottom of.

In the event of my identifying a root cause or future update removing the issue; I shall feedback on my findings... asides that I will thank you each for your support and advice.

Stay safe

Edited April 28, 2020 by Steve Burridge

Wrong Drive / Duplicate Drive?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation