Steve Burridge Posted April 7, 2020 Share Posted April 7, 2020 Hi, My apologies if this type of issue is already covered by another thread... This is a new one on me and I am somewhat struggling to understand what type of issue i am looking at; therefore best to begin with a screen shot.. then the events.. In short, I previously had a 750GB drive as disk 10 on my array. As I have done many times, I replaced the drive with a 3TB drive which rebuilt successfully. I had carried out the rebuild without the shares mounted therefore shutdown the array and restarted to mount them... my typically rock solid 24x7 system crashed. Any subsequent attempts to restart the array and shares resulted in a crash... start just the array without shares and no crash. A look at disk 10 showed whilst the drive had no data errors, it did have a large number of SMART errors... So... I once again replaced disk 10 with a new drive... tested error free, this time a 4TB model. The rebuild completed successfully with zero errors, stopped the array, restarted and once again a crash... Upon the reboot I am now faced with the oddity shown in the attached screen show, Disk10 is in error, status "Wrong", displaying a split line in identification with 1x3TB drive and 1x4TB drive having the same drive ID. Has anyone seen this and possibly know the cause or more importantly the resolution. I can only hope that resolving this issue with Disk10 resolves the crash issue; either way this issue needs sorting before I look any broader. Regards Steve Quote Link to comment
JorgeB Posted April 7, 2020 Share Posted April 7, 2020 Unassign disk10, start the array, when it crashes see if you can download the diags on the console, by typing "diagnostics", then attach here. Quote Link to comment
Steve Burridge Posted April 7, 2020 Author Share Posted April 7, 2020 Unassigned disk10, started array, crashed as predicted. Diags attached. tower-diagnostics-20200407-1817.zip Quote Link to comment
JorgeB Posted April 7, 2020 Share Posted April 7, 2020 Diags are before array start (or after a reboot). Quote Link to comment
Steve Burridge Posted April 7, 2020 Author Share Posted April 7, 2020 After the crash (which reboots the box), before array start. Quote Link to comment
JorgeB Posted April 7, 2020 Share Posted April 7, 2020 1 minute ago, Steve Burridge said: (which reboots the box) OK, then the diags don't help, try this: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=781601 Quote Link to comment
Steve Burridge Posted April 7, 2020 Author Share Posted April 7, 2020 I should have thought of redirecting the logs from tmp... I have set up mirroring to flash as the quickest option. I shall reboot, start the array (invoking the crash) then pull the mirrored log from flash... trusting that works... Watch this space. Quote Link to comment
Steve Burridge Posted April 7, 2020 Author Share Posted April 7, 2020 Anything personal need filtering from syslog? Not seeing anything at a glance Quote Link to comment
Steve Burridge Posted April 7, 2020 Author Share Posted April 7, 2020 Here is the log. I did'nt see anything personal. syslog Quote Link to comment
JorgeB Posted April 7, 2020 Share Posted April 7, 2020 Although there are no errors logged it's crashing while mounting disk10: Apr 7 18:46:13 Tower kernel: REISERFS (device md10): checking transaction log (md10) Apr 7 18:48:51 Tower kernel: microcode: microcode updated early to revision 0x2f, date = 2019-02-17 Check filesystem on disk10: https://wiki.unraid.net/Check_Disk_Filesystems#Checking_and_fixing_drives_in_the_webGui P.S. Reiserfs is not recommended for a long time. Quote Link to comment
Steve Burridge Posted April 7, 2020 Author Share Posted April 7, 2020 It is crashing with Disk10 assigned or unassigned... Would that indicate FS Corruption being emulated? and indeed could that lead to a crash? Im slowly working through the Reiserfs Drives. Quote Link to comment
JorgeB Posted April 7, 2020 Share Posted April 7, 2020 Yes, if there's filesystem corruption it will also be on the emulated disk. Quote Link to comment
Steve Burridge Posted April 8, 2020 Author Share Posted April 8, 2020 Just a brief update to assure that i am still working through this. I completed the scan of Disk10 overnight with no issues being found and have consequently began the rebuild today. To avoid the crash, I am once again carrying out the rebuild with no mounted shares. There are a large number of drives on this box and a few of the drives are not running on the fastest of controllers thus the rebuild will probably take 30 odd hours to complete. I appreciate your ongoing help and moral support... stay safe and look after yourself... will check in with status after the rebuild! Quote Link to comment
JorgeB Posted April 8, 2020 Share Posted April 8, 2020 If it crashes before the rebuild it will almost certainly crash after the rebuild, though I was expecting reiserfsck to find some corruption. Quote Link to comment
Steve Burridge Posted April 9, 2020 Author Share Posted April 9, 2020 Hi, Checking back in, as we both somewhat expected; it is crashing post rebuild... What is odd is that... 1. Try to start array and mount shares... and immediate crash. 2. Try to mount array without shares mounted... no crash. able to scan, repair, rebuild, etc... all looks good in the world. 3. Try to stop array, or indeed reboot (which should perform an elegant shutdown of the array) and there is an immediate crash. Ive been running this set up for a long time, drive failures, upgrades, etc... all routine... nothing like this. Any ideas? Quote Link to comment
JorgeB Posted April 9, 2020 Share Posted April 9, 2020 And you're sure no fs corruption was found on that disk? You can post the reiserfsck output, you can also try mounting that disk manually to confirm that is the problem: mkdir /temp mount -o ro /dev/sdX1 /temp Replace X with the correct letter, note the 1 after the device, i.e., sdc1 If it works unmount with umount /temp Quote Link to comment
Steve Burridge Posted April 9, 2020 Author Share Posted April 9, 2020 I just rerun the scan to be sure... here is the output. Disk10 is obviously md10. I have not tried mounting the individual disk, will try that... I suspect it will be fine or indeed may trigger the crash. If it does I shall try another disk and see if the same behavior persists. I fear I may be into a process of elimination. reiserfsck --check started at Thu Apr 9 19:55:46 2020 ########### Replaying journal: Done. Reiserfs journal '/dev/md10' in blocks [18..8211]: 0 transactions replayed ^[[AChecking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 214516 Internal nodes 1360 Directories 8381 Other files 22060 Data block pointers 215131773 (8 of them are zero) Safe links 0 ########### reiserfsck finished at Thu Apr 9 20:44:27 2020 Quote Link to comment
Steve Burridge Posted April 9, 2020 Author Share Posted April 9, 2020 This definitely relates somehow to Disk10... Am I missing anything obvious? Apr 7 18:46:13 Tower emhttpd: reiserfs: resizing /mnt/disk9 Apr 7 18:46:13 Tower emhttpd: shcmd (259): mkdir -p /mnt/disk10 Apr 7 18:46:13 Tower emhttpd: shcmd (260): mount -t reiserfs -o noatime,nodiratime /dev/md10 /mnt/disk10 Apr 7 18:46:13 Tower kernel: REISERFS (device md10): found reiserfs format "3.6" with standard journal Apr 7 18:46:13 Tower kernel: REISERFS (device md10): using ordered data mode Apr 7 18:46:13 Tower kernel: reiserfs: using flush barriers Apr 7 18:46:13 Tower kernel: REISERFS (device md10): journal params: device md10, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 Apr 7 18:46:13 Tower kernel: REISERFS (device md10): checking transaction log (md10) crash occurs here! Apr 7 18:48:51 Tower kernel: microcode: microcode updated early to revision 0x2f, date = 2019-02-17 Apr 7 18:48:51 Tower kernel: Linux version 4.19.107-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Mar 5 13:55:57 PST 2020 Apr 7 18:48:51 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot Apr 7 18:48:51 Tower kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Apr 7 18:48:51 Tower kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Apr 7 18:48:51 Tower kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Apr 7 18:48:51 Tower kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 Apr 7 18:48:51 Tower kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. Apr 7 18:48:51 Tower kernel: BIOS-provided physical RAM map: Quote Link to comment
JorgeB Posted April 9, 2020 Share Posted April 9, 2020 7 minutes ago, Steve Burridge said: This definitely relates somehow to Disk10... Yep, I posted that above a couple of days ago, see if it crashes mounting manually, or just make it unmoutable and start the array, you can do that by changing the filesystem on that disk to another one, just don't format it after array start. Quote Link to comment
Steve Burridge Posted April 10, 2020 Author Share Posted April 10, 2020 A long night of tearing down, building back up, switching cables, drive bay locations etc... resulted in no variance associated with any of the physical pieces. I then started pouring over settings and for the purposes of eliminating power faults, potential wake up delays, etc... thus I disabled spin-down by setting spin-down delay to never. No crashing, fully mounted array and shares. This is not full resolution and indeed its odd, given I run two hefty 1200W server power supply's; but it is the solid progress that I have been looking for and somewhat locks we into focused activity I need to do to identify what the root cause is. I suspect a drive, connector, lead etc may have some fault and may be showing high resistance and pulling down power. Anyhow... I'm pretty confident I should be able to zero in on the issue with some testing over the weekend. Thanks again for your support... I shall feedback what I find. Quote Link to comment
JorgeB Posted April 10, 2020 Share Posted April 10, 2020 56 minutes ago, Steve Burridge said: thus I disabled spin-down by setting spin-down delay to never. No crashing, fully mounted array and shares. Seems unlikely that would be the issue. 57 minutes ago, Steve Burridge said: I suspect a drive, connector, lead etc may have some fault and may be showing high resistance and pulling down power. This would make more sense, but please keep us updated. Quote Link to comment
Steve Burridge Posted April 12, 2020 Author Share Posted April 12, 2020 (edited) Yes, I fully agree with each of your statements. I am still working through things trying to fully identify the cause of the issue... Thus far the root cause is proving to be quite evasive... but I'm still plugging away and hope to get there eventually. I shall attempt to cover where I am and maybe where to next and in doing so try to only flag the pieces I have tested myself and can replicate as fact. Everything else is somewhat theory and speculation. Spin-Down Theory: The disabling of spin-down is likely masking a power issue; e.g. possibly associated with one or more rails from power supply not being able to handle the spike/peak load of running up all drives at the same time. If nothing else; the spin-down disable/enable does provide a means of both working around the fault for continued use and recreation of the fault to aid resolution. Fact: Start array with spin-down enabled - Immediate hardware crash/reset with no errors logged. Fact: Start array with spin-down disabled - Array starts with no issues. System remains 100% stable, cannot replicate crash. Fact: Start array with spin-down disabled then enable spin-down with aggressive spin-down values - System remains 100% stable, cannot replicate crash. Fact: Ensure all drives are spun up then start array with spin-down enabled - Immediate hardware crash/reset with no errors logged. Fact: Start array with spin-down disabled then enable spin-down with aggressive spin-down values, allow drives to idle and spin-down en-mass, attempt to quickly wake all of the drives, repeat... - System remains 100% stable, cannot replicate crash. Power Supply Issue, Cabling Issue, Etc... (Lots of Effort, Near to exhausted things to test related to power or cables) Theory: If power or cabling were a fault then a systematic process of elimination should (at some point) resolve the issue. Fact: The power draw from this box barely exceeds 500 Watt; even when put under artificial load/stress in my efforts to recreate the crash condition. This box is running with 2x 1200Watt (Highly regarded) Newton power supplies. Fact: Both replacement of power supply's with spare and switching of cables with known good cables from another box has made no difference to when and how the crash occurs. No degree of replacement of power or communications cables results in any changes of outcome to the scenarios listed under section 1. Other Hardware / Controller Etc... (No activity as of yet... However everything is error free and performs 100% once the array is started) Theory: If not power then Issue may relate to DMA or interrupt type overload of a controller... increased noise, cross-talk or suchlike?? To do: This will require a process of elimination. Easiest way (If I can) would be to swap the array and boot drive/configuration onto another box and controllers. Software / Configuration (Ad-hoc activity to date) Theory: Had this system up for many years, through many versions. Anything is possible... right? Fact: Disabling VM support, disabling Docker, removing all addons, etc... - Makes no difference To do: Make a backup of the boot drive and various configuration areas such as system and appdata. Rebuild boot drive from clean image and revert back to plain vanilla as far as possible. I am fortunate that I do have a solid workaround to this issue; regardless of not being able to explain exactly what the root cause is. Disabling spin-down prior to shut--down or start-up of the array, enabling once started, results in the array starting, operating and performing without errors. This may end up being one of those issues that may seemingly just go away at some point and I may never be the wiser as to why it occurred or what change resolved it. Edited April 12, 2020 by Steve Burridge Quote Link to comment
Steve Burridge Posted April 28, 2020 Author Share Posted April 28, 2020 (edited) I thought I would drop in a final post on this issue. The server has been up and running with no issues whatsoever since April 12. Spin-down has been enabled for all of that time (having re-enabled after starting the array) and there have been many periods whereby all drives have been spinning and indeed whereby most drives have been spun-down. I have placed reasonable duty on the server through hosting of a number of docker containers and regular put up and pull down of VM's. I decided to execute a controlled power cycle of the server to see if the issue persisted as before. I can verify that both the crash condition when starting the array with spin-down enabled and the workaround of starting the array with spin-down disabled (and enabling after start) still persist exactly as before. As I have previously stated; this may just be one of those odd issues that is somehow unique to my own system and that I shall never get to the bottom of. In the event of my identifying a root cause or future update removing the issue; I shall feedback on my findings... asides that I will thank you each for your support and advice. Stay safe Edited April 28, 2020 by Steve Burridge Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.