TheIronAngel Posted December 27, 2022 Share Posted December 27, 2022 (edited) Hi Guys, as in the title, this is my first multi-drive failure with UNRAID. Contrary to what I would have expected from waking up to multiple dead drives I'm actually somewhat excited to learn the process of recovering from a situation like this. I have an array of 26+2 and 3 drives have spat the dummy - one of the main reasons I picked UNRAID was for this exact situation, if more drives than parity can restore detonate then you only lose the data on the drives that turned themselves into paper weights. At this point in time I have 2 failed Parity disks and 1 data disk that decided to grenade itself after parity went bye-bye. Timeline: Midnight - Routine Parity check starts ~2am - Parity disk 1 starts throwing errors and gets marked bad ~3am - Parity disk 2 starts throwing errors and gets marked bad ~3:30am - Disk 23 starts throwing 'Sector 1 unreadable' ~3:50am - I notice issues with server and start troubleshooting ~4am Drive goes from 'Unmountable' to 'Missing' after restarting server 4:30 - Here we are at time of writing Troubleshooting: So, for troubleshooting of all disks I've done the basics: Restart the server Power Cycle the disk shelf For Parity 1 and 2 - Both are SAS (PD1 and PD2) SMART DST - Pass - I suspect they are still functional and contain valid parity data as everything that gets written to the array gets written to Cache first and transferred to disk at 6am logs show IO errors on at least PD1, I'm revising my opinion here, both have probably taken an arrow to the knee. For Disk 23 - SAS (D23) Reconnect the drive in the same bay slot - Still not detected Swap the drive to another bay in the chassis (on a different backplane) - No detect Conclusion: Here is my conclusion, Please do tell me if you think there is something else I can/should do before moving on: D23 going Dodo would have been recoverable in addition to PD1 so long as PD2 hadn't also decided to fall flat on its face. Unfortunately with all 3 drives marked bad by UNRAID this means that the array is 'unrecoverable' in a traditional sense. I have a hot spare disk that I could use to replace D23 literally installed and waiting, and can get a couple of disks for the Parity drives later today to build a new array but I want to keep the data on all other disks. Boiled down, I have to replace the failed drives and create a new array config with new drives maintaining disk order to keep data on the drives that aren't dead but... Questions: 1: I have 23 other disks from this array all with information on them and I would like to keep the information on these disks - I understand that this should be possible when creating a new config with the drives but I would like someone with experience to point me to the right resource or 'hold my hand' while doing this for the first time to make sure I don't do something dumb. Things like letting me know of any risks/things to be aware of when doing this would also be very much appreciated! 2: Why did I say 23 and not 25 'other disks'? I added 2 disks within the last 30 days and, as of ~6 hours prior to the failure, neither of them had anything stored on them yet. My question is: One of these drives happens to be the right size to become a parity disk, would it be possible to remove this disk and assign it as parity instead of buying a replacement or would doing this be a case of 'don't know what could happen, just be safe and don't do it' recommendation? 3: Is it possible that the 2 Parity drives are A-OK (they pass SMART etc.) and could be forced to resume opteration in an attempt to restore the data on failed D23 to the hot spare before getting the parity drives replaced just in case? 4: If option 3 were tried what possible issues could occur if the parity drives aren't 'dead' but just not reading correctly? Would D23's content be mangled and unusable or worse - like, rest of the array data takes a grenade? - AKA, is this something reasonably low risk and worth trying or something high risk and I should just write off D23 and count my blessings that the other disks aren't FUBAR too? Things to note: Not all data on this array is backed up but all critical files are - any data stored on this array that isn't backed up, primarily - lets say for argument sake - ISOs, is replaceable and only classed as 'nice to have.' If push comes to shove I'm more than capable of manually verifying what ones are missing and re-acquiring anything lost. I'm not in a huge rush to fix this problem so if there are options that take time to try thats fine. Its now 5:55am and I'm going back to bed and will check back in a little while, I can post any information that you would like when I get back up in a couple of hours Earlier in the day I had updated UNRAID 6.11.0 --> 6.11.5 and the unit was awaiting a restart which I was planning on doing after parity check had completed today but had to restart the server during troubleshooting which completed the update - not sure if this is an issue for logs or something. Edited December 28, 2022 by TheIronAngel Quote Link to comment
trurl Posted December 27, 2022 Share Posted December 27, 2022 attach diagnostics to your NEXT post in this thread. Quote Link to comment
trurl Posted December 27, 2022 Share Posted December 27, 2022 25 minutes ago, TheIronAngel said: I suspect they are still functional and contain valid parity data as everything that gets written to the array gets written to Cache first and transferred to disk at 6am Parity is realtime. Any write to the array updates parity at that time. So whether data was moved from cache or not is irrelevant. 29 minutes ago, TheIronAngel said: neither of them had anything stored on them yet. My question is: One of these drives happens to be the right size to become a parity disk An empty disk is not a clear disk. An empty disk has an empty filesystem written to it by format. So, those empty filesystems are now part of parity. Only clear disks can be removed from the array without invalidating parity. Won't comment further on your other scenarios or possible ways to recover. Often there are ways to recover some data even in situations worse than what you described. Despite your lengthy description, some things aren't entirely clear or missing some details. So 4 minutes ago, trurl said: attach diagnostics to your NEXT post in this thread. Quote Link to comment
TheIronAngel Posted December 27, 2022 Author Share Posted December 27, 2022 diskstation2-diagnostics-20221228-1058.z Hi Trurl, thanks for the reply, much appreciated - here is the Diag Zip Quote Link to comment
trurl Posted December 28, 2022 Share Posted December 28, 2022 Both parity disabled, not clear there is anything wrong with either. Disk23 isn't disabled since you already have 2 disabled disks, but Unraid thinks it is missing. However, there is an Unassigned Device sdm in the SMART reports that seems like it might be the missing disk23. Unfortunately, sdm is showing critical medium errors in syslog. Can you see that disk in Unassigned Devices? Is that disk23? Quote Link to comment
trurl Posted December 28, 2022 Share Posted December 28, 2022 2 minutes ago, trurl said: Both parity disabled, not clear there is anything wrong with either I must admit I'm not very experienced in looking at SMART reports for SAS disks though. Quote Link to comment
TheIronAngel Posted December 28, 2022 Author Share Posted December 28, 2022 SDM is in UD section however it looks like it's spun down and can't be spun up: Also when investigating the disk by clicking on sdm none of the typical data is shown. It's also not able to be added to the D23 slot in the array. Seems like the disk is badly in-op Quote Link to comment
TheIronAngel Posted December 28, 2022 Author Share Posted December 28, 2022 (edited) D26 is an identical drive to PD1 and PD2 - it's SMART info is basically the same across all 3 of them. They are refurb disks so I have a suspicion that SMART data isn't reported correctly on them. I have 2x new 10TB IronWolf disks on order for collection tomorrow so will be able to replace them with name brand disks shortly. I also still have warranty on all 10TB disks (purchased Nov this year - I even triple pre-cleared since they were refurbs) Edited December 28, 2022 by TheIronAngel Quote Link to comment
trurl Posted December 28, 2022 Share Posted December 28, 2022 2 minutes ago, trurl said: I must admit I'm not very experienced in looking at SMART reports for SAS disks though. Do any disks show SMART warnings on the Dashboard page? Quote Link to comment
TheIronAngel Posted December 28, 2022 Author Share Posted December 28, 2022 1 minute ago, trurl said: Do any disks show SMART warnings on the Dashboard page? Only 1 with a SMART warning is the hotspare Dev1 (ST8000VX004-2M1101) which has had 1 Reported Uncorrect for a couple of months and hasn't degraded further since (has been pre-cleared probably another 20 times since it got removed from the array in attempt to force a fail as its still in warranty until Feb) Quote Link to comment
trurl Posted December 28, 2022 Share Posted December 28, 2022 It is possible to get all drives back into the array, then make Unraid rebuild disk23 to a spare. Might be mostly OK even though parity disks are probably out-of-sync. Do you know if anything was written to the server since parity was disabled? I'd like to get a second opinion on this thread since I don't use SAS disks. @JorgeB probably won't be around for several hours since it's way past bedtime in his timezone. Quote Link to comment
TheIronAngel Posted December 28, 2022 Author Share Posted December 28, 2022 1 minute ago, trurl said: It is possible to get all drives back into the array, then make Unraid rebuild disk23 to a spare. Might be mostly OK even though parity disks are probably out-of-sync. Do you know if anything was written to the server since parity was disabled? I'd like to get a second opinion on this thread since I don't use SAS disks. @JorgeB probably won't be around for several hours since it's way past bedtime in his timezone. If I add the hotspare into D23's place the array still calls 'Invalid configuration.' and won't allow a start. I'm assuming you want me to force the parity disks online using New Config --> Same as old + replaced D23 --> Parity is valid and test? Happy to wait for JorgeB to chime in before I do anything if you think thats best though. Quote Link to comment
JorgeB Posted December 28, 2022 Share Posted December 28, 2022 Please reboot and post new diags, after array if it can be started. Quote Link to comment
JorgeB Posted December 28, 2022 Share Posted December 28, 2022 When you reboot swap the cables for disk23, but looks like there's a problem with that disk. Quote Link to comment
TheIronAngel Posted December 28, 2022 Author Share Posted December 28, 2022 57 minutes ago, JorgeB said: Please reboot and post new diags, after array if it can be started. Hi JorgeB, I've shutdown the system and rebooted earlier in the day (about 9 hours ago) to work on another machine in the rack. It has rebooted with the exact same 3 disks marked faulty and the array cannot be started. Even if I replace the missing data disk with the 'spare' that is installed UNRAID still reports an invalid config and won't start the arrage (Too many wrong and/or missing disks!). I can do another reboot if you'd like but I've attached a fresh diag that I've just generated as well as a screenshot of all disks in the system. diskstation2-diagnostics-20221228-2226.zip Quote Link to comment
TheIronAngel Posted December 28, 2022 Author Share Posted December 28, 2022 2 minutes ago, JorgeB said: When you reboot swap the cables for disk23, but looks like there's a problem with that disk. The drive is in a 20-bay expander, I've swapped it from position 3-4 to 4-2 (Row/Col) this was a swap with disk ending ZAD70MFR (Disk22/sdr) - MFR registered in the new bay but 5W9 did not - I've swapped them back to their original locations and the same has resulted. This expander uses 5 seperate backplanes (1 for each row) the impacted disks PD1/PD2/D23 are in position 1-1 1-2 and 3-4 respectively all other disks on Row 1 and Row 3 are working A-OK. Reasonably sure there isn't an issue with the Backplane, SAS cable, SAS Expander and Controller but happy to test anything - I work in IT myself so I've seen weirder things cause issues Quote Link to comment
JorgeB Posted December 28, 2022 Share Posted December 28, 2022 30 minutes ago, TheIronAngel said: Reasonably sure there isn't an issue with the Backplane, SAS cable, SAS Expander and Controller but happy to test anything Agree, especially since the disk failed a short SMART test, cable/slot swap was just to confirm it really is a disk problem. On the other hand both parity disks look OK, you could run a long SMART test to confirm, then depending on the backup situation you can try to re-enable parity to see if the missing disk can still be correctly emulated, if data on that disk is backed up or is not important and you just want to bring the array online you can just do a new config without it and re-sync parity. Quote Link to comment
TheIronAngel Posted December 28, 2022 Author Share Posted December 28, 2022 21 minutes ago, JorgeB said: On the other hand both parity disks look OK, you could run a long SMART test to confirm SMART Extended test has started on both of the Parity drives, by the time this has finished I image the replacement drives will have arrived or will be close to arriving (ETA~16 hours for test.) 21 minutes ago, JorgeB said: you can try to re-enable parity to see if the missing disk can still be correctly emulated Providing one or both of them pass an extended test I'd like to attempt to re-enable parity, admittedly I'm a little lazy and don't want to manually replace the data on the missing disk if I can help it given we're in the holiday season I have things I would rather do. Can you point me at the resource to follow for re-enabling parity - at this point my assumption would be, as mentioned before, the 'new config' and select 'parity data is valid option' but really would prefer that to be confirmed before I do something I can't simply 'ctrl+z' if you know what I mean. I hope you don't mind me asking, I love to know what most people would call 'irrelevant information' and tend to ask a lot of curiosity questions. Do you know what would be the result of parity data that isn't completely correct? Would ALL reconstructed data be bad or only parts of the data? Since I have 2 parity drives that failed with an hours gap would I be better off only re-enabling one of the parity drives (the last one to be marked failed) or both? Quote Link to comment
Solution JorgeB Posted December 28, 2022 Solution Share Posted December 28, 2022 24 minutes ago, TheIronAngel said: Do you know what would be the result of parity data that isn't completely correct? Would ALL reconstructed data be bad or only parts of the data? If parity is not 100% in sync only the bad part on the rebuilt disk would also be bad. 24 minutes ago, TheIronAngel said: Since I have 2 parity drives that failed with an hours gap would I be better off only re-enabling one of the parity drives (the last one to be marked failed) or both? Yes, it might be better to re-enable only parity2, since if both are re-enabled only parity1 will be used to emulated the missing disk, assuming no read errors on a different disk. The procedure would be the following (you need a new disk to replace disk23): -Tools -> New Config -> Retain current configuration: All -> Apply -Unassign parity1 and then check all assignments, parity2 must remain assigned as parity2, they are not interchangeable, assign any missing disk(s) if needed, including the new disk23, replacement disk should be same size or larger than the old one -IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked) -Stop array -Unassign disk23 -Start array (in normal mode now), ideally the emulated disk will now mount and contents look correct, if it doesn't post new diags Quote Link to comment
TheIronAngel Posted December 28, 2022 Author Share Posted December 28, 2022 41 minutes ago, JorgeB said: If parity is not 100% in sync only the bad part on the rebuilt disk would also be bad. Wow, magic right there! I feel very much vindicated in going UNRAID for this, I'm almost 100% Certain that nothing has been written to that disk in a long time due to all of my shares being set to 'High water' mode - a disk much higher on the list (lower in number) has been getting the writes recently and not much as been added to the array as well. 44 minutes ago, JorgeB said: The procedure would be the following (you need a new disk to replace disk23): -Tools -> New Config -> Retain current configuration: All -> Apply -Unassign parity1 and then check all assignments, parity2 must remain assigned as parity2, they are not interchangeable, assign any missing disk(s) if needed, including the new disk23, replacement disk should be same size or larger than the old one -IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked) -Stop array -Unassign disk23 -Start array (in normal mode now), ideally the emulated disk will now mount and contents look correct, if it doesn't post new diags Excellent, thank you very much - I'll let you know what the result of the SMART Extended test is for the drives and whether or not I'll be trying that process. Let me know if I'm out of field here but I have a couple of last questions before I let this run and take a break from asking you wonderful people questions: 1: Is it possible that UNRAID failed the parity disks due to garbage data coming from the drive that eventually failed during the parity check? 2: On that same note: Hypothetically speaking, if a parity check starts and a drive in the array is spitting out bad data would this corrupt parity data? Would UNRAID be able to detect that the drive is sending schmoo or would it assume that what the drive reads out is correct? Thank you so much JorgeB and trurl for the time that you've taken out of your day, especially being that its the festive season, to help read logs and answer my questions, it is all very, very much appreciated. Quote Link to comment
JorgeB Posted December 28, 2022 Share Posted December 28, 2022 1 minute ago, TheIronAngel said: 1: Is it possible that UNRAID failed the parity disks due to garbage data coming from the drive that eventually failed during the parity check? Unlikely, though it's strange both parity drives getting disabled an hour apart, but since the initial diags are after a reboot difficult to say more. 4 minutes ago, TheIronAngel said: 2: On that same note: Hypothetically speaking, if a parity check starts and a drive in the array is spitting out bad data would this corrupt parity data? Would UNRAID be able to detect that the drive is sending schmoo or would it assume that what the drive reads out is correct? It's rare but it's been known to happen, that why we always recommend scheduling non correct parity checks, those won't cause any problems, of course if sync errors are expected or detected by a non correcting check (without disk errors) then you must run a correcting check. Quote Link to comment
TheIronAngel Posted December 28, 2022 Author Share Posted December 28, 2022 10 minutes ago, JorgeB said: Unlikely, though it's strange both parity drives getting disabled an hour apart, but since the initial diags are after a reboot difficult to say more. It's rare but it's been known to happen, that why we always recommend scheduling non correct parity checks, those won't cause any problems, of course if sync errors are expected or detected by a non correcting check (without disk errors) then you must run a correcting check. Interesting, thanks for the info, I just checked my scheduled parity check settings and looks like I have mine setup the way you have described unless the settings aren't accurate until the array is online. If I'm not mistaken the last notification from UNRAID was: Error code: aborted Finding 3584 errors Last month's Parity check completed without issue and no hardware has been changed between parity checks so those errors are new. I won't speculate further on the cause or if the errors are actually errors or schmoo from a disk about to blow as its getting a bit to far into the unknowable to be worth any real consideration but wanted to at least offer the information. This whole experience has been very enlightening and educational! Quote Link to comment
JorgeB Posted December 28, 2022 Share Posted December 28, 2022 Just now, TheIronAngel said: I have mine setup the way you have described Yes, that's correct, it's also the default now. Quote Link to comment
TheIronAngel Posted December 29, 2022 Author Share Posted December 29, 2022 15 hours ago, JorgeB said: If parity is not 100% in sync only the bad part on the rebuilt disk would also be bad. Yes, it might be better to re-enable only parity2, since if both are re-enabled only parity1 will be used to emulated the missing disk, assuming no read errors on a different disk. The procedure would be the following (you need a new disk to replace disk23): -Tools -> New Config -> Retain current configuration: All -> Apply -Unassign parity1 and then check all assignments, parity2 must remain assigned as parity2, they are not interchangeable, assign any missing disk(s) if needed, including the new disk23, replacement disk should be same size or larger than the old one -IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked) -Stop array -Unassign disk23 -Start array (in normal mode now), ideally the emulated disk will now mount and contents look correct, if it doesn't post new diags Both Parity disks completed the extended SMART test and I've followed your instructions, unfortunately disk23 does not emulate and just comes up as 'Unmountable: Wrong or no filesystem' I also had a swathe of UDMA CRC errors on restarting the array however its not confined to one backplane or expander as only some disks on Row 2, 3 and 5 in the diskshelf got them as well as a couple of disks in that were on internal headers on the master controller (not in the disk shelf at all) - I'm not really sure what to make of that other than 'meh, UDMA CRC error, not a major issue' Diags attached for checking and I've stopped the array diskstation2-diagnostics-20221229-1558.zip Quote Link to comment
trurl Posted December 29, 2022 Share Posted December 29, 2022 29 minutes ago, TheIronAngel said: unfortunately disk23 does not emulate and just comes up as 'Unmountable: Wrong or no filesystem' It is emulated, but unmountable. Check filesystem on emulated disk23 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.