Jump to content

First time multi-drive failure


Go to solution Solved by JorgeB,

Recommended Posts

Hi Guys, as in the title, this is my first multi-drive failure with UNRAID. Contrary to what I would have expected from waking up to multiple dead drives I'm actually somewhat excited to learn the process of recovering from a situation like this.

 

I have an array of 26+2 and 3 drives have spat the dummy - one of the main reasons I picked UNRAID was for this exact situation, if more drives than parity can restore detonate then you only lose the data on the drives that turned themselves into paper weights. At this point in time I have 2 failed Parity disks and 1 data disk that decided to grenade itself after parity went bye-bye.

 

Timeline:

Midnight - Routine Parity check starts

~2am - Parity disk 1 starts throwing errors and gets marked bad

~3am - Parity disk 2 starts throwing errors and gets marked bad

~3:30am - Disk 23 starts throwing 'Sector 1 unreadable'

~3:50am - I notice issues with server and start troubleshooting

~4am Drive goes from 'Unmountable' to 'Missing' after restarting server

4:30 - Here we are at time of writing

 

Troubleshooting:

So, for troubleshooting of all disks I've done the basics:

Restart the server

Power Cycle the disk shelf

 

For Parity 1 and 2 - Both are SAS (PD1 and PD2)

SMART DST - Pass - I suspect they are still functional and contain valid parity data as everything that gets written to the array gets written to Cache first and transferred to disk at 6am logs show IO errors on at least PD1, I'm revising my opinion here, both have probably taken an arrow to the knee.

 

For Disk 23 - SAS (D23)

Reconnect the drive in the same bay slot - Still not detected

Swap the drive to another bay in the chassis (on a different backplane) - No detect

 

Conclusion:

Here is my conclusion, Please do tell me if you think there is something else I can/should do before moving on:

D23 going Dodo would have been recoverable in addition to PD1 so long as PD2 hadn't also decided to fall flat on its face. Unfortunately with all 3 drives marked bad by UNRAID this means that the array is 'unrecoverable' in a traditional sense.

I have a hot spare disk that I could use to replace D23 literally installed and waiting, and can get a couple of disks for the Parity drives later today to build a new array but I want to keep the data on all other disks.

Boiled down, I have to replace the failed drives and create a new array config with new drives maintaining disk order to keep data on the drives that aren't dead but...

 

Questions:

1: I have 23 other disks from this array all with information on them and I would like to keep the information on these disks - I understand that this should be possible when creating a new config with the drives but I would like someone with experience to point me to the right resource or 'hold my hand' while doing this for the first time to make sure I don't do something dumb. Things like letting me know of any risks/things to be aware of when doing this would also be very much appreciated!

2: Why did I say 23 and not 25 'other disks'? I added 2 disks within the last 30 days and, as of ~6 hours prior to the failure, neither of them had anything stored on them yet. My question is: One of these drives happens to be the right size to become a parity disk, would it be possible to remove this disk and assign it as parity instead of buying a replacement or would doing this be a case of 'don't know what could happen, just be safe and don't do it' recommendation?

3: Is it possible that the 2 Parity drives are A-OK (they pass SMART etc.) and could be forced to resume opteration in an attempt to restore the data on failed D23 to the hot spare before getting the parity drives replaced just in case?

4: If option 3 were tried what possible issues could occur if the parity drives aren't 'dead' but just not reading correctly? Would D23's content be mangled and unusable or worse - like, rest of the array data takes a grenade? - AKA, is this something reasonably low risk and worth trying or something high risk and I should just write off D23 and count my blessings that the other disks aren't FUBAR too?

 

Things to note:

Not all data on this array is backed up but all critical files are - any data stored on this array that isn't backed up, primarily - lets say for argument sake - ISOs, is replaceable and only classed as 'nice to have.' If push comes to shove I'm more than capable of manually verifying what ones are missing and re-acquiring anything lost.

I'm not in a huge rush to fix this problem so if there are options that take time to try thats fine.

Its now 5:55am and I'm going back to bed and will check back in a little while, I can post any information that you would like when I get back up in a couple of hours

Earlier in the day I had updated UNRAID 6.11.0 --> 6.11.5 and the unit was awaiting a restart which I was planning on doing after parity check had completed today but had to restart the server during troubleshooting which completed the update - not sure if this is an issue for logs or something.

 

Edited by TheIronAngel
Link to comment
25 minutes ago, TheIronAngel said:

I suspect they are still functional and contain valid parity data as everything that gets written to the array gets written to Cache first and transferred to disk at 6am

Parity is realtime. Any write to the array updates parity at that time. So whether data was moved from cache or not is irrelevant.

 

29 minutes ago, TheIronAngel said:

neither of them had anything stored on them yet. My question is: One of these drives happens to be the right size to become a parity disk

An empty disk is not a clear disk. An empty disk has an empty filesystem written to it by format. So, those empty filesystems are now part of parity. Only clear disks can be removed from the array without invalidating parity.

 

Won't comment further on your other scenarios or possible ways to recover. Often there are ways to recover some data even in situations worse than what you described.

 

Despite your lengthy description, some things aren't entirely clear or missing some details. So

4 minutes ago, trurl said:

attach diagnostics to your NEXT post in this thread.

 

 

Link to comment

Both parity disabled, not clear there is anything wrong with either.

 

Disk23 isn't disabled since you already have 2 disabled disks, but Unraid thinks it is missing.

 

However, there is an Unassigned Device sdm in the SMART reports that seems like it might be the missing disk23.

 

Unfortunately, sdm is showing critical medium errors in syslog.

 

Can you see that disk in Unassigned Devices? Is that disk23?

Link to comment

D26 is an identical drive to PD1 and PD2 - it's SMART info is basically the same across all 3 of them. They are refurb disks so I have a suspicion that SMART data isn't reported correctly on them.

 

I have 2x new 10TB IronWolf disks on order for collection tomorrow so will be able to replace them with name brand disks shortly. I also still have warranty on all 10TB disks (purchased Nov this year - I even triple pre-cleared since they were refurbs)

Edited by TheIronAngel
Link to comment
1 minute ago, trurl said:

Do any disks show SMART warnings on the Dashboard page?

Only 1 with a SMART warning is the hotspare Dev1 (ST8000VX004-2M1101) which has had 1 Reported Uncorrect for a couple of months and hasn't degraded further since (has been pre-cleared probably another 20 times since it got removed from the array in attempt to force a fail as its still in warranty until Feb)

Link to comment

It is possible to get all drives back into the array, then make Unraid rebuild disk23 to a spare. Might be mostly OK even though parity disks are probably out-of-sync. Do you know if anything was written to the server since parity was disabled?

 

I'd like to get a second opinion on this thread since I don't use SAS disks.

 

@JorgeB  probably won't be around for several hours since it's way past bedtime in his timezone.

Link to comment
1 minute ago, trurl said:

It is possible to get all drives back into the array, then make Unraid rebuild disk23 to a spare. Might be mostly OK even though parity disks are probably out-of-sync. Do you know if anything was written to the server since parity was disabled?

 

I'd like to get a second opinion on this thread since I don't use SAS disks.

 

@JorgeB  probably won't be around for several hours since it's way past bedtime in his timezone.

If I add the hotspare into D23's place the array still calls 'Invalid configuration.' and won't allow a start.

I'm assuming you want me to force the parity disks online using New Config --> Same as old + replaced D23 --> Parity is valid and test?

 

Happy to wait for JorgeB to chime in before I do anything if you think thats best though.

Link to comment
57 minutes ago, JorgeB said:

Please reboot and post new diags, after array if it can be started.

Hi JorgeB, I've shutdown the system and rebooted earlier in the day (about 9 hours ago) to work on another machine in the rack. It has rebooted with the exact same 3 disks marked faulty and the array cannot be started. Even if I replace the missing data disk with the 'spare' that is installed UNRAID still reports an invalid config and won't start the arrage (Too many wrong and/or missing disks!). I can do another reboot if you'd like but I've attached a fresh diag that I've just generated as well as a screenshot of all disks in the system.

2022-12-28_22-29-41.png

diskstation2-diagnostics-20221228-2226.zip

Link to comment
2 minutes ago, JorgeB said:

When you reboot swap the cables for disk23, but looks like there's a problem with that disk.

The drive is in a 20-bay expander, I've swapped it from position 3-4 to 4-2 (Row/Col) this was a swap with disk ending ZAD70MFR (Disk22/sdr) - MFR registered in the new bay but 5W9 did not - I've swapped them back to their original locations and the same has resulted. 

This expander uses 5 seperate backplanes (1 for each row) the impacted disks PD1/PD2/D23 are in position 1-1  1-2 and 3-4 respectively all other disks on Row 1 and Row 3 are working A-OK. Reasonably sure there isn't an issue with the Backplane, SAS cable, SAS Expander and Controller but happy to test anything - I work in IT myself so I've seen weirder things cause issues :)

Link to comment
30 minutes ago, TheIronAngel said:

Reasonably sure there isn't an issue with the Backplane, SAS cable, SAS Expander and Controller but happy to test anything

Agree, especially since the disk failed a short SMART test, cable/slot swap was just to confirm it really is a disk problem.

 

On the other hand both parity disks look OK, you could run a long SMART test to confirm, then depending on the backup situation you can try to re-enable parity to see if the missing disk can still be correctly emulated, if data on that disk is backed up or is not important and you just want to bring the array online you can just do a new config without it and re-sync parity.

Link to comment
21 minutes ago, JorgeB said:

On the other hand both parity disks look OK, you could run a long SMART test to confirm

SMART Extended test has started on both of the Parity drives, by the time this has finished I image the replacement drives will have arrived or will be close to arriving (ETA~16 hours for test.)

21 minutes ago, JorgeB said:

you can try to re-enable parity to see if the missing disk can still be correctly emulated

Providing one or both of them pass an extended test I'd like to attempt to re-enable parity, admittedly I'm a little lazy and don't want to manually replace the data on the missing disk if I can help it given we're in the holiday season I have things I would rather do.

Can you point me at the resource to follow for re-enabling parity - at this point my assumption would be, as mentioned before, the 'new config' and select 'parity data is valid option' but really would prefer that to be confirmed before I do something I can't simply 'ctrl+z' if you know what I mean.

 

I hope you don't mind me asking, I love to know what most people would call 'irrelevant information' and tend to ask a lot of curiosity questions. Do you know what would be the result of parity data that isn't completely correct? Would ALL reconstructed data be bad or only parts of the data? 

Since I have 2 parity drives that failed with an hours gap would I be better off only re-enabling one of the parity drives (the last one to be marked failed) or both?

Link to comment
  • Solution
24 minutes ago, TheIronAngel said:

Do you know what would be the result of parity data that isn't completely correct? Would ALL reconstructed data be bad or only parts of the data? 

If parity is not 100% in sync only the bad part on the rebuilt disk would also be bad.

 

24 minutes ago, TheIronAngel said:

Since I have 2 parity drives that failed with an hours gap would I be better off only re-enabling one of the parity drives (the last one to be marked failed) or both?

Yes, it might be better to re-enable only parity2, since if both are re-enabled only parity1 will be used to emulated the missing disk, assuming no read errors on a different disk.

 

The procedure would be the following (you need a new disk to replace disk23):

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Unassign parity1 and then check all assignments, parity2 must remain assigned as parity2, they are not interchangeable, assign any missing disk(s) if needed, including the new disk23, replacement disk should be same size or larger than the old one
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign disk23
-Start array (in normal mode now), ideally the emulated disk will now mount and contents look correct, if it doesn't post new diags

 

 

 

Link to comment
41 minutes ago, JorgeB said:

If parity is not 100% in sync only the bad part on the rebuilt disk would also be bad.

Wow, magic right there! I feel very much vindicated in going UNRAID for this, I'm almost 100% Certain that nothing has been written to that disk in a long time due to all of my shares being set to 'High water' mode - a disk much higher on the list (lower in number) has been getting the writes recently and not much as been added to the array as well.

 

44 minutes ago, JorgeB said:

The procedure would be the following (you need a new disk to replace disk23):

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Unassign parity1 and then check all assignments, parity2 must remain assigned as parity2, they are not interchangeable, assign any missing disk(s) if needed, including the new disk23, replacement disk should be same size or larger than the old one
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign disk23
-Start array (in normal mode now), ideally the emulated disk will now mount and contents look correct, if it doesn't post new diags

Excellent, thank you very much - I'll let you know what the result of the SMART Extended test is for the drives and whether or not I'll be trying that process.

 

Let me know if I'm out of field here but I have a couple of last questions before I let this run and take a break from asking you wonderful people questions:

1: Is it possible that UNRAID failed the parity disks due to garbage data coming from the drive that eventually failed during the parity check?

2: On that same note: Hypothetically speaking, if a parity check starts and a drive in the array is spitting out bad data would this corrupt parity data? Would UNRAID be able to detect that the drive is sending schmoo or would it assume that what the drive reads out is correct?

 

Thank you so much JorgeB and trurl for the time that you've taken out of your day, especially being that its the festive season, to help read logs and answer my questions, it is all very, very much appreciated.

Link to comment
1 minute ago, TheIronAngel said:

1: Is it possible that UNRAID failed the parity disks due to garbage data coming from the drive that eventually failed during the parity check?

Unlikely, though it's strange both parity drives getting disabled an hour apart, but since the initial diags are after a reboot difficult to say more.

 

4 minutes ago, TheIronAngel said:

2: On that same note: Hypothetically speaking, if a parity check starts and a drive in the array is spitting out bad data would this corrupt parity data? Would UNRAID be able to detect that the drive is sending schmoo or would it assume that what the drive reads out is correct?

It's rare but it's been known to happen, that why we always recommend scheduling non correct parity checks, those won't cause any problems, of course if sync errors are expected or detected by a non correcting check (without disk errors) then you must run a correcting check.

 

Link to comment
10 minutes ago, JorgeB said:

Unlikely, though it's strange both parity drives getting disabled an hour apart, but since the initial diags are after a reboot difficult to say more.

 

It's rare but it's been known to happen, that why we always recommend scheduling non correct parity checks, those won't cause any problems, of course if sync errors are expected or detected by a non correcting check (without disk errors) then you must run a correcting check.

 

Interesting, thanks for the info, I just checked my scheduled parity check settings and looks like I have mine setup the way you have described unless the settings aren't accurate until the array is online. If I'm not mistaken the last notification from UNRAID was:

 Error code: aborted
 Finding 3584 errors

Last month's Parity check completed without issue and no hardware has been changed between parity checks so those errors are new. I won't speculate further on the cause or if the errors are actually errors or schmoo from a disk about to blow as its getting a bit to far into the unknowable to be worth any real consideration but wanted to at least offer the information.

 850a465c5c.png

 

This whole experience has been very enlightening and educational!

Link to comment
15 hours ago, JorgeB said:

If parity is not 100% in sync only the bad part on the rebuilt disk would also be bad.

 

Yes, it might be better to re-enable only parity2, since if both are re-enabled only parity1 will be used to emulated the missing disk, assuming no read errors on a different disk.

 

The procedure would be the following (you need a new disk to replace disk23):

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Unassign parity1 and then check all assignments, parity2 must remain assigned as parity2, they are not interchangeable, assign any missing disk(s) if needed, including the new disk23, replacement disk should be same size or larger than the old one
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign disk23
-Start array (in normal mode now), ideally the emulated disk will now mount and contents look correct, if it doesn't post new diags

 

 

 

Both Parity disks completed the extended SMART test and I've followed your instructions, unfortunately disk23 does not emulate and just comes up as 'Unmountable: Wrong or no filesystem'

 

I also had a swathe of UDMA CRC errors on restarting the array however its not confined to one backplane or expander as only some disks on Row 2, 3 and 5 in the diskshelf got them as well as a couple of disks in that were on internal headers on the master controller (not in the disk shelf at all) - I'm not really sure what to make of that other than 'meh, UDMA CRC error, not a major issue'

 

Diags attached for checking and I've stopped the array

diskstation2-diagnostics-20221229-1558.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...