Possible multiple drive failures


Recommended Posts

My unRAID 6.5.2 system reported that disk6 (sdi) had been disabled. I had a same-sized, unassigned hot spare already in the tower, so I followed the instructions below to replace it with the hot spare (sdl). (I did not remove the disabled disk, as I am currently thousands of miles away from the tower and won't be near it for several more months.)

 

https://lime-technology.com/wiki/Replacing_a_Data_Drive

 

When I started the array, unRAID began to rebuild disk6. Shortly thereafter, however, there was a notification that there is a problem with the hot spare disk also. Shortly after that, another disk -- disk7 (sdj) -- showed up as unmountable.

 

I have another disk in the array (disk8) that is empty, the same size, and could be used to rebuild disk6 (my first priority, if possible). That disk, however, was set up with encryption a while back, but it never seemed to work properly, always displaying "Unmountable: Unsupported partition layout" for unknown reasons. I didn't need the space on that drive at the time, so I figured I'd get around to reformatting it (unencrypted) at some point in the future.

 

In short, it seems like three drives (two data drives, one hot spare) *might* have developed/exhibited problems all at once. I don't really understand what's going on. I should, in theory, have a working empty disk in the array (disk8) that could be used as a potential drive replacement.

 

If at all possible, I would like to remove the empty disk8 from the array and use it to rebuild disk6. I don't know whether that's possible, and if so, how to proceed.

 

More generally, I am looking for any and all recommendations regarding how best to move forward. I have attached a console screenshot and a diagnostics bundle.

 

Any thoughts? Any other information I can provide that would be helpful?

unraid-console.png

tower-diagnostics-20180818-0308.zip

Link to comment

In order to rebuild any disk, all other assigned disks must be present, so there is no way to take a disk in the array and use it to rebuild a different disk, even if it is "empty". All bits of a disk have some value even if "empty", and those bits are all part of the current parity. Only a clear disk (all bits zero) can be removed without affecting parity, and an "empty" disk is not clear.

 

Do you have good backups of any important and irreplaceable data? 

 

In addition to the write error which disabled disk6, you were getting read errors on the parity disk some day before that, and you are still getting read errors on 6,7,8. Also, 6,7,8 are not reporting SMART so likely they aren't connected any longer. 

 

I think you must have a serious hardware problem such as the disk controller which is affecting multiple disks. I don't think you are going to be able to fix this remotely.

Link to comment

Many thanks for the quick response. The most important data is indeed backed-up off-site. While I would prefer to recover the non-critical data on disk6, I understand that may not be very likely.

 

Given that I won't be in the vicinity of this machine for several months, I agree that it's unlikely I'll be able to fix this situation soon. What I would like to do now is determine the following:

 

1. What can I do to increase the likelihood that the data on the remaining drives (1, 2, 3, 4, 5) will remain intact and uncorrupted? I suspect those drives are connected to the motherboard's disk controller, while the others are connected to a (possibly failing) Supermicro PCI card. For the non-critical data on those drives that isn't yet backed up off-site, I'll begin to get as much backed up as possible, but that will take time and thus I'd to do everything I can to protect this data in the interim. Anything in particular I should be doing or not doing?

 

2. What do I do when I'm back in physical proximity of the machine? Presumably the Supermicro PCI card needs to be replaced, and the entire array needs to be rebuilt from scratch. Or is that not the case? Any suggestions on what steps I should take when I arrive?

Link to comment

SAS2LP driver crashed during the rebuild, all disks are likely fine, but you'll need to reboot and get new diags since they are offline, rebooting should also bring all your data back.

 

SASLP/SAS2LP are not recommended for a while now, and this is one the the recurring issues, you should replace it with an LSI, but for now you might get lucky and a second attempt could complete without issues, assuming the disks are in fact OK.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.