Multiple failed disks

tucansam · September 29, 2022

Disk 15 showed tons of unrecoverable errors, so I replaced it. It came up as "unmountable" so I checked the box to format it, and clicked format. Disk 16 immediately went into an error state (red x). I have two parity disks, so I began reconstruction of Disk 15, and just bought an overnight replacement for Disk 16 (I keep one spare on hand and used it to replace Disk 15). During the rebuilt, Disk 19 has now thrown red X! I now have one freshly formatted disk that was 0.000001% done being reconstructed, and two disks that have just thrown red x's.

My only priority at this point is data preservation, all other considerations are secondary.

This array has given me absolutely no end of trouble since Day 1, with months or even years of trouble free running, followed by weeks of multiple cascading disk failures, errors, etc. They all come in spurts. I've swapped cables, power supplies, etc over the years. And years. And years.

I'm do have 20+ disks, which is far too many, and I'm trying to talk myself into dropping $$$$$$ on a brand new sever with the same capacity but a third the number of disks. Too many variables with this many drives, cables, controllers, etc.

In the mean time....

The disks that are all failing right now are on an HBA and a SAS expander in a separate chassis. Not sure if that's a coincidence or not but I doubt it.

The red X's have started right after I pulled the tower to swap disk 15 -- while I was in there I reseated all of the cable and data connectors on the disks in that chassis (6 disks total, three have now apparently died).

The original Disk 15 was definitely dead as per SMART. These other two disks are probably fine (99.9% of my red x's over the decade+ that I've used unraid have been false alarms).

Once again.... many disks down.... data preservation key. How do I proceed?

Thank you.

Edited September 29, 2022 by tucansam

itimpi · September 29, 2022

You say that you issued a format command on disk15 before trying to rebuild it? You would have got a big warning NOT to do this unless you were prepared to lose the disk contents! The format would have created an empty file system on the emulated disk15 and updated parity to reflect this so a rebuild would just end up with an empty disk as all a rebuild does is make a physical disk match the emulated one. Do you still have the original disk15 untouched as if it has not completely failed it could be the best chance of recovering the contents.

tucansam · September 29, 2022

Yes, I still have the original Disk 15. I am more concerned about the other two disks that are now showing up as Red X's, although I don't believe them to be truly problematic, as I looked at all SMART data for them.

My old procedure for disk replacement was to pre-clear the disk on a separate unraid server (this was years ago). Then, at some point, the new unraid version started either doing it automatically (I think I remember that being a thing) or I just stopped doing it. I would pop in a new disk, start the array, unraid would rebuild it, and its off to the races.

The last three times I've replaced a disk, this has happened. The array starts, the disk shows up as "unmountable" and is offline, yet a parity sync starts. I end up confused and format the disk. In fact, the very last time this happened, I lost data as well, but I thought it was me mis-remembering something along the way and screwing it up.

Right now I have the server powered down with two red X disks and one that needs to be rebuilt. Plus the old drive that I'm replacing.

itimpi · September 29, 2022

The correct handling of unmountable disks is covered here in the online documentation accessible via the ‘Manual’ link at the bottom of the GUI or the DOCS link at the top of each forum page. This applies whether it is happening to an actual physical drive or to an emulated one (which is what you will have is the drive is shown as disabled with a red ‘x’). A format is never the correct answer if you want to keep any data.

In the case of an emulated drive then we always recommending trying the check/repair before attempting a rebuild. The reason is that since a rebuild only makes a physical drive match the emulated one if for any reason the repair went badly you will at that point still have the physical disable drive available untouched to use as an alternative for data recovery purposes.

tucansam · September 29, 2022

Here is current state of things.

Disk 15 is the new disk, formerly "unmountable," which got 30-60 seconds through a format before I aborted and shut down the array. I have the original Disk 15, the one with SMART errors.

Disk 16 is physically present, but not showing as assigned.

Disk 19 is also red x'd.

I have two new-in-box 8TB disks waiting, arrived a few hours ago.

I have absolutely no idea how to proceed at this point (two parity disks if it matters). At the very least I'd like to get 16 and 19 back in the game, and worry about 15 later. Unless there is a better order to this.....

JorgeB · September 30, 2022

Please post the diagnostics.

tucansam · September 30, 2022

Its also listed all three disks as "to be encrypted."

ffs2-diagnostics-20220930-0409.zip

JorgeB · September 30, 2022

Diags are after rebooting so we can't see what happened, but disks 16 and 19 look mostly OK, disable disk spin down and run an extended test on both, then post new diags.

tucansam · September 30, 2022

Running extended diags on 19 now.

Disk 16 says "no device." If I select the disk that was (formerly?) it, will it try to reconstruct it if I start the array? I don't know why that disk dropped off. Its still good, as is the data on it.

tucansam · September 30, 2022

ETA I selected the "old" disk 16 so I could run an extended test (doing it now) and it shows it as "new device."

JorgeB · September 30, 2022

That's fine, we will need to manually re-enable at least one of them anyway, you currently have 3 invalid disks with dual parity, so not possible to rebuild as is.

tucansam · September 30, 2022

After starting extended SMART on 16 and 19, 16 shows the unraid equiv of the hourglass and never loads any data, and 16 and 19 both show "cannot read attributes."

How do I begin rebuilding the array to maximize data preservation?

Thank you again.

JorgeB · October 1, 2022

That suggests the disk dropped offline, check/replace cables (power cable also) and try again.

tucansam · October 1, 2022

Disk 19 reported no errors on the extended test.

Disk 16 stopped showing up in the list of disks. I re-seated all power and data cables (external disks are directly connected to a break-out cable coming from an HBA with external SAS connections) and rebooted everything. Disk 16 showed up again (still showing as "new disk") and I have started another extended test (it reported "host disconnect" or something to that effect). It is showing 97 "Report Uncorrected" from previous tests.

I'll advise when the extended test finishes -- thank you for your help.

tucansam · October 1, 2022

Test didn't run for very long (not sure how long as I wasn't babysitting but it wasn't more than 30 minutes) before the host interrupt error popped again. See attached.

Kilrah · October 1, 2022

You need to disable disk spindown for extended tests, otherwise it'll get interrupted when unraid tells the drives to spin down.

tucansam · October 1, 2022

Unless I'm missing it somewhere else, under Settings -> Disk Settings, I set "default spindown delay" to "Never," and the extended SMART on #16 isn't running for more than a few minutes without "Interrupted (host reset)" being the result.

I see there is a place to set individual disk spindown as well -- I have set it to "never" for #16 specifically and run extended SMART yet again.

tucansam · October 1, 2022

No effect. Extended SMART won't even run for a few minutes before host interrupt occurs.

tucansam · October 1, 2022

Short SMART test lasted 33 seconds and got to 50% before I got the host interrupt message on #16.

JorgeB · October 2, 2022

Do you have a spare you could use?

tucansam · October 2, 2022

I have two spare 8TB, new-in-box disks, yes.

BUT, I replaced the long break-out cable from the SAS controller and Disk 16 has completed an extended SMART test without error.

tucansam · October 2, 2022

Diags attached.

ffs2-diagnostics-20221002-0752.zip

JorgeB · October 2, 2022

OK, good, if I understood correctly disk15 was formatted, so we are assuming data there is lost?

tucansam · October 2, 2022

A format was begun, but aborted not even a minute into it.

I do have the original Disk 15.

trurl · October 2, 2022

On 9/29/2022 at 7:38 AM, tucansam said:

My old procedure for disk replacement was to pre-clear the disk on a separate unraid server (this was years ago). Then, at some point, the new unraid version started either doing it automatically (I think I remember that being a thing) or I just stopped doing it. I would pop in a new disk, start the array, unraid would rebuild it, and its off to the races.

FYI - A clear disk has never been required on any version to REPLACE a disk. Unraid only requires a clear disk when ADDING to a NEW slot in an array that already has valid parity. This is so parity will remain valid since a clear disk is all zeros, so has no affect on parity. When ADDING a disk, Unraid will clear it if it hasn't been precleared.

For REPLACING a disk, doesn't matter at all what was on the replacement disk since it is going to be completely overwritten.

Not formatting an unmountable disk has also been that way on all versions of Unraid, but old versions may not have had good warnings against it. Format is a write operation that updates parity (how could parity be valid otherwise?), so rebuild can only result in a formatted disk.

1 hour ago, tucansam said:

A format was begun, but aborted not even a minute into it.

I do have the original Disk 15.

Doesn't matter much how long format ran. Format doesn't take very long anyway. It just writes a small amount of metadata to represent an empty filesystem. Hang on to that original disk 15, you will need it to copy its data back to the array after you get the other disks taken care of.

Multiple failed disks

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation