v6.9.2 Disk 1 entered error state during Disk 3 upgrade - Data Lost?

JimmyC · March 28, 2022

I was performing a routine upgrade from an old 1.5 TB drive to my previous 8TB parity drive (upgraded parity to 14 TB about a week back). Upon boot, I saw that I needed to assign disk 3 as expected. However, once I started the array and commenced data rebuild, disk 1 immediately entered an error state. Unraid alerts post to a slack channel I set up:

Files Davis 4:46 PM
Warning [TOWER] - Disk 3, drive not ready, content being reconstructed
WDC_WD80EMAZ-00WJTA0_7JJYY38C (sdd)
4:47
Alert [TOWER] - Disk 1 in error state (disk dsbl)
WDC_WD30EFRX-68AX9N0_WD-WMC1T3212961 (sdb)

I then lost remote access to system completely (web/ssh/ping), although server was still powered on. I run it headless and didn't have a monitor handy so I power-cycled first to see if anything would change. Server booted but I lost ping again shortly after and never got to GUI on this boot. Pulled server and swapped SATA cable for Disk 1 with a spare, while migrating Disk 3 back to original drive, and booted again. Still seeing errors on Disk 1 and Disk 3 now shows not installed, guessing that was because I did commit the prior change before the Disk 1 problem. Started array in maint mode to run file system check as recommended in wiki. Results of reiserfsck on Disk 1:

reiserfsck 3.6.27

Will read-only check consistency of the filesystem on /dev/md1
Will put log info to 'stdout'

The problem has occurred looks like a hardware problem. If you have
bad blocks, we advise you to get a new hard drive, because once you
get one bad block that the disk drive internals cannot hide from
your sight,the chances of getting more are generally said to become
much higher (precise statistics are unknown to us), and this disk
drive is probably not expensive enough for you to you to risk your
time and data on it. If you don't want to follow that follow that
advice then if you have just a few bad blocks, try writing to the
bad blocks and see if the drive remaps the bad blocks (that means
it takes a block it has in reserve and allocates it for use for
of that block number). If it cannot remap the block, use badblock
option (-B) with reiserfs utils to handle this block correctly.

bread: Cannot read the block (2): (Input/output error).

I'm at a point where I'm not sure what my next step should be to reduce potential for data loss. Since I've only been running single parity, I have an unrecoverable array currently, but I do still have the 1.5 TB drive with whatever data it contained and believe it to be in a working state. Diagnostics attached, albeit from my most recent boot only. I have not shut down or made any further array changes.

tower-diagnostics-20220327-1856.zip

JorgeB · March 28, 2022

Single parity can't emulate two disks, so no point in trying a filesystem check, you can force enable disk1 to try again to rebuild disk3.

This will only work if parity is still valid:

-Tools -> New Config -> Retain current configuration: All -> Apply
-Check all assignments and assign any missing disk(s) if needed, including disk3
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign disk3
-Start array (in normal mode now), ideally the emulated disk3 will now mount and contents look correct, if it doesn't you should run a filesystem check on the emulated disk or post new diags
-If the emulated disk mounts and contents look correct stop the array
-Re-assign the disk to rebuild and start array to begin.

JimmyC · March 28, 2022

I believe parity is still valid at this point and will proceed with this suggestion and advise.

To clarify, would you want me to create the new config with new or old Disk 3 connected? I was thinking if I could restore previous config with valid parity and emulate possibly failing disk 1, I could then use my spare 8TB to replace the failed 3 TB disk rather than the smaller 1.5 TB for now.

JorgeB · March 28, 2022

2 minutes ago, JimmyC said:

To clarify, would you want me to create the new config with new or old Disk 3 connected?

That is with the new disk, if that fails or if you still have the old disk and it's healthy you can do a new config with it, sync parity, then try upgrading again.

JimmyC · March 28, 2022

I already had the old Disk 3 connected so I tried the new config operation as outlined and am no longer seeing Disk 1 disabled in maintenance mode. I have a parity-check (no corrections) running currently just to verify where everything stands.

I've reviewed Disk 1 SMART data and am not terribly concerned for the overall health of that drive. I would guess that I must have bumped SATA cable when I was in the case and just caused a temporary comms issue on it.

I feel like you have me on the right track here, but will report back in a day or two when I have finished check and rebuild.

I also feel now like this is some simple troubleshooting I should have known about, but (knock on wood) I've been running Unraid for about a decade and this is the largest problem I've encountered. Pretty solid code.

JimmyC · April 4, 2022

Just to update, I was able to use this guide to revive my array with no data loss.

v6.9.2 Disk 1 entered error state during Disk 3 upgrade - Data Lost?

Recommended Posts

JimmyC

Link to comment

JorgeB

Link to comment

JimmyC

Link to comment

JorgeB

Link to comment

JimmyC

Link to comment

JimmyC

Link to comment

Join the conversation