Disk Disabled, Unable to Add Back to the Array

banangineer · December 13, 2021

General System Info:
- 3 10TB Drives
- 1 10TB Drive for Parity

- 2 2TB SSDs in Cache Pool

Issue:
I was logging into my UnRAID panel to start a parity check, but upon looking at the Dashboard, one of my drives showed as Disabled (dev2), with the contents being emulated. I kept seeing sector errors, but couldn't find exactly what was the issue. All of my 10TB drives were purchased around March of 2021, and the server was built in July, so it's not like they're end of life. The only thing that I can think of is a forced shutdown from when my cat sat on the power button too long and the array wasn't able to gracefully shut down.

Attempts to Resolve:
1. I found this thread in the UnRAID forum, and followed Squid's advice. However, after stopping the array and attempting to re-add the drive, it gives an error and assumes unassigned, and the problem drive is no longer in the dropdown.

2. I tried this again, but by following the official UnRAID docs to try and remount and rebuild the drive from parity.
3. Restarted the UnRAID system a few times for sanity checks
4. Checked the SATA connections to the motherboard and drives, unplug-then-plug, nothing changed

I've done everything short of taking out the drive and plugging it into my main PC and seeing if I can run further tests on it. Before I do that, I wanted to forward my diagnostics and see if someone else has a better idea of what's going on and what can be done to fix it.

zawarudo-diagnostics-20211213-0403.zip

JorgeB · December 13, 2021

Was this disk2?

Model Family:     Seagate Enterprise Capacity 3.5 HDD
Device Model:     ST10000NM0086-2AA101
Serial Number:    ZA2E1Z0F

If yes it appears to be failing and needs replacing with a new one.

banangineer · December 13, 2021

25 minutes ago, JorgeB said:
Was this disk2?
Model Family:     Seagate Enterprise Capacity 3.5 HDD
Device Model:     ST10000NM0086-2AA101
Serial Number:    ZA2E1Z0F
If yes it appears to be failing and needs replacing with a new one.

Yes, that is the one that is currently failing. Hopefully I can still RMA it with Seagate since it was purchased earlier this year.

In the meantime, I have another drive that is registered in the array, but is still listed as disk3 in the array. The drive is not attached to either Share, and I'm fine with dropping the contents of it. Could I shrink the array from 3 + 1 parity to 2 + 1 parity, moving disk3 to replace disk2, and re-add my RMA'd or replaced drive back as disk3 later? Or is that not how shrinking the array works?

Also, for my own education, can you tell that's the failing drive because of the sector errors? Or are there other useful logs that tell you that?

ChatNoir · December 13, 2021

Some SMART attributes do not look good, in particular #187, 197, 198.

D# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   054   049   044    -    66767508
  3 Spin_Up_Time            PO----   092   091   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    43
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    2576
  7 Seek_Error_Rate         POSR--   080   060   045    -    4395031003
  9 Power_On_Hours          -O--CK   097   097   000    -    2837
 10 Spin_Retry_Count        PO--C-   100   100   097    -    0
 12 Power_Cycle_Count       -O--CK   100   100   020    -    42
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   001   001   000    -    282
188 Command_Timeout         -O--CK   100   097   000    -    1 9 9
189 High_Fly_Writes         -O-RCK   071   071   000    -    29
190 Airflow_Temperature_Cel -O---K   062   047   040    -    38 (Min/Max 37/38)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    268
192 Power-Off_Retract_Count -O--CK   100   100   000    -    10
193 Load_Cycle_Count        -O--CK   098   098   000    -    5469
194 Temperature_Celsius     -O---K   038   053   000    -    38 (0 21 0 0 0)
195 Hardware_ECC_Recovered  -O-RC-   078   064   000    -    66767508
197 Current_Pending_Sector  -O--C-   099   094   000    -    880
198 Offline_Uncorrectable   ----C-   099   094   000    -    880
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
200 Pressure_Limit          PO---K   100   100   001    -    0
240 Head_Flying_Hours       ------   100   253   000    -    2809h+04m+29.660s
241 Total_LBAs_Written      ------   100   253   000    -    17757413192
242 Total_LBAs_Read         ------   100   253   000    -    80081644090

Possibly other things too.

JorgeB · December 13, 2021

34 minutes ago, banangineer said:

Could I shrink the array from 3 + 1 parity to 2 + 1 parity, moving disk3 to replace disk2, and re-add my RMA'd or replaced drive back as disk3 later?

No, you could move any data form emulated disk2 to other disk(s) and then do a new config and re-sync parity.

35 minutes ago, banangineer said:

Also, for my own education, can you tell that's the failing drive because of the sector errors?

Disk has a lot of pending sectors.

banangineer · December 13, 2021

10 minutes ago, JorgeB said:

46 minutes ago, banangineer said:

Could I shrink the array from 3 + 1 parity to 2 + 1 parity, moving disk3 to replace disk2, and re-add my RMA'd or replaced drive back as disk3 later?

No, you could move any data form emulated disk2 to other disk(s) and then do a new config and re-sync parity.

You make a good point. In the meantime I'm going to replace the drive with another similar spec drive and hope for the best. I'll avoid shrinking the array and causing further issues.

13 minutes ago, JorgeB said:

Disk has a lot of pending sectors.

Thanks, I'll do some research to get a better idea, I appreciate the direction.

14 minutes ago, ChatNoir said:

Some SMART attributes do not look good, in particular #187, 197, 198.

I remember seeing #187 being a warning in UnRAID so that checks out.

I'll go ahead and get a replacement drive in ASAP and ship off my dud for an RMA. I'll update this thread when the new drive is in and the bad drive has been replaced.

trurl · December 13, 2021

You can see SMART warnings on the Dashboard page.

Setup Notifications to alert you immediately by email or other agent as soon as a problem is detected.

banangineer · December 15, 2021

Just to update the situation.

Bought an Iron Wolf NAS drive and replaced it. Everything seemed fine. Up until the 19% mark of the rebuild, and I started getting tons of errors on my disk1 (one of the drives that was working).

I had started it and went out to get food, several hours later I came back to the attached image and tons of chirping from what is disk1. Can I assume I have a second failed drive as well? My parity and disk3 are still fine. But disk1 has several errors and they keep climbing. I was able to grab most of what I needed from it, albeit slowly.

image.png.ce873221726db9a659d3131b36abc882.png

banangineer · December 15, 2021

I'm updating this again to include the diagnostics. Even though the SMART check passed on Disk1, the seek error rate seems extremely high. I can also confirm that I have begun losing files while the disk was rebuilding and the disk was erroring.

In case anyone would like to hear it for themselves:
https://imgur.com/a/QUrhx7z

zawarudo-diagnostics-20211214-2302.zip

JorgeB · December 15, 2021

Disk1 is also failing, this means single parity can't help, you can try using ddrescue with both failing disks.

banangineer · December 15, 2021

5 hours ago, JorgeB said:

Disk1 is also failing, this means single parity can't help, you can try using ddrescue with both failing disks.

At this point was I was able to gather and collect everything I needed from the failing drive, albeit slowly. It didn't kill the share completely so I was lucky it could move the data somewhat.

At this point both drives are essentially dead to me. If I can keep my cache drives so I at least have my Docker and VM data, I'll be fine with nuking the array.

What are next steps? Can I remove the second failed drive, create a new config with the three drives remaining and add back the other drives when they're back from RMA?

JorgeB · December 15, 2021

12 minutes ago, banangineer said:

Can I remove the second failed drive, create a new config with the three drives remaining and add back the other drives when they're back from RMA?

Yes, drives added later will need to be cleared, but they will be empty so not a problem.

banangineer · December 15, 2021

1 minute ago, JorgeB said:

Yes, drives added later will need to be cleared, but they will be empty so not a problem.

That's fine with me.

If I went with a new config for my array, does that reset my cache drives? They're pooled, and even though I've backed up the most important parts, can I assume since they're detached from the array they're unaffected if I rebuild the array from scratch?

trurl · December 15, 2021

10 minutes ago, banangineer said:

If I went with a new config for my array, does that reset my cache drives? They're pooled, and even though I've backed up the most important parts, can I assume since they're detached from the array they're unaffected if I rebuild the array from scratch?

New Config only resets your disk assignments. It doesn't change anything on any disk, except rebuilding parity to any disks assigned to any parity slot.

banangineer · December 15, 2021

2 minutes ago, trurl said:

New Config only resets your disk assignments. It doesn't change anything on any disk, except rebuilding parity to any disks assigned to any parity slot.

I think that may be what I have to do for the time being. I don't really want to drop more cash on drives until after I've RMA'd the two failing ones.

Assuming that the new config only affects the array, I will probably go ahead remove the second failed drive for RMA, and build a new config using my three (hopefully) working drives. Once the others are back from RMA, I'll add a second parity disk and just keep building to the array from there.

Is this the recommended approach?

trurl · December 16, 2021

New Config has a checkbox to say parity is already valid, but it won't be if any disks are removed so don't check the box and it will rebuild.

trurl · December 16, 2021

Also, if any disk with data on it shows as unmountable, DON'T format, ask for advice.

And make sure you double check connections when mucking about inside.

Disk Disabled, Unable to Add Back to the Array

Recommended Posts

banangineer

Link to comment

JorgeB

Link to comment

banangineer

Link to comment

ChatNoir

Link to comment

JorgeB

Link to comment

banangineer

Link to comment

trurl

Link to comment

banangineer

Link to comment

banangineer

Link to comment

JorgeB

Link to comment

banangineer

Link to comment

JorgeB

Link to comment

banangineer

Link to comment

trurl

Link to comment

banangineer

Link to comment

trurl

Link to comment

trurl

Link to comment

Join the conversation