[SOLVED] Moved disks between servers - now tons of write errors

weirdcrap · March 7, 2021

I recently put a new 8TB drive in NODE and was going to use the existing 6TB drive I replaced in VOID.

I stupidly did not preclear the drive as I have done this many times before without issue, but this time, barely 2% into the disk rebuild on VOID, the disk threw 1500 + read errors, write errors, then disabled the disk.

What is my best course of action here? I can't put the old 2TB disk I replaced it with back into the server since I already started the rebuild right?

My parity is in a paused "Read Check" state. Should I cancel it?

Because UnRAID disabled the disk I can't run SMART tests from the GUI nor can i seem to call the disk via the command line to try and test it using smartctl.

I of course don't have any spare drives on hand.

void-diagnostics-20210307-0621.zip

EDIT: I can't find it in the terminal because UnRAID seems to have dropped the disk and changed its drive letter? The log shows it changed to sdw after it shit the bed.

EDIT2: Holy shit that went south quickly. I took it out of NODE and there were no reallocated sectors. Now its got 31 reallocated sectors in a matter of minutes

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   224   196   021    Pre-fail  Always       -       7775
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2967
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       31
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   057   057   000    Old_age   Always       -       31759
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       19
193 Load_Cycle_Count        0x0032   190   190   000    Old_age   Always       -       30273
194 Temperature_Celsius     0x0022   124   094   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   169   169   000    Old_age   Always       -       31
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

So I guess this disk was just a ticking time bomb from the get go?

Edited March 7, 2021 by weirdcrap

JorgeB · March 7, 2021

Disk dropped offline and reconnect again, this is usually a connection/power, check/replace cables and try again.

weirdcrap · March 7, 2021

4 minutes ago, JorgeB said:

Disk dropped offline and reconnect again, this is usually a connection/power, check/replace cables and try again.

Did you see my edit about the 31 reallocated sectors? Those weren't there when I took the disk out of NODE. Do you still think its a power/cable issue?

These are all hot swap bays where I don't have to mess with cabling and the old disk had no issues in this bay.

Is it safe to cancel the rebuild? It will just leave disk 6 emulated?

Edited March 7, 2021 by weirdcrap

JorgeB · March 7, 2021

5 minutes ago, weirdcrap said:

Those weren't there when I took the disk out of NODE.

I saw that but that by itself it's not a big deal, but If you're sure they weren't there before the move run an extended SMART test, if it passes try again, if it doesn't replace.

weirdcrap · March 7, 2021

2 hours ago, JorgeB said:

I saw that but that by itself it's not a big deal, but If you're sure they weren't there before the move run an extended SMART test, if it passes try again, if it doesn't replace.

@JorgeBIt immediately fails both tests.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     31759         45736176
# 2  Short offline       Completed: read failure       10%     31759         45739672

That's why I was surprised. NODE hadn't recorded any pending or reallocated sectors for this disk. Then I brought it home (carefully packed in a HDD box with padding) and it just immediately fails right off the bat.

Thankfully it looks like B&H Photo has the 6TB red plus drives in stock.

Final question: Since I put that 6TB disk in and started the rebuild, I have to replace it with a 6TB or bigger correct?

Edited March 7, 2021 by weirdcrap

JorgeB · March 7, 2021

30 minutes ago, weirdcrap said:

Since I put that 6TB disk in and started the rebuild, I have to replace it with a 6TB or bigger correct?

Yep.

weirdcrap · March 7, 2021

Alright thanks. Lesson learned, don't trust a used disk just because it didn't report any problems.

I also need to be better about regularly running SMART tests on my array disks so I can hopefully catch this stuff before its a catastrophic failure and my array is no longer protected.

itimpi · March 7, 2021

A regularity scheduled parity test (e.g. monthly) is a good alternative to a SMART test. It has the advantage that you can use the Parity Check Tuning plugin so that it runs in increments outside prime time to minimise the impact on your daily use of the unRaid server.

weirdcrap · March 7, 2021

8 minutes ago, itimpi said:

A regularity scheduled parity test (e.g. monthly) is a good alternative to a SMART test. It has the advantage that you can use the Parity Check Tuning plugin so that it runs in increments outside prime time to minimise the impact on your daily use of the unRaid server.

Yeah I do run monthly parity checks, NODE gets it on the 1st and VOID on the 15th.

NODE completed its parity check on 3/1 with no errors and this disk reported no issues. Then a week later I remove it from NODE drive it 4 hours back to my house to put it in VOID and it fails in 30 minutes.

I'm not sure how else I could have caught this earlier except with monthly SMART testing of all disks.

Edited March 7, 2021 by weirdcrap

JorgeB · March 8, 2021

Not that uncommon for a disk to "disintegrate" when moving it after many hours of use in the same place, even if you are careful.

[SOLVED] Moved disks between servers - now tons of write errors

Recommended Posts

weirdcrap

Link to comment

JorgeB

Link to comment

weirdcrap

Link to comment

JorgeB

Link to comment

weirdcrap

Link to comment

JorgeB

Link to comment

weirdcrap

Link to comment

itimpi

Link to comment

weirdcrap

Link to comment

JorgeB

Link to comment

Join the conversation