Disk in error state - looking for some much needed guidance

BlkNTan · February 23, 2022

Hi all - first off, here is the overview of my server:

Unraid v: 6.9.2 (trial)

Plugins: CA, FCP, My Servers, Nerd Tools, Tips and Tweaks, Unassigned Devices, Dynamix System Buttons

No dockers or VMs

Hardware: ASUS P8B-E/4L mobo, Xeon CPU E3-1220 V2 @ 3.10GHz, 16 GB ECC RAM, 6 total disks all connected through onboard Intel C204 chipset controller

Array:

Parity: (1) WD60EFZX - 6TB

Data disks: (4) WD30EFRX - 3TB

Cache: (1) Samsung 870 EVO 500GB

Background: I originally set this hardware up about 9 years ago with Win Server 2012 Essentials and it pretty much ran fine until recently. The OS drive (Samsung 840 Pro) seemed to have a disk error and became unbootable. I restored the backup to a new 870 EVO, but then started having a problem with one of the disks in the storage pool (a WD30EFRX). Since I knew I was overdue to move to a new OS (that wasn't 10 years old!), I decided to scrap the whole thing and try Unraid instead of another version of Win Server.

I set the Unraid server up last week with a new WD60EFZX (6TB) as the parity drive and the 870 EVO (basically brand new) as the cache drive. I used 4 WD30EFRX's for the data disks. During the first parity check, on of my WD30EFRX's had read errors. I checked all the cables and replaced it with another (I have 6 total) - not the one that had the errors previously in the Win Server setup. This time around the parity check was all good so I started up the array and began restoring all my data.

Problem: After restoring all my data, things seemed to be running well. Then, I started working on setting up backups to iDrive using their Linux scripts. I got it pretty much figured out and ran two selective backups (as tests) successfully. On the third one, it hung up and then I lost connectivity to the WebGUI. I logged into the console directly but I could not get it to shut down cleanly - I read a lot pages with explanations on how to do this, but apparently I didn't get all the commands to run successfully before rebooting.

After it rebooted, a parity check immediately started. During that check, Disk 3 started showing a ton of read errors. When it got to something like 48k read errors, Unraid disabled the disk and the rest of the parity check finished with no other errors (all other disks show healthy). I stopped the array, downloaded diagnostics, then shut down the server and checked all cables. I restarted the server and tried to run an extended SMART test on disk 3. The test hardly started before coming back with errors and stopping.

Unfortunately I did not get a diagnostics before the unclean restart, but I attached the diagnostics from after the parity check along with the SMART test report for Disk 3.

Questions:

1) Based on the attached reports, is disk 3 definitely toast? (I assume it is)

2) Assuming yes to 1, then is this the proper procedure to follow to replace it: https://wiki.unraid.net/Manual/Storage_Management#Replacing_failed.2Fdisabled_disk.28s.29

3) Should I assume the other old WD30EFRX's are likely to fail soon also? Essentially, I have lost 3 of 6 so far. I am running extended SMART tests on them currently.

4) My plan is to add a second parity disk (another WD60EFZX 6TB), a second cache disk (another 870 EVO 500GB) and replace Disk 3 with a new WD60EFZX 6TB. But I am wondering if the other 3 WD30EFRX's are ticking time bombs and I should just decommission and replace all of them?

That's all I can think of for now. I really appreciate any help and insights since I am new to Unraid and drinking the manual, guides, etc. like a fire hose currently, but still very green at this point!

bnt-unraid-smart-20220223-0909.zip bnt-unraid-diagnostics-20220223-0908.zip

JorgeB · February 23, 2022

It's logged as a disk issue and the SMART test failed, so yes, the disk needs to be replaced.

JorgeB · February 23, 2022

8 minutes ago, BlkNTan said:

Assuming yes to 1, then is this the proper procedure to follow to replace it: https://wiki.unraid.net/Manual/Storage_Management#Replacing_failed.2Fdisabled_disk.28s.29

Yes.

8 minutes ago, BlkNTan said:

Should I assume the other old WD30EFRX's are likely to fail soon also? Essentially, I have lost 3 of 6 so far. I am running extended SMART tests on them currently.

Wait for the extended test results, but if they just did a parity check they should be OK for now.

BlkNTan · February 23, 2022

Jorge - thank you so much for the quick response!!

Quick follow up question - is there anything specific in the SMART test report that I should be focusing on or is it just the fact that it couldn't finish and had errors. In other words, assuming the other 3 disks pass their extended tests, is there anything I should look for in those results that might tell me the drive is ok now, but likely to have trouble soon?

Thanks again!

JorgeB · February 23, 2022

With WD disks it's good practice to monitor these attributes:

  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    997
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0

A non zero value is never a good sign, especially if it keeps climbing, this one is from the failed disk, other ones are still at 0, so they should be good for now.

itimpi · February 23, 2022

Passing the extended tests is a MUST for assuming a disk is healthy.

Other signs of potential problems even is extended tests complete successfully is reallocated sectors starting to increase or Pending sectors being non-zero

BlkNTan · February 24, 2022

Thanks both for the replies! I did add attributes 1 and 200 to SMART monitoring based on another post - thanks for confirming.

As an update, disk 4 passed the extended test. Disk 1 is in process.

Reallocated and pending sectors are attribute numbers 5 and 197 - correct?

Thanks again for all the help!!

BlkNTan · February 25, 2022

Update - the other 3 disks pass their extended tests. I added a replacement disk and the data rebuild finished with zero errors - yay!

Next up is adding a second parity disk and a second cache disk.

In case anyone else reads this and is looking to learn more about SMART attributes, etc., I found these two articles helpful:

https://wiki.unraid.net/Understanding_SMART_Reports

https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/

Thanks again Jorge and itimpi!

Disk in error state - looking for some much needed guidance

Recommended Posts

BlkNTan

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

BlkNTan

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

BlkNTan

Link to comment

BlkNTan

Link to comment

Join the conversation