[Help] Is my disk DOA?

codearoni · November 16, 2020

Solution:

If you own WD Red Plus drives, and you're experiencing Errors - perform the following steps:

1) Run an Extended SMART test on the drive that is error'ing.

2) Download the SMART results when it's complete and check Raw_Read_Error_Rate in the .txt file

3) Raw_Read_Error_Rate should be zero for WD Red Plus drives specifically (this statement is not true for all HDD's, ask for help if you're using a different type of drive)!

4) If the Raw_Read_Error_Rate is not zero, your Red Plus drive will need replacing. RMA it if under warranty.

5) For extra certainty, run an Extended SMART test on the replacement drive to ensure it's working as expected.

6) Add "1,200" (no quotes) to the Smart Attribute Notifications of your WD Red Plus drives (textbox next to "Custom Attributes")

OP Below:
Hi all! New Unraid user here.

Everything has been working swimmingly up until my first mover job (cache dumping contents onto spinny plates).

My disk1 is receiving a crazy mount of errors, screenshot:

1216363019_ScreenShot2020-11-16at1_53_02PM.png.d41f33b311215c106b27851e2a59c396.png

This system, including all the drives are brand new.

I downloaded my diagnostics, and found thousands of these in the syslog.txt

Nov 16 10:08:58 Alexandria kernel: md: disk1 read error, sector=15032479712
Nov 16 10:08:58 Alexandria kernel: md: disk1 read error, sector=15032479720
Nov 16 10:08:58 Alexandria kernel: md: disk1 read error, sector=15032479728
Nov 16 10:08:58 Alexandria kernel: md: disk1 read error, sector=15032479736
Nov 16 10:08:58 Alexandria kernel: md: disk1 read error, sector=15032479744
Nov 16 10:08:58 Alexandria kernel: md: disk1 read error, sector=15032479752

I'm currently running the SMART extended self-test on disk1. Results TBD.

My question is: Is disk1 bunk? Given the fact that all the drives are fresh off the press, so to speak, I would expect zero errors.

Could there be a software reason for all these errors, outside of a bad disk? Looking for help here before moving forward with an RMA. Cheers!

Edited December 3, 2020 by codearoni

trurl · November 16, 2020

Syslog snippets are seldom sufficient. Without more information, best guess is bad connection, simply based on most frequent problem we see.

7 minutes ago, codearoni said:

downloaded my diagnostics

Give them to us and we will have more information to understand what is happening and make recommendations.

Attach complete Diagnostics ZIP file to your NEXT post in this thread.

codearoni · November 16, 2020

Attaching. Thank you trurl!

alexandria-diagnostics-20201116-1414.zip

trurl · November 16, 2020

This one looks like it may be a disk problem:

Nov 16 03:40:22 Alexandria kernel: ata2.00: status: { DRDY SENSE ERR }
Nov 16 03:40:22 Alexandria kernel: ata2.00: error: { UNC }
Nov 16 03:40:22 Alexandria kernel: ata2.00: configured for UDMA/133
Nov 16 03:40:22 Alexandria kernel: sd 2:0:0:0: [sdc] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Nov 16 03:40:22 Alexandria kernel: sd 2:0:0:0: [sdc] tag#4 Sense Key : 0x3 [current] 
Nov 16 03:40:22 Alexandria kernel: sd 2:0:0:0: [sdc] tag#4 ASC=0x11 ASCQ=0x4 
Nov 16 03:40:22 Alexandria kernel: sd 2:0:0:0: [sdc] tag#4 CDB: opcode=0x88 88 00 00 00 00 03 80 00 4c 18 00 00 05 40 00 00
Nov 16 03:40:22 Alexandria kernel: print_req_error: I/O error, dev sdc, sector 15032405016

Let us know how the extended SMART turns out.

codearoni · November 16, 2020

Just posting an update: extended SMART is still at 40%. Might not have results ready until tomorrow. Thanks again for meandering this issue with me trurl

codearoni · November 17, 2020

Attached is my smart report for disk1. The text below the report download says "Completed without error"

alexandria-smart-20201117-0817.zip

trurl · November 17, 2020

That WD Red disk went from zero to this on SMART attribute 1:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   086   086   016    -    162

I would replace it.

codearoni · November 17, 2020

Roger. Thank you so much Trurl!

Just for my own notes and knowledge: can you briefly describe what you're seeing. Would a healthy disc have "000" for all of those fields?

trurl · November 17, 2020

1 minute ago, codearoni said:

Just for my own notes and knowledge: can you briefly describe what you're seeing. Would a healthy disc have "000" for all of those fields?

Different disk models interpret that attribute differently. For WD Red it should be zero. If you have any other disks of that model, you should click on it to get to its page and set Unraid to monitor that attribute.

codearoni · November 19, 2020

Just an update: I spun down the array and removed the disk. It's currently in RMA.

After I get the replacement I'll start a rebuild. When it's all said and done, I'll update the OP with my steps used to triage this issue. Hoping it'll help future WD Red owners.

codearoni · November 28, 2020

Hi trurl! While I've been waiting on my RMA'd disk, I've been looking into setting up Unraid to monitor said attribute for my WD Red drives.

I've looked at the wiki plus these forums, but am unsure how to add monitoring as discussed above. I assume I go to the disk page, and enter a custom attribute (screenshot of what I'm talking about attached)? Is this correct? What would the syntax for this custom attribute look like?

trurl · November 28, 2020

Just as it says.

Custom attributes (use comma to separate numbers)

You want 1 and 200 so just put 1,200 in the blank and APPLY

codearoni · November 30, 2020

Thanks trurl. Looks good now. I was making it more complicated than it needed to be. (i.e. "Attribute = 0" trying to match the checkboxes below).

Final question: I'll be rebuilding the array soon. I am adding a 2nd parity drive and one more storage drive.

Should I: 1) spin up the array with the replacement disk ONLY, and rebuild FIRST - followed by spinning down the array, and adding the new drives.
or 2) spin up the array with the replacement disk, plus the new drives, and rebuild all together.

Couldn't find any documentation on this particular scenario in the wiki. I would prefer to do #2 as I imagine it'll be faster, but am obviously interested in doing this correctly moreso than quickly.

trurl · November 30, 2020

Assuming you mean the new data disk for a new slot you can do #2 with new config. If the new data disk isn't clear you will have to rebuild parity1 at the same time so no protection until done.

codearoni · November 30, 2020

Thanks trurl. Just to be clear: I'll be moving from 1x Parity and 3x Data drives to 2x Parity and 4x Data drives.
Sounds like adding a 2nd parity will require a rebuild on Parity #1...so I might be better off doing #1, just adding the replacement data disk and rebuilding the array. Then afterwards, spinning down the array, and adding Parity #2 and Data #4?

itimpi · November 30, 2020

8 minutes ago, codearoni said:Then afterwards, spinning down the array, and adding Parity #2 and Data #4?

I have a feeling that Unraid will not allow these to be done in one step as adding the extra data drive starts a clear operation and adding a parity drive starts a parity sync operation - and you cannot run both of these at the same time.

codearoni · November 30, 2020

Right on, ty itimpi!

So as a general rule of thumbs: adding multiple data drives at once = fine. Adding data + parity drives = not (do them as separate tasks). Makes sense. I'm just a new user and don't want to make any assumptions as to how unraid operates.

trurl · November 30, 2020

I should have reviewed the thread since I overlooked the fact you were replacing a disk. Of course that has to be done separately and before any other changes.

You can add data and parity drives at the same time (through new config), but you must replace / rebuild a disk separately. If the disk actually needs replacing due to problems then that should be done before anything else.

codearoni · December 1, 2020

No worries trurl, I imagine you're managing thousands of threads on this board lol. I've begun a rebuild but it'll take 12 hours. Probably tomorrow I'll pop on and update the OP with a summary of steps taken for these particular drives (WD Red Plus).

codearoni · December 2, 2020

Everything has been updating swimmingly, just taking a while given the drives I got (14 hours each).

I had a question about extended SMART tests though: can I run them while the array is up and running?
Will things like mover jobs be interrupted by extended SMART tests if I run them at night?

JorgeB · December 2, 2020

17 minutes ago, codearoni said:

can I run them while the array is up and running?

Yes, they ruin on the background, but avoid heavy i/o or they will take much longer.

codearoni · December 3, 2020

Thanks to everyone for the help on this issue!
I've updated my OP with my triage steps. Hopefully it'll help future WD Red users in the future.
I've got my array back online. The rebuild process was incredibly easy. Hardest part of this whole thing was waiting on the RMA drive. It's only strengthened the idea that Unraid was the right choice for my NAS.

[Help] Is my disk DOA?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation