Jump to content

Sanity Check for Read Errors


Geth

Recommended Posts

I think I've been pretty lucky with my Unraid server, this is the first time I've received alerts for read errors, so I'm dealing with this for the first time. I received an alert one of the hard drives in my array had 32 errors last night. I shut down all the containers running on the server so nothing has been writing to the array for the past few hours. I checked SMART and ran an extended SMART test (took about 10 hours) and it seems it all completed without error. In the disk log I see a single error which corresponds to the time when I received the alert:

 

Apr 26 02:06:51 Tower kernel: I/O error, dev sdf, sector 9206914224 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0

 

I've also attached the SMART report for this drive. From what I can tell it looks like the drive might be fine. I dumped all the diagnostic information too, but it's a little overwhelming, nothing really jumped out at me.

 

Based on what I've read, it seems like the drive may be ok, and I should reboot to clear the statistics and keep an eye on the drive. I actually just bought a new 8TB hard drive that was pre-clearing when the errors were reported. So I was contemplating swapping the drive with errors with the new one to be safe, but I was thinking maybe I should run another parity check to check the entire array (with corrections turned off)? Last check was completed on the first of the month with no errors.

 

When researching this, it seems like an error like this could be caused by data / power cables. I have my hard drives in a server chassis with hot swap bays and they connect to the motherboard using this card:

https://www.amazon.com/dp/B002RL8I7M/?coliid=I1JDPV4RBNXVPJ&colid=3RNCTRI7WGC20&psc=1&ref_=lv_ov_lig_dp_it

 

and these cables:

 

https://www.amazon.com/dp/B07CKXFKHT/?coliid=IJ7PN6L2328TJ&colid=3RNCTRI7WGC20&ref_=lv_ov_lig_dp_it&th=1

 

Thought I'd mention it in case someone spotted some issues with the cards / cables I might be using. When I shut down the server, I was going to check cable connections to make things are seated right, but I don't see errors with other drives, so maybe not the likely cause. Here's what I was thinking of doing:

 

1. Shut down the server and check connections.

2. Reboot and start running a parity check with corrections turned off.

3. Depending on the parity check results, run tests on the RAM.

 

If all that completes without issue, I'm thinking about leaving the disk in the array, but part of me wants to swap it for the new drive. Any thoughts or observations? I can attach the entire diagnostic output if it would help. Any help is appreciated.

 

ST8000VN004-2M2101_WKD37GL8-20230428-0444.txt

Link to comment

@trurl and @JorgeB Alright the plot has thickened. I was preparing to replace the two seagate drives and the monthly parity check started this am and I didn't realize. It was about 40% through the parity check when 4 drives suddenly showed ~3 million read errors. One of them was the same seagate drive, but 3 other non-seagate drives had errors too. I cancelled the parity check (which maybe I shouldn't have) and rebooted and the read errors are gone. I want to replace the seagate drives, but now I'm worried parity isn't correct and if I try to rebuild their contents on a new disk it might not rebuild right. Any thoughts on what I should do? Should I restart the parity check and write corrections, then try and replace the two disks? I downloaded the diagnostics again and attached.

tower-diagnostics-20230501-1409.zip

Link to comment

You rebooted before getting diagnostics so can't tell what happened before reboot.

 

5 minutes ago, Geth said:

monthly parity check started this am

Scheduled parity checks should be configured to be non-correcting. Is it?

 

7 minutes ago, Geth said:

restart the parity check and write corrections

You should never correct parity until all other problems are eliminated.

Link to comment

Shit my bad. So I actually see parity corrections is disabled for the monthly check. The server had been running fine until the parity checks this am. So I think I'm gonna run the short smart test on all the drives first and see if any errors show up. Assuming no issues, you think I should start the parity check again?

Link to comment

So parity check just completed without any errors. I'm going to apply the fix to those SeaGate drives mentioned in the thread when I get home from work. I have everything set-up and ready to go, although I installed the SeaGate utilities a bit differently. Maybe I'll write up how I eneded up doing it. Seems like things have changed a bit since the original instructions were written. Going to shut the server down and inspect backplane connections and the power supply as well, mostly to tick the boxes. It seems strange that the other western digital drives would have sync errors if it was the seagate issue though. But maybe I've misunderstood the problem exactly.

Link to comment

@trurl So I just got read errors on 4 disks again, pretty sure these were the same disks that had issues last time. I actually applied the SeaGate fix a few days ago. Haven't had any issues until this morning. I downloaded the diagnostics and haven't rebooted. Let me know if you see anything. I'm gonna check if these 4 disks are a part of the same backplane, but don't want to reboot yet.

tower-diagnostics-20230507-1228.zip

Link to comment

I dumped the diagnostics logs again and shut the server down. All 4 drives with issues are in the same row and use the same backplane. So I'm thinking that the backplane has gone bad and that might be the issue. Problem is I have a NORCO RPC-4224 case and the company has gone under. Seems like there are a lot of problems with backplanes dying and taking drives with them. Maybe I should move to a new case and be done with this one.

Link to comment

When the issues first happened, I set the spin down delay to never on both of those drives. The morning of the issues, I set the spin down delay back to the default for both drives so you might be right. The only thing that I don't understand is why other non-SeaGate drives show read errors too? I followed the instructions in that support thread and disabled EPC & Low Current Spinup, I guess it hasn't helped. I'm thinking of just moving all the files off these two disks using the unbalance plugin and removing them from the array entirely. What do you think of that?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...