Disk read errors but SMART test PASSED

cardNull · January 5

For the last 5 days I have been getting notified that the health check of my disks is failing. One disk in particular "disk1" is having read errors. When I run a SMART test, both short and extended, it says PASSED. I will note that my extended test took like 8 hours to complete. Is there a next step to testing the drive or gathering additional info on the "read" errors? Would it be the FS that is causing it? I have attached the SMART logs.

Thank you.

SMART-REPORT.txt

itimpi · January 5

Since the extended SMART test passed then it is likely that the drive is OK and the problem is something external to it such as the cabling and/or power. Would need the diagnostics to have any chance of seeing if it looks like this might be the case.

trurl · January 5

Attach diagnostics to your NEXT post in this thread.

cardNull · January 5

nachoserver-diagnostics-20240105-0827.zipAttached is the diagnostics export.

trurl · January 5

You should click on each of your WD disks and add attributes 1 and 200 for monitoring. Disk1 does have value 1 for each of those attributes, but I'm not sure how critical that is since it passed extended self-test. On the other hand, syslog says critical medium error for that disk and nothing that would indicate communication problems such as cabling.

Looks like sdc (cache) also has some disk problems.

cardNull · January 5

7 minutes ago, trurl said:

You should click on each of your WD disks and add attributes 1 and 200 for monitoring. Disk1 does have value 1 for each of those attributes, but I'm not sure how critical that is since it passed extended self-test. On the other hand, syslog says critical medium error for that disk and nothing that would indicate communication problems such as cabling.

Looks like sdc (cache) also has some disk problems.

I added 1 and 200 to all my WD drives. Disk 1 popped alerts right away. Another disk has those same values for 1 and 200. But its not in error.

cardNull · January 7

On 1/5/2024 at 8:47 AM, trurl said:

You should click on each of your WD disks and add attributes 1 and 200 for monitoring. Disk1 does have value 1 for each of those attributes, but I'm not sure how critical that is since it passed extended self-test. On the other hand, syslog says critical medium error for that disk and nothing that would indicate communication problems such as cabling.

Looks like sdc (cache) also has some disk problems.

The disk that was tossing errors "Raw read error rate" (raw value of 1) is now zero. "Returned to normal". However the Multi zone error rate is still 1. Now my Parity disk has a Raw read error rate of 1, and UDMA CRC error count of 53. So something is up. But I am not sure how to keep chasing this one. I do have 1 8TB drive with less then 48 hours on it so my plan was to put that in the parity slot. Then build parity. Once that was correct I was going to replace all my 4tb with new 8tb disks. I guess my question now is, I can just shut down, remove the parity disk, add the new parity disk, boot up and let it build right?

Looking at this (https://docs.unraid.net/legacy/FAQ/parity-swap-procedure/), I am right. But maybe I am doing it wrong lol

This procedure is strictly for replacing data drives in an Unraid array. If all you want to do is replace your Parity drive with a larger one, then you don't need the Parity Swap procedure. Just remove the old parity drive and add the new one, and start the array. The process of building parity will immediately begin. (If something goes wrong, you still have the old parity drive that you can put back!)

trurl · January 7

CRC errors are just communication problems logged by the disk firmware. They are almost always connection or cable problems. You can acknowledge any SMART warning on the Dashboard page by clicking on it and it will warn again if any increase. I usually just acknowledge the occasional CRC, maybe reseat the connection next time I'm in the case. If it continues to increase investigate further and replace cables or whatever. Power cables and splitters can also be a reason for problems. Connection problems are much more common than actual disk problems so be careful with connections.

You are correct, you don't need parity swap procedure.

The documentation talks a lot about replacing a data disk, but replacing a parity disk is exactly the same.

You can't change disk assignments with the array started.

With the array stopped, assign the new disk to the slot to be replaced. Start the array to begin rebuild. Simple as that.

Shutting down, installing new disk, possibly removing old disk, whatever you need to do to get to assigning that new disk then starting the array.

cardNull · January 7

2 hours ago, trurl said:

CRC errors are just communication problems logged by the disk firmware. They are almost always connection or cable problems. You can acknowledge any SMART warning on the Dashboard page by clicking on it and it will warn again if any increase. I usually just acknowledge the occasional CRC, maybe reseat the connection next time I'm in the case. If it continues to increase investigate further and replace cables or whatever. Power cables and splitters can also be a reason for problems. Connection problems are much more common than actual disk problems so be careful with connections.

You are correct, you don't need parity swap procedure.

The documentation talks a lot about replacing a data disk, but replacing a parity disk is exactly the same.

You can't change disk assignments with the array started.

With the array stopped, assign the new disk to the slot to be replaced. Start the array to begin rebuild. Simple as that.

Shutting down, installing new disk, possibly removing old disk, whatever you need to do to get to assigning that new disk then starting the array.

Thanks for your help with this. I have moved disks and parity is rebuilding now. Good to know on the CRC. I will monitor it for now. I have 10 drives of varying size. It would be ideal to pare them down so that I am not using SATA power splitters. Which I am now, and could be causing these issues? I am using a SAS HBA for the disks that have issues now.

trurl · January 7

Molex-SATA splitters are preferred over SATA-SATA splitters. Ideally only 4 disks per power supply cable.

New larger parity is the first step in getting to fewer larger data disks.

Disk read errors but SMART test PASSED

Recommended Posts

cardNull

Link to comment

itimpi

Link to comment

trurl

Link to comment

cardNull

Link to comment

trurl

Link to comment

cardNull

Link to comment

cardNull

Link to comment

trurl

Link to comment

cardNull

Link to comment

trurl

Link to comment

Join the conversation