Disk in Error State (Read Errors)

lilpete · March 10, 2020

Hi all,

I'm running Unraid 6.8.2 and just received a notification that one of my disks is in an error state. It's not a young drive (5 years of power on time), but when I check SMART everything looks to be fine (it's currently running an extended test to be sure). The log however shows Disk 1 having 672 errors, so clearly something's up.

I've had a look at the syslog in the attached diagnostics and it shows masses of read and write errors but I'm unable to divine anything else.

Any ideas? I'm assuming that the array will continue to function, but in emulated mode/without protection until I replace it?

I actually have enough disk space free on the other disks to pick up the slack, is it worth running Unbalance to move all the data "off" this disk, or is that just going to screw things up?

Thanks for any help,

Peter

argo-diagnostics-20200310-1515-anon.zip

trurl · March 10, 2020

It is always safer to rebuild to a replacement and keep the original in reserve in case there are problems with the rebuild. But, you can rebuild to that same disk if you want. I don't recommend shuffling things around to other disks when there is already a disabled disk in the array since all disks must be read to get the data for the disabled disk, so that is just a lot of additional unnecessary activity on an array that isn't protected. The only thing I usually recommend is if you don't already have backups (you should) you can copy anything important and irreplaceable from the emulated disk to another system or to an Unassigned Device.

Before doing anything though you need to see if there is anything that needs to be fixed so you can successfully rebuild. SMART looks OK for that disk as you say, but you might run an extended SMART test on it anyway. And, of course, make sure all connections are good.

JorgeB · March 10, 2020

If rebuilding to the same disk recommend replacing/swapping cables, just to rule them out if it happens again to the same disk.

lilpete · March 10, 2020

Hi both,

Sorry for the confusion, I didn't mean to re-build onto the same drive, once a drive has developed errors I tend to get a bit twitchy about it. I've not done anything physically to the server recently, so I'm doubting it's a cable, unless it's spontaneously failed which I guess is possible. It's the first proper error I've seen on a disk in Unraid, and given the SMART data (including short test) came back clean initially I was/still am confused as to what's caused it as I thought any read/write errors would show up there.

My only other thought is that I do have a Dell H310 SAS card with a fan stuck to the heatsink. I'm wondering if the fan has failed and the card's overheated, will have to take the side off the case when I get home and see if it's running.

I'll wait and see what the extended SMART test says, which for a 6TB disk I guess will complete some time tomorrow. Might order a spare disk just in case as I'm likely to want to replace it anyway.

I do indeed have backups! All important data is in multiple places and the cloud, all, err, mass data, is backed up to a disk sat in my locked cupboard at work and refreshed monthlyish. I'm trying to figure out if there's a way of telling what file it was trying to write when it errored, but I can't see anything obvious.

Thanks for your help and advice!

JorgeB · March 10, 2020

12 minutes ago, lilpete said:

My only other thought is that I do have a Dell H310 SAS card with a fan stuck to the heatsin

Unlikely to be the controller, or multiple disks would have been affected, problem looks like a connection issue, though some times a healthy looking disk still has issues, hence why I recommended swapping cables, if the same disk fails again after that it's likely a disk problem.

trurl · March 10, 2020

1 hour ago, lilpete said:

I'm trying to figure out if there's a way of telling what file it was trying to write when it errored, but I can't see anything obvious.

The failed write which disabled the disk, and any subsequent writes to that (now emulated) disk, are still emulated by updating parity, so they should be recovered on rebuild.

lilpete · March 11, 2020

The full SMART scan showed 0 errors, so I suspect a cable, as it's a SAS breakout cable I might just switch it for an unused one on the same set and mark it with red tape or something to remind me.

New disk has turned up now anyway so after pre-clearing the new one I can replace the "failed" one and test it more thoroughly.

Thanks to you both for your help!

lilpete · March 12, 2020

Concerningly, the other large disk has now also shown a bunch of errors, although unlike the first disk hasn't been disabled.

Again, SMART tests show all is fine. I'm currently pre-clearing the new disk, but I'm increasingly suspect cabling has failed. I'm trying to remember if they're both plugged into the same splitter from the HBA as I do have a spare.

Might run Memtest or something when the system is down anyway to try and see if there's something else afoot.

Disk in Error State (Read Errors)

Recommended Posts

lilpete

Link to comment

trurl

Link to comment

JorgeB

Link to comment

lilpete

Link to comment

JorgeB

Link to comment

trurl

Link to comment

lilpete

Link to comment

lilpete

Link to comment

Join the conversation