Jump to content

Continuous Drive Failures


Go to solution Solved by JorgeB,

Recommended Posts

For some reason I've been having an issue since May or so. I had a drive fail, and got a RMA replacement sent from WD. I did a preclear on another machine, shut down UnRAID, swapped the drive and brought UnRAID back up to start the rebuild. However, as the rebuild started I started getting udma crc error count errors on a drive that previously reported no issues. I figured it was just a fluke that a second drive failed while rebuiding, but once the rebuild was done I RMAd the new drive and repeated the process. Then another drive failed during that build with the same thing, and this just keeps happening over and over. I am now rebuilding my 5th or 6th drive, and again, I am getting a ton of udma errors (Disk 3 (disk dsbl) in the logs). 

 

While all my previously failed drives were bought a couple of years ago, the new Disk3 drive with errors is a WD Gold I just bought back in May (just before this all started). I did preclear it with no issue originally.

 

I don't understand what's going on, but am hoping someone can take a look at my diagnostics and provide some insights. This is happening far too frequently for me to think it's actually drive failure after failure, but I could be wrong.

 

I know sometimes rebooting will clear the crc errors, but I'd like to try and understand root cause and see if I can do something more permanent to fix it.

cydstorage-diagnostics-20221015-1520.zip

Link to comment
  • Solution

Disk3 dropped offline so there's no SMART, but looks more like a power/connection problem, check/replace cables and post new diags.

 

13 hours ago, bkastner said:

I started getting udma crc error count errors on a drive that previously reported no issues.

These are not a disk problem, usually it's bad SATA cable, you should also update the LSI firmware since it's quite old.

Link to comment

Okay, so I've flashed the firmware (what a pain in the ass that was), and I've replaced all my sff-8643 cables, and brought the system back up. Everything seems like it's better.

 

The last drive that was screaming at me as 130050 UDMA CRC errors, but seems to not be moving. When I started rebuilding the last failed drive the CRC errors on this drive were skyrocketing, so I am guessing them remaining stable now is a good sign.

 

Is there an easy way to test? I've browsed the drive through the GUI as I figured that would cause a read operation which would maybe cause the CRC errors to climb, but it's still the same. I also ran a short SMART test and it came back without error.

 

Is it fairly safe to assume that it was a controller/cable issue, and I shouldn't have any more drive failures? Or is there another test I should run before considering this case closed?

Link to comment

I have 2 drives on that particular backplane... the one with 130000 CRC errors (disk3), and the second drive that has 1 random one (disk8). Disk3 isn't in the array anymore and is being simulated. I had to do a parity check over the weekend as I inadventenly caused an unclean shutdown, and I didn't get any CRC errors on Disk8 during the check. The 130000 CRC errors on Disk3 occured while UnRAID was rebuilding the previously failed drive, but disk8 just had that one minor hiccup.

 

Is a parity check still worthwhile in this scenario? I am guessing not, but want to confirm in case I am missing something.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...