Jump to content

UDMA CRC Errors. Are my drives going bad?


Recommended Posts

Hello unRAID community. I could use your support

 

My setup:

Intel Core i7 12700k

1x Seagate Ironwolf Pro 16TB (Parity 1)

4x WD Red Pro 16TB (Parity 2. Disks 1, 2, & 3)

LSI 9400-16i in IT mode (all hard drives connected to this HBA)

2x Samsung 980 Pro 2TB in RAID1 (Cache)

 

My issue:

My server has been working great for the last year, then all of a sudden about 2 weeks ago I started getting UDMA CRC errors on all of my data drives during my quarterly parity check. This error is easily reproducible by doing a parity check, or during extended sessions with the mover. The odd thing is, that every time I reproduce the issue, the error will come from 1 or 2 of the drives at random, but not all 3. Eventually if I let the parity check run long enough, unRAID will detect errors on one of the 3 data drives. The concerning thing is that I find it hard to believe that 3 of my drives are failing. 2 of the 3 are ~1 year old, while the last one (which is having the most UDMA CRC issues, is 3 months old)

 

Steps I have attempted to resolve:

  • Updated BIOS on motherboard
  • Updated unRAID to 6.12.11
  • Verified LSI 9400-16i is running latest firmware
  • Dusted SAS ports on LSi, dusted PCIe slot, dusted SATA ports on HDD's
  • Swapped all mini-SAS to SATA breakout cables for new ones
  • Swapped power cables on HDD's for new ones (from PSU box)
  • Switched LSI card to different PCIe slot
  • Purchased new LSI 9400-16i HBA and installed with another set of brand new mini-SAS to SATA cables
  • Extended SMART tests on all HDD's. All drives passed 2 Extended SMART tests

 

I am running out of ideas to try. Is there anything else I can try, or should I accept that my data drives have gone bad?

diagnostics-20240725-1217.zip

Edited by loyalsnoopdoge
grammar corrections, attaching diagnostics
Link to comment

It's really sus that multiple young drives would fail so soon... Hard for me to accept that.  Let's see, you eliminated SAS controller, you spared off both data and power cables.  There's only one long shot I can think of with HW and that would be the PSU actual.  Any chance you have an old one laying around?

  If not that, maybe Jorge (or another pro) can see something in the logs.  I'm pretty bad at reading through those unfortunately.

  

Link to comment
5 hours ago, Veah said:

It's really sus that multiple young drives would fail so soon...

Agreed, the logs show what look like power/connection issues, and if it's the same with the onboard SATA still think power or miniSAS cables are the problem.

Link to comment
8 hours ago, JorgeB said:

Agreed, the logs show what look like power/connection issues, and if it's the same with the onboard SATA still think power or miniSAS cables are the problem.

I have tried replacing the power cables as well. Do you think it could be the SATA connectors on the drives themselves?

Link to comment
  • 2 weeks later...
Posted (edited)

Since my last post I've been hard at work with this thing

I just put new drives in with the new HBA and new SAS to SATA breakout cables. I made sure the temps aren't going above 40C on the drives. One of my brand new drives just threw a UDMA CRC error during a parity check

I've also re seated the motherboard, and re seated all the connections on it. I also ran memtest for 16 hours to see if my RAM may be a potential issue, even though it's a connection issue

I'm at a loss of what to do. do I replace the PSU and/or the motherboard?

Edited by loyalsnoopdoge
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...