Jump to content
We're Hiring! Full Stack Developer ×

Every disk has CRC errors


tucansam

Recommended Posts

Every single one of my data disks has CRC errors.  They are spread across the motherboard controller and two PCIe controllers.  Combination of several brands of SATA cables, including two breakout cables coming off an LSI.  

 

What could be causing this?

 

Secondarily, and perhaps related, I've noticed that some of my files are corrupt.  While looking through old picture, many only load halfway or not at all, and some home videos don't play anymore.

 

crc.jpg

Link to comment

I changed SATA cables on that disk and rebooted, two disks  (including that one) immediately had 50+ CRC errors.  

 

So I changed SATA ports and rebooted again (both were, and still are, on motherboard ports, just moved to two unused ones).

 

Rebooted and now Disk 7 has 653 CRC errors (completely unrelated to the previous errors).

 

Bad motherboard?  RAM?  Power supply?

Link to comment

How is airflow over the controllers?  LSI controllers generally have small heat sinks because they're designed to be used in forced cooling servers.  In low air flow machines they can get well over 100C and freak out. 

 

Might be worth adding a fan blowing over the controllers, even a low RPM 80mm fan would do the job.

Link to comment

There are six 140mm fans and one 120mm fan in the case, all of which are doing evac duty, so there is a ton of air moving through the case.  Good call though.

 

I will replace the power supply.  This has been an ongoing issues for almost a year and the power supply is one thing I haven't yet swapped out to test.

Link to comment
  • 1 month later...
  • 3 weeks later...
17 minutes ago, Maticks said:

also thought sata cables couldn’t be the issue.

 

No.  Exactly the opposite.  See here:

         https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes

 

Look for the Error/Attribute Number and read what it means.  This is an error picked up during the transmission using the SATA protocol.  Even if the disk is sending bad data out on the SATA cable, the bad data should still pass the ICRC (Interface Cyclic Redundancy Check) check when it is decoded at the other end. 

 

EDIT:  make sure that you haven't bundled all of the SATA cables together for appearance sake.  This can cross cross-talk between the cables unless they are shielded and very few shielded SATA cables are made these days.

Link to comment
21 minutes ago, Maticks said:

...

 

You mentioned data corruption. That is most concerning. CRC errors are being detected and (supposedly) corrected. Although we don't like to see them, they are more likely to cause drives to drop offline or slow down. They should never cause corruption. If you are seeing CRC and corruption together, that is very concerning.

 

I might suggest a non-correcting parity check. Would be very good to know if this condition is causing sync errors. If you start seeing lots of sync errors, you can stop it. But if you are not seeing them, I'd let it run. It will satisfy for monthly check. I would mention that no sync errors don't mean you're not getting corrupted data. Only that whatever data you are getting is being used to consistently update the array. If that data is bad, it can be consistently updated with bad data.

 

Would suggest going back to a stock configuration with no Dockers and NO VMs, no ACS override. See if issues go away. If they don't, and you're seeing corruption, you'd probably want to stop that array and carefully consider options. Could be the MB?? EMF? Who knows. But I would not want to subject array to more abuse if you are reporting data corruption.

 

You could remove the disks (easier if you have hot-swap) and insert some older disks for your testing.

 

Then slowly make changes and monitor. 

 

If you have a backup MB / CPU, you might try switching over. 

 

My advise on these types of things is try to get back to a working configuration, and then slowly make changes and monitor for the problems to come back. Rather than slowly dismantle and wait for problems to go away. For whatever reason, the latter never works for me.

Link to comment
6 hours ago, SSD said:

 

You mentioned data corruption. That is most concerning. CRC errors are being detected and (supposedly) corrected. Although we don't like to see them, they are more likely to cause drives to drop offline or slow down. They should never cause corruption. If you are seeing CRC and corruption together, that is very concerning.

CRC is designed to catch most errors - not all. With too long run of bit errors - or two bit errors too long distance from each other - the CRC can fail to notice anything wrong.

 

The CRC here is intended to catch a bit error once in a blue moon - not to filter a continuous stream of more or less damaged data blocks. So in the end - if there is a danger of regularly getting multiple bit errors in the same block, then it's just a question of time for a bad transfer to not get caught.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...