Every disk has CRC errors

tucansam · November 21, 2017

Every single one of my data disks has CRC errors. They are spread across the motherboard controller and two PCIe controllers. Combination of several brands of SATA cables, including two breakout cables coming off an LSI.

What could be causing this?

Secondarily, and perhaps related, I've noticed that some of my files are corrupt. While looking through old picture, many only load halfway or not at all, and some home videos don't play anymore.

Frank1940 · November 21, 2017

I would suggest that you post up a diagnostics file in a new post ( Tools >>> Diagnostics )

tucansam · November 21, 2017

Thanks!

ffs2-diagnostics-20171121-0952.zip

JorgeB · November 21, 2017

In the meantime you can also acknowledge current values on all disks and see if they continue to rise.

JorgeB · November 21, 2017

On the current diags they're happening to this disk only:

Nov 18 05:14:44 ffs2 kernel: ata5.00: ATA-9: WDC WD60EZRZ-00RWYB1,      WD-WX81D6550SLK, 80.00A80, max UDMA/133

Replace the SATA cable.

tucansam · November 21, 2017

I changed SATA cables on that disk and rebooted, two disks (including that one) immediately had 50+ CRC errors.

So I changed SATA ports and rebooted again (both were, and still are, on motherboard ports, just moved to two unused ones).

Rebooted and now Disk 7 has 653 CRC errors (completely unrelated to the previous errors).

Bad motherboard? RAM? Power supply?

JorgeB · November 21, 2017

CRC errors are usually caused by, and in this order, cable/backplane, port/controller or the disk itself, having so many disks with issue I believe rules them out. Is disk7 also on the board controller?

tucansam · November 21, 2017

No sir, #7 is on an LSI controller on a breakout cable. I have three backplanes, two of which were in use in my old server for nearly five years without issues.

HellDiverUK · November 21, 2017

How is airflow over the controllers? LSI controllers generally have small heat sinks because they're designed to be used in forced cooling servers. In low air flow machines they can get well over 100C and freak out.

Might be worth adding a fan blowing over the controllers, even a low RPM 80mm fan would do the job.

JorgeB · November 21, 2017

It's unusual but it can be the power supply, since it's affecting so many disks.

tucansam · November 21, 2017

There are six 140mm fans and one 120mm fan in the case, all of which are doing evac duty, so there is a ton of air moving through the case. Good call though.

I will replace the power supply. This has been an ongoing issues for almost a year and the power supply is one thing I haven't yet swapped out to test.

tucansam · January 4, 2018

Disk 7 is at it again, 731 CRC errors in the last 48 hours.

Think its time to send it back to Seagate.

Will a long SMART test prove valuable?

tucansam · January 24, 2018

No one, then?

JorgeB · January 24, 2018

On 04/01/2018 at 5:04 AM, tucansam said:

Will a long SMART test prove valuable?

No, CRC errors are connection related

tucansam · January 24, 2018

I have replaced the cable four or five times at this point, and have tried multiple ports on three different controllers.

Disk problem?

JorgeB · January 24, 2018

If it's just one disk possibly, if it's more than one it could also be the a bad PSU.

Maticks · January 24, 2018

I had this happen a day after turning on acs override on my vm config with 3 disks on my lsi controller.

havent made any config changes recently also looked fine 2 disks on lsi one in the onboard controller also thought sata cables couldn’t be the issue.

Frank1940 · January 24, 2018

17 minutes ago, Maticks said:

also thought sata cables couldn’t be the issue.

No. Exactly the opposite. See here:

https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes

Look for the Error/Attribute Number and read what it means. This is an error picked up during the transmission using the SATA protocol. Even if the disk is sending bad data out on the SATA cable, the bad data should still pass the ICRC (Interface Cyclic Redundancy Check) check when it is decoded at the other end.

EDIT: make sure that you haven't bundled all of the SATA cables together for appearance sake. This can cross cross-talk between the cables unless they are shielded and very few shielded SATA cables are made these days.

SSD · January 24, 2018

21 minutes ago, Maticks said:

...

You mentioned data corruption. That is most concerning. CRC errors are being detected and (supposedly) corrected. Although we don't like to see them, they are more likely to cause drives to drop offline or slow down. They should never cause corruption. If you are seeing CRC and corruption together, that is very concerning.

I might suggest a non-correcting parity check. Would be very good to know if this condition is causing sync errors. If you start seeing lots of sync errors, you can stop it. But if you are not seeing them, I'd let it run. It will satisfy for monthly check. I would mention that no sync errors don't mean you're not getting corrupted data. Only that whatever data you are getting is being used to consistently update the array. If that data is bad, it can be consistently updated with bad data.

Would suggest going back to a stock configuration with no Dockers and NO VMs, no ACS override. See if issues go away. If they don't, and you're seeing corruption, you'd probably want to stop that array and carefully consider options. Could be the MB?? EMF? Who knows. But I would not want to subject array to more abuse if you are reporting data corruption.

You could remove the disks (easier if you have hot-swap) and insert some older disks for your testing.

Then slowly make changes and monitor.

If you have a backup MB / CPU, you might try switching over.

My advise on these types of things is try to get back to a working configuration, and then slowly make changes and monitor for the problems to come back. Rather than slowly dismantle and wait for problems to go away. For whatever reason, the latter never works for me.

pwm · January 24, 2018

6 hours ago, SSD said:

You mentioned data corruption. That is most concerning. CRC errors are being detected and (supposedly) corrected. Although we don't like to see them, they are more likely to cause drives to drop offline or slow down. They should never cause corruption. If you are seeing CRC and corruption together, that is very concerning.

CRC is designed to catch most errors - not all. With too long run of bit errors - or two bit errors too long distance from each other - the CRC can fail to notice anything wrong.

The CRC here is intended to catch a bit error once in a blue moon - not to filter a continuous stream of more or less damaged data blocks. So in the end - if there is a danger of regularly getting multiple bit errors in the same block, then it's just a question of time for a bad transfer to not get caught.

Every disk has CRC errors

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived