tucansam Posted November 21, 2017 Share Posted November 21, 2017 Every single one of my data disks has CRC errors. They are spread across the motherboard controller and two PCIe controllers. Combination of several brands of SATA cables, including two breakout cables coming off an LSI. What could be causing this? Secondarily, and perhaps related, I've noticed that some of my files are corrupt. While looking through old picture, many only load halfway or not at all, and some home videos don't play anymore. Link to comment
Frank1940 Posted November 21, 2017 Share Posted November 21, 2017 I would suggest that you post up a diagnostics file in a new post ( Tools >>> Diagnostics ) Link to comment
tucansam Posted November 21, 2017 Author Share Posted November 21, 2017 Thanks! ffs2-diagnostics-20171121-0952.zip Link to comment
JorgeB Posted November 21, 2017 Share Posted November 21, 2017 In the meantime you can also acknowledge current values on all disks and see if they continue to rise. Link to comment
JorgeB Posted November 21, 2017 Share Posted November 21, 2017 On the current diags they're happening to this disk only: Nov 18 05:14:44 ffs2 kernel: ata5.00: ATA-9: WDC WD60EZRZ-00RWYB1, WD-WX81D6550SLK, 80.00A80, max UDMA/133 Replace the SATA cable. Link to comment
tucansam Posted November 21, 2017 Author Share Posted November 21, 2017 I changed SATA cables on that disk and rebooted, two disks (including that one) immediately had 50+ CRC errors. So I changed SATA ports and rebooted again (both were, and still are, on motherboard ports, just moved to two unused ones). Rebooted and now Disk 7 has 653 CRC errors (completely unrelated to the previous errors). Bad motherboard? RAM? Power supply? Link to comment
JorgeB Posted November 21, 2017 Share Posted November 21, 2017 CRC errors are usually caused by, and in this order, cable/backplane, port/controller or the disk itself, having so many disks with issue I believe rules them out. Is disk7 also on the board controller? Link to comment
tucansam Posted November 21, 2017 Author Share Posted November 21, 2017 No sir, #7 is on an LSI controller on a breakout cable. I have three backplanes, two of which were in use in my old server for nearly five years without issues. Link to comment
HellDiverUK Posted November 21, 2017 Share Posted November 21, 2017 How is airflow over the controllers? LSI controllers generally have small heat sinks because they're designed to be used in forced cooling servers. In low air flow machines they can get well over 100C and freak out. Might be worth adding a fan blowing over the controllers, even a low RPM 80mm fan would do the job. Link to comment
JorgeB Posted November 21, 2017 Share Posted November 21, 2017 It's unusual but it can be the power supply, since it's affecting so many disks. Link to comment
tucansam Posted November 21, 2017 Author Share Posted November 21, 2017 There are six 140mm fans and one 120mm fan in the case, all of which are doing evac duty, so there is a ton of air moving through the case. Good call though. I will replace the power supply. This has been an ongoing issues for almost a year and the power supply is one thing I haven't yet swapped out to test. Link to comment
tucansam Posted January 4, 2018 Author Share Posted January 4, 2018 Disk 7 is at it again, 731 CRC errors in the last 48 hours. Think its time to send it back to Seagate. Will a long SMART test prove valuable? Link to comment
JorgeB Posted January 24, 2018 Share Posted January 24, 2018 On 04/01/2018 at 5:04 AM, tucansam said: Will a long SMART test prove valuable? No, CRC errors are connection related Link to comment
tucansam Posted January 24, 2018 Author Share Posted January 24, 2018 I have replaced the cable four or five times at this point, and have tried multiple ports on three different controllers. Disk problem? Link to comment
JorgeB Posted January 24, 2018 Share Posted January 24, 2018 If it's just one disk possibly, if it's more than one it could also be the a bad PSU. Link to comment
Maticks Posted January 24, 2018 Share Posted January 24, 2018 I had this happen a day after turning on acs override on my vm config with 3 disks on my lsi controller. havent made any config changes recently also looked fine 2 disks on lsi one in the onboard controller also thought sata cables couldn’t be the issue. Link to comment
Frank1940 Posted January 24, 2018 Share Posted January 24, 2018 17 minutes ago, Maticks said: also thought sata cables couldn’t be the issue. No. Exactly the opposite. See here: https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes Look for the Error/Attribute Number and read what it means. This is an error picked up during the transmission using the SATA protocol. Even if the disk is sending bad data out on the SATA cable, the bad data should still pass the ICRC (Interface Cyclic Redundancy Check) check when it is decoded at the other end. EDIT: make sure that you haven't bundled all of the SATA cables together for appearance sake. This can cross cross-talk between the cables unless they are shielded and very few shielded SATA cables are made these days. Link to comment
SSD Posted January 24, 2018 Share Posted January 24, 2018 21 minutes ago, Maticks said: ... You mentioned data corruption. That is most concerning. CRC errors are being detected and (supposedly) corrected. Although we don't like to see them, they are more likely to cause drives to drop offline or slow down. They should never cause corruption. If you are seeing CRC and corruption together, that is very concerning. I might suggest a non-correcting parity check. Would be very good to know if this condition is causing sync errors. If you start seeing lots of sync errors, you can stop it. But if you are not seeing them, I'd let it run. It will satisfy for monthly check. I would mention that no sync errors don't mean you're not getting corrupted data. Only that whatever data you are getting is being used to consistently update the array. If that data is bad, it can be consistently updated with bad data. Would suggest going back to a stock configuration with no Dockers and NO VMs, no ACS override. See if issues go away. If they don't, and you're seeing corruption, you'd probably want to stop that array and carefully consider options. Could be the MB?? EMF? Who knows. But I would not want to subject array to more abuse if you are reporting data corruption. You could remove the disks (easier if you have hot-swap) and insert some older disks for your testing. Then slowly make changes and monitor. If you have a backup MB / CPU, you might try switching over. My advise on these types of things is try to get back to a working configuration, and then slowly make changes and monitor for the problems to come back. Rather than slowly dismantle and wait for problems to go away. For whatever reason, the latter never works for me. Link to comment
pwm Posted January 24, 2018 Share Posted January 24, 2018 6 hours ago, SSD said: You mentioned data corruption. That is most concerning. CRC errors are being detected and (supposedly) corrected. Although we don't like to see them, they are more likely to cause drives to drop offline or slow down. They should never cause corruption. If you are seeing CRC and corruption together, that is very concerning. CRC is designed to catch most errors - not all. With too long run of bit errors - or two bit errors too long distance from each other - the CRC can fail to notice anything wrong. The CRC here is intended to catch a bit error once in a blue moon - not to filter a continuous stream of more or less damaged data blocks. So in the end - if there is a danger of regularly getting multiple bit errors in the same block, then it's just a question of time for a bad transfer to not get caught. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.