[Solved] Weird CRC Errors [199] during simultaneous preclear of two new drives

iripmotoles · September 30, 2021

tl;dr: Replaced the mainboard, issue fixed. Thanks a lot for the help!

Hi guys,

I have a recently built a server with the following config:

Gigabyte C246M-WU4

Xeon E-2176G

1x Kingston Server Premier DIMM 32GB ECC

2x WD Black 1TB NVMe Cache

2x WD Red 4TB WD40EFAX

be quiet! Pure Power 11 FM 550W

Everything running very nicely and happily. Two days ago I installed additionally two Toshiba Enterprise 10GB MG06ACA10TE. All disks connected directly to the MoBo, all on one of the modular power cables of the PSU.

Then started preclearing both new disks. Yesterday I woke up to find several error notifications.

Unraid dev1 SMART health [199]:

Warning - udma crc error count is...

I aborted the preclear, shut down the system and swapped the sata cables from the WD to the Toshiba drives and installed new cables on the WD disks. Then I started a new preclear on both Toshiba but the errors kept coming up. Shut down the system again and installed another modular power cable so only two disks per outlet on the PSU.

This time resumed the previous preclear. The errors keep coming. First I thought that maybe the parcel with the two disks was mishandled during transport. But at some point I noticed something strange: The errors are almost always happening on both disks at the same exact minute. Not sure at when I got the one error mismatch.

At some point I lost patience and put one of the new disks into the array as second parity but immediately realized my stupidity and took it out again.

System log is pretty colorful but I can't make much sense of it. I saved diagnostics before each shutdown except the very first one for the disk install.

This screenshot is from today. Can't find anything in the syslog for these times:

492884940_Screenshotfrom2021-09-3014-05-45.png.9d951ae4b64eb3dca429ec455b17d686.png

Preclear is still running, currently 90% zeroing. I will post results once finished.

Any help would be highly appreciated.

Edit: Just found a correlation between SMART notification and syslog:

Sep 30 14:45:37 3Server kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:05:00.0
Sep 30 14:45:37 3Server kernel: nvme 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep 30 14:45:37 3Server kernel: nvme 0000:05:00.0: device [15b7:5006] error status/mask=00000001/0000e000

Edited October 9, 2021 by iripmotoles

JorgeB · September 30, 2021

CRC errors are a connection problem, usually the SATA cable, if you've already replaced those it could be controller or disks, or some incompatibly issues, try the disks in different SATA ports and/or another PC.

iripmotoles · September 30, 2021

After first noticing the issues, I hooked up the Toshiba drives to the cables (and respectively ports on MoBo) that were working fine with the WD drives. So I can pretty much rule out SATA ports and cables.

I have a new HP microserver gen10+ sitting here and was planning on putting the 4TB drives in there for my parents NAS needs as well as cross offsite backup. I could put the Toshiba drives in and try to preclear them there. Is it worth waiting for the preclear results before pulling the drives from my server?

JorgeB · September 30, 2021

1 hour ago, iripmotoles said:

Is it worth waiting for the preclear results before pulling the drives from my server?

As you prefer, like mentioned those errors are not a disk surface problem, preclear should always be successeful, unless there are other issues.

iripmotoles · September 30, 2021

So then what can be gained exactly by trying the disks in another system? If I understand you correctly I have already narrowed it down to an incompatibility between the Toshiba MG and the C246 chipset? Any thoughts on the nvme error at the time of the crc error?

JorgeB · September 30, 2021

To confirm if the problem is with the board or the disks, it can still be a disk problem, just not surface related, i.e., not bad sectors.

iripmotoles · September 30, 2021

Got it, thank you. Will try them in the microserver.

iripmotoles · October 1, 2021

They are going through without any errors so far, clearing completed and half of post-read. Afterwards I will try them again in my server with yet another set of SATA cables and on different ports of the board. Any BIOS settings I could check out, bearing in mind that everything works great with the WD drives?

JorgeB · October 1, 2021

30 minutes ago, iripmotoles said:

Any BIOS settings I could check out

Not that I can think of, you can look for a BIOS update though.

UhClem · October 1, 2021

Weird indeed!

BUT it is not a cable issue, nor a disk issue.

It is a flaky motherboard, specifically the Intel C246 chipset. (your syslog.txt files are gory with details)

[ not an Unraid user ; but enjoy weird problems ]

iripmotoles · October 1, 2021

Thanks for the feedback guys, very interesting. The board came with the latest BIOS version.

Do you mean flaky chipset as in a bad unit or as in a bad model / series? Return the board for direct replacement or change to another product? Can you please elaborate a bit on the details in the syslog for my warranty claim?

JorgeB · October 1, 2021

Could be some compatibility issues with those disks and that Intel chipset, or that board model in particular, you'd need to test the same disks in a different identical board and/or a different board with the same chipset to compare.

iripmotoles · October 1, 2021

Yes, those would be the next logical steps for fault isolation. However, I don’t have either laying around and I will certainly not buy another board to find out what’s wrong with the one I‘ve bought only a few weeks ago.

It sounded like @UhClem actually found something useful in my syslog and I‘d much rather make a well founded warranty claim on my existing purchase.

Edit: For what it's worth, as you guys predicted the preclear + post-read on the microserver came out with zero new errors on the new disks. Maybe this thread should be moved to General Support?

Edited October 1, 2021 by iripmotoles

UhClem · October 1, 2021

The (unique/specific) C246 chipset on your own [3Server] motherboard is flaky. (You should get a direct replacement.)

The best documentation (readily available) for the issue is the output of:

grep -e "ATA-10" -e "AER: Corr" -e "FPDMA QUE" -e "USB disc" syslog.txt

Use the syslog.txt from the 20210930-0723 .zip file. It has all 4 HDDs throwing errors. The "ATA-10" pattern just documents which HDD is ata[1357].00 . I'm pretty sure there are also relevant NIC errors in there, but I'm networking-ignorant. Note that all of these errors emanate from devices on the C246. Please examine a syslog.txt from your test-run on your Gen10 MS+; that box also uses the C246, but its syslog.txt will have none of these errors.

Attached is the output from the above command (filtered thru uniq -c, for brevity).

c246.txt

iripmotoles · October 2, 2021

Brilliant, thank you! Very helpful. I will contact the vendor.

[Solved] Weird CRC Errors [199] during simultaneous preclear of two new drives

Recommended Posts

iripmotoles

Link to comment

JorgeB

Link to comment

iripmotoles

Link to comment

JorgeB

Link to comment

iripmotoles

Link to comment

JorgeB

Link to comment

iripmotoles

Link to comment

iripmotoles

Link to comment

JorgeB

Link to comment

UhClem

Link to comment

iripmotoles

Link to comment

JorgeB

Link to comment

iripmotoles

Link to comment

UhClem

Link to comment

iripmotoles

Link to comment

Join the conversation