[Solved] Weird CRC Errors [199] during simultaneous preclear of two new drives


Recommended Posts

tl;dr: Replaced the mainboard, issue fixed. Thanks a lot for the help!

 

Hi guys,

 

I have a recently built a server with the following config:

Gigabyte C246M-WU4

Xeon E-2176G

1x Kingston Server Premier DIMM 32GB ECC

2x WD Black 1TB NVMe Cache

2x WD Red 4TB WD40EFAX

be quiet! Pure Power 11 FM 550W

 

Everything running very nicely and happily. Two days ago I installed additionally two Toshiba Enterprise 10GB MG06ACA10TE. All disks connected directly to the MoBo, all on one of the modular power cables of the PSU.

Then started preclearing both new disks. Yesterday I woke up to find several error notifications.

Unraid dev1 SMART health [199]:

Warning - udma crc error count is...

 

I aborted the preclear, shut down the system and swapped the sata cables from the WD to the Toshiba drives and installed new cables on the WD disks. Then I started a new preclear on both Toshiba but the errors kept coming up. Shut down the system again and installed another modular power cable so only two disks per outlet on the PSU.

This time resumed the previous preclear. The errors keep coming. First I thought that maybe the parcel with the two disks was mishandled during transport. But at some point I noticed something strange: The errors are almost always happening on both disks at the same exact minute. Not sure at when I got the one error mismatch.

At some point I lost patience and put one of the new disks into the array as second parity but immediately realized my stupidity and took it out again.

 

System log is pretty colorful but I can't make much sense of it. I saved diagnostics before each shutdown except the very first one for the disk install.

This screenshot is from today. Can't find anything in the syslog for these times:

 

492884940_Screenshotfrom2021-09-3014-05-45.png.9d951ae4b64eb3dca429ec455b17d686.png

 

Preclear is still running, currently 90% zeroing. I will post results once finished.

Any help would be highly appreciated.

 

Edit: Just found a correlation between SMART notification and syslog:

 

Sep 30 14:45:37 3Server kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:05:00.0
Sep 30 14:45:37 3Server kernel: nvme 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep 30 14:45:37 3Server kernel: nvme 0000:05:00.0: device [15b7:5006] error status/mask=00000001/0000e000

1659550324_Screenshotfrom2021-09-3014-53-03.thumb.png.779875fe19d8d70308a1479dbb684f64.png

 

Edited by iripmotoles
Link to comment

After first noticing the issues, I hooked up the Toshiba drives to the cables (and respectively ports on MoBo) that were working fine with the WD drives. So I can pretty much rule out SATA ports and cables.

I have a new HP microserver gen10+ sitting here and was planning on putting the 4TB drives in there for my parents NAS needs as well as cross offsite backup. I could put the Toshiba drives in and try to preclear them there. Is it worth waiting for the preclear results before pulling the drives from my server?

Link to comment

They are going through without any errors so far, clearing completed and half of post-read. Afterwards I will try them again in my server with yet another set of SATA cables and on different ports of the board. Any BIOS settings I could check out, bearing in mind that everything works great with the WD drives?

Link to comment

Thanks for the feedback guys, very interesting. The board came with the latest BIOS version.


Do you mean flaky chipset as in a bad unit or as in a bad model / series? Return the board for direct replacement or change to another product? Can you please elaborate a bit on the details in the syslog for my warranty claim?

Link to comment

Yes, those would be the next logical steps for fault isolation. However, I don’t have either laying around and I will certainly not buy another board to find out what’s wrong with the one I‘ve bought only a few weeks ago.

It sounded like @UhClem actually found something useful in my syslog and I‘d much rather make a well founded warranty claim on my existing purchase. 

 

Edit: For what it's worth, as you guys predicted the preclear + post-read on the microserver came out with zero new errors on the new disks. Maybe this thread should be moved to General Support?

Edited by iripmotoles
Link to comment

The (unique/specific) C246 chipset on your own [3Server] motherboard is flaky. (You should get a direct replacement.)

 

The best documentation (readily available) for the issue is the output of:

grep -e "ATA-10" -e "AER: Corr" -e "FPDMA QUE" -e "USB disc" syslog.txt

Use the syslog.txt from the 20210930-0723 .zip file. It has all 4 HDDs  throwing errors. The "ATA-10" pattern just documents which HDD is ata[1357].00 . I'm pretty sure there are also relevant NIC errors in there, but I'm networking-ignorant. Note that all of these errors emanate from devices on the C246. Please examine a syslog.txt from your test-run on your Gen10 MS+; that box also uses the C246, but its syslog.txt will have none of these errors.

 

Attached is the output from the above command (filtered thru uniq -c, for brevity).

c246.txt

 

  • Thanks 1
Link to comment
  • iripmotoles changed the title to [Solved] Weird CRC Errors [199] during simultaneous preclear of two new drives

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.