iripmotoles Posted September 30, 2021 Share Posted September 30, 2021 (edited) tl;dr: Replaced the mainboard, issue fixed. Thanks a lot for the help! Hi guys, I have a recently built a server with the following config: Gigabyte C246M-WU4 Xeon E-2176G 1x Kingston Server Premier DIMM 32GB ECC 2x WD Black 1TB NVMe Cache 2x WD Red 4TB WD40EFAX be quiet! Pure Power 11 FM 550W Everything running very nicely and happily. Two days ago I installed additionally two Toshiba Enterprise 10GB MG06ACA10TE. All disks connected directly to the MoBo, all on one of the modular power cables of the PSU. Then started preclearing both new disks. Yesterday I woke up to find several error notifications. Unraid dev1 SMART health [199]: Warning - udma crc error count is... I aborted the preclear, shut down the system and swapped the sata cables from the WD to the Toshiba drives and installed new cables on the WD disks. Then I started a new preclear on both Toshiba but the errors kept coming up. Shut down the system again and installed another modular power cable so only two disks per outlet on the PSU. This time resumed the previous preclear. The errors keep coming. First I thought that maybe the parcel with the two disks was mishandled during transport. But at some point I noticed something strange: The errors are almost always happening on both disks at the same exact minute. Not sure at when I got the one error mismatch. At some point I lost patience and put one of the new disks into the array as second parity but immediately realized my stupidity and took it out again. System log is pretty colorful but I can't make much sense of it. I saved diagnostics before each shutdown except the very first one for the disk install. This screenshot is from today. Can't find anything in the syslog for these times: Preclear is still running, currently 90% zeroing. I will post results once finished. Any help would be highly appreciated. Edit: Just found a correlation between SMART notification and syslog: Sep 30 14:45:37 3Server kernel: pcieport 0000:00:1d.0: AER: Corrected error received: 0000:05:00.0 Sep 30 14:45:37 3Server kernel: nvme 0000:05:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Sep 30 14:45:37 3Server kernel: nvme 0000:05:00.0: device [15b7:5006] error status/mask=00000001/0000e000 Edited October 9, 2021 by iripmotoles Quote Link to comment
JorgeB Posted September 30, 2021 Share Posted September 30, 2021 CRC errors are a connection problem, usually the SATA cable, if you've already replaced those it could be controller or disks, or some incompatibly issues, try the disks in different SATA ports and/or another PC. Quote Link to comment
iripmotoles Posted September 30, 2021 Author Share Posted September 30, 2021 After first noticing the issues, I hooked up the Toshiba drives to the cables (and respectively ports on MoBo) that were working fine with the WD drives. So I can pretty much rule out SATA ports and cables. I have a new HP microserver gen10+ sitting here and was planning on putting the 4TB drives in there for my parents NAS needs as well as cross offsite backup. I could put the Toshiba drives in and try to preclear them there. Is it worth waiting for the preclear results before pulling the drives from my server? Quote Link to comment
JorgeB Posted September 30, 2021 Share Posted September 30, 2021 1 hour ago, iripmotoles said: Is it worth waiting for the preclear results before pulling the drives from my server? As you prefer, like mentioned those errors are not a disk surface problem, preclear should always be successeful, unless there are other issues. Quote Link to comment
iripmotoles Posted September 30, 2021 Author Share Posted September 30, 2021 So then what can be gained exactly by trying the disks in another system? If I understand you correctly I have already narrowed it down to an incompatibility between the Toshiba MG and the C246 chipset? Any thoughts on the nvme error at the time of the crc error? Quote Link to comment
JorgeB Posted September 30, 2021 Share Posted September 30, 2021 To confirm if the problem is with the board or the disks, it can still be a disk problem, just not surface related, i.e., not bad sectors. Quote Link to comment
iripmotoles Posted September 30, 2021 Author Share Posted September 30, 2021 Got it, thank you. Will try them in the microserver. Quote Link to comment
iripmotoles Posted October 1, 2021 Author Share Posted October 1, 2021 They are going through without any errors so far, clearing completed and half of post-read. Afterwards I will try them again in my server with yet another set of SATA cables and on different ports of the board. Any BIOS settings I could check out, bearing in mind that everything works great with the WD drives? Quote Link to comment
JorgeB Posted October 1, 2021 Share Posted October 1, 2021 30 minutes ago, iripmotoles said: Any BIOS settings I could check out Not that I can think of, you can look for a BIOS update though. Quote Link to comment
UhClem Posted October 1, 2021 Share Posted October 1, 2021 Weird indeed! BUT it is not a cable issue, nor a disk issue. It is a flaky motherboard, specifically the Intel C246 chipset. (your syslog.txt files are gory with details) [ not an Unraid user ; but enjoy weird problems ] Quote Link to comment
iripmotoles Posted October 1, 2021 Author Share Posted October 1, 2021 Thanks for the feedback guys, very interesting. The board came with the latest BIOS version. Do you mean flaky chipset as in a bad unit or as in a bad model / series? Return the board for direct replacement or change to another product? Can you please elaborate a bit on the details in the syslog for my warranty claim? Quote Link to comment
JorgeB Posted October 1, 2021 Share Posted October 1, 2021 Could be some compatibility issues with those disks and that Intel chipset, or that board model in particular, you'd need to test the same disks in a different identical board and/or a different board with the same chipset to compare. Quote Link to comment
iripmotoles Posted October 1, 2021 Author Share Posted October 1, 2021 (edited) Yes, those would be the next logical steps for fault isolation. However, I don’t have either laying around and I will certainly not buy another board to find out what’s wrong with the one I‘ve bought only a few weeks ago. It sounded like @UhClem actually found something useful in my syslog and I‘d much rather make a well founded warranty claim on my existing purchase. Edit: For what it's worth, as you guys predicted the preclear + post-read on the microserver came out with zero new errors on the new disks. Maybe this thread should be moved to General Support? Edited October 1, 2021 by iripmotoles Quote Link to comment
UhClem Posted October 1, 2021 Share Posted October 1, 2021 The (unique/specific) C246 chipset on your own [3Server] motherboard is flaky. (You should get a direct replacement.) The best documentation (readily available) for the issue is the output of: grep -e "ATA-10" -e "AER: Corr" -e "FPDMA QUE" -e "USB disc" syslog.txt Use the syslog.txt from the 20210930-0723 .zip file. It has all 4 HDDs throwing errors. The "ATA-10" pattern just documents which HDD is ata[1357].00 . I'm pretty sure there are also relevant NIC errors in there, but I'm networking-ignorant. Note that all of these errors emanate from devices on the C246. Please examine a syslog.txt from your test-run on your Gen10 MS+; that box also uses the C246, but its syslog.txt will have none of these errors. Attached is the output from the above command (filtered thru uniq -c, for brevity). c246.txt 1 Quote Link to comment
iripmotoles Posted October 2, 2021 Author Share Posted October 2, 2021 Brilliant, thank you! Very helpful. I will contact the vendor. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.