Uncorrectable ECC and HT Sync Errors


Recommended Posts

Hello,

I've been having some trouble with these random ECC errors. It's always reporting the ECC error on Node 2 with all the other nodes showing an HT Sync error.

 

Specs

Motherboard: SuperMicro H8DGU-F

CPU: Dual AMD Opteron 6278's

RAM: 8x Micron 2GB DDR3 1333MHz ECC RAM (MT18JSF25672PDZ)

Riser Card: SuperMicro RSC-R2UU-UA3E8+

RAID Controller: 2x Dell PERC H310's

Storage:

  • 12x WD RE4's (2 as parity drives)
  • Samsung SSD 850 EVO 500GB (Primary Cache)
  • SAMSUNG SSD SM841N 2.5 7mm 512GB (Secondary Cache)

 

A picture of the error is attached.

 

Memtest came back with no errors.

I've updated the BIOS, shuffled the RAM around, and swapped the CPU slots, and even swapped the motherboard. It still shows the exact same error Node 2 with the uncorrectable ECC error and the other nodes with HT Sync errors.

 

I'm currently testing each of the RAM sticks in pairs. The error shows up anywhere from a day to after a week so it's gonna take a while.

At this point, the only things left to test is the riser card, the RAID controllers, and all the storage. I don't see how any of those can be the cause of an ECC error but I don't know what else to do.

 

JPEG_20200412_145424.jpg

Link to comment

has nothing to do with unraid.  Sounds like a hardware failure or incompatible hardware or firmware/BIOS is outdated. Make sure those CPUs are compatible and check the same for the memory.   See if there is a BIOS update which addresses this.

 

You said you changed the motherboard?  It sounds like to me that the board is damaged (like a bent pin on the CPU socket).

Link to comment
31 minutes ago, civic95man said:

has nothing to do with unraid.  Sounds like a hardware failure or incompatible hardware or firmware/BIOS is outdated. Make sure those CPUs are compatible and check the same for the memory.   See if there is a BIOS update which addresses this.

 

You said you changed the motherboard?  It sounds like to me that the board is damaged (like a bent pin on the CPU socket).

Naw, I changed the motherboard after this error was happening and I tested with different ram and the CPU's swapped. I contacted the seller and they had an extra board that they sent over. Both boards are having the exact same error. I've already checked compatibility, had to scroll and click through endless docs pages for it but they're all on the compatibility list on the SuperMicro site.

Edited by Reetik2000
Link to comment

By any chance, does this only come up after a reboot (frozen/hung system or otherwise)?  If you login to the IPMI and look at the logs, does it show anything with more detail? A lot of the references I find to HT link sync errors involve people overclocking their rig and pushing the limits.  The memory ECC error though doesn't look good.

Link to comment

No overclocks and all the BIOS settings defaulted except for obvious things like the boot sequence.

I checked the logs and it shows the two errors since I installed this second board. The two errors show the ECC error in DIMM 2A(CPU2) for the first one, while the second one showed in DIMM 1A(CPU2) which is strange since every time I've checked, it's always been node 2 like the screenshot. I'll try to find a way to boot up my last board and pull the logs from their since it's probably got at least 7 errors in it's logs. Right now, I'm running it with only 2 sticks of RAM, one for each CPU, and just waiting for it to crash again. If it does, I'm gonna try it with a different 2 sticks. If it doesn't crash within the next few days, I'm planning to add 2 more sticks of RAM. Basically gonna repeat this process until I either have all the RAM back in there to rule out the RAM being the issue, or I'm left over with a couple sticks that seemed to have cause the issue.

 

I doubt it's the power supply but I haven't tested it yet.

I don't have another one I could use. Are there any other ways I could go about testing it?

 

I'll check the logs and post here next time it crashes.

Link to comment

Side note on ECC RAM: it's unlikely that memtest will show error with ECC RAM (because ECC would auto-correct errors) unless the stick is really bad.

 

So in your case, your current approach is probably the best one. I.e. test the sticks by pair until you find the one that crashes.

Also try swaping the CPU around to see if the errors move with the CPU.

Link to comment
5 hours ago, testdasi said:

Side note on ECC RAM: it's unlikely that memtest will show error with ECC RAM (because ECC would auto-correct errors) unless the stick is really bad.

 

So in your case, your current approach is probably the best one. I.e. test the sticks by pair until you find the one that crashes.

Also try swaping the CPU around to see if the errors move with the CPU.

Thanks @testdasi,

I've already tried swapping the CPUs and it still showed up in node 2.

So far, it hasn't crashed yet with the two sticks of RAM. I was starting to lose hope that it was a RAM issue. Thanks for confirming. 

Link to comment
  • 3 weeks later...

So it's been a couple weeks. I ended up getting a new set of RAM. I found a really good deal and I couldn't stop myself. I installed it in the same way by adding one stick to each CPU every 3 days until I got the whole set installed. I now have 32GB instead of 16GB and it's finally stable.

 

Thanks for the help everyone!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.