Gico Posted September 27, 2023 Share Posted September 27, 2023 (edited) Hi. I'm testing the memory of a new server hardware and got these ECC errors this morning after about 60 hours of total interrupted memtest run. Total errors is zero so hardware overcame these issue. I didn't mean to run memtest this long, but had three power failures in my home, and the server is not connected to a UPS, so the memtest restarted and continued to test. One full test pass (~ 20 hours) was completed without any errors / ECC Errors. These errors began this morning, about 15 minutes after the third power failure. All logs (found on the memtest86 USB stick) beside the current one have no errors and no fixed ECC errors. Any recommendation? Should I start looking for a malfunctioned Dimm? Maybe run several passes (only) of these failed tests, this time with a UPS? Each pass would take about 4 hours. The hardware was bought used: MB: Supermicro HL12SSL-I CPU: EPYC 7302 + 4U fan Mem: 8X64GB Samsung PC4-2666V Registered ECC DDR4 PSU: Corsair HX1200i Edited September 27, 2023 by Gico Quote Link to comment
JorgeB Posted September 27, 2023 Share Posted September 27, 2023 15 minutes ago, Gico said: Should I start looking for a malfunctioned Dimm? Look and the system event log in the BIOS, or IPMI log, it may show the affected DIMM. Quote Link to comment
Gico Posted September 27, 2023 Author Share Posted September 27, 2023 Didn't find anything relevant. The "Health Event Log" in the IPMI has similar errors in 2021. BIOS had only configuration of system event log, not the event log entries. Found "SMBIOS event log" which wasn't relevant. Quote Link to comment
JorgeB Posted September 27, 2023 Share Posted September 27, 2023 In that case you'll probably need to run with one or a few sticks at a time. Quote Link to comment
Gico Posted October 6, 2023 Author Share Posted October 6, 2023 I tested the memory sticks. 8 Passes of 4 sticks passed successfully without any error. 8 Passes of 4 the other 4 sticks (using the same slots as the previous 4) passed successfully without any error. 8 Passes of all the 8 sticks together passed successfully with 1 corrected ECC error. This might indicate that one of the slots being used by sticks 5-8 has an issue, but I doubt it. I don't know if this has nothing to do with the ECC errors, but the CPU fan is adjacent and actually touching one of the sticks. On my initial testing I in stalled the fan next to that stick, so the fan pushed a little the stick horizontally, and when doing the latter tests I installed the stick under the fan, so it might pushed the fan up a little, as seen in the screenshot. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.