Random reboot - machine check events - MC17 - DRAM ECC Error

Kyle W · February 7, 2023

I noticed last night that my Mac VM was not running after having not touched my NAS all weekend (my BlueBubbles client was not connected on my Android). After logging into the web GUI, there was a notification of an unclean shutdown and parity check in-progress. The power did not go out (I have a UPS which triggers a shutdown within 120 seconds) and asked my family if anyone had touched the NAS, but they hadn't.

Looking at the logs, there seems to be a significant number of machine check events with the same IPID and syndromes 0x7e3a00100a800a02 and 0xc3f501000a800a02. I'm assuming there was a crash that resulted in a reboot of the system.

Specs:

AsRock B450M Pro 4 P5.40 BIOS

Ryzen 5700X

Nemix 2x16GB DDR4-3200 unbuffered ECC RAM

3x 8tb WD Red Plus

1x 1tb Samsung 980 Pro Cache Drive

Nvidia Quadro P4000 (only using for video out so I can modify BIOS settings, etc right now)

Here's an example of the MCE, diagnostics also attached:

Feb  6 15:06:14 Kyle-Server kernel: mce: [Hardware Error]: Machine check events logged
Feb  6 15:06:14 Kyle-Server kernel: [Hardware Error]: Corrected error, no action required.
Feb  6 15:06:14 Kyle-Server kernel: [Hardware Error]: CPU:0 (19:21:2) MC17_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0x9c2041000000011b
Feb  6 15:06:14 Kyle-Server kernel: [Hardware Error]: Error Addr: 0x00000000b5a68b40
Feb  6 15:06:14 Kyle-Server kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x7e3a00100a800a02
Feb  6 15:06:14 Kyle-Server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Feb  6 15:06:14 Kyle-Server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x3169a2 offset:0xc40 grain:64 syndrome:0x10)
Feb  6 15:06:14 Kyle-Server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Any advice for troubleshooting this one? The RAM is running at its specified 3200MHz speed and I've not touched any other settings in the BIOS except for custom fan curves on my chassis fans. I did see a note about setting typical idle current in the BIOS, so I will look into that later today when the parity check is finished.

All the listed components are new except for the Quadro. I've been running the RAM since December 31 with no other issues, and the current CPU since January 19 (previously a Ryzen 5700G which I returned due to lack of ECC support, then a Ryzen 1600X for about a week while I was waiting for my 5700X to arrive) also with no noticeable issues but I have not been keeping an eye on the logs.

I did actually have a fan controller failure about a week ago and the machine shutdown from what I assumed to be a CPU over-temperature condition. I'm hoping the CPU didn't become damaged from this though it is still within the return window. The RAM would require a manufacturer RMA at this point. Two of the WD Reds hit 47C and 50C which has me a bit freaked out but that's within their operating temperature range.

server-diagnostics-20230207-1026.zip

apandey · February 8, 2023

Download and run memtest86 to start with

Kyle W · February 8, 2023

Note: tested the typical idle current setting and that did not fix the issue. Running Memtest today.

Kyle W · February 9, 2023

Memtest passed 3 times after running for roughly 24 hours. I am going to swap in my Ryzen 1600X to see if the CPU is causing problems.

Kyle W · February 10, 2023

Well unfortunately the CPU swap did not result in fully resolving the issue. The errors are different now, but not gone. I'm going to reach out to Nemix regarding an RMA on the RAM.

Feb  9 22:53:27 Kyle-Server kernel: mce: [Hardware Error]: Machine check events logged
Feb  9 22:53:27 Kyle-Server kernel: [Hardware Error]: Corrected error, no action required.
Feb  9 22:53:27 Kyle-Server kernel: [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Feb  9 22:53:27 Kyle-Server kernel: [Hardware Error]: Error Addr: 0x00000000b3052b60
Feb  9 22:53:27 Kyle-Server kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000004c60a400a02
Feb  9 22:53:27 Kyle-Server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Feb  9 22:53:27 Kyle-Server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#0 (csrow:2 channel:0 page:0x1a60a5 offset:0x760 grain:64 syndrome:0x4c6)
Feb  9 22:53:27 Kyle-Server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

kyle-server-diagnostics-20230209-2331.zip

apandey · February 10, 2023

Have you tried running with a single RAM stick at a time to try and eliminate any bad ones. Normally, if it's stable in memtest, it should not cause problems. I hope you ran the latest downloaded memtest, not the one bundled with unraid

Kyle W · February 10, 2023

7 hours ago, apandey said:

Have you tried running with a single RAM stick at a time to try and eliminate any bad ones. Normally, if it's stable in memtest, it should not cause problems. I hope you ran the latest downloaded memtest, not the one bundled with unraid

I'll try this next. I did download the latest memtest and created a bootable USB.

Kyle W · February 14, 2023

Well it's been running for over 24 hours with no errors after pulling the second stick. Hopefully I've isolated the bad stick!

Lolight · February 14, 2023

On 2/14/2023 at 12:20 AM, Kyle W said:

Well it's been running for over 24 hours with no errors after pulling the second stick. Hopefully I've isolated the bad stick!

A 24 hours long test is not really required to isolate bad RAM.

10 passes without errors would be sufficient at first.

Run each stick separately for 10 passes.

If errors are encountered try a different slot.

If no errors then run the sticks together for another 10 passes.

If errors are encountered then the memory controller or memory slots could be at fault.

In that case try switching RAM slots and do it again.

Edited February 15, 2023 by Lolight

Kyle W · March 3, 2023

I ran the machine for about 20 days without errors after removing the single stick I previously isolated. Nemix sent me an RMA and I mailed it in, a little over a week later I had a replacement stick in hand. It has been over 24 hours without any errors so far, so I'm hoping things are resolved.

duelistjp · March 12, 2023

On 2/13/2023 at 10:17 PM, Lolight said:

A 24 hours long test is not really required to isolate bad RAM.

10 passes without errors would be sufficient at first.

Run each stick separately for 10 passes.

If errors are encountered try a different slot.

If no errors then run the sticks together for another 10 passes.

If errors are encountered then the memory controller or memory slots could be at fault.

In that case try switching RAM slots and do it again.

depends how many sticks. if you have more than 2 you are better off with a binary search doing half at a time. makes a pretty big difference if you are running server hardware with 16 sticks of ram.

Random reboot - machine check events - MC17 - DRAM ECC Error

Recommended Posts

Kyle W

Link to comment

apandey

Link to comment

Kyle W

Link to comment

Kyle W

Link to comment

Kyle W

Link to comment

apandey

Link to comment

Kyle W

Link to comment

Kyle W

Link to comment

Lolight

Link to comment

Kyle W

Link to comment

duelistjp

Link to comment

Join the conversation