[6.4.1] MCE Errors in Syslog, looking for help diagnosing the issue

blocker85 · March 2, 2018

I hope this is the appropriate forum for this. If not, feel free to move it or let me know where to repost.

I'm an unRAID newb, but I've spent countless hours tinkering and watching Spaceinvader's tutorials, etc. Most issues I have been able to work out on my own, EXCEPT for this one. Here is my relevant server hardware setup:

Intel S2600CP
2x Intel E5-2650
32GB (8x 4GB) Micron RAM

Software Setup:

unRAID 6.4.1
plugins:
- Community Apps
- Dynamix SSD TRIM
- Fix Common Problems
- Nerd Tools
- rclone
- Unassigned Devices
- User Scripts
Dockers
- Couchpotato
- lidarr
- sabnzbd
- sonarr
- deluge
- headphones
- krusader
- netdata
- ombi
- plexserver
- plexpy
- qbittorrent
- rutorrent
VMs
- macOS High Sierra

On boot, the 2600CP shows no error LEDs and the unRAID syslog is clean. However, after about 30min - 3 hours (depends on the reboot), the 2600CP throws a blinking error amber led and one or two ram slots throw error amber LEDs. The unRAID syslog then starts reporting MCE errors.

I have already RMA'd a couple of the RAM sticks on the slots, but the problem persists, and the error LEDs do not seem to follow particular ram modules. The modules and/or RAM slots reporting errors are not always consistent, which is confusing.

Is this a 2600CP issue? Software? Would love some insight from the experts. Server diagnostics attached.

Thanks in advance.

lockerserver-diagnostics-20180302-1410.zip

John_M · March 3, 2018

How much memory did you say you have, 32 GB? It's reporting 27 GB, which is an strange value. Your syslog shows memory errors. Does the BIOS keep a log of memory errors? I'd run MemTest86 for a good long time.

blocker85 · March 3, 2018

15 minutes ago, John_M said:

How much memory did you say you have, 32 GB? It's reporting 27 GB, which is an strange value. Your syslog shows memory errors. Does the BIOS keep a log of memory errors? I'd run MemTest86 for a good long time.

Good catch. I pulled one of the RAM modules out to see if I could stop the errors. Not sure why it's reporting 27 instead of 28.

How long is a good long time?

John_M · March 3, 2018

I'd give it at least 24 hours but I think it might well throw up some errors before then.

blocker85 · March 11, 2018

On 3/3/2018 at 4:58 AM, John_M said:

I'd give it at least 24 hours but I think it might well throw up some errors before then.

Well, this was genuinely surprising. I was finally able to shut down the server and run MemTest86+ for 24 hours. Not a single error (see screen shot attached).

Next steps?

JorgeB · March 11, 2018

Memtest won't detect ECC corrected errors, check the board's system event viewer in the bios/ipmi, it should have more info.

blocker85 · March 11, 2018

4 hours ago, johnnie.black said:

Memtest won't detect ECC corrected errors, check the board's system event viewer in the bios/ipmi, it should have more info.

Good call. Here it is.

From a cursory review, it looks like I also need a new power supply. I'll take care of that today.

I also see a mix of correctable and uncorrectable ECC memory errors. What should I do?

SYSTEMEVENTLOG.sel

Squid · March 11, 2018

6 hours ago, blocker85 said:

I also see a mix of correctable and uncorrectable ECC memory errors. What should I do?

Replace the memory

blocker85 · March 12, 2018

6 hours ago, Squid said:

Replace the memory

OK. This will be the second RMA with this Ebay reseller. I think I'll just ask him to just send me all new modules instead of playing the "find the bad RAM stick" game any longer :/.

pwm · March 13, 2018

On 3/11/2018 at 11:04 PM, Squid said:

Replace the memory

But are any of the errors from after pulling one memory module?

556 	03/10/2018-15:40:20  	Memory, Mmry ECC Sensor (#0x2)                                                                      	Warning event: Mmry ECC Sensor reports correctable error. There has been a correctable ECC or other correctable memory error for the memory module  RANK_0, CPU_1, Channel = A, DIMM_1.                                                                                                                                                   	BIOS SMI Handler - LUN#0 (Channel#0)

The above is likely to be before the 24 hour memory test.

blocker85 · March 13, 2018

6 hours ago, pwm said:

But are any of the errors from after pulling one memory module?


556 	03/10/2018-15:40:20  	Memory, Mmry ECC Sensor (#0x2)                                                                      	Warning event: Mmry ECC Sensor reports correctable error. There has been a correctable ECC or other correctable memory error for the memory module  RANK_0, CPU_1, Channel = A, DIMM_1.                                                                                                                                                   	BIOS SMI Handler - LUN#0 (Channel#0)

The above is likely to be before the 24 hour memory test.

@pwm: memory errors continued after pulling the module. Also, unless I'm mistaken, from the time stamp it appears that the error you quoted occurred DURING the 24 hour memtest run. The test ran from about 7pm on 3/9 through 7 or 8pm on 3/10.

In any event, the ebay seller has agreed to swap out all 8 memory modules, so we'll see what happens. It would be awful if it turned out to be an issue with the ram terminals on the motherboard. Is that unlikely?

pwm · March 13, 2018

1 hour ago, blocker85 said:

@pwm: memory errors continued after pulling the module. Also, unless I'm mistaken, from the time stamp it appears that the error you quoted occurred DURING the 24 hour memtest run. The test ran from about 7pm on 3/9 through 7 or 8pm on 3/10.

In any event, the ebay seller has agreed to swap out all 8 memory modules, so we'll see what happens. It would be awful if it turned out to be an issue with the ram terminals on the motherboard. Is that unlikely?

In that case you have to continue to replace memory modules as long as you are sure the parity errors aren't caused by overclocking, overtemp or unstable supply voltages.

When run within the specifications, you should not see these errors. It could be possible to accept maybe one correctable ECC error / year knowing that the same specific address needs two or more bit errors to actually lead to incorrect data being read. Only an unhealthy system produces this amount of ECC errors.

blocker85 · March 17, 2018

So just to put a bookend on this thread, I had the Ebay reseller send me all new modules (different brand this time), and I'm now at 24 hours of uptime without a single error in the logs. I think I may be in good shape. Thanks for all the help.

[6.4.1] MCE Errors in Syslog, looking for help diagnosing the issue

Recommended Posts

blocker85

Link to comment

John_M

Link to comment

blocker85

Link to comment

John_M

Link to comment

blocker85

Link to comment

JorgeB

Link to comment

blocker85

Link to comment

Squid

Link to comment

blocker85

Link to comment

pwm

Link to comment

blocker85

Link to comment

pwm

Link to comment

blocker85

Link to comment

Archived