Jump to content

[6.4.1] MCE Errors in Syslog, looking for help diagnosing the issue


blocker85

Recommended Posts

I hope this is the appropriate forum for this.  If not, feel free to move it or let me know where to repost.

 

I'm an unRAID newb, but I've spent countless hours tinkering and watching Spaceinvader's tutorials, etc.  Most issues I have been able to work out on my own, EXCEPT for this one.  Here is my relevant server hardware setup:

 

  • Intel S2600CP
  • 2x Intel E5-2650
  • 32GB (8x 4GB) Micron RAM

 

Software Setup:

  • unRAID 6.4.1
  • plugins:
    • Community Apps
    • Dynamix SSD TRIM
    • Fix Common Problems
    • Nerd Tools
    • rclone
    • Unassigned Devices
    • User Scripts
  • Dockers
    • Couchpotato
    • lidarr
    • sabnzbd
    • sonarr
    • deluge
    • headphones
    • krusader
    • netdata
    • ombi
    • plexserver
    • plexpy
    • qbittorrent
    • rutorrent
  • VMs
    • macOS High Sierra

 

On boot, the 2600CP shows no error LEDs and the unRAID syslog is clean.  However, after about 30min - 3 hours (depends on the reboot), the 2600CP throws a blinking error amber led and one or two ram slots throw error amber LEDs.  The unRAID syslog then starts reporting MCE errors.

 

I have already RMA'd a couple of the RAM sticks on the slots, but the problem persists, and the error LEDs do not seem to follow particular ram modules.  The modules and/or RAM slots reporting errors are not always consistent, which is confusing.

 

Is this a 2600CP issue?  Software?  Would love some insight from the experts.  Server diagnostics attached.

 

Thanks in advance.

lockerserver-diagnostics-20180302-1410.zip

Link to comment
15 minutes ago, John_M said:

How much memory did you say you have, 32 GB? It's reporting 27 GB, which is an strange value. Your syslog shows memory errors. Does the BIOS keep a log of memory errors? I'd run MemTest86 for a good long time.

 

Good catch.  I pulled one of the RAM modules out to see if I could stop the errors.  Not sure why it's reporting 27 instead of 28.

 

How long is a good long time?

Link to comment
On 3/3/2018 at 4:58 AM, John_M said:

I'd give it at least 24 hours but I think it might well throw up some errors before then.

 

Well, this was genuinely surprising.  I was finally able to shut down the server and run MemTest86+ for 24 hours.  Not a single error (see screen shot attached).

 

Next steps?

IMG_20180310_195429.jpg

Link to comment
4 hours ago, johnnie.black said:

Memtest won't detect ECC corrected errors, check the board's system event viewer in the bios/ipmi, it should have more info.

 

Good call.  Here it is.

 

From a cursory review, it looks like I also need a new power supply.  I'll take care of that today.

 

I also see a mix of correctable and uncorrectable ECC memory errors.  What should I do?

SYSTEMEVENTLOG.sel

Link to comment
On 3/11/2018 at 11:04 PM, Squid said:

Replace the memory

But are any of the errors from after pulling one memory module?

 

556 	03/10/2018-15:40:20  	Memory, Mmry ECC Sensor (#0x2)                                                                      	Warning event: Mmry ECC Sensor reports correctable error. There has been a correctable ECC or other correctable memory error for the memory module  RANK_0, CPU_1, Channel = A, DIMM_1.                                                                                                                                                   	BIOS SMI Handler - LUN#0 (Channel#0) 

The above is likely to be before the 24 hour memory test.

Link to comment
6 hours ago, pwm said:

But are any of the errors from after pulling one memory module?

 


556 	03/10/2018-15:40:20  	Memory, Mmry ECC Sensor (#0x2)                                                                      	Warning event: Mmry ECC Sensor reports correctable error. There has been a correctable ECC or other correctable memory error for the memory module  RANK_0, CPU_1, Channel = A, DIMM_1.                                                                                                                                                   	BIOS SMI Handler - LUN#0 (Channel#0) 

The above is likely to be before the 24 hour memory test.

 

@pwm: memory errors continued after pulling the module.  Also, unless I'm mistaken, from the time stamp it appears that the error you quoted occurred DURING the 24 hour memtest run.  The test ran from about 7pm on 3/9 through 7 or 8pm on 3/10.

 

In any event, the ebay seller has agreed to swap out all 8 memory modules, so we'll see what happens.  It would be awful if it turned out to be an issue with the ram terminals on the motherboard.  Is that unlikely?

Link to comment
1 hour ago, blocker85 said:

 

@pwm: memory errors continued after pulling the module.  Also, unless I'm mistaken, from the time stamp it appears that the error you quoted occurred DURING the 24 hour memtest run.  The test ran from about 7pm on 3/9 through 7 or 8pm on 3/10.

 

In any event, the ebay seller has agreed to swap out all 8 memory modules, so we'll see what happens.  It would be awful if it turned out to be an issue with the ram terminals on the motherboard.  Is that unlikely?

In that case you have to continue to replace memory modules as long as you are sure the parity errors aren't caused by overclocking, overtemp or unstable supply voltages.

 

When run within the specifications, you should not see these errors. It could be possible to accept maybe one correctable ECC error / year knowing that the same specific address needs two or more bit errors to actually lead to incorrect data being read. Only an unhealthy system produces this amount of ECC errors.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...