blocker85 Posted March 2, 2018 Share Posted March 2, 2018 I hope this is the appropriate forum for this. If not, feel free to move it or let me know where to repost. I'm an unRAID newb, but I've spent countless hours tinkering and watching Spaceinvader's tutorials, etc. Most issues I have been able to work out on my own, EXCEPT for this one. Here is my relevant server hardware setup: Intel S2600CP 2x Intel E5-2650 32GB (8x 4GB) Micron RAM Software Setup: unRAID 6.4.1 plugins: Community Apps Dynamix SSD TRIM Fix Common Problems Nerd Tools rclone Unassigned Devices User Scripts Dockers Couchpotato lidarr sabnzbd sonarr deluge headphones krusader netdata ombi plexserver plexpy qbittorrent rutorrent VMs macOS High Sierra On boot, the 2600CP shows no error LEDs and the unRAID syslog is clean. However, after about 30min - 3 hours (depends on the reboot), the 2600CP throws a blinking error amber led and one or two ram slots throw error amber LEDs. The unRAID syslog then starts reporting MCE errors. I have already RMA'd a couple of the RAM sticks on the slots, but the problem persists, and the error LEDs do not seem to follow particular ram modules. The modules and/or RAM slots reporting errors are not always consistent, which is confusing. Is this a 2600CP issue? Software? Would love some insight from the experts. Server diagnostics attached. Thanks in advance. lockerserver-diagnostics-20180302-1410.zip Link to comment
John_M Posted March 3, 2018 Share Posted March 3, 2018 How much memory did you say you have, 32 GB? It's reporting 27 GB, which is an strange value. Your syslog shows memory errors. Does the BIOS keep a log of memory errors? I'd run MemTest86 for a good long time. Link to comment
blocker85 Posted March 3, 2018 Author Share Posted March 3, 2018 15 minutes ago, John_M said: How much memory did you say you have, 32 GB? It's reporting 27 GB, which is an strange value. Your syslog shows memory errors. Does the BIOS keep a log of memory errors? I'd run MemTest86 for a good long time. Good catch. I pulled one of the RAM modules out to see if I could stop the errors. Not sure why it's reporting 27 instead of 28. How long is a good long time? Link to comment
John_M Posted March 3, 2018 Share Posted March 3, 2018 I'd give it at least 24 hours but I think it might well throw up some errors before then. Link to comment
blocker85 Posted March 11, 2018 Author Share Posted March 11, 2018 On 3/3/2018 at 4:58 AM, John_M said: I'd give it at least 24 hours but I think it might well throw up some errors before then. Well, this was genuinely surprising. I was finally able to shut down the server and run MemTest86+ for 24 hours. Not a single error (see screen shot attached). Next steps? Link to comment
JorgeB Posted March 11, 2018 Share Posted March 11, 2018 Memtest won't detect ECC corrected errors, check the board's system event viewer in the bios/ipmi, it should have more info. Link to comment
blocker85 Posted March 11, 2018 Author Share Posted March 11, 2018 4 hours ago, johnnie.black said: Memtest won't detect ECC corrected errors, check the board's system event viewer in the bios/ipmi, it should have more info. Good call. Here it is. From a cursory review, it looks like I also need a new power supply. I'll take care of that today. I also see a mix of correctable and uncorrectable ECC memory errors. What should I do? SYSTEMEVENTLOG.sel Link to comment
Squid Posted March 11, 2018 Share Posted March 11, 2018 6 hours ago, blocker85 said: I also see a mix of correctable and uncorrectable ECC memory errors. What should I do? Replace the memory Link to comment
blocker85 Posted March 12, 2018 Author Share Posted March 12, 2018 6 hours ago, Squid said: Replace the memory OK. This will be the second RMA with this Ebay reseller. I think I'll just ask him to just send me all new modules instead of playing the "find the bad RAM stick" game any longer :/. Link to comment
pwm Posted March 13, 2018 Share Posted March 13, 2018 On 3/11/2018 at 11:04 PM, Squid said: Replace the memory But are any of the errors from after pulling one memory module? 556 03/10/2018-15:40:20 Memory, Mmry ECC Sensor (#0x2) Warning event: Mmry ECC Sensor reports correctable error. There has been a correctable ECC or other correctable memory error for the memory module RANK_0, CPU_1, Channel = A, DIMM_1. BIOS SMI Handler - LUN#0 (Channel#0) The above is likely to be before the 24 hour memory test. Link to comment
blocker85 Posted March 13, 2018 Author Share Posted March 13, 2018 6 hours ago, pwm said: But are any of the errors from after pulling one memory module? 556 03/10/2018-15:40:20 Memory, Mmry ECC Sensor (#0x2) Warning event: Mmry ECC Sensor reports correctable error. There has been a correctable ECC or other correctable memory error for the memory module RANK_0, CPU_1, Channel = A, DIMM_1. BIOS SMI Handler - LUN#0 (Channel#0) The above is likely to be before the 24 hour memory test. @pwm: memory errors continued after pulling the module. Also, unless I'm mistaken, from the time stamp it appears that the error you quoted occurred DURING the 24 hour memtest run. The test ran from about 7pm on 3/9 through 7 or 8pm on 3/10. In any event, the ebay seller has agreed to swap out all 8 memory modules, so we'll see what happens. It would be awful if it turned out to be an issue with the ram terminals on the motherboard. Is that unlikely? Link to comment
pwm Posted March 13, 2018 Share Posted March 13, 2018 1 hour ago, blocker85 said: @pwm: memory errors continued after pulling the module. Also, unless I'm mistaken, from the time stamp it appears that the error you quoted occurred DURING the 24 hour memtest run. The test ran from about 7pm on 3/9 through 7 or 8pm on 3/10. In any event, the ebay seller has agreed to swap out all 8 memory modules, so we'll see what happens. It would be awful if it turned out to be an issue with the ram terminals on the motherboard. Is that unlikely? In that case you have to continue to replace memory modules as long as you are sure the parity errors aren't caused by overclocking, overtemp or unstable supply voltages. When run within the specifications, you should not see these errors. It could be possible to accept maybe one correctable ECC error / year knowing that the same specific address needs two or more bit errors to actually lead to incorrect data being read. Only an unhealthy system produces this amount of ECC errors. Link to comment
blocker85 Posted March 17, 2018 Author Share Posted March 17, 2018 So just to put a bookend on this thread, I had the Ebay reseller send me all new modules (different brand this time), and I'm now at 24 hours of uptime without a single error in the logs. I think I may be in good shape. Thanks for all the help. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.