August 30, 20187 yr I am getting many memory read errors. The warnings are caused by a raspberry pi running nut that is not always reliable. The system used to be fairly unstable but has remained up for 2 months, it only crashes now under heavy load.. virtualizing gaming machine and running parity check at the same time. ECC memory is installed and system is running dual e5-2667v2. tower-syslog-20180830-1601.zip tower-diagnostics-20180830-1605.zip Edited August 30, 20187 yr by spencerdf
August 31, 20187 yr Author ErrorWarningSystemArrayLogin Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2180 Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: MISC 14268e486 Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677823 SOCKET 0 APIC 2Aug 30 21:10:23 Tower kernel: EDAC MC0: 166 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x180 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:10:40 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00280000010093Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2380 Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: MISC 152305486 Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677840 SOCKET 0 APIC 2Aug 30 21:10:40 Tower kernel: EDAC MC0: 160 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x380 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:10:57 Tower kernel: mce_notify_irq: 2 callbacks suppressedAug 30 21:10:57 Tower kernel: mce: [Hardware Error]: Machine check events loggedAug 30 21:10:57 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:10:57 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00268000010093Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2380 Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: MISC 1522c8286 Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677857 SOCKET 0 APIC 2Aug 30 21:10:57 Tower kernel: EDAC MC0: 154 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x380 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:11:13 Tower kernel: mce: [Hardware Error]: Machine check events loggedAug 30 21:11:13 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:11:13 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc0027c000010093Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2380 Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: MISC 1422cc086 Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677873 SOCKET 0 APIC 2Aug 30 21:11:13 Tower kernel: EDAC MC0: 159 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x380 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:11:30 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00280000010093Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: ADDR 37e3d27c0 Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: MISC 1425e3a86 Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677890 SOCKET 0 APIC 2Aug 30 21:11:30 Tower kernel: EDAC MC0: 160 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x37e3d2 offset:0x7c0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:8 rank:0)
August 31, 20187 yr Author channels 2 and 3 slot 0, just seems strange that this only just happened and 2 dimms fail at the same time.
August 31, 20187 yr Stress test your RAM by running memtest from the unRAID boot menu and then look again in the event log. Or remove DIMMs, test. shuffle, test until you find which ones are bad.
Archived
This topic is now archived and is closed to further replies.