spencerdf Posted August 30, 2018 Share Posted August 30, 2018 I am getting many memory read errors. The warnings are caused by a raspberry pi running nut that is not always reliable. The system used to be fairly unstable but has remained up for 2 months, it only crashes now under heavy load.. virtualizing gaming machine and running parity check at the same time. ECC memory is installed and system is running dual e5-2667v2. tower-syslog-20180830-1601.zip tower-diagnostics-20180830-1605.zip Link to comment
John_M Posted August 30, 2018 Share Posted August 30, 2018 If your memory has read errors you need to replace it. Link to comment
spencerdf Posted August 30, 2018 Author Share Posted August 30, 2018 which dimms? I have indications of bank 7 and channels 2&3 Link to comment
John_M Posted August 31, 2018 Share Posted August 31, 2018 What does your BIOS event log say? Link to comment
spencerdf Posted August 31, 2018 Author Share Posted August 31, 2018 ErrorWarningSystemArrayLogin Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2180 Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: MISC 14268e486 Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677823 SOCKET 0 APIC 2Aug 30 21:10:23 Tower kernel: EDAC MC0: 166 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x180 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:10:40 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00280000010093Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2380 Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: MISC 152305486 Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677840 SOCKET 0 APIC 2Aug 30 21:10:40 Tower kernel: EDAC MC0: 160 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x380 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:10:57 Tower kernel: mce_notify_irq: 2 callbacks suppressedAug 30 21:10:57 Tower kernel: mce: [Hardware Error]: Machine check events loggedAug 30 21:10:57 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:10:57 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00268000010093Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2380 Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: MISC 1522c8286 Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677857 SOCKET 0 APIC 2Aug 30 21:10:57 Tower kernel: EDAC MC0: 154 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x380 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:11:13 Tower kernel: mce: [Hardware Error]: Machine check events loggedAug 30 21:11:13 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:11:13 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc0027c000010093Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2380 Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: MISC 1422cc086 Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677873 SOCKET 0 APIC 2Aug 30 21:11:13 Tower kernel: EDAC MC0: 159 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x380 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:11:30 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00280000010093Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: ADDR 37e3d27c0 Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: MISC 1425e3a86 Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677890 SOCKET 0 APIC 2Aug 30 21:11:30 Tower kernel: EDAC MC0: 160 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x37e3d2 offset:0x7c0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:8 rank:0) Link to comment
spencerdf Posted August 31, 2018 Author Share Posted August 31, 2018 channels 2 and 3 slot 0, just seems strange that this only just happened and 2 dimms fail at the same time. Link to comment
John_M Posted August 31, 2018 Share Posted August 31, 2018 12 minutes ago, John_M said: What does your BIOS event log say? Link to comment
spencerdf Posted August 31, 2018 Author Share Posted August 31, 2018 The event log in Ipmi tools doesn't show anything over the last few days. Link to comment
John_M Posted August 31, 2018 Share Posted August 31, 2018 Stress test your RAM by running memtest from the unRAID boot menu and then look again in the event log. Or remove DIMMs, test. shuffle, test until you find which ones are bad. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.