spencerdf Posted August 30, 2018 Share Posted August 30, 2018 (edited) I am getting many memory read errors. The warnings are caused by a raspberry pi running nut that is not always reliable. The system used to be fairly unstable but has remained up for 2 months, it only crashes now under heavy load.. virtualizing gaming machine and running parity check at the same time. ECC memory is installed and system is running dual e5-2667v2. tower-syslog-20180830-1601.zip tower-diagnostics-20180830-1605.zip Edited August 30, 2018 by spencerdf Quote Link to comment
John_M Posted August 30, 2018 Share Posted August 30, 2018 If your memory has read errors you need to replace it. Quote Link to comment
spencerdf Posted August 30, 2018 Author Share Posted August 30, 2018 which dimms? I have indications of bank 7 and channels 2&3 Quote Link to comment
John_M Posted August 31, 2018 Share Posted August 31, 2018 What does your BIOS event log say? Quote Link to comment
spencerdf Posted August 31, 2018 Author Share Posted August 31, 2018 ErrorWarningSystemArrayLogin Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2180 Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: MISC 14268e486 Aug 30 21:10:23 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677823 SOCKET 0 APIC 2Aug 30 21:10:23 Tower kernel: EDAC MC0: 166 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x180 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:10:40 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00280000010093Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2380 Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: MISC 152305486 Aug 30 21:10:40 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677840 SOCKET 0 APIC 2Aug 30 21:10:40 Tower kernel: EDAC MC0: 160 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x380 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:10:57 Tower kernel: mce_notify_irq: 2 callbacks suppressedAug 30 21:10:57 Tower kernel: mce: [Hardware Error]: Machine check events loggedAug 30 21:10:57 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:10:57 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00268000010093Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2380 Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: MISC 1522c8286 Aug 30 21:10:57 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677857 SOCKET 0 APIC 2Aug 30 21:10:57 Tower kernel: EDAC MC0: 154 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x380 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:11:13 Tower kernel: mce: [Hardware Error]: Machine check events loggedAug 30 21:11:13 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:11:13 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc0027c000010093Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: ADDR 37e3d2380 Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: MISC 1422cc086 Aug 30 21:11:13 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677873 SOCKET 0 APIC 2Aug 30 21:11:13 Tower kernel: EDAC MC0: 159 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x37e3d2 offset:0x380 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:4 rank:0)Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERRORAug 30 21:11:30 Tower kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00280000010093Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: TSC 0 Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: ADDR 37e3d27c0 Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: MISC 1425e3a86 Aug 30 21:11:30 Tower kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1535677890 SOCKET 0 APIC 2Aug 30 21:11:30 Tower kernel: EDAC MC0: 160 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x37e3d2 offset:0x7c0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0093 socket:0 ha:0 channel_mask:8 rank:0) Quote Link to comment
spencerdf Posted August 31, 2018 Author Share Posted August 31, 2018 channels 2 and 3 slot 0, just seems strange that this only just happened and 2 dimms fail at the same time. Quote Link to comment
John_M Posted August 31, 2018 Share Posted August 31, 2018 12 minutes ago, John_M said: What does your BIOS event log say? Quote Link to comment
spencerdf Posted August 31, 2018 Author Share Posted August 31, 2018 The event log in Ipmi tools doesn't show anything over the last few days. Quote Link to comment
John_M Posted August 31, 2018 Share Posted August 31, 2018 Stress test your RAM by running memtest from the unRAID boot menu and then look again in the event log. Or remove DIMMs, test. shuffle, test until you find which ones are bad. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.