Brydezen Posted March 12, 2018 Share Posted March 12, 2018 Hello guys, I know some of you leave you're servers on for maybe months before restarting or something. I left my server on when I was away, for some download and hosting of VM's for a friend. But not the "fix commen problems plugin" says my log os almost at 100% usage - and a reboot would fix this temporary, but I do wanna fix this permanetly. Here are some screenshots, and my diganostics. Hope someone would/can help me tower-diagnostics-20180312-1531.zip Link to comment
Brydezen Posted March 12, 2018 Author Share Posted March 12, 2018 Have tried to open the log, from the log button on the Web GUI, and left it loading for almost an hour, but nothing ever came up. :-( Link to comment
JorgeB Posted March 12, 2018 Share Posted March 12, 2018 Syslog is getting spammed with memory errors: Mar 9 07:07:06 Tower kernel: EDAC MC1: 24441 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1052674 offset:0xc00 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:0) Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 5: cc16544000010090 Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: TSC 0 Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: ADDR 107ec70600 Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: MISC 204a00e086 Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:206d7 TIME 1520575626 SOCKET 1 APIC 20 Mar 9 07:07:06 Tower kernel: EDAC MC1: 22865 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x107ec70 offset:0x600 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:0) Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 9: cc0001d0000800c1 Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: TSC 0 Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: ADDR 89b248000 Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: MISC 90840000000208c Mar 9 07:07:06 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:206d7 TIME 1520575626 SOCKET 1 APIC 20 Mar 9 07:07:06 Tower kernel: EDAC MC1: 7 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x89b248 offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1) Mar 9 07:07:07 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Mar 9 07:07:07 Tower kernel: EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 5: cc11044000010090 These are hardware errors, system event log should have more info on the affected slots. Link to comment
Brydezen Posted March 12, 2018 Author Share Posted March 12, 2018 Oh, how do i check the system event log or post it? Do want do fix this problem if I can, or contact the seller who sold me the ram, as its "kinda" new, but still used. Is it only DIMM 0 throwing errors? I have tried to spam though the syslog.1 and syslog.2 and it looks like it only is Channel: 0 DIMM: 0 throwing out errors. Link to comment
JorgeB Posted March 12, 2018 Share Posted March 12, 2018 I only have Supermicro boards but Asrock should be similar, SEL should be visible on the bios or through IPMI. Link to comment
Brydezen Posted March 12, 2018 Author Share Posted March 12, 2018 Do I need to look at the SEL logs, or can it maybe be fixed just by rebooting? Kinda wish I still had the manual under my bed, but put it in the basement in the box and now a lot of stuff is on top of it -.- Link to comment
JorgeB Posted March 12, 2018 Share Posted March 12, 2018 1 minute ago, Brydezen said: Do I need to look at the SEL logs, or can it maybe be fixed just by rebooting? You need to identify the memory/slot causing the errors to fix it, rebooting will fix the log size problem but not the memory errors, it will be the same after a few days. Link to comment
Brydezen Posted March 12, 2018 Author Share Posted March 12, 2018 So the memory is bad like it needs to be replaced or? Can't I just go out from the log in unraid that it DIMM 0? Link to comment
JorgeB Posted March 12, 2018 Share Posted March 12, 2018 Try removing that DIMM and running without it for a while, if the errors stop try using it in a different slot, if errors stay with the DIMM that's the problem. Link to comment
pwm Posted March 13, 2018 Share Posted March 13, 2018 On 3/12/2018 at 6:48 PM, Brydezen said: Do I need to look at the SEL logs, or can it maybe be fixed just by rebooting? Kinda wish I still had the manual under my bed, but put it in the basement in the box and now a lot of stuff is on top of it -.- I have as rule to always keep downloaded manuals for motherboards etc easily accessible from some other machine and without need for working network. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.