log hitting 100% after only 4 days

Brydezen · March 12, 2018

Hello guys,

I know some of you leave you're servers on for maybe months before restarting or something. I left my server on when I was away, for some download and hosting of VM's for a friend. But not the "fix commen problems plugin" says my log os almost at 100% usage - and a reboot would fix this temporary, but I do wanna fix this permanetly. Here are some screenshots, and my diganostics. Hope someone would/can help me

tower-diagnostics-20180312-1531.zip

Brydezen · March 12, 2018

Have tried to open the log, from the log button on the Web GUI, and left it loading for almost an hour, but nothing ever came up. :-(

JorgeB · March 12, 2018

Syslog is getting spammed with memory errors:

Mar  9 07:07:06 Tower kernel: EDAC MC1: 24441 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1052674 offset:0xc00 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:0)
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 5: cc16544000010090
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: TSC 0
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: ADDR 107ec70600
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: MISC 204a00e086
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:206d7 TIME 1520575626 SOCKET 1 APIC 20
Mar  9 07:07:06 Tower kernel: EDAC MC1: 22865 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x107ec70 offset:0x600 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:0)
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 9: cc0001d0000800c1
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: TSC 0
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: ADDR 89b248000
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: MISC 90840000000208c
Mar  9 07:07:06 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:206d7 TIME 1520575626 SOCKET 1 APIC 20
Mar  9 07:07:06 Tower kernel: EDAC MC1: 7 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x89b248 offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:1)
Mar  9 07:07:07 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Mar  9 07:07:07 Tower kernel: EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 5: cc11044000010090

These are hardware errors, system event log should have more info on the affected slots.

Brydezen · March 12, 2018

Oh, how do i check the system event log or post it? Do want do fix this problem if I can, or contact the seller who sold me the ram, as its "kinda" new, but still used.

Is it only DIMM 0 throwing errors? I have tried to spam though the syslog.1 and syslog.2 and it looks like it only is Channel: 0 DIMM: 0 throwing out errors.

JorgeB · March 12, 2018

I only have Supermicro boards but Asrock should be similar, SEL should be visible on the bios or through IPMI.

Brydezen · March 12, 2018

Do I need to look at the SEL logs, or can it maybe be fixed just by rebooting?

Kinda wish I still had the manual under my bed, but put it in the basement in the box and now a lot of stuff is on top of it -.-

JorgeB · March 12, 2018

1 minute ago, Brydezen said:

Do I need to look at the SEL logs, or can it maybe be fixed just by rebooting?

You need to identify the memory/slot causing the errors to fix it, rebooting will fix the log size problem but not the memory errors, it will be the same after a few days.

Brydezen · March 12, 2018

So the memory is bad like it needs to be replaced or? Can't I just go out from the log in unraid that it DIMM 0?

JorgeB · March 12, 2018

Try removing that DIMM and running without it for a while, if the errors stop try using it in a different slot, if errors stay with the DIMM that's the problem.

pwm · March 13, 2018

On 3/12/2018 at 6:48 PM, Brydezen said:

Do I need to look at the SEL logs, or can it maybe be fixed just by rebooting?

Kinda wish I still had the manual under my bed, but put it in the basement in the box and now a lot of stuff is on top of it -.-

I have as rule to always keep downloaded manuals for motherboards etc easily accessible from some other machine and without need for working network.

log hitting 100% after only 4 days

Recommended Posts

Brydezen

Link to comment

Brydezen

Link to comment

JorgeB

Link to comment

Brydezen

Link to comment

JorgeB

Link to comment

Brydezen

Link to comment

JorgeB

Link to comment

Brydezen

Link to comment

JorgeB

Link to comment

pwm

Link to comment

Archived