Jump to content

14 posts in this topic Last Reply

Recommended Posts

Posted (edited)

hi,

 

server has freeze again

 

got this in my log :

 

Jul 10 17:58:28 THE-BEAST kernel: mce: [Hardware Error]: Machine check events logged
Jul 10 17:58:28 THE-BEAST kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 10 17:58:28 THE-BEAST kernel: EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 7: cc00038000010092
Jul 10 17:58:28 THE-BEAST kernel: EDAC sbridge MC1: TSC 1d1bb543a0b50 
Jul 10 17:58:28 THE-BEAST kernel: EDAC sbridge MC1: ADDR f29c05d00 
Jul 10 17:58:28 THE-BEAST kernel: EDAC sbridge MC1: MISC 427ee286 
Jul 10 17:58:28 THE-BEAST kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1562795908 SOCKET 1 APIC 20
Jul 10 17:58:28 THE-BEAST kernel: EDAC MC1: 14 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xf29c05 offset:0xd00 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:1 ha:0 channel_mask:1 rank:0)
Jul 11 18:48:14 THE-BEAST kernel: mdcmd (45): set md_write_method 1

 

Memory dimm module failure ?

 

++

the-beast-diagnostics-20190711-2302.zip

Edited by ibasaw

Share this post


Link to post
3 hours ago, ibasaw said:

Memory dimm module failure ?

 

yup

Share this post


Link to post

why this is random crash ?

 

why server can boot if dim module is fail ?

 

i must to do memtest for 24 h right ?

Share this post


Link to post

Memtest from the boot menu may or may not catch it because most of the errors will be corrected by ECC.

As far as why it can boot is because the memory in question may not be accessed at that time

Sent from my NSA monitored device

Share this post


Link to post
Posted (edited)

hi !

 

memtest result passed: 0 errors after 58 hours.

 

the server has crash again, so i came after the memtest result.

 

here the log from rsyslog the day before the crash, nothing in the log this time...

 

Jul 19 13:51:07 THE-BEAST kernel: eth0: renamed from vethe9c74f1
Jul 19 13:51:07 THE-BEAST kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth167ddcd: link becomes ready
Jul 19 13:51:07 THE-BEAST kernel: docker0: port 6(veth167ddcd) entered blocking state
Jul 19 13:51:07 THE-BEAST kernel: docker0: port 6(veth167ddcd) entered forwarding state
Jul 20 00:20:02 THE-BEAST sSMTP[12660]: Creating SSL connection to host
Jul 20 00:20:02 THE-BEAST sSMTP[12660]: SSL connection using TLS_AES_256_GCM_SHA384
Jul 20 00:20:05 THE-BEAST sSMTP[12660]: Sent mail for xxx@gmail.com (221 2.0.0 closing connection m3sm12585949qki.87 - gsmtp) uid=0 username=root outbytes=1591
Jul 20 06:00:16 THE-BEAST crond[2661]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null

 

Whats next ?

 

++

 

 

IMG_20190722_183920.jpg

the-beast-diagnostics-20190722-2243.zip

Edited by ibasaw

Share this post


Link to post

tonight freeze again, nothing in the log ....

 

becoming realy annoying !

 

i had updated to 6.7.rc2, wait and see

Share this post


Link to post

You still have a bad dimm module.  As stated, memtest may or may not catch it because of ECC

Share this post


Link to post
15 minutes ago, ibasaw said:

how to find the bad one ?

Pull 1 at a time and run without it. Errors continue implies bad DIMM is still installed, so switch the one you pulled with a different one and try again. When the bad DIMM is removed, the errors will stop.

Share this post


Link to post
On 7/11/2019 at 7:03 PM, ibasaw said:

Jul 10 17:58:28 THE-BEAST kernel: EDAC MC1: 14 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xf29c05 offset:0xd00 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0092 socket:1 ha:0 channel_mask:1 rank:0)

 

3 hours ago, ibasaw said:

that will be very long to test ;)

I would be doing a bit of research on the Internet and reading of the MB manual to see if I could  some direction as to which module it might be. 

Share this post


Link to post

news

 

after testing memory one by one, i remove one and wait for crash....i had loop all my memory and the server always crash

 

no errors in the logs

 

what i do now ?

 

this is really boring....

Share this post


Link to post
On 9/11/2019 at 5:43 PM, ibasaw said:

news

 

after testing memory one by one, i remove one and wait for crash....i had loop all my memory and the server always crash

 

no errors in the logs

 

what i do now ?

 

this is really boring....

Hi there,

 

You'll need to attach a monitor to your server and boot into console mode.  Login from there and then type the following command:

 

tail /var/log/syslog -f

 

This will begin printing all log events directly to the monitor.  If/when the server crashes, it should print some more info to the monitor.  Take a pic of that and post it here for our review.  This definitely seems like a hardware-specific error, but only more info from the logs after a crash will reveal.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.