Jump to content
sand_

System reboots while doing parity check

13 posts in this topic Last Reply

Recommended Posts

Posted (edited)

Not sure what's occurring, trying out unraid right now and running a parity check seems to cause the system to reboot (? not sure since I'm never able to observe it when it does) after like 4 or something hours. Fix Common Errors found that I had Machine Check Events.

tower-diagnostics-20200801-0200.zip

Edited by sand_

Share this post


Link to post

I would run Memtst (A boot option) for 24 hours...

Aug  1 01:24:29 Tower kernel: mce: [Hardware Error]: Machine check events logged
Aug  1 01:24:29 Tower kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: bf80000000000124
Aug  1 01:24:29 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 40fd03e00 MISC 86

 

Share this post


Link to post
Posted (edited)

8 hours in with 4 passes, still no errors. Good, I'm guessing?

Edited by sand_

Share this post


Link to post
On 8/1/2020 at 7:53 AM, Frank1940 said:

I would run Memtst (A boot option) for 24 hours...


Aug  1 01:24:29 Tower kernel: mce: [Hardware Error]: Machine check events logged
Aug  1 01:24:29 Tower kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: bf80000000000124
Aug  1 01:24:29 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 40fd03e00 MISC 86

 

Zero errors during the memtest. 10 passes

Share this post


Link to post

Next step, Let's see there is a clue in the syslog at the time when the failure occurs.  Setup the Syslog Server per the following set of instructions:

 

 

I would be using the Mirror Syslog to Flash method since it occurs within a few hours.

Share this post


Link to post

It took about 6 hours for it to reboot this time, I was able to catch it and I think it might've said there was a kernal panic.

syslog

Share this post


Link to post

The reboot occurs at line 22 in the syslog---  Time of 22:06:21    

 

I am not an expert at reading syslogs but I don't see anything in the first 21 lines that is not typical of normal operation. 

 

I must ask, is it possible that you have a pet or child that might be pushing the reset button.  Often times during a parity check, there is a nice flashing led that tends to attract and demand attention from the curious.

 

Is this a new hardware build or is it a recycled computer?  You might provide a few details as to the background of this server. 

Share this post


Link to post

While I do have pets, none were in my room during the time while it rebooted and there are no children in my house.

 

This is a recycled computer, it was my old gaming computer from about 5 years ago. It has an i5 4690k, 2x8gb of ram and a corsair CX450 psu.

Before I started using unraid, it was functioning as a windows/ubuntu computer for a couple of weeks, which had no problems. 

 

I'm going to try and snap a picture of when it crashes because the most recent reboot showed that text does appear on screen when it does crash.

Share this post


Link to post

Next thing to try.  Boot it in the Safe Mode and see if it still reboots.  And go back to the BIOS stock settings on any overclocking.  (Overclocking is a no, no for servers!)  

 

Also look at the inside of the case.  Make sure it is clean.  Get the dust out of heat sinks and fans.  Make sure that the air flow is over the drives.  Basically, the fans at the back of the case should blow out.  Double check that the PS/MB power plugs are all securely plugged in.  (By the way, PS have caused this problem in the past...)  Most of the rebooting problems are hardware related. 

Share this post


Link to post

A parity check is when the system is likely to be under maximum load.   This suggests it might either be power supply or temperature related.    Do you have access to another power supply to see if that might be the culprit?

Share this post


Link to post

I managed to capture the moment when it reboots and it spits out this.

mce: [Hardware Error]: CPU 2: Machine Check Exception 5 Bank 1: bf80000000000124
mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81334b4e> {percpu_counter_add_batch+0x4e/0x52}
mce: [Hardware Error]: TSC 3963ac8a7429 ADDR 40b9a9340 MISC 86
mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1596582445 SOCKET 0 APIC 4 microcode 27
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check
Kernel Offset: disabled
Rebooting in 30 seconds..

I don't currently have a power supply on hand, I would use my main pc's psu when I manage to find a good sale to replace however. It would suck if it is the psu as this one is only 4ish months old.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.