Random? Crashes Kernel Panics


Recommended Posts

Servers been up and running for several months now in the last week or so started seeing issues I've never come across before.

I'm running Unraid 6.6.5

I struggled to even see the logs of the crashes until recently when I read a post mentioning you can turn on troubleshooting mode in the fix common problems plugin which then write the logs to the flash drive.  After enabling that feature last night,  I let the server run all night all with out even starting the array.  The only task I started was the preclear scripts on 4 new 6TB drives I was getting ready to install.

Here are some screen shots of the crashes prior to last night, I'd find the host unresponsive, no keyboard input worked and I couldn't access the web gui forcing me to do a hard reset.

image.png.ecc6f72d70b5abad23fdb23997d36553.png

image.png.1a364fcd42dfad3ebfe263ec3e33de6a.png

image.png.8bc53356b4c96de3d34ecec1b99aecf8.png

 

This morning I again found a similar error but this time the system was responsive to both local keyboard input and via the web gui

image.png.8a286bb8a74dbf877259fd9c66cd82fe.png

 

I was able to scp down the syslog file stored on the flash drive by the fix common problems plugin.
FCPsyslog_tail.txt

I tried to grab the full diagnostic file but after 40 some odd min the page still hasn't generated the diagnostic file.

When I scp'd the syslog file above down I saw a number of diag .zip files in the same directory but nothing for today, so it appears its failing to generate the diag file not just failing to let me download it via the GUI. 

image.thumb.png.503547ec18995cc99701bebf9b16d44e.png
 

The other oddity I noticed is failure of unassigned devices to finish loading its data on the main page.
image.thumb.png.737a22b7e0f862536cfc0340b8c70d0f.png

If anyone has some advice here it would be greatly appreciated.  I'm going to try to clean reboot and grab the diag file again shortly.

Link to comment
2 hours ago, snowmirage said:

Not sure what else to check yet so I'm going to try a memory test.

That's a useful thing to do. Let it run for 24 hours or more.

 

You have a complex syslinux config with ACS override and unsafe interrupts enabled, both of which are best avoided unless absolutely necessary. Have you tried booting into a simpler mode for a while? You could try the GUI mode if you haven't edited it to be the same, or Safe mode. The aim would be to achieve stability of the basic NAS functions before enabling the more complex features, such as VMs.

 

Your OP suggests that all was well when you ran an earlier version of Unraid. Have you checked for a newer BIOS as that might provide better compatibility with the newer kernel?

  • Like 1
Link to comment

Thats a great idea thank you ! (avoiding the ACS override and unsafe interrupts configs) and I'll give that a shot after the memory test.

I'm attempting to pass through GPUs and USB card to VMs and thats why those changes where added.

Regarding the BIOS updates I'm a bit out of luck unfortunately its an old (though amazing) motherboard EVGA SR-2, these issues are happening at stock speeds and settings in the bios.

Its been years since I've had to run memtest.  I noticed when I started memtest via the boot menu going into unraid (where you can select GUI mode/ safe mode etc..) there was a flash briefly with an option to force enable SMP mode?

I imagine if I have memory issues the test should still report errors even if I did not enable SMP mode correct?  I tried searching around for an answer to that and most of what I could find just discussed the differences between "memtest" and "memtest86+" and that SMP mode for the later had some bugs years ago.

 

Link to comment

I was able to run the memtest for over 24 hrs completed 4 passes with no errors.

After rebooting under settings > VM manager

I changed VFIO allow unsafe interrupts from "yes" to "no"
and
changed PCIe ACS override from "Downstream" to "Disabled"

Rebooted again then let the server run with the array started since this afternoon.  Coming home at ~ 11pm I found the server had rebooted at some point about 4 hrs ago.

In the course of all those reboots I think the troubleshooting mode in fix common problems may have been turned off as the last logs I see in the previous file I pulled from the thumb drive are from the time frame of my first post (1-2 days ago).

I've attached the latest diagnostic file.

I'm going to reenable troubleshooting mode in fix common proble ms and hopefully catch something interesting in those logs.

Any other ideas / advice on what might be going on here?

phoenix-diagnostics-20181120-2313.zip

Link to comment
11 hours ago, snowmirage said:

Any other ideas / advice on what might be going on here?

phoenix-diagnostics-20181120-2313.zip

Hardware errors during CPU start-up?

Nov 20 18:37:11 phoenix kernel: smp: Bringing up secondary CPUs ...
Nov 20 18:37:11 phoenix kernel: x86: Booting SMP configuration:
Nov 20 18:37:11 phoenix kernel: .... node  #0, CPUs:        #1  #2
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: Machine check events logged
Nov 20 18:37:11 phoenix kernel:  #3
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: be00000000800400
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: TSC 0 ADDR 3fff8162a61c MISC 7fff 
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1542756960 SOCKET 0 APIC 4 microcode 1f
Nov 20 18:37:11 phoenix kernel:  #4  #5

 

Link to comment

Tried using motherboard jumpers on my EVGA SR-2 motherboard to disable each CPU but when either is disabled unraid won't boot.

Before going through the headache of troubleshooting further I ordered another x5680 for ~ $47 when that gets here I'll try swapping the 2nd cpu out and see if it still crashes.

Then swap the other CPU.

Hard to know if this CPU is good or not as the ones I have seemed to be good for months.

To replace the motherboard is going to run between 500 and 1500 which is crazy to spend on such old hardware.

Hopefully a CPU swap fixes what ever my issue is

Link to comment
  • 3 weeks later...

Reporting back and thankfully the problem appears to be solved so far with a CPU swap.  One of the errors I saw (posted above) seemed to indicate an issue with CPU #2.

Its possible all it needed was to be reseated but being a fully custom water cooled system I didn't want to have to open everything up twice so for $47 with express shipping a new X5680 went in.

That stayed up and stable for 5 days after that I reconnected two GTX 980s I had unplugged while troubleshooting and its now been up for another 5 days.

No indications at all of the previous problems so far!

My next step is to re-enable VFIO allow unsafe interrupts and PCIe ACS override then make sure its still stable.

But so far things are looking great thanks for the assistance!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.