Random? Crashes Kernel Panics

snowmirage · November 19, 2018

Servers been up and running for several months now in the last week or so started seeing issues I've never come across before.

I'm running Unraid 6.6.5

I struggled to even see the logs of the crashes until recently when I read a post mentioning you can turn on troubleshooting mode in the fix common problems plugin which then write the logs to the flash drive. After enabling that feature last night, I let the server run all night all with out even starting the array. The only task I started was the preclear scripts on 4 new 6TB drives I was getting ready to install.

Here are some screen shots of the crashes prior to last night, I'd find the host unresponsive, no keyboard input worked and I couldn't access the web gui forcing me to do a hard reset.

image.png.ecc6f72d70b5abad23fdb23997d36553.png

image.png.1a364fcd42dfad3ebfe263ec3e33de6a.png

image.png.8bc53356b4c96de3d34ecec1b99aecf8.png

This morning I again found a similar error but this time the system was responsive to both local keyboard input and via the web gui

image.png.8a286bb8a74dbf877259fd9c66cd82fe.png

I was able to scp down the syslog file stored on the flash drive by the fix common problems plugin.
FCPsyslog_tail.txt

I tried to grab the full diagnostic file but after 40 some odd min the page still hasn't generated the diagnostic file.

When I scp'd the syslog file above down I saw a number of diag .zip files in the same directory but nothing for today, so it appears its failing to generate the diag file not just failing to let me download it via the GUI.

The other oddity I noticed is failure of unassigned devices to finish loading its data on the main page.

If anyone has some advice here it would be greatly appreciated. I'm going to try to clean reboot and grab the diag file again shortly.

snowmirage · November 19, 2018

After a reboot the full diagnostic file generated just as it should.

phoenix-diagnostics-20181119-1130.zip

Not sure what else to check yet so I'm going to try a memory test.

John_M · November 19, 2018

2 hours ago, snowmirage said:

Not sure what else to check yet so I'm going to try a memory test.

That's a useful thing to do. Let it run for 24 hours or more.

You have a complex syslinux config with ACS override and unsafe interrupts enabled, both of which are best avoided unless absolutely necessary. Have you tried booting into a simpler mode for a while? You could try the GUI mode if you haven't edited it to be the same, or Safe mode. The aim would be to achieve stability of the basic NAS functions before enabling the more complex features, such as VMs.

Your OP suggests that all was well when you ran an earlier version of Unraid. Have you checked for a newer BIOS as that might provide better compatibility with the newer kernel?

snowmirage · November 19, 2018

Thats a great idea thank you ! (avoiding the ACS override and unsafe interrupts configs) and I'll give that a shot after the memory test.

I'm attempting to pass through GPUs and USB card to VMs and thats why those changes where added.

Regarding the BIOS updates I'm a bit out of luck unfortunately its an old (though amazing) motherboard EVGA SR-2, these issues are happening at stock speeds and settings in the bios.

Its been years since I've had to run memtest. I noticed when I started memtest via the boot menu going into unraid (where you can select GUI mode/ safe mode etc..) there was a flash briefly with an option to force enable SMP mode?

I imagine if I have memory issues the test should still report errors even if I did not enable SMP mode correct? I tried searching around for an answer to that and most of what I could find just discussed the differences between "memtest" and "memtest86+" and that SMP mode for the later had some bugs years ago.

John_M · November 19, 2018

You might find the downloadable (but still free!) version 7.5 of MemTest86 more useful than the built-in one, assuming your motherboard supports UEFI booting. Use it to make a separate bootable USB stick.

https://www.memtest86.com/download.htm

snowmirage · November 21, 2018

I was able to run the memtest for over 24 hrs completed 4 passes with no errors.

After rebooting under settings > VM manager

I changed VFIO allow unsafe interrupts from "yes" to "no"
and
changed PCIe ACS override from "Downstream" to "Disabled"

Rebooted again then let the server run with the array started since this afternoon. Coming home at ~ 11pm I found the server had rebooted at some point about 4 hrs ago.

In the course of all those reboots I think the troubleshooting mode in fix common problems may have been turned off as the last logs I see in the previous file I pulled from the thumb drive are from the time frame of my first post (1-2 days ago).

I've attached the latest diagnostic file.

I'm going to reenable troubleshooting mode in fix common proble ms and hopefully catch something interesting in those logs.

Any other ideas / advice on what might be going on here?

phoenix-diagnostics-20181120-2313.zip

snowmirage · November 21, 2018

Crashed again last night around 2am ish

In the sys log in this diagnostic file I see the system booting back up ~2:11am on the 21st
phoenix-diagnostics-20181121-1034.zip

and here I seem to see about the same thing.
FCPsyslog_tail.txt

I'm not sure what else to try here other than pointing a camera at the screen for 8 hrs and hoping I catch it crashing.

John_M · November 21, 2018

11 hours ago, snowmirage said:

Any other ideas / advice on what might be going on here?

phoenix-diagnostics-20181120-2313.zip

Hardware errors during CPU start-up?

Nov 20 18:37:11 phoenix kernel: smp: Bringing up secondary CPUs ...
Nov 20 18:37:11 phoenix kernel: x86: Booting SMP configuration:
Nov 20 18:37:11 phoenix kernel: .... node  #0, CPUs:        #1  #2
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: Machine check events logged
Nov 20 18:37:11 phoenix kernel:  #3
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: be00000000800400
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: TSC 0 ADDR 3fff8162a61c MISC 7fff 
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1542756960 SOCKET 0 APIC 4 microcode 1f
Nov 20 18:37:11 phoenix kernel:  #4  #5

snowmirage · November 21, 2018

O... wow... I did not see that.

Maybe its really a bad CPU. They are quite old. I'll remove CPU 2 and see if its stable on just CPU 1.

Thank you

snowmirage · November 27, 2018

Tried using motherboard jumpers on my EVGA SR-2 motherboard to disable each CPU but when either is disabled unraid won't boot.

Before going through the headache of troubleshooting further I ordered another x5680 for ~ $47 when that gets here I'll try swapping the 2nd cpu out and see if it still crashes.

Then swap the other CPU.

Hard to know if this CPU is good or not as the ones I have seemed to be good for months.

To replace the motherboard is going to run between 500 and 1500 which is crazy to spend on such old hardware.

Hopefully a CPU swap fixes what ever my issue is

John_M · November 27, 2018

Reseating the CPUs in their sockets, replacing the thermal compound and cleaning the dust out of the heatsinks - all part of the swap - are good things. I'd reseat the CPU power cables too.

snowmirage · December 17, 2018

Reporting back and thankfully the problem appears to be solved so far with a CPU swap. One of the errors I saw (posted above) seemed to indicate an issue with CPU #2.

Its possible all it needed was to be reseated but being a fully custom water cooled system I didn't want to have to open everything up twice so for $47 with express shipping a new X5680 went in.

That stayed up and stable for 5 days after that I reconnected two GTX 980s I had unplugged while troubleshooting and its now been up for another 5 days.

No indications at all of the previous problems so far!

My next step is to re-enable VFIO allow unsafe interrupts and PCIe ACS override then make sure its still stable.

But so far things are looking great thanks for the assistance!

Random? Crashes Kernel Panics

Recommended Posts

snowmirage

Link to comment

snowmirage

Link to comment

John_M

Link to comment

snowmirage

Link to comment

John_M

Link to comment

snowmirage

Link to comment

snowmirage

Link to comment

John_M

Link to comment

snowmirage

Link to comment

snowmirage

Link to comment

John_M

Link to comment

snowmirage

Link to comment

Join the conversation