Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

Random? Crashes Kernel Panics

Featured Replies

Servers been up and running for several months now in the last week or so started seeing issues I've never come across before.

I'm running Unraid 6.6.5

I struggled to even see the logs of the crashes until recently when I read a post mentioning you can turn on troubleshooting mode in the fix common problems plugin which then write the logs to the flash drive.  After enabling that feature last night,  I let the server run all night all with out even starting the array.  The only task I started was the preclear scripts on 4 new 6TB drives I was getting ready to install.

Here are some screen shots of the crashes prior to last night, I'd find the host unresponsive, no keyboard input worked and I couldn't access the web gui forcing me to do a hard reset.

image.png.ecc6f72d70b5abad23fdb23997d36553.png

image.png.1a364fcd42dfad3ebfe263ec3e33de6a.png

image.png.8bc53356b4c96de3d34ecec1b99aecf8.png

 

This morning I again found a similar error but this time the system was responsive to both local keyboard input and via the web gui

image.png.8a286bb8a74dbf877259fd9c66cd82fe.png

 

I was able to scp down the syslog file stored on the flash drive by the fix common problems plugin.
FCPsyslog_tail.txt

I tried to grab the full diagnostic file but after 40 some odd min the page still hasn't generated the diagnostic file.

When I scp'd the syslog file above down I saw a number of diag .zip files in the same directory but nothing for today, so it appears its failing to generate the diag file not just failing to let me download it via the GUI. 

image.thumb.png.503547ec18995cc99701bebf9b16d44e.png
 

The other oddity I noticed is failure of unassigned devices to finish loading its data on the main page.
image.thumb.png.737a22b7e0f862536cfc0340b8c70d0f.png

If anyone has some advice here it would be greatly appreciated.  I'm going to try to clean reboot and grab the diag file again shortly.

  • Author

After a reboot the full diagnostic file generated just as it should.

phoenix-diagnostics-20181119-1130.zip

Not sure what else to check yet so I'm going to try a memory test.

2 hours ago, snowmirage said:

Not sure what else to check yet so I'm going to try a memory test.

That's a useful thing to do. Let it run for 24 hours or more.

 

You have a complex syslinux config with ACS override and unsafe interrupts enabled, both of which are best avoided unless absolutely necessary. Have you tried booting into a simpler mode for a while? You could try the GUI mode if you haven't edited it to be the same, or Safe mode. The aim would be to achieve stability of the basic NAS functions before enabling the more complex features, such as VMs.

 

Your OP suggests that all was well when you ran an earlier version of Unraid. Have you checked for a newer BIOS as that might provide better compatibility with the newer kernel?

  • Author

Thats a great idea thank you ! (avoiding the ACS override and unsafe interrupts configs) and I'll give that a shot after the memory test.

I'm attempting to pass through GPUs and USB card to VMs and thats why those changes where added.

Regarding the BIOS updates I'm a bit out of luck unfortunately its an old (though amazing) motherboard EVGA SR-2, these issues are happening at stock speeds and settings in the bios.

Its been years since I've had to run memtest.  I noticed when I started memtest via the boot menu going into unraid (where you can select GUI mode/ safe mode etc..) there was a flash briefly with an option to force enable SMP mode?

I imagine if I have memory issues the test should still report errors even if I did not enable SMP mode correct?  I tried searching around for an answer to that and most of what I could find just discussed the differences between "memtest" and "memtest86+" and that SMP mode for the later had some bugs years ago.

 

You might find the downloadable (but still free!) version 7.5 of MemTest86 more useful than the built-in one, assuming your motherboard supports UEFI booting. Use it to make a separate bootable USB stick.

 https://www.memtest86.com/download.htm

  • Author

I was able to run the memtest for over 24 hrs completed 4 passes with no errors.

After rebooting under settings > VM manager

I changed VFIO allow unsafe interrupts from "yes" to "no"
and
changed PCIe ACS override from "Downstream" to "Disabled"

Rebooted again then let the server run with the array started since this afternoon.  Coming home at ~ 11pm I found the server had rebooted at some point about 4 hrs ago.

In the course of all those reboots I think the troubleshooting mode in fix common problems may have been turned off as the last logs I see in the previous file I pulled from the thumb drive are from the time frame of my first post (1-2 days ago).

I've attached the latest diagnostic file.

I'm going to reenable troubleshooting mode in fix common proble ms and hopefully catch something interesting in those logs.

Any other ideas / advice on what might be going on here?

phoenix-diagnostics-20181120-2313.zip

  • Author

Crashed again last night around 2am ish

In the sys log in this diagnostic file I see the system booting back up ~2:11am on the 21st
phoenix-diagnostics-20181121-1034.zip

and here I seem to see about the same thing.
FCPsyslog_tail.txt

I'm not sure what else to try here other than pointing a camera at the screen for 8 hrs and hoping I catch it crashing.

11 hours ago, snowmirage said:

Any other ideas / advice on what might be going on here?

phoenix-diagnostics-20181120-2313.zip

Hardware errors during CPU start-up?

Nov 20 18:37:11 phoenix kernel: smp: Bringing up secondary CPUs ...
Nov 20 18:37:11 phoenix kernel: x86: Booting SMP configuration:
Nov 20 18:37:11 phoenix kernel: .... node  #0, CPUs:        #1  #2
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: Machine check events logged
Nov 20 18:37:11 phoenix kernel:  #3
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: be00000000800400
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: TSC 0 ADDR 3fff8162a61c MISC 7fff 
Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1542756960 SOCKET 0 APIC 4 microcode 1f
Nov 20 18:37:11 phoenix kernel:  #4  #5

 

  • Author

O... wow... I did not see that.

Maybe its really a bad CPU.  They are quite old.  I'll remove CPU 2 and see if its stable on just CPU 1.

Thank you 

  • Author

Tried using motherboard jumpers on my EVGA SR-2 motherboard to disable each CPU but when either is disabled unraid won't boot.

Before going through the headache of troubleshooting further I ordered another x5680 for ~ $47 when that gets here I'll try swapping the 2nd cpu out and see if it still crashes.

Then swap the other CPU.

Hard to know if this CPU is good or not as the ones I have seemed to be good for months.

To replace the motherboard is going to run between 500 and 1500 which is crazy to spend on such old hardware.

Hopefully a CPU swap fixes what ever my issue is

Reseating the CPUs in their sockets, replacing the thermal compound and cleaning the dust out of the heatsinks - all part of the swap - are good things. I'd reseat the CPU power cables too.

  • 3 weeks later...
  • Author

Reporting back and thankfully the problem appears to be solved so far with a CPU swap.  One of the errors I saw (posted above) seemed to indicate an issue with CPU #2.

Its possible all it needed was to be reseated but being a fully custom water cooled system I didn't want to have to open everything up twice so for $47 with express shipping a new X5680 went in.

That stayed up and stable for 5 days after that I reconnected two GTX 980s I had unplugged while troubleshooting and its now been up for another 5 days.

No indications at all of the previous problems so far!

My next step is to re-enable VFIO allow unsafe interrupts and PCIe ACS override then make sure its still stable.

But so far things are looking great thanks for the assistance!

Archived

This topic is now archived and is closed to further replies.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.