(SOLVED) unRaid 6.4 Kernel Panic

pinion · January 15, 2018

Has anyone else experienced a kernel panic after upgrading? I upgraded through the web gui and it booted once but then hung and I couldn't do anything. On every subsequent boot I got a kernel panics. I tried to wipe my usb and install fresh via the Mac tool to format the USB drive but I got kernel panics there as well.

I know this isn't enough info to troubleshoot but I thought I'd post to at least say I'm having problems and for anyone else who wants to roll back all I did was delete the UFI directory and copy the contents of the "previous" directory on top of the files in the top level directory /. I think when you upgrade the UFI directory is disabled anyway by adding a dash to the end "UFI-" but mine wasn't.

I'm able to boot now with 6.3.5. I'll try to come back and update this with any other info although the only thing I can capture is a picture of the screen when it panics. Anyone have an idea why 6.3.5 works fine but I kernel panic on 6.4? For now I'll be spending the whole day doing a parity check.

pinion · January 15, 2018

I decided to try and upgrade again. I made sure all my plugins and dockers were up to date and then stopped the array, upgraded, then rebooted once the system told me I needed to manually reboot. It came back up and everything seemed fine for a bit then I noticed the "[Hardware Error]: Corrected error" a few times. I typed in and ran diagnostics right after I saw that and I've attached them. I came back up stairs and looked around in the gut for a bit. Clicked on check for updates for Dockers and waited a minute while it was working. Then I tried to ping the server and it was a no go. Couldn't ssh either obviously. Went downstairs and was greeted by a flashing cursor and nothing else. I waited 10-15 minutes to see if it would work itself out before forcing a restart.

When I rebooted I made sure to select safe mode but after loading bzroot I immediately got a kernel panic. I've attached "screenshots" as well as the diagnostics I took before the system crashed. Anyone see anything?

unraid-diagnostics-20180115-1421.zip

JorgeB · January 15, 2018

The log shows the server trying to suspend and crashing, there appears to be an issues with the s3 plugin and v6.4, you should uninstall it for now.

The hardware errors look like a corrected RAM error, you'll need to identify the dimm with the problem and replace it, but since the error was corrected that would cause a crash, an uncorrected error would halt the system and would not be visible in the log.

You're also using a somewhat old board, it just might be incompatible with the new kernel used on v6.4.

pwm · January 15, 2018

The question here is if that corrected error really is for RAM or if it is for the processor.

Some processors has ECC for cache and internal data busses that can also result in ECC warning messages.

I have had a machine where the CPU itself produced a similar message maybe once/month.

If the machine runs stable with 6.3.5 and emits lots of correction printouts with 6.4, then I wonder if it possibly is intentionally or unintentionally overclocked, and if the newer kernel happens to tweak some setting a bit harder and so remove the tiny bit of safety margin the 6.3.5 kernel did survive on. The newer kernel has lots of changed code for power modes etc.

pinion · January 16, 2018

Thanks! I uninstalled the s3 plugin and I'm no longer getting kernel panics. I wasn't using it anyway and I have a feeling my setup doesn't completely work with sleep anyway because I couldn't ever get it to work reliably. I'll mark this as solved since the kernel panics are gone but I'm still getting those ECC errors about every 5 minutes and 27.5 seconds. I've attached a new diagnostic if you want to look. Thanks again!

unraid-diagnostics-20180115-1903.zip

Edit: messed up a word

JorgeB · January 16, 2018

7 hours ago, pinion said:

I'm still getting those ECC errors about every 5 minutes and 27.5

Strange pattern if they are RAM errors, still if you have more than one DIMM you can test with just half at a time and see if you keep getting them, if yes it's probably not the RAM.

pwm · January 16, 2018

11 hours ago, johnnie.black said:

Strange pattern if they are RAM errors, still if you have more than one DIMM you can test with just half at a time and see if you keep getting them, if yes it's probably not the RAM.

But are you sure they are RAM errors?

I have only suffered processor ECC errors on a Windows machine, so I'm not sure if I can identify the difference between ECC errors reported for the external ECC RAM or errors from cache or internal data busses.

Any errors with the all the changed power save code in the kernel could potentially resulting in some part of the processor running at too low voltage.

5 minutes 27,5 seconds is almost exactly 32768 ticks for a timer running at 100 Hz (5 minutes 27,78 seconds), so it very much sounds like something affected by a timer in the processor or the chipset.

JorgeB · January 16, 2018

56 minutes ago, pwm said:

But are you sure they are RAM errors?

No, in fact and like I mentioned because of the repeating pattern they are probably not RAM, but RAM is the easiest to rule out.

pinion · January 16, 2018

I have dual procs installed and I have a spare proc laying around too. So I'll do some testing and update this in the coming weeks.

(SOLVED) unRaid 6.4 Kernel Panic

Recommended Posts

pinion

Link to comment

pinion

Link to comment

JorgeB

Link to comment

pwm

Link to comment

pinion

Link to comment

JorgeB

Link to comment

pwm

Link to comment

JorgeB

Link to comment

pinion

Link to comment

Archived