UnRAID 6.12.2 server crashed and restarted

July 10, 20233 yr

I was working on configuring a new VM with the SPICE VM Console Protocol, and my VM stopped responding. Unfortunately, it was because my server had crashed (dirty) and restarted. I'm seeing Fix Common Problems telling me my server has detected hardware errors, and suggests in install mcelog via the NerdPack plugin, but I don't think that plugin is available anymore.

Not sure what hardware error occurred, but would like to learn how to find out.

Edited July 14, 20233 yr by Alyred
Remove irrelevant diagnostics.

Quote

July 11, 20233 yr

Community Expert

Start here:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173

Quote

July 11, 20233 yr

Author

Thanks. Did you see something in particular in my diagnostics that made you believe it was something specific in my settings? The system ran TrueNAS for months before UnRAID without issue, though I did update the BIOS once for the new AGESA API when I installed UnRAID. I have most of the overclocking turned off, though the memory is running at an overclocked profile as it was before and I did tune the elliptical curve down slightly.

I'll check for the c-state settings.

Quote

July 11, 20233 yr

Author

Got the BIOS option for Power Supply Idle Control set to "typical" and the server just crashed again while idle. Since it went for weeks prior to the change and this time it crashed within an hour, it appears to have made things less stable, or I just got unlucky.

@JorgeB, you replied to someone else's post a couple of years ago that said the go file line to disable C6 should no longer be needed with the Power Supply Idle Control set to Typical. Is that still the case or is there possible a regression bug in 6.12.x?

I've now setup the local syslog server but the only option it gives for "local syslog folder" is "<custom>" and won't accept any input. I've mirrored it to flash for the time being wit ha 10mb maximum filesize.

Quote

July 12, 20233 yr

Community Expert

10 hours ago, Alyred said:

is there possible a regression bug in 6.12.x?

Not AFAIK.

Quote

July 14, 20233 yr

Author

So it just restarted again this morning. I do have the syslog from the syslog server being mirrored to the flash drive. Not sure if that's safe to upload in its raw form like that? Happy to do so if needed.

This is with the Power Supply Idle Control set to Typical still in the BIOS.

Glancing through I see the following "hardware errors" logged during the bootup I believe:

Jul 14 08:17:26 Moissanite kernel: microcode: CPU0: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU1: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU2: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU3: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU4: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU5: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU6: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU7: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU8: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU9: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU10: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU11: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU12: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU13: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU14: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU15: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU16: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU17: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU18: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU19: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU20: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU21: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU22: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU23: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU24: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU25: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU26: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU27: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU28: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU29: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU30: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU31: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: Microcode Update Driver: v2.2.
Jul 14 08:17:26 Moissanite kernel: IPI shorthand broadcast: enabled
Jul 14 08:17:26 Moissanite kernel: sched_clock: Marking stable (1536275102, 317374787)->(1886336138, -32686249)
Jul 14 08:17:26 Moissanite kernel: mce: [Hardware Error]: Machine check events logged
Jul 14 08:17:26 Moissanite kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: baa0000000030150
Jul 14 08:17:26 Moissanite kernel: mce: [Hardware Error]: TSC 0 MISC d012000200000000 SYND 4d000002 IPID 500b000000000 
Jul 14 08:17:26 Moissanite kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1689347821 SOCKET 0 APIC 0 microcode a20120a

moissanite-diagnostics-20230714-0944.zip

Quote

July 14, 20233 yr

Community Expert

You are still overclocking the RAM, please see the link above again.

Quote

July 14, 20233 yr

Author

Right, thought I turned off XMP when I got the Power Supply Idle Control, but apparently didn't save that config.

Question for future visitors: Are you seeing that as part of the syslog error I posted (with the Hardware error), or something else you saw in the diagnostics?

I'll get XMP disabled (though seems a waste, especially since it was stable for months in TrueNAS) and let it run to see if I get further lockups.

Quote

July 14, 20233 yr

Community Expert

2 minutes ago, Alyred said:

Are you seeing that as part of the syslog error I posted

In the diags, was just double checking since I assumed it had already been done.

Quote

July 14, 20233 yr

Keep in mind that Unraid completely runs from RAM, which is a different situation as Truenas.

Quote

July 14, 20233 yr

Author

1 hour ago, JorgeB said:

In the diags, was just double checking since I assumed it had already been done.

Thanks. Have it fixed to 2666 mhz now and XMP turned off, upped the voltage a tad to 1.250v for a bit more stability since I'm not having heat issues. Edit: Decided to just run it at "auto", which is 1.200v.

1 hour ago, bonienl said:

Keep in mind that Unraid completely runs from RAM, which is a different situation as Truenas.

This is a good point, though it was running fine through all of my weeks of trying to get the Preclears to work properly (I believe these were related to other issues, certainly no crashes) and didn't start crashing until I began configuring a VM hosted in the server.

Edited July 14, 20233 yr by Alyred

Quote

July 15, 20233 yr

I just think xmp is flaky. I have had tons of issues blue screen, random reboots, hard locks, just to name a few on my own and client machines. I've had them run great for weeks to over a year then bam irratice pc behavior. On both windows and linux PCs. Turning off xmp always fixes the problems. You're not trying to get the most fps out of a server anyways it's really not needed.

Edited July 15, 20233 yr by skaterpunk0187

Quote

1

July 16, 20233 yr

Author

New server rebooted again just a few minutes ago. RAM was set to 2666 Mhz/not overclocked, BIOS setting for Power Supply Idle Control set to Typical. What should I try next?

DIagnostics attached. I still have my syslog mirrored to flash and can upload that too if necessary.

moissanite-diagnostics-20230716-0043.zip

Quote

July 16, 20233 yr

Community Expert

1 minute ago, Alyred said:

I still have my syslog mirrored to flash and can upload that too if necessary

That is the only file that can show what was happening leading up to the reboot.

Do you have your server set to automatically boot if power is applied? Do you have a UPS? A reboot (as opposed to a crash) is normally triggered by something external to Unraid or is hardware related.

Quote

July 16, 20233 yr

Author

50 minutes ago, itimpi said:

That is the only file that can show what was happening leading up to the reboot.

Do you have your server set to automatically boot if power is applied? Do you have a UPS? A reboot (as opposed to a crash) is normally triggered by something external to Unraid or is hardware related.

The server is running on a UPS that doesn't seem to be causing issues otherwise; there's another computer plugged into it but sleeping for the time being as I'm using the monitor and keyboard while I'm building the server. It hasn't restarted or been interrupted.

I believe it's set to stay off if the power's been interrupted. That's how I usually set it because I don't want them powering on when the UPS comes online after a power outage or anything like that. I don't know of anything else that might be causing a restart, after a day or two of running without issue I just come back to find the array offline and the uptime has recently restarted.

Here's the syslog. It covers both the recent restarts.

I'm currently running a memtest on it because why not.

syslog.txt

Quote

July 16, 20233 yr

Author

Where would I look specifically for the VM system logs and errors? Not for the individual VMs but for the virtual system?

The problems/crashes only started happening when I created and began running a VM, and then I noticed something when looking around - somehow, my System share was still stuck on my array. I thought I had moved it off long ago but the entire thing was still there. So I just stopped the VM and Docker services, forced mover to finish, then removed the secondary storage so it was exclusive to my SATA cache pool. Restarted my VM and container and let's see how it goes.

I'm wondering if something trying to access the system share while the array disk was spun down, or something with timeouts relating to that, might have caused the system reboot/crash?

Memtest ran fine with zero errors over 14 hours. I also double-checked my BIOS settings. At one time I had tried to use the curve optimizer to slightly reduce the voltage to the CPU (-10) but had turned off PBO altogether, which made the curve optimizer settings go away. I had assumed those would have been reset but maybe not; I re-enabled it long enough to set that back to 0 and then re-disabled PBO.

Currently Global C-States is still enabled, and left Power Supply Idle Control set to Typical.

Going to let it run for a bit and see if moving the system share to the SATA pools help, without mover bothering it or it being on a possibly spun-down array drive, etc.

Quote

UnRAID 6.12.2 server crashed and restarted

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)