Jump to content

UnRAID 6.12.2 server crashed and restarted


Recommended Posts

I was working on configuring a new VM with the SPICE VM Console Protocol, and my VM stopped responding.  Unfortunately, it was because my server had crashed (dirty) and restarted.  I'm seeing Fix Common Problems telling me my server has detected hardware errors, and suggests in install mcelog via the NerdPack plugin, but I don't think that plugin is available anymore.

 

Not sure what hardware error occurred, but would like to learn how to find out.

 

 

 

 

Edited by Alyred
Remove irrelevant diagnostics.
Link to comment

Thanks.  Did you see something in particular in my diagnostics that made you believe it was something specific in my settings? The system ran TrueNAS for months before UnRAID without issue, though I did update the BIOS once for the new AGESA API when I installed UnRAID.  I have most of the overclocking turned off, though the memory is running at an overclocked profile as it was before and I did tune the elliptical curve down slightly.

 

I'll check for the c-state settings.

Link to comment

Got the BIOS option for Power Supply Idle Control set to "typical" and the server just crashed again while idle. Since it went for weeks prior to the change and this time it crashed within an hour, it appears to have made things less stable, or I just got unlucky.

 

@JorgeB, you replied to someone else's post a couple of years ago that said the go file line to disable C6 should no longer be needed with the Power Supply Idle Control set to Typical. Is that still the case or is there possible a regression bug in 6.12.x?

 

I've now setup the local syslog server but the only option it gives for "local syslog folder" is "<custom>" and won't accept any input.  I've mirrored it to flash for the time being wit ha 10mb maximum filesize.

Link to comment

So it just restarted again this morning.  I do have the syslog from the syslog server being mirrored to the flash drive. Not sure if that's safe to upload in its raw form like that? Happy to do so if needed.

 

This is with the Power Supply Idle Control set to Typical still in the BIOS.

 

Glancing through I see the following "hardware errors" logged during the bootup I believe:
 

Jul 14 08:17:26 Moissanite kernel: microcode: CPU0: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU1: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU2: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU3: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU4: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU5: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU6: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU7: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU8: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU9: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU10: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU11: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU12: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU13: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU14: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU15: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU16: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU17: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU18: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU19: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU20: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU21: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU22: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU23: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU24: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU25: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU26: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU27: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU28: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU29: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU30: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: CPU31: patch_level=0x0a20120a
Jul 14 08:17:26 Moissanite kernel: microcode: Microcode Update Driver: v2.2.
Jul 14 08:17:26 Moissanite kernel: IPI shorthand broadcast: enabled
Jul 14 08:17:26 Moissanite kernel: sched_clock: Marking stable (1536275102, 317374787)->(1886336138, -32686249)
Jul 14 08:17:26 Moissanite kernel: mce: [Hardware Error]: Machine check events logged
Jul 14 08:17:26 Moissanite kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: baa0000000030150
Jul 14 08:17:26 Moissanite kernel: mce: [Hardware Error]: TSC 0 MISC d012000200000000 SYND 4d000002 IPID 500b000000000 
Jul 14 08:17:26 Moissanite kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1689347821 SOCKET 0 APIC 0 microcode a20120a

 

moissanite-diagnostics-20230714-0944.zip

Link to comment

Right, thought I turned off XMP when I got the Power Supply Idle Control, but apparently didn't save that config.

 

Question for future visitors: Are you seeing that as part of the syslog error I posted (with the Hardware error), or something else you saw in the diagnostics?

 

I'll get XMP disabled (though seems a waste, especially since it was stable for months in TrueNAS) and let it run to see if I get further lockups.

Link to comment

 

1 hour ago, JorgeB said:

In the diags, was just double checking since I assumed it had already been done.

Thanks.  Have it fixed to 2666 mhz now and XMP turned off, upped the voltage a tad to 1.250v for a bit more stability since I'm not having heat issues. Edit: Decided to just run it at "auto", which is 1.200v.

 

1 hour ago, bonienl said:

Keep in mind that Unraid completely runs from RAM, which is a different situation as Truenas.

 

This is a good point, though it was running fine through all of my weeks of trying to get the Preclears to work properly (I believe these were related to other issues, certainly no crashes) and didn't start crashing until I began configuring a VM hosted in the server.

Edited by Alyred
Link to comment

I just think xmp is flaky. I have had tons of issues blue screen, random reboots, hard locks, just to name a few on my own and client machines. I've had them run great for weeks to over a year then bam irratice pc behavior. On both windows and linux PCs. Turning off xmp always fixes the problems. You're not trying to get the most fps out of a server anyways it's really not needed.

Edited by skaterpunk0187
  • Like 1
Link to comment
1 minute ago, Alyred said:

I still have my syslog mirrored to flash and can upload that too if necessary

 

That is the only file that can show what was happening leading up to the reboot.

 

Do you have your server set to automatically boot if power is applied?  Do you have a UPS?  A reboot (as opposed to a crash) is normally triggered by something external to Unraid or is hardware related.

Link to comment
50 minutes ago, itimpi said:

 

That is the only file that can show what was happening leading up to the reboot.

 

Do you have your server set to automatically boot if power is applied?  Do you have a UPS?  A reboot (as opposed to a crash) is normally triggered by something external to Unraid or is hardware related.

The server is running on a UPS that doesn't seem to be causing issues otherwise; there's another computer plugged into it but sleeping for the time being as I'm using the monitor and keyboard while I'm building the server. It hasn't restarted or been interrupted.

 

I believe it's set to stay off if the power's been interrupted. That's how I usually set it because I don't want them powering on when the UPS comes online after a power outage or anything like that. I don't know of anything else that might be causing a restart, after a day or two of running without issue I just come back to find the array offline and the uptime has recently restarted.

 

Here's the syslog. It covers both the recent restarts.

 

I'm currently running a memtest on it because why not.

 

syslog.txt

Link to comment

Where would I look specifically for the VM system logs and errors? Not for the individual VMs but for the virtual system?

 

The problems/crashes only started happening when I created and began running a VM, and then I noticed something when looking around - somehow, my System share was still stuck on my array. I thought I had moved it off long ago but the entire thing was still there. So I just stopped the VM and Docker services, forced mover to finish, then removed the secondary storage so it was exclusive to my SATA cache pool. Restarted my VM and container and let's see how it goes.

I'm wondering if something trying to access the system share while the array disk was spun down, or something with timeouts relating to that, might have caused the system reboot/crash?

 

Memtest ran fine with zero errors over 14 hours. I also double-checked my BIOS settings.  At one time I had tried to use the curve optimizer to slightly reduce the voltage to the CPU (-10) but had turned off PBO altogether, which made the curve optimizer settings go away. I had assumed those would have been reset but maybe not; I re-enabled it long enough to set that back to 0 and then re-disabled PBO.

Currently Global C-States is still enabled, and left Power Supply Idle Control set to Typical.

Going to let it run for a bit and see if moving the system share to the SATA pools help, without mover bothering it or it being on a possibly spun-down array drive, etc.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...