Jump to content

mce panics with v6.5.3 + specific VM uses


67south

Recommended Posts

I'm in the process of setting up a new (my first) unraid machine for use primarily as a VM workstation with attached storage that I can share between the VMs.

 

The machine itself is a Core i9 7940x with 128 gig of RAM and an ASUS Prime X299 Deluxe motherboard.  I haven't done any overclocking or similar trickery -- everything is completely stock.

 

Generally the machine runs completely fine -- I have Windows 10 and Arch Linux VMs up and running, and can run extended general work / game / load testing in them with no issues.  I've also done a full memtest, which has passed, and I've booted natively into Windows from another disk and run a Intel CPU stress test in their extreme tweaking utility for several hours without issue.

 

However, if I create a new VM and install Ubuntu 16.04, the whole machine semi-predictably panics with an MCE error.  I say semi-predictably because it will happen at some point during the installation, but exactly when changes from run to run.

 

I know an MCE error "should" always be a hardware fault, but this is the second motherboard I've seen this issue on and I've swapped out all the other components except the CPU (which passes its tests -- see above) and the PSU (which is a 1000 watt model from Be Quiet) and I've seen a few hints of MCE errors under specific workloads.

 

Has anyone else encountered similar issues with Unraid / other KVM-based VM systems?   I built this machine specifically to allow me to run lots of VMs and heavy compilation / ML workloads, and I'm rapidly running out of ideas about what to do next to get it up and running.

 

Thanks!

Link to comment

Presumably, you've got FCP telling you about mce's and have hopefully installed mcelog.  But, unless you post diagnostics after an mce error happens and before a reboot, every thing else is conjecture.

 

13 minutes ago, 67south said:

I've seen a few hints of MCE errors under specific workloads

 

14 minutes ago, 67south said:

but this is the second motherboard

Without seeing those hints, the first thing that pops into my mind is cooling issues.

Link to comment

The MCE is always followed by an immediately fatal kernel panic, so I'm not sure how to get any better diagnostics I'm afraid.  All I have so far is the attached photo.  I'm happy to set up any other diagnostics that stand a chance of getting more information.

 

I definitely wouldn't rule out cooling issues.  The system was built to be quiet; it's running a Dark Rock Pro 4 cooler which is rated for way higher TDP than the CPU, but it's possible some other part of the motherboard is reaching unhappy temperatures in a way that I'm not monitoring.  However, it seems odd that I can load the system and CPU heavily in certain ways, but a VM installation (which shouldn't be hard work) is what causes the panic.

IMG_1021.jpg

Link to comment

Do you have any fan blowing air over the motherboard? During the installation, the motherboard chip set will run warmer. It's normally quite easy to keep the motherboard cool, but there needs to be airflow over the components. Extra problematic if there are pockets blocking air movement.

Link to comment

I've just swapped out the original cooler (which was a bulky lump of metal, covering all of the CPU and restricting airflow to the RAM and surrounding areas) for a Corsair H150i PRO AIO loop (at the cost of some additional pump noise :'( )   I was hoping this would fix the matter, but sadly it's made no difference at all -- despite a noticable drop in CPU temperature and 3 more fans pulling air over areas like the VRAM and RAM etc, the Ubuntu VM installation still causes the same error at the same point, and other activities seem mostly to be fine.

Link to comment

OK, this is weird, but repeatable.

 

An installation of Ubuntu 16.04 directly after a clean boot of the Unraid system causes the MCE panic every time, at the same point in the process.  Since I have also installed Arch and Windows VMs with no problem, I though I'd try Ubuntu 18.04 and see what happened.  18.04 installs fine.

 

Not only that, trying to install 16.04 immediately after completing an 18.04 installation (in a different VM but without rebooting Unraid) also worked fine.

 

I began to wonder whether 16.04 was making some kind of strange request directly to the CPU as part of its installation process, and whether (the kernel used by) 18.04 did something during its boot or installation process that mitigated the problem.  I tried installing 16.04 after booting (not installing) 18.04, and it was also fine.

 

I then tried changing the CPU mode of the 16.04 VM from "Host Passthrough (Intel Core i9-7940X)" to "Emulated QEMU64" to see whether stopping 16.04 talking directly to the CPU might help matters, and it seems to have fixed the issue.

 

16.04 installation with passthrough CPU on clean boot == MCE panic every time

16.04 installation with emulated CPU or after 18.04 boot in a different VM == OK.

 

As far as I'm aware I'm running the latest versions of everything: BIOS, CPU microcode, and ME microcode.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...