[Solved] XMP Causing Random Crashes


Recommended Posts

Hello, I've been dealing with something very strange. I have a machine I've been trying to get unRAID running on in any fashion of stable possible.

 

It crashes seemingly randomly during boot, sometimes it makes it far enough to start the array, other times it crashes as soon as it times out on the boot up selection screen. I haven't been able to narrow it down, it logs nothing when crashing it must be something hardware based.

 

I'm running 4 x 8 GB RAM but I've removed them and inserted each of them alone, and then built back up to all four; nothing tied to any crashes and continued. Also can run memtest64 for days without any logged errors.

 

I used to be on a ASUS WS Z390 Pro motherboard with a 9th generation Intel Core i9-9900k and was logging "machine check events" so I swapped to a MSI Z490 motherboard with a 10th gen Intel Core i7-10600k I haven't seen any machine check events in the log on this motherboard/CPU combination.

 

It has a single PCI card for a TV Tuner, no GPU or anything else. It's running 6 SATA drives, 1 being an SSD.

 

I just swapped USB drives hoping it was a corrupted boot files; but happened in lock step as it always does on the new drive, barely had time to "replace key" my license onto the new one.

 

I gifted the Z390 and 9900k combo to my brother running Windows 11 and it has had zero issues. My next step is to try running Windows live off a USB and see if somehow it's unRAID/Linux only. Not sure why that would make sense, but here I am.

 

P.S. I've attached diagnostics that were auto dump onto the previous drive two days ago, it should contain anything.

olympus-diagnostics-20220722-0519.zip

Edited by FEENX
Updating after solving
Link to comment
2 hours ago, JorgeB said:

Strange that docker would cause crashing during boot, but won't say it's impossible, assuming the crashing happens after array start.

 

Yeah that's what I thought, cause it definitely would crash during the boot up, sometimes immediately after it auto-selected normal boot up. And yet, here I am without a single crash, I'm almost afraid to turn docker on haha.

Link to comment

Looking for Needles in a Hay Stack here, but................

 

Fire up Docker and see what happens and then turn on each docker container one at a time and see if one of them is causing an issue.

You could either rebuild your docker file if its corrupt or maybe one of your dockers is causing it to crash. 

 

You did say Docker so I wonder if it could be related to your SSD unless your not running it off your SSD. 

 

 

Link to comment
1 hour ago, kizer said:

Looking for Needles in a Hay Stack here, but................

 

Fire up Docker and see what happens and then turn on each docker container one at a time and see if one of them is causing an issue.

You could either rebuild your docker file if its corrupt or maybe one of your dockers is causing it to crash. 

 

You did say Docker so I wonder if it could be related to your SSD unless your not running it off your SSD. 

 

 

 

I do have a SSD cache and Docker does run off that. And I will do, I deleted the docker image file so I assuming it'll create a new one when I start it up again.

 

All in it's nice to be able to have stable NAS and VMs though, I run my Plex off a VM so I can use TV Tuners that don't have linux drivers, so this ended up not effecting my media server ha.

Edited by FEENX
Link to comment

Now it's back to crashing during boot over and over. It crashes right as it says its checking "bzimage" or "bzfirmeware" can't tell it's too fast. But it doesn't make it to linux booting even.

 

Only thing left to change out is the power supply, can't imagine it's the PSU tho.

Link to comment

Yeah just not sure which could be, cause since this starting happening I replaced It's motherboard, cpu, ram, usb drive. 

 

Changed from a Asus Z390 with a 9900k to a MSI Z490 with a 10600k. As for RAM I've run memtest for hours without errors, and even tried running with each of the DIMMs separately.

 

Assuming it could be the SATA drives, other than SELF tests or benchmarking via DiskSpeed anything else I could do to check? It doesn't seem to be reproducible by doing filesystem stuff.

Link to comment
4 minutes ago, FEENX said:

Assuming it could be the SATA drives, other than SELF tests or benchmarking via DiskSpeed anything else I could do to check?

Don't think so, if it crashes during bz* files loading, try v6.11.0-rc2 just to make sure it's not some kernel compatibilty, v6.11 includes a much newer kernel.

Link to comment
9 minutes ago, JorgeB said:

Don't think so, if it crashes during bz* files loading, try v6.11.0-rc2 just to make sure it's not some kernel compatibilty, v6.11 includes a much newer kernel.

 

Okay. I'm going to try turning it on again soon. It's very odd cause it seems to "get worse" by crashing sooner and sooner in the sequence but then if I turn it off for a while it'll start to work again. As if it's overheating but by checking the BIOS I don't believe the CPU/MB are overheating.

 

If it continues I'll looking at using that version. I see the creator tool can create using it, then I copy over the "/config" folder, correct?

 

Appreciate the help of course!

Link to comment
2 hours ago, JorgeB said:

Yes, you can also update using the GUI, if you can get it to boot at least once.

 

image.png.e39fc7fe023831d1c8d11a16d6e91a1b.png

 

As it usually does, after being off for awhile it was perfectly fine booting up and is now running. I managed to upgrade to the RC as you can see from that screenshot, and rebooted just fine. It's now running with only 1 VM and the NAS features, Docker is off.

  • Like 1
Link to comment

Well it's been running since Thursday morning and so only crashed once, and re-booted fine.

 

Not sure what that was, I walked into my office and it was starting up and stayed running the next "1 day 3 hours" now. Can't tell it rebooted multiple times or just once and stuck.

 

Still been running with docker entirely off but VM engine now has Windows and Ubuntu running all the time. So far so good with the new kernel I guess. *knock on wood*

  • Like 1
Link to comment

And there it goes again:

  1. Crashes, gets back to GUI
  2. ~1 mins later crashes, gets to all the boot up sequence text
  3. Crashes, over and over after "loading bzroot..."

Just turned it off, can almost guarantee that in ~10 minutes I'll be able to boot it up just fine and it'll run seemingly fine for a few hours. I just can't figure out what would cause this.

Edited by FEENX
Link to comment

Based on the way it seems to "simmer" over and then come back after cool off, added to the fact the only part I haven't replaced is the power supply.

 

I guess in theory if the power supply is failing, and perhaps something is overheating and then the next time there's a load it crashes. Then crashes again during boot at heavy load times. But then after cooling off it works again seemingly regularly.

 

Not sure if I have another PSU to put it in, if not, I'll have to grab a cheaper one and see. Can buy a proper one if that's the issue for sure.

Link to comment
8 minutes ago, JonathanM said:

Since you mention swapping PSU's, I'm dropping a reminder to never use modular cables from a different PSU unless you use a meter to verify correct connections. Incompatible cables that appear to connect the same can permanently fry expensive parts.

 

For sure, the one it's using now that's been in there isn't modular and I don't think a cheap new one I get will be either but I appreciate it.

Link to comment
7 hours ago, ChatNoir said:

While you are in the server, you should check that your CPU cooling is still working properly (paste, fans, dust, etc)

 

For sure, I thought similar eerily on. I've been monitoring the CPU temps randomly since the crashing started happening. It'll hover around 21-25 and then when I do benchmarking it spikes into the mid 80s. So very normal imo. Plus the CPU/Cooler are different than when I started :/

Edited by FEENX
Link to comment

Well I decided to disconnect all the SATA drives to see if it could be the random drives I have in there. None are particularly new or known good. 

 

So far it's running without crashing for almost 18 hours, thinking about plugging in the SSD SATA and try getting the array started with just VM and Dockers running off the SSD with no hard drives or parity going.

 

Well, decided to run Geekbench for Linux and try to make sure the hardware that was running was good and it crashed during. Makes me less sure it was the drives and more sure I need to swap the PSU.

 

Yep, crashes at the Multi-core "Navigation" test every time. Not sure if that means anything.

Edited by FEENX
Link to comment

So I'm about a week out of being able to afford a new PSU so I started just trying to see if I could replicate the crashes and then see if any one component was related, again at least.

 

Started with one stick of RAM, and was running both Geekbench5 and Prime95 stress tests before moving on. Got up to 4, and was running super stable; completing each test just fine. Full clean shutdowns, and restarts.

 

And then I went into the BIOS... and did what I've done for years. Enabled XMP.

 

Instantly can't complete a single test to save my life. As I write this it's running just fine after finishing Geekbench5 with all 32 GBs of RAM but XMP disabled.

 

Will keep working to verify stability but... yeah. I know XMP is technically "overclocking" but I never thought it was an issue, and I've probably had it on for every other crash because it's something I turn on as if it was default.

Link to comment

Solved. It was XMP.

 

Not sure why, but it was. Zero crashes since turning it off and I've done everything, sometimes all at once, with my unRAID server now. Docker, VM, stress tests, days running, etc.

 

Thanks for the help, hopefully someone else will save the time and try XMP first.

  • Like 1
Link to comment
  • FEENX changed the title to [Solved] XMP Causing Random Crashes
  • 2 months later...

Maybe this was mentioned and I missed it, but, anyone who comes across this make note of the below:

 

tl;dr: 4 sticks do not work at overclock (XMP) speeds in Alderlake. Run at JDEC speeds for stability.

 

You cannot use 4 sticks of ram, (in my case ddr5) and run them overclocked in the Alderlake, (z690 in my case, pun intended). So, no XMP. Just run at JDEC speeds if you have 4 sticks. I've only applied this hardware to Linux, and Windows 11 may do some voodoo magic that somehow allows this, but I think this is just a major fcuk up by Intel and the board manufacturers for not being crystal clear with their customers.

Edited by jakeshake
Link to comment
  • 1 year later...

I have currently the same thing happening to my system when I enable XMP and I only have two memory sticks not 4. I got an Asus B760-I GAMING WIFI motherboard with the latest BIOS update of December 15th.

Edited by amuzhaqi
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.