[Resolved] UnRaid Server randomly power cycling


Go to solution Solved by nexusjosh,

Recommended Posts

3 hours ago, hamish_18 said:

Thanks for the troubleshooting @nexusjosh. Definitely would be interested if you found the conflicting plugin.

Sigh.  My machine still crashed after about 18 hours of running.  I'm going to boot it into safe mode, and validate it runs fine in it.  Right now, I'm back to having no idea what is causing the hard reboots.  It seems more stable now though?  I guess?

Edited by nexusjosh
Link to comment

So, Instead of launching it into safe mode, I reset the server again, and started the array.  Perhaps, it crashed becasue I didn't restart the server after clearing the plugins?  Well, its been running for 29 hours now, and counting, sooo... Perhaps it IS fixed.  I'll update again either 1: The server crashes again.  2: if it is on and stable for 3 days.  And I'll begin installing plugins, re-add my VM, and re-add the docker I use.

Link to comment

Update.  It appears to be reverting somewhat?  But I have drilled down to potentially what appears to be the issue?  Upon getting my VM back up, that is a dev VM for my small business that has some reasonable HDD IO, file transfers, etc, the server is back to crashing after a couple hours, it appears.  As it completed a Parity check without issue... I'm back to square 1.  Could it be one of the drives?  If so, I don't understand why a Parity check would finish with no issues.  Maybe something is going on with the HyperV?  But I just generated a new one, and loaded up an older VM...  The only other thing I did was add Unassigned Devices and re-added an old VM image.

 

At this moment, I'm going to run an extended self test on each drive, and see if any errors pop up?

 

Any thoughts?  Ideas?

Link to comment

Update:  I am So confused.  This issue is so... RANDOM!!!

 

Ran just fine for 2 days once more, Created a New VM to replace the old one.  And it has been running for 9 hours without issue. Granted the VM just exists, and is running win 10.  I haven't put it to work yet.  Will update tomorrow, when I put the VM to work.

 

Could it have been an old VM image that is roughly 3 years old?  😰

 

Current up time is 2 days, 13 hours, and counting.

 

Edit: Oh, and I am having it grind through extended SMART tests.  It takes about a day for each drive.

Parity: Completed without error
Parity 2: Completed without error
Disk 1: Completed without error

Disk 2: Completed without error

Disk 3: In Progress

Disk 4:

Disk 5:

Disk 6:

Cache:

VM Drive:

Hot Spare:

Edited by nexusjosh
DIsk 2 complete.
Link to comment

Update:  So, I've installed my usual dockers once more, and my server is back to its normal production.  for the last couple hours.  Uptime: 3 Days, 2 Hours, 26 mins and counting.  Could it have been a VM all along?  But then I wasn't running the VM when it was crashing.  We'll see what happens after a couple of days.  I'm going to update the above post for giggles as I complete the smart tests on each drive.

Link to comment

Update:  Everything was going so well, I was moving the server back to full normal production, and it just crashed once more, while under heavy load.  Temps look fine.  Nothing in the IPMI event log.  For giggles I even just replaced the BIOS battery! (But tested the one that was in the system, and it was just fine.)

 

Thoughts, ideas, anyone?

 

Edit: Pain.  22.JPG.2f99e86ba3ee743e727886ae184b8e4f.JPG

Edited by nexusjosh
Link to comment
10 minutes ago, klepel said:

"...while under heavy load." - possible its the PSU failing under load?

And sorry, I cannot remember if you mentioned it being replaced or not in the thread.

Its possible.  Today, I'm going to try and re-seating the two CPU's, see if that makes any sort of difference.  If not, I'll see if I can find a CPU I can plug in and see if there is any difference.  The PSU in it is only about 6 months old, EVGA Super Nova.  I don't have a spare one on hand, and at peak, according to the UPS, the server only pulls 300 Watts, and the Super Nova is rated for 1,000.

 

Perhaps, given my server has so many Exos drives, the PSU is having a hard time with so much pulling from the 12V rail?  Le sigh.  I shall have to investigate.

 

Edit But then, before I removed all of plugins, the server wasn't doing anything when it was power cycling.  The Array was just On.  The VM's & Docker Containers were off.  This problem is so inconsistent its ridiculous.

Edited by nexusjosh
Inconsistent
Link to comment
16 hours ago, klepel said:

Can you see voltages in your IPMI? I can on my X10 board, but not sure on the X9. If so have a login to the IPMI voltages (or the plugin) and monitor the 3,5, and 12v and see if any dip (or drop) under load.

 

Yes.  Though I always figured that the IPMI would error out, if the voltages went out of spec?  Attached,  Though you really can't see as of whats going on really, all that well.

Volt.JPG

Link to comment

Are the voltages at system working or after crash?

 

They all look ok, and seem to be in range. The 5v, 5VSB, and 12v show the most variability in the images (distance between the 2 red lines).

 

My 5v and 5VSB are 5 and 5.02 respectively. And for reference my 12v is 11.75.

 

Each board is different, your voltages for those three should be ok.

 

If I happen think of anything else I'll offer. Sorry what I've suggested so far hasn't been fruitful.

Link to comment
2 hours ago, klepel said:

Are the voltages at system working or after crash?

 

They all look ok, and seem to be in range. The 5v, 5VSB, and 12v show the most variability in the images (distance between the 2 red lines).

 

My 5v and 5VSB are 5 and 5.02 respectively. And for reference my 12v is 11.75.

 

Each board is different, your voltages for those three should be ok.

 

If I happen think of anything else I'll offer. Sorry what I've suggested so far hasn't been fruitful.

Hey its all good.  Its something.  When the system crashes the IPMI does not.  I've been busy the last few days, so I haven't been able to sub the PSU.  If it IS the PSU then this will have been a the 2nd PSU in 5 years, which is a bit crazy, but we'll see.  And if it is, which PSU should I purchase to be the permanent replacement?

Link to comment

Update:  I just swapped the PSU from my Server and gaming rig, so we'll see if that makes any sort of difference.  We'll see if the server continues to crash... Or my gaming PC starts crashing.

 

Edit:  Aaaaaaand Crashed!  Server crashed that is.  New PSU made no difference.  Sigh.

Edited by nexusjosh
Link to comment

Update:  Resetting BIOS settings made no difference.  The SuperO support tech called me, and we discussed what was going on for a bit.  He said the only thing it could be at this point is thermals.  I re-seated both CPU's, thoroughly cleaned the CPU's and heatsinks of new paste and applied new thermal paste.  I then put the server under full standard load, and after about 20 minutes under full load it reset.  At the time I was monitoring the tempatures, and on both CPU's, the temps didn't go above 40C.

Link to comment
  • Solution

Update: I sent the SuperO Tech the same message I sent here, and he called me quickly.  He reviewed some pics I sent, and took it to a couple of different departments.  It was pointed out that it could be the two CPU's I have paired with my MB.  Since officially the Motherboard only supports Intel® Xeon® processor E5-2600 and E5-2600 v2 family, and I have two E5-4610 v2.

 

So he sent me one of the original BIOS builds, and I've rolled back the BIOS.  I have it under full load now.  so we'll see what happens!

 

Its insane that a "Non Supported CPU" would just be unstable, and the BIOS wouldn't just reject the CPU upon boot like every other Motherboard I've ever seen does.  This has all been a roller coaster.

 

On 3/22/2022 at 4:30 PM, klepel said:

Maybe one cpu is bad. Try with one cpu at a time and see if it goes beyond the 20 minutes?

If this BIOS Rollback doesn't fix my issue, I'll try that.

 

Information Edit:  The BIOS I was provided was x9dr3p3.  I'm attaching it here, in case anyone in the future needs it.  Fair warning though, on my system at least the BIOS interface is strangely, buggy.  You can't left arrow to the save menu, instead you have to press F Keys to save BIOS changes, restore defaults, etc.  If I note any other strange bugs, I'll update this post.

x9dr3p3.c04.zip

Edited by nexusjosh
Link to comment

Years ago when using two CPU's the two had to be "matched" (same family, stepping, etc) as to not mess up clocks or add other interference to the system. Not the same, but it made me think of that when you mentioned the 4600's were not supported on that MB.

Might be a stretch, but maybe the MB can operate with the 46xx but only as a single in an unofficial capacity. Would agree that if those 4600 are not approved for dual configuration the bios should refuse to allow an OS to boot.

Here is to hoping you get an answer to the issue soon.

Link to comment
7 hours ago, klepel said:

Years ago when using two CPU's the two had to be "matched" (same family, stepping, etc) as to not mess up clocks or add other interference to the system. Not the same, but it made me think of that when you mentioned the 4600's were not supported on that MB.

Might be a stretch, but maybe the MB can operate with the 46xx but only as a single in an unofficial capacity. Would agree that if those 4600 are not approved for dual configuration the bios should refuse to allow an OS to boot.

Here is to hoping you get an answer to the issue soon.

 

6 hours ago, m1a8x2 said:

I'm curious for your next update. I'm having similar issues after upgrading my CPUS to "unsupported" models as well. The crashing seems so random, but that's the one thing I'm seeing in common with my issue and yours.

 

At this moment, I'm going to say its a tentative resolution.  From what I understand, various BIOS updates, not to mention the most current release back in 2019, had various micro architecture updates.  When the MB called the CPU's to do something specific (I am no Engineer) and the CPU's couldn't.  It'd crash.  It almost always crashed when under load, but not always.  I've had it under a full load, at 80-95% CPU load, which was causing the system to crash, and it is now stable.

 

I remember when I originally purchased the MB, I saw on the site "E5-2600 v2 family" I googled that, bringing me to This page.  The page listing all CPU's in the generation.  I didn't realize at the time, that the CPU had to also be 26xx.  So it worked for YEARS, and it wasn't until I was upgrading the system, and decided to do a BIOS Upgrade.  The System STILL BOOTS, and it isn't until there is a call from the BIOS to do something that the CPU is too new to?  To not.  Its quite sad that SuperO didn't just support the newer CPU's?  They OBVIOUSLY work.  But it is what it is.

 

@m1a8x2 If you want to try the same thing I did.  Look at the version of your BIOS, and see if there is an earlier version online.  Or contact your motherboard manufacturer directly, and request the earlier BIOS rev. they have for "Troubleshooting Purposes"

 

All in all.  I feel like there is probably a custom Bios somewhere on the interwebs.  I might look for it at some point, because after I rolled back the BIOS, I've found the actual menu to be strangely buggy. 

 

Its also funny, a few months ago, I was going to upgrade the processors.  But now, I'm certainly NOT going to push my luck.  Though perhaps I could try and get my hands on a pair of E5-2697 v2... Better, and In spec.  Hmm...

Edited by nexusjosh
Link to comment
  • nexusjosh changed the title to [Resolved] UnRaid Server randomly power cycling

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.