Jump to content

Server Repeatedly Hard Crashes


Recommended Posts

Hello

 

I'm experiencing an issue where my server will randomly crash completely.  No web UI, no SAMBA, everything hard crashes and the box requires a hard restart.  It sounds like a kernel panic but I can't confirm this.

 

At first the server would last between 12-24 hours before crashing, but recently this window has been cut to around 2-6 hours.

 

I was originally under the impression that I had a failing flash drive.  Sometimes before it would crash, I would see the "License file not found" error and that the boot USB device had been moved to Unassigned Devices.  However, I've already swapped to a brand new drive and have been moving it around to different USB ports and controllers, and this hasn't fixed anything.

ca-server-diagnostics-20220915-0935.zip

Link to comment

UPDATE:  4 days 1 hour up and counting.  I ended up needing to change both "Power Supply Idle Control" and global C-states, but it appears to be stable now!  Thanks for pointing that out!

 

Will do.  I was a little thrown since I've run unRAID without issue on this board before when I was running a 1900X, but it makes sense that something may have changed when I upgraded.

Edited by SpyisSandvich
Status update
Link to comment
  • 2 months later...
On 12/12/2022 at 2:13 AM, JorgeB said:

Also worth completely disabling C-states if just using the power supply idle control setting since it has been reported that it can change with a kernel change.

I checked again this morning.  Global C-State Control is "Disabled", and Power Supply Idle Control is "Typical Current Idle".

 

On 12/11/2022 at 9:09 AM, ChatNoir said:

Then set up a syslog server and post the file after the next crash.

Tried with a local syslog server and it didn't capture anything around the time of the last crash  (It crashed around 22:30 and the last log message was two hours prior).  I also briefly tried remote syslog with a virtual Debian machine, but switched away from it because for most of the 11th, I only ever saw the line that indicated syslogging was started.  Should I be mirroring to the flash drive, or should I try the remote solution again?

syslog-192.168.2.251.log

Link to comment

There really isn't much here to go off of, it frequently crashes several hours ahead of the last logs.

 

The part that gets me here is that this was stable until I updated unRAID a month or so ago.  I realize it's possible that some of the BIOS settings got messed with, but I've been able to confirm that this hasn't happened.  I would rather not downgrade my unRAID version, and it would be difficult to switch to comparable hardware I'm currently running.

 

How does one troubleshoot this without any information?

Link to comment

Is there a better way to do this than the built-in Update OS screen?  That screen only allows me to go back one version.

 

EDIT:  I suppose I can follow this nugget.  I'll take a backup of my flash drive first though.

 

EDIT 2:  Currently testing unRAID 6.10.3, seemed to take the downgrade just fine.

 

EDIT 3:  6.10.3 crashed last night.  Testing 6.9.2 now.  If this fails, then that should remove the software as the possible culprit because I should have easily been past this point when it was working well.

Edited by SpyisSandvich
Link to comment

Okay, quickly breaking at my 6.9.2 test as I'm immediately seeing a flood of this message in my syslog:

Quote

Dec 21 08:23:42 CA-Server kernel: vfio-pci 0000:42:00.0: BAR 1: can't reserve [mem 0x80000000-0x87ffffff 64bit pref]

 

I'm going to halt the software regression test because it seems like my system can't handle going back this far.

 

EDIT 2:  This is still happening even now that I've gone back to 6.11.5, should I restore from my flash backup, or is this something deep in the system configuration that the flash wouldn't touch?

 

I want to revisit the RAM speed, since I was told previously that might be an issue with servers.  This system runs DDR4-2400 with unbuffered ECC.  UEFI was already set to use that speed, and from what I gathered for this configuration, this should be okay.  I could try bringing the speed down and seeing if that improves stability?

 

EDIT:  Downclocking my RAM only seems to have made things worse.  My computer gets into this weird loop where it'll spin the fans for a few seconds, then power back down.  It'll do this 3 times then stabilize, where I'm assuming it falls back to last good configuration.  Even does this when setting the speed back to normal.  It will eventually boot though.

 

Attached logs from the 6.9.2 test in case they're helpful.

ca-server-diagnostics-20221221-0832.zip

Edited by SpyisSandvich
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...