Hard crash every 3-6 weeks


Adeon
Go to solution Solved by Adeon,

Recommended Posts

Hi,

 

what i mean with "Hard crash" is, the server just turns completely off.

IPMI is still available, but i can not power the server back on via IPMI.

I can only turn it back on if i unplug the power and wait a couple of seconds (discharging capacitors).

 

I have set up the syslog server, but there is nothing useful in the logs at all.

The crash always happens while the server is idling @45W.

 

The server is connected to a PDU that is connected to a UPS.

No other device connected to the UPS turns off (Modem, Router etc.)

 

Hardware:

  • Motherboard - Supermicro X11SCH-F
  • CPU - Xeon E-2278G
  • RAM - 64 GB DDR4 ECC @2666Mhz
  • PSU - Seasonic Focus PX 750W (80+ Plat)
  • 3 x Toshiba 16TB HDDs
  • 2 x Silicon Power 1TB M.2 NVMe drives

 

Server was running smooth for 1,5 years

The crashes started ~ 3 months ago (no hardware changes at that time). 

 

 

What i have done so far:

  • MemTest86+ 1 pass
  • Changed custom network to "ipvlan"
  • 1 hour Prime95 small FFTs
    • 150W max power draw
    • 68°C max CPU temp
  • Checked and reseated all power connection on the motherboard
  • Reseated the 2 RAM modules
  • After the last crash, i directly connected the server to the UPS (without the PDU)

 

The relevant part of the syslog is attached.

I have an appdata backup script running at 4:30 that stops and starts my Docker containers.

I don't know the exact time, but i think the server crashed between 11:00AM - 13:00PM.

 

Unraid Version is 6.11.1 crashes also happened with 6.10.3

 

Syslog.txt

Edited by Adeon
Link to comment

What i remembered now is that i disabled "Restart after AC loss" in BIOS, so that is the reason i have to manually restart the server.

There is also nothing useful in the IPMI logs (the only entries are "ACPowerOn(OEM)".)

After the parity check is done, i'll update the BIOS + BMC and do another run with MemTest86+ but this time with a downloaded version, because i learned that the Unraid version is not capable of reporting ECC errors.

 

 

If the server crashes again, i'll try another PSU and reseat the CPU.

 

What is weird to me is the server never crashed under load, only when idling, and the server is idling 90% of the time.

Because of that, i changed the "Normal CPU Scaling Governor:" back to "Performance" (was Power Save) in the Tips & Tricks plugin, just in case there is an issue with c-states or p-states.

 

I'll keep updating this thread and hopefully find a solution.

 

Edit:

Ran Memtest for ~14 hours (4 passes) with 0 errors.

 

Edited by Adeon
Link to comment
  • 4 weeks later...

A little Update:

 

The server crashed again today after running smooth for ~4 weeks.

Checked logs every day and there was nothing suspicious at all. 

The server was pretty much idling all the time.

Again, there is absolutely nothing in the syslogs and nothing in the IPMI logs.

 

I just can't imagine that this is caused by a broken PSU.

 

I'm gonna disabled the last couple of Dockers that i installed 3-4 months ago.

If the server crashes again, i'll change the PSU.

Edited by Adeon
Link to comment
1 hour ago, Adeon said:

A little Update:

 

The server crashed again today after running smooth for ~4 weeks.

Checked logs every day and there was nothing suspicious at all. 

The server was pretty much idling all the time.

Again, there is absolutely nothing in the syslogs and nothing in the IPMI logs.

 

I just can't imagine that this is caused by a broken PSU.

 

I'm gonna disabled the last couple of Dockers that i installed 3-4 months ago.

If the server crashes again, i'll change the PSU.

Are you using the iGPU in the 2278G for transcoding via the i915 drivers?  If so, how are you loading the drivers?

 

A thread about issues others have had with the i915 drivers can be found here.  I am not saying that is your problem.  This is my reply in that thread about what I did to solve the serving crashing problems that seemed to be related to i915.

 

I have a 2288G in my server and it started crashing with 6.10.x releases of unRAID.  Nothing at all useful in the syslog. I had the GPU Statistics and Intel-GPU-Top plugins installed.  As a test, I removed them and the crashes (about once a week) stopped.  As a test, I re-installed these plugins with the 6.11.1 release and my unRAID server locked five days later.  Uninstalled the plugins and it has been smooth sailing ever since.

 

I let the i915 drivers load via the touch method.

 

I also uninstalled the CoreFreq plugin.

Edited by Hoopster
Link to comment
3 hours ago, Adeon said:

Yes the iGPU is for transcoding but i pretty much never use it.

It does not have to be in use to cause server lockups.  In fact, for most people, including me, it always happened when the server was fairly idle.

 

Let's see if removing those plugins works for you as it did for me. 

Edited by Hoopster
Link to comment
  • 3 weeks later...
  • 4 weeks later...
  • 3 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.