Server Crashes -- Caused by drives or power?


Recommended Posts

Over the past ~6 months, I've had a handful of server crashes. Based on the logs, I suspect it could be related to drives or the power supply, but I wanted to get some other thoughts.

 

First off, some basic server information (build in 2021):

  • Unraid OS Plus 6.12.5 (though the crashes date back to 6.11.X)
  • Intel i7-4790K CPU (sourced from a working desktop)
  • Gigabyte GA-H97N motherboard (eBay)
  • G.Skill Ripjaws X 2x8 GB DDR3-1600 RAM
  • 4x WD Red Plus (CMR) 6 TB (from desktop, purchased 2018)
  • 2x WD Red (SMR) 6 TB (new 2021)
  • WD Blue SN550 1 TB NVMe SSD (cache, new 2021)
  • EVGA SuperNOVA G3 550W 80+ Gold power supply (new 2021)

 

PCPartPicker estimates the power consumption at 246W, so there should be plenty of margin.

 

Here are the diagnostics:

diagnostics-20240319-1902.zip

 

I have syslogs from several crashes after I enabled the syslog server (there were a handful before this). I've trimmed the logs to (what I think) is a reasonable amount of time before and after the crash. I can provide more of the logs if needed.

 

26 Nov 2023: Server crashes around 03:26:06, potentially while spinning down /dev/sdb.

syslog-crash-231126.txt

 

28 Nov 2023: Server crashes around 04:23:10, potentially while reading SMART from /dev/sdc.

syslog-crash-231128.txt

 

11 Dec 2023: Server crashes around 19:58:16, and I don't really have a good indication of what might have been happening when it crashed. When the UPS I was using at this time was at 100% charge, its status message was "OL+DISCHRG" even though it was using mains power, not battery; the log is a bit cluttered with messages about this. Additionally, the USB connection of the UPS was unreliable, so I had a script check for the connection once a minute and reset USB if it was disconnected. There may be messages about this in the log too.

syslog-crash-231211.txt

 

14 Dec 2023: Server crashes around 07:08:16, and the log is pretty similar to the Dec 11 log, with many messages about the UPS.

syslog-crash-231214.txt

 

31 Dec 2023: Server crashes around 16:48:35, and like the 28 Nov log, the last message was reading SMART from /dev/sdc.

syslog-crash-231231.txt

 

16 Mar 2024: After a stable period of a few months, the server crashed around 00:15:16; this time the last message was reading SMART from /dev/sdb. Following start-up, during the resulting parity check, there are may disk0 read errors. This was the only such errors occurred after a crash.

syslog-crash-240316.txt

 

19 Mar 2024: Back to more frequent crashes, this one happened around 14:31:04, while spinning down /dev/sdf.

syslog-crash-240319.txt

 

I will acknowledge some of the drives are old and likely need replacing, but I was hoping to narrow down the cause before I replace all of my drives.

Link to comment

Is there any good way to start narrowing down the source?

 

On one hand, the old drives seem like the obvious place to start, but it's not consistent which one shows up in the logs for each crash.

 

Regarding the power supply, I find it hard to believe it's getting overloaded as it's rated over 2x the PCPartPicker estimate. I also checked the history of my UPS's measured power usage (recorded every 10s in Home Assistant) and the peak was less than 200W. I suppose the supply could be failing, but it's not that old.

 

One last note: When the server crashes, the display still remains (statically), but the server is unresponsive (via both the web GUI and a hardwired keyboard).

Link to comment
  • 3 weeks later...

I haven't had the chance to dig into hardware troubleshooting yet, but I realized that Home Assistant's data recording should give me a better indication of the actual failure time. I only had detailed data for the past 3 crashes, but they revealed the last entries in the Unraid log were separated from the actual crash time by 5 to 30 minutes. In my mind, this means there's probably no correlation between those entries and the root cause.

 

I think the RAM is at the top of my suspect list right now, so I'll start with a memtest and go from there.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.