Server Crashes -- Caused by drives or power?

March 20, 20242 yr

Over the past ~6 months, I've had a handful of server crashes. Based on the logs, I suspect it could be related to drives or the power supply, but I wanted to get some other thoughts.

First off, some basic server information (build in 2021):

Unraid OS Plus 6.12.5 (though the crashes date back to 6.11.X)
Intel i7-4790K CPU (sourced from a working desktop)
Gigabyte GA-H97N motherboard (eBay)
G.Skill Ripjaws X 2x8 GB DDR3-1600 RAM
4x WD Red Plus (CMR) 6 TB (from desktop, purchased 2018)
2x WD Red (SMR) 6 TB (new 2021)
WD Blue SN550 1 TB NVMe SSD (cache, new 2021)
EVGA SuperNOVA G3 550W 80+ Gold power supply (new 2021)

PCPartPicker estimates the power consumption at 246W, so there should be plenty of margin.

Here are the diagnostics:

diagnostics-20240319-1902.zip

I have syslogs from several crashes after I enabled the syslog server (there were a handful before this). I've trimmed the logs to (what I think) is a reasonable amount of time before and after the crash. I can provide more of the logs if needed.

26 Nov 2023: Server crashes around 03:26:06, potentially while spinning down /dev/sdb.

syslog-crash-231126.txt

28 Nov 2023: Server crashes around 04:23:10, potentially while reading SMART from /dev/sdc.

syslog-crash-231128.txt

11 Dec 2023: Server crashes around 19:58:16, and I don't really have a good indication of what might have been happening when it crashed. When the UPS I was using at this time was at 100% charge, its status message was "OL+DISCHRG" even though it was using mains power, not battery; the log is a bit cluttered with messages about this. Additionally, the USB connection of the UPS was unreliable, so I had a script check for the connection once a minute and reset USB if it was disconnected. There may be messages about this in the log too.

syslog-crash-231211.txt

14 Dec 2023: Server crashes around 07:08:16, and the log is pretty similar to the Dec 11 log, with many messages about the UPS.

syslog-crash-231214.txt

31 Dec 2023: Server crashes around 16:48:35, and like the 28 Nov log, the last message was reading SMART from /dev/sdc.

syslog-crash-231231.txt

16 Mar 2024: After a stable period of a few months, the server crashed around 00:15:16; this time the last message was reading SMART from /dev/sdb. Following start-up, during the resulting parity check, there are may disk0 read errors. This was the only such errors occurred after a crash.

syslog-crash-240316.txt

19 Mar 2024: Back to more frequent crashes, this one happened around 14:31:04, while spinning down /dev/sdf.

syslog-crash-240319.txt

I will acknowledge some of the drives are old and likely need replacing, but I was hoping to narrow down the cause before I replace all of my drives.

Quote

March 20, 20242 yr

Community Expert

There's nothing relevant logged that I can see, suggesting a power/hardware issue.

Quote

March 22, 20242 yr

Author

Is there any good way to start narrowing down the source?

On one hand, the old drives seem like the obvious place to start, but it's not consistent which one shows up in the logs for each crash.

Regarding the power supply, I find it hard to believe it's getting overloaded as it's rated over 2x the PCPartPicker estimate. I also checked the history of my UPS's measured power usage (recorded every 10s in Home Assistant) and the peak was less than 200W. I suppose the supply could be failing, but it's not that old.

One last note: When the server crashes, the display still remains (statically), but the server is unresponsive (via both the web GUI and a hardwired keyboard).

Quote

March 22, 20242 yr

Community Expert

I would start with using a different PSU if availbale, not because it's being overloaded, but it may be failing, after that I would test with a different board, CPU, RAM.

Quote

April 6, 20242 yr

Author

I haven't had the chance to dig into hardware troubleshooting yet, but I realized that Home Assistant's data recording should give me a better indication of the actual failure time. I only had detailed data for the past 3 crashes, but they revealed the last entries in the Unraid log were separated from the actual crash time by 5 to 30 minutes. In my mind, this means there's probably no correlation between those entries and the root cause.

I think the RAM is at the top of my suspect list right now, so I'll start with a memtest and go from there.

Quote

Server Crashes -- Caused by drives or power?

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)