Batman Posted March 20 Share Posted March 20 Over the past ~6 months, I've had a handful of server crashes. Based on the logs, I suspect it could be related to drives or the power supply, but I wanted to get some other thoughts. First off, some basic server information (build in 2021): Unraid OS Plus 6.12.5 (though the crashes date back to 6.11.X) Intel i7-4790K CPU (sourced from a working desktop) Gigabyte GA-H97N motherboard (eBay) G.Skill Ripjaws X 2x8 GB DDR3-1600 RAM 4x WD Red Plus (CMR) 6 TB (from desktop, purchased 2018) 2x WD Red (SMR) 6 TB (new 2021) WD Blue SN550 1 TB NVMe SSD (cache, new 2021) EVGA SuperNOVA G3 550W 80+ Gold power supply (new 2021) PCPartPicker estimates the power consumption at 246W, so there should be plenty of margin. Here are the diagnostics: diagnostics-20240319-1902.zip I have syslogs from several crashes after I enabled the syslog server (there were a handful before this). I've trimmed the logs to (what I think) is a reasonable amount of time before and after the crash. I can provide more of the logs if needed. 26 Nov 2023: Server crashes around 03:26:06, potentially while spinning down /dev/sdb. syslog-crash-231126.txt 28 Nov 2023: Server crashes around 04:23:10, potentially while reading SMART from /dev/sdc. syslog-crash-231128.txt 11 Dec 2023: Server crashes around 19:58:16, and I don't really have a good indication of what might have been happening when it crashed. When the UPS I was using at this time was at 100% charge, its status message was "OL+DISCHRG" even though it was using mains power, not battery; the log is a bit cluttered with messages about this. Additionally, the USB connection of the UPS was unreliable, so I had a script check for the connection once a minute and reset USB if it was disconnected. There may be messages about this in the log too. syslog-crash-231211.txt 14 Dec 2023: Server crashes around 07:08:16, and the log is pretty similar to the Dec 11 log, with many messages about the UPS. syslog-crash-231214.txt 31 Dec 2023: Server crashes around 16:48:35, and like the 28 Nov log, the last message was reading SMART from /dev/sdc. syslog-crash-231231.txt 16 Mar 2024: After a stable period of a few months, the server crashed around 00:15:16; this time the last message was reading SMART from /dev/sdb. Following start-up, during the resulting parity check, there are may disk0 read errors. This was the only such errors occurred after a crash. syslog-crash-240316.txt 19 Mar 2024: Back to more frequent crashes, this one happened around 14:31:04, while spinning down /dev/sdf. syslog-crash-240319.txt I will acknowledge some of the drives are old and likely need replacing, but I was hoping to narrow down the cause before I replace all of my drives. Quote Link to comment
JorgeB Posted March 20 Share Posted March 20 There's nothing relevant logged that I can see, suggesting a power/hardware issue. Quote Link to comment
Batman Posted March 22 Author Share Posted March 22 Is there any good way to start narrowing down the source? On one hand, the old drives seem like the obvious place to start, but it's not consistent which one shows up in the logs for each crash. Regarding the power supply, I find it hard to believe it's getting overloaded as it's rated over 2x the PCPartPicker estimate. I also checked the history of my UPS's measured power usage (recorded every 10s in Home Assistant) and the peak was less than 200W. I suppose the supply could be failing, but it's not that old. One last note: When the server crashes, the display still remains (statically), but the server is unresponsive (via both the web GUI and a hardwired keyboard). Quote Link to comment
JorgeB Posted March 22 Share Posted March 22 I would start with using a different PSU if availbale, not because it's being overloaded, but it may be failing, after that I would test with a different board, CPU, RAM. Quote Link to comment
Batman Posted April 6 Author Share Posted April 6 I haven't had the chance to dig into hardware troubleshooting yet, but I realized that Home Assistant's data recording should give me a better indication of the actual failure time. I only had detailed data for the past 3 crashes, but they revealed the last entries in the Unraid log were separated from the actual crash time by 5 to 30 minutes. In my mind, this means there's probably no correlation between those entries and the root cause. I think the RAM is at the top of my suspect list right now, so I'll start with a memtest and go from there. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.