Rock solid until today, 3 random reboots and a corrupt cache drive

bling · February 29, 2020

Logged in this morning to my server and noticed that a parity check was running, which I thought was odd. I was like hmmmm...I didn't start that. Maybe I had it scheduled for end of the month? Checked that, nope -- it's off. Then I realized that the uptime was an hour. My machine is 24/7 and has been rock solid for 2-3 weeks.

From there, I was experimenting with a new docker app, and then bam!! Random reboot. From here, I turned off docker/VMs and let the parity check run to completion. Thankfully no errors.

Just now, again, playing around with a docker app, and it rebooted again. This time, I'm greeted with an unmountable btrfs cache disk (can't read superblock).

I was able to mount the cache disk in read-only mode with nologreplay and copy everything to the array. I've heard horror stories of others with corrupting btrfs cache disks, and once I copy everything over I'm reformatting my cache disk to XFS. Ironically, this is usually due to sudden power loss, and even though I do have a UPS hooked up, it didn't protect from the computer rebooting itself.

Could a bad hard drive cause random reboots? I'm highly suspecting it's either the drive or btrfs, given that's the only thing common in all 3 reboots. All docker containers are using the cache disk. Thanks in advance.

bling · February 29, 2020

Sigh....while I was rsyncing from the array back to a freshly formatted XFS cache disk, the server hard rebooted again. So I guess that rules out the file system. I had putty tailing the syslog at the time and nothing was logged during the reboot. I'm also tailing dmesg now...

Edited March 1, 2020 by bling

Decto · February 29, 2020

You haven't said anything about the spec so difficult to comment.

Have you run a memory test?

bling · February 29, 2020

It's my old rig that I repurposed as a NAS server. 4790k, asrock mobo, 16GB RAM. It's been rock solid since day 1 when it was running Windows. Recently when I rebuilt it for unraid, all the hardware remained the same except for new hard drives, a recently replaced PSU under warranty, and a new UPC.

I'm running memtest right now, directly plugged into the wall.

bling · February 29, 2020

Another bit of useful information, I caught it doing a random boot in the middle of a reboot! I SSHed into the box, was tailing the log, before emhttp started up I lost connection.

bling · March 1, 2020

memtest passed overnight. rebooted unraid in safe mode, and within moments of a medium workload within a docker container, reboot!

i checked /proc/sys/kernel/panic, and it's set to 0, which is the default meaning it will not auto-reboot.

just swapped out the PSU with a spare...wish me luck!

bling · March 2, 2020

it's the PSU. since i swapped with a spare it hasn't crashed regardless of what workloads i threw at it. did a full parity check and some bits needed correcting.

Hoopster · March 2, 2020

4 minutes ago, bling said:

it's the PSU. since i swapped with a spare it hasn't crashed regardless of what workloads i threw at it. did a full parity check and some bits needed correcting.

Every time I have seen the behavior you described; random reboots; especially under more than an idle load, the problem has been the PSU. That is especially true if it reboots while the system is booting up. The boot up process puts brief moderate to heavy loads on the PSU and if it is failing, you get another reboot.

The PSU and/or RAM are the usual suspects but I always check PSU before RAM (although RAM tests are relatively easy to do). That's why I keep a spare around.

Glad you appear to have it sorted.

Rock solid until today, 3 random reboots and a corrupt cache drive

Recommended Posts

bling

Link to comment

bling

Link to comment

Decto

Link to comment

bling

Link to comment

bling

Link to comment

bling

Link to comment

bling

Link to comment

Hoopster

Link to comment

Join the conversation