Rock solid until today, 3 random reboots and a corrupt cache drive


bling

Recommended Posts

Logged in this morning to my server and noticed that a parity check was running, which I thought was odd.  I was like hmmmm...I didn't start that.  Maybe I had it scheduled for end of the month?  Checked that, nope -- it's off.  Then I realized that the uptime was an hour.  My machine is 24/7 and has been rock solid for 2-3 weeks.

 

From there, I was experimenting with a new docker app, and then bam!!  Random reboot.  From here, I turned off docker/VMs and let the parity check run to completion.  Thankfully no errors.

 

Just now, again, playing around with a docker app, and it rebooted again.  This time, I'm greeted with an unmountable btrfs cache disk (can't read superblock).

 

I was able to mount the cache disk in read-only mode with nologreplay and copy everything to the array.  I've heard horror stories of others with corrupting btrfs cache disks, and once I copy everything over I'm reformatting my cache disk to XFS.  Ironically, this is usually due to sudden power loss, and even though I do have a UPS hooked up, it didn't protect from the computer rebooting itself.

 

Could a bad hard drive cause random reboots?  I'm highly suspecting it's either the drive or btrfs, given that's the only thing common in all 3 reboots.  All docker containers are using the cache disk.  Thanks in advance.

Link to comment

Sigh....while I was rsyncing from the array back to a freshly formatted XFS cache disk, the server hard rebooted again.  So I guess that rules out the file system.  I had putty tailing the syslog at the time and nothing was logged during the reboot.  I'm also tailing dmesg now...

Edited by bling
Link to comment

It's my old rig that I repurposed as a NAS server.  4790k, asrock mobo, 16GB RAM.  It's been rock solid since day 1 when it was running Windows.  Recently when I rebuilt it for unraid, all the hardware remained the same except for new hard drives, a recently replaced PSU under warranty, and a new UPC.

 

I'm running memtest right now, directly plugged into the wall.

Link to comment

memtest passed overnight.  rebooted unraid in safe mode, and within moments of a medium workload within a docker container, reboot!

 

i checked /proc/sys/kernel/panic, and it's set to 0, which is the default meaning it will not auto-reboot.

 

just swapped out the PSU with a spare...wish me luck!

Link to comment
4 minutes ago, bling said:

it's the PSU.  since i swapped with a spare it hasn't crashed regardless of what workloads i threw at it.  did a full parity check and some bits needed correcting.

Every time I have seen the behavior you described; random reboots; especially under more than an idle load, the problem has been the PSU.  That is especially true if it reboots while the system is booting up.  The boot up process puts brief moderate to heavy loads on the PSU and if it is failing, you get another reboot.

 

The PSU and/or RAM are the usual suspects but I always check PSU before RAM (although RAM tests are relatively easy to do).  That's why I keep a spare around.

 

Glad you appear to have it sorted.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.