Server crashes nearly every day. Random times random loads.


Go to solution Solved by JorgeB,

Recommended Posts

Server seems to be crashing nearly every day after running mostly solid for a couple years. Where can I start to look. Syslog is mirrored to flash and available if desired. The only events leading up to the crash is the flash backup plugin running every 30 minutes - which seems excessive to me. Sometimes the crash reboots the server, sometimes I have to reboot it manually. Connecting a monitor displays a black screen. Only recent hardware change was a slightly different HBA card (external connectors vs internal). It ran for a couple weeks after that before this crashing though, so I doubt that it is. How can I start to look for clues? I would like to rule out the HBA quickly because it is still returnable. I allow the parity checks after the crashes (4 data + 1 parity + 1 cache on primary array and 6 zfs disk array) - so I think this rules out read issues. Write issues might be ruled out by the nightly backups - main array and cache disk backs up to zfs array. The only real new addition - I added a second server that is pulling a backup from this server over an NFS share on the ZFS filesystem. I switched from a btrfs pool to the ZFS pool a couple months ago. The new backup is putting a heavy read load on that ZFS share - but it still completed last night with no error then 30 minutes later the primary server locked and rebooted at 4:30am - then again at 6:30am. The only scheduled task during that time is a remote server outside of my local network backs up to this server through an rsync docker that has a static IP. I recently found a forum post about switching from macvlan to ipvlan when running custom ip dockers and made that change this morning.

cvg02-diagnostics-20221213-0834.zip

Link to comment

If there's nothing relevant logged in the persistent syslog that usually points to a hardware problem, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment

I was able to reliably get the server to crash when writing to the cache drive ssd 4 out of 4 tries dd'ing 100-200GBs to the cache drive it locked and rebooted every time. This was performed while doing parity checks on the main array and ZFS array. The cache drive (and three of the hard drives) isn't connected to the HBA. I rebooted and checked the ram with 4 passes using MemTest86 v10 - Passed 64GB ECC DDR4. I then rebooted into unraid safe mode (selected from the thumb drive) and have written 500GB to the cache SSD with no hiccups. I was simultaneously scrubbing the cache drive to hammer that disk as hard as I could. No lockups. Smart attributes are clean on that SSD, BTRFS device stats are all 0, scrub is clean.

So then I started recreating the same load in safe mode - started a scrub of the unraid array, imported my ZFS pool and started a scrub on it and continued to hammer everything. No lockups whatsoever. All dockers that normally run - running fine (didn't test the others - irrelevant).

 

So, are plugins the primary difference between safe mode and regular mode? If so, I may have a rogue plugin.

Edited by wildfire305
bad grammar
Link to comment
3 hours ago, JorgeB said:

Not necessarily, it's been known to cause unexpected showdowns/crashes.

Why then would it not be pulled from the app store or at least have an incompatibility warning? It wasted a lot of my time if it ends up being the cause - so far stable as a rock today and I've been running it at about 400 watts worth of processes.

Edited by wildfire305
added app store
Link to comment

I'm going to mark this as solved. I never would have suspected a plugin for wake on lan to have that much influence on the system stability. I believe it should have a caution label on the plugin. It didn't immediately cause problems, but removing it has resolved the issues I was having. I could imagine folks that like to fire parts cannons at problems being extremely upset at replacing hardware over a silly plugin. I'm not using "server grade" hardware, but I think it's close enough to it when you look at the base chips. And it is standardized enough that everything so far has "just worked". 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.