wildfire305 Posted December 13, 2022 Share Posted December 13, 2022 Server seems to be crashing nearly every day after running mostly solid for a couple years. Where can I start to look. Syslog is mirrored to flash and available if desired. The only events leading up to the crash is the flash backup plugin running every 30 minutes - which seems excessive to me. Sometimes the crash reboots the server, sometimes I have to reboot it manually. Connecting a monitor displays a black screen. Only recent hardware change was a slightly different HBA card (external connectors vs internal). It ran for a couple weeks after that before this crashing though, so I doubt that it is. How can I start to look for clues? I would like to rule out the HBA quickly because it is still returnable. I allow the parity checks after the crashes (4 data + 1 parity + 1 cache on primary array and 6 zfs disk array) - so I think this rules out read issues. Write issues might be ruled out by the nightly backups - main array and cache disk backs up to zfs array. The only real new addition - I added a second server that is pulling a backup from this server over an NFS share on the ZFS filesystem. I switched from a btrfs pool to the ZFS pool a couple months ago. The new backup is putting a heavy read load on that ZFS share - but it still completed last night with no error then 30 minutes later the primary server locked and rebooted at 4:30am - then again at 6:30am. The only scheduled task during that time is a remote server outside of my local network backs up to this server through an rsync docker that has a static IP. I recently found a forum post about switching from macvlan to ipvlan when running custom ip dockers and made that change this morning. cvg02-diagnostics-20221213-0834.zip Quote Link to comment
wildfire305 Posted December 13, 2022 Author Share Posted December 13, 2022 Started this command on the ZFS array to try to rule out write issues with the HBA "dd if=/dev/random of=test.img bs=1G count=500 status=progress" ...while running a zfs scrub - this outta tax it. Quote Link to comment
JorgeB Posted December 13, 2022 Share Posted December 13, 2022 If there's nothing relevant logged in the persistent syslog that usually points to a hardware problem, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
wildfire305 Posted December 13, 2022 Author Share Posted December 13, 2022 Well....that DD command crashed it...looks like maybe I've got a clue. Quote Link to comment
wildfire305 Posted December 13, 2022 Author Share Posted December 13, 2022 Maybe that was my fault - changed the command to " dd if=/dev/random of=test.img bs=1M count=1000000 status=progress" and it has completed almost a terrabyte so far of writing - while also performing a full ZFS scrub. I think my previous command ran me out of RAM. Quote Link to comment
wildfire305 Posted December 14, 2022 Author Share Posted December 14, 2022 (edited) I was able to reliably get the server to crash when writing to the cache drive ssd 4 out of 4 tries dd'ing 100-200GBs to the cache drive it locked and rebooted every time. This was performed while doing parity checks on the main array and ZFS array. The cache drive (and three of the hard drives) isn't connected to the HBA. I rebooted and checked the ram with 4 passes using MemTest86 v10 - Passed 64GB ECC DDR4. I then rebooted into unraid safe mode (selected from the thumb drive) and have written 500GB to the cache SSD with no hiccups. I was simultaneously scrubbing the cache drive to hammer that disk as hard as I could. No lockups. Smart attributes are clean on that SSD, BTRFS device stats are all 0, scrub is clean. So then I started recreating the same load in safe mode - started a scrub of the unraid array, imported my ZFS pool and started a scrub on it and continued to hammer everything. No lockups whatsoever. All dockers that normally run - running fine (didn't test the others - irrelevant). So, are plugins the primary difference between safe mode and regular mode? If so, I may have a rogue plugin. Edited December 14, 2022 by wildfire305 bad grammar Quote Link to comment
JorgeB Posted December 14, 2022 Share Posted December 14, 2022 Yes, plugins are the only difference, try adding them one at a time to see if you can find the culprit. Quote Link to comment
wildfire305 Posted December 14, 2022 Author Share Posted December 14, 2022 The last one I installed...about a week ago... was the WOL plugin - which appeared to be partially broken. I removed it and performed the same tests and the server did not crash. I have a hard time trusting that as the "fix" though. I would assume that plugin does nothing until you push for it to wake a computer. Quote Link to comment
Solution JorgeB Posted December 14, 2022 Solution Share Posted December 14, 2022 28 minutes ago, wildfire305 said: I would assume that plugin does nothing until you push for it to wake a computer. Not necessarily, it's been known to cause unexpected showdowns/crashes. Quote Link to comment
wildfire305 Posted December 14, 2022 Author Share Posted December 14, 2022 (edited) 3 hours ago, JorgeB said: Not necessarily, it's been known to cause unexpected showdowns/crashes. Why then would it not be pulled from the app store or at least have an incompatibility warning? It wasted a lot of my time if it ends up being the cause - so far stable as a rock today and I've been running it at about 400 watts worth of processes. Edited December 14, 2022 by wildfire305 added app store Quote Link to comment
JorgeB Posted December 15, 2022 Share Posted December 15, 2022 11 hours ago, wildfire305 said: Why then would it not be pulled from the app store or at least have an incompatibility warning? Because it works for most. Quote Link to comment
wildfire305 Posted December 15, 2022 Author Share Posted December 15, 2022 I'm going to mark this as solved. I never would have suspected a plugin for wake on lan to have that much influence on the system stability. I believe it should have a caution label on the plugin. It didn't immediately cause problems, but removing it has resolved the issues I was having. I could imagine folks that like to fire parts cannons at problems being extremely upset at replacing hardware over a silly plugin. I'm not using "server grade" hardware, but I think it's close enough to it when you look at the base chips. And it is standardized enough that everything so far has "just worked". Quote Link to comment
JorgeB Posted December 15, 2022 Share Posted December 15, 2022 One of the first troubleshooting steps we usually recommend is to boot in safe mode to rule out any plugin issues. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.