Diagnosis of crashing server?

Foozle · October 3, 2021

I have an unraid server that spontaneously reboots every 8 to 24 hours.

This is a server in my home - it didn't get a lot of use over the summer. About 3 weeks ago, I needed to run a couple of virtual machines and noticed that the unraid server was down. I have no idea how long it had been down for because I hadn't been using it since early Summer.

I restarted it, it came up, things were fine, for about 6 hours. Sometime after I'd gone to bed, the server crashed/hung. Nothing in syslog (all in ram) and nothing on console. I'd had an open ssh login ... nothing there I could see.

Over the last 3 weeks I've been trying to figure out what is going on. The box will NOT stay up more than about a day, sometimes less. The box always hung - it did not come back up cleanly. Of the many things I tried, I moved my flash drive to a windows box and chkdsk'd it. After that the box no longer hung, but it still crashed periodically and rebooted.

What I've tried so far:

Remote syslog to another box on the network - there are no log entries anywhere near the time of the crash (as evidenced by uptime on the box when I come back to it. I also left open ssh connections, running tail -f on multiple log files (e.g. tail -f /var/log* /var/log/*/*). I find nothing in the logs for the times the server crashes/reboots
I left memtestx86 running for about 16 hours - 5 full passes of my 32G ram with no issues
I chkdsk'd the flash drive on a windows box. This changed 'crash and hang' into 'crash and reboot'. Not sure if it's just physically removing the usb and reinserting it (or maybe a different usb port?) or the chkdsk that solved it - windows found no issues with the filesystem.
Because I had cache mirrored cache ssds that I got from a hardware liquidator that were not erroring but past the end of their life according to SMART, I replaced my two cache drives. I moved everything off the cache, removed the cache entirely, swapped in two new 1TB WD Blue ssds for the old ones and reconfigured the cache, moved my VMs back to cache etc. This had no effect on stability.
I swapped out the power supply of the box. No change.
One of the days with very long uptime - I changed my kvm config to allow use of most of the ram and all the cpu cores and ran stress-ng inside a virtual machine for about 18 hours - the server crashed/rebooted at about 26 hours of uptime.
Early on I disabled all my docker containers, and I've gone through several crash/reboot cycles with KVM disabled as well. The box is doing virtually nothing.
I've never been at the computer, paying attention when it crashes, either while I'm away at work, or asleep in bed etc.
I upgraded unraid from 6.8.3 to 6.9.2. No change.
As part of that unraid version upgrade one of the odd things I noticed in syslog was that my Intel X520 T2 10gb-T network card started having odd log entries. It seems that some very old firmware versions of this card have spurious odd log entries on recent linux kernels, which even more recent kernels fix. Since the card runs very hot and I'd wanted to replace it anyway, I swapped out the network card. With a different intel card, cooler, newer firmware etc. This didn't change my crash/reboot behavior.

I have an old low-end radeon hd7750 gpu I could remove from the system - haven't done it yet. At this point I could swap out the motherboard, or the ram.... I can't seem to get any "this is why I crashed" kind of info out of it. Attached, you'll find my 'diagnostics' output, anonymized. I have a mildly unusual configuration for my network - all my VMs are on a bridge on a separate network that is in a DMZ in my house - this nic is plugged into a different port on my firewall.

Ideas on what to try? How can I further diagnose this? I can come up with about a week's worth of syslogs from my remote syslog server, but there really appears to be nothing in them.

Thanks in advance for any ideas.

netdrive2-diagnostics-20211003-0915.zip

Foozle · October 3, 2021

And since I last posted it crashed after about 3 hours... Here is syslog for the last day or so, showing two crashes. (and showing me logging in/out, fiddling etc). This is from a remote syslog server I set up (windows 10 running solarwinds kiwi syslogd).

In this last case the server crashed about 1 hour after parity check finished and the two array drives spun down.

In an open ssh connection I had 'dmesg -w' running. It showed lines from the last boot, then from mdsync finishing, then connection closed with no more data:

[   64.241626] br1: port 3(vnet1) entered blocking state
[   64.241628] br1: port 3(vnet1) entered forwarding state
[ 1351.012659] mdcmd (37): check
[35035.688938] md: sync done. time=34986sec
[35035.692589] md: recovery thread: exit status: 0

Shared connection to 192.168.7.251 closed.

and in another ssh connection, tail -f syslog is:

Oct  3 10:37:56 netdrive2 emhttpd: spinning down /dev/sdg
Oct  3 10:37:56 netdrive2 emhttpd: spinning down /dev/sdf
Shared connection to 192.168.7.251 closed.

I'm stumped.

Hardware I've changed: NIC, SSD-cache, power-supply, a fan that was kinda loud

Hardware I can easily change: RAM, I could change the flash drive, I could remove the video card

Hardware it would be a giant PITA to change: motherboard, processor, array drives (unless you have a pair of 6tb drives you can loan me

SyslogCatchAll-2021-10-03.txt

trurl · October 3, 2021

Foozle · October 3, 2021

I've made that bios change (bios already at the latest...). We'll see if it helps. What makes me skeptical of this as a fix is that "I haven't chagned anything" (famous last words) since last spring when I was using the server a lot. I don't know what would have caused it to develop this behavior - I didn't upgrade the version of unraid, I didn't even basically touch the server before the issue started happening.

While I'm skeptical, I'm 100% willing to try what you suggest and I'll wait to see if it helps, with hope and fingers crossed.

Foozle · October 5, 2021

Ok - 1 day and 6 hours of uptime so far.... looks like changing my bios for my 3rd gen ryzen proc was the fix.

...."Power Supply Idle Control" (or similar) and set it to "typical current idle"....

The odd thing is that I ran at least 8 months without that setting and was very stable - I have no idea what changed to suddenly make this issue happen.

At any rate, thanks much for the link.

Foozle · October 23, 2021

Coming back to confirm - I currently have an uptime of 16 days. This did indeed solve my problem.

Thanks.

Diagnosis of crashing server?

Recommended Posts

Foozle

Link to comment

Foozle

Link to comment

trurl

Link to comment

Foozle

Link to comment

Foozle

Link to comment

Foozle

Link to comment

Join the conversation