I put my server together about 6 years ago and it's been really solid until recently. However, 4 times in the last ~month I've found it completely frozen, not responding to network requests (file shares, docker images, the admin web UI), and not responding to keyboard or mouse inputs at the physical machine. When this happens, the numlock light on the keyboard turns off and the caps lock and scroll lock lights start blinking. It seems to be getting more frequent, with the most recent two freezes happening after only a day or two of uptime.
The screen shows whatever was happening at the moment it froze, so I've left `htop` open once, and `dmesg --follow` the next time, but neither have anything too obvious. htop shows shfs using about 10% CPU and transmission using another 15% across two processes; dmesg shows only two recent messages:
md: sync done. time=60938sec
md: recovery thread: exit status 0
I'm not sure what those mean, but "exit status 0" sounds like "not a crash".
I'm also attaching an anonymized diagnostics bundle.
The CPU, MB, and RAM (i7-2600K, Asus P8P67 Pro, & 2x8GB Kingston HyperX Fury DDR3-1866) are all recycled from the desktop PC I built about 15 years ago, so my first thought is that maybe one of them is going out. But I'd still like to understand what's happening better.
Also, all of 4 freezes have happened in the middle of the night, which makes me think it might be some scheduled thing that's triggering it.
[Edit] One other thing that comes to mind is that I switched the cache drive from a SATA SSD to an NVMe SSD a couple of months ago. I initially messed up the file owners when copying everything over to the new SSD, which broke some of my docker images, but I think I have it straightened out now.
Does anyone here have any ideas what the root cause might be?
unraid-diagnostics-20240226-1418.zip