Unraid 6.11.5
Server had been running (mostly) flawlessly since 2016.
About a month ago, I decided to update my firmware/bios for my motherboard (had lots of Spectre/Meltdown fixes and such).
Things were going okay for a about a day, then I started getting full system freezes (dead UI, no SSH access, no docker/vm access, smb shares inaccessible). Requires going into IPMI to force a shutdown/reboot.
Initially thought it was tied to VM (Nvidia passthrough), as the system would die the moment I spun a particular VM up, but I've since had the issue with all VMs and Docker service disabled (freeze can be triggered by SMB transfer).
Troubleshooting:
Physically Removed GPU and a cache pool drive that had some Smart errors
Memtest on RAM
SMART drive tests
btrfs scrubs, xfs-repairs, parity checks
Disabled everything except SMB shares
Toggled P and C states in BIOS, looked for other relevant settings that might have changed
Syslog server enabled, no entries during crash
CPU/Disk temps are fine
Changed network cables
Changed switch ports
Rolled back bios/firmware (rolled it forward again after no change)
Currently, the system can be stable for a couple of days of light usage (Home Assistant, Plex, casual VMs), but eventually if I try to transfer files over SMB it might freeze. Once I reboot the system I can transfer the same file(s) (linux ISOs) to the same share(s) just fine (shares are set to NO cache). It has also triggered adding a few files to torrent managers (I've used deluge and transmission, both have caused freeze).
I still have a few things I am going to try (removing RAM sticks, substitute hardware) but it seems really frustrating to have very little visibility to what is causing the freezes. Is there anything I'm missing?
I've attached diags and a small snippet of syslog that contained a freeze.
quiet-server-diagnostics-20230215-2338.zip
syslog_snippet.txt