I recently picked up a Supermicro X8DTH-6 system to upgrade my unRAID server, and I'm running into problems where it locks up within a few days. No events get logged by IPMI, nothing gets sent to my syslog server, and nothing shows up on the console. It's unresponsive over the network or via the console when this happens. I've tested maxing out the CPU cores for a while, and sensors show no problems with heat or the power supply. The last time this happened, I tried doing the Alt-SysRq trick to see if I could get more diagnostic info out of it, but that didn't seem to do anything.
I was hopeful after reading the thread about kernel panics caused by a bug in the macvlan, but it looked like that maybe only (or mostly?) affected people running docker containers with static IPs on br0. This machine isn't running docker or VMs yet, because there's no license on this flash drive (my licensed copy is running on a temporary server while I try to get this sorted out). The machine is obviously idle, but it's still hanging every couple of days.
The machine itself is an X8DTH-6 with dual Xeon-E5645 CPUs, 96GB of ECC DDR3 RAM, a built-in LSI 2008-based SAS HBA, and an additional LSI 2008-based HBA in a PCIe slot. They're each flashed to IT mode with the most recent firmware. They're each connected to a SAS expander (one LSI, one HP) with some random drives attached for testing. There's an Nvidia Quadro P400 and a Geforce 750Ti in it, and a generic USB3 card. It has two built-in NICs running in active-backup mode (besides the dedicated IPMI NIC). It has an IPMI BMC with remote KVM, and the sensors all look okay.
I'm not sure what the next step should be here. I can try rolling back to 6.8, or turning off bridging on the NIC(s), or installing a separate copy of linux onto a drive and running various burn-in tests. Am I missing something?
tower2-diagnostics-20210405-0845.zip