Hi,
In the last few weeks I've been having random crashes occur more regularly over that time. What started from perhaps every two weeks went to every week and now virtually every day.
I've done a memtest with no errors sadly. I originally took out the NVMe cache drives temporarily with no change, as during some of the crashes I was only able to write to the HDDs and not the NVMe drives until a hard reset was done. I've upgraded to 6.10rc1 due to the macvlan crash, which I originally had allowed the host access to containers (now disabled and using ipvlan). The voltages shown on the IPMI seem fine so I can't believe it's a power supply fault developing. At this point I'm stumped, the hardware (other than the hard drives) is pretty new too and it was reliable for quite some time.
I've now enabled mirroring of syslog to flash for now, but I've attached a snippet of what I was able to retrieve prior to needing to hard reset again. Next time it crashes I will be able to get a full syslog.
I'm hoping someone possibly has an idea of what might be causing this problem, beyond the "it could be the motherboard, CPU, memory, hard drives or power supply" which sadly doesn't narrow things down much.
Let me know if you have any questions. Thanks in advance.
Basic summary of specs:
AMD Threadripper Pro 3995WX 64-Core CPU
512GB DDR4 ECC RDIMM (64GB x 8 at 3200MHz), Kingston Server Premier
ASUS WRX80-E SAGE Wifi Motherboard
Corsair AXi 1200 PSU
ASUS ROG 1080Ti OC GPU
Samsung 970 Pro 512GB NVMe x 4
Western Digital Red NAS drives for general storage and parity, 2 x 10TB and 3 x 4TB
EDIT: Looks like either or both changing the memory clock speed to a lower value, not that which is officially stated as compatible with my motherboard on Kingston's website, and disabling global c-states control has solved the instability. I've not had any issues as yet since changing those settings. Thanks for the help! Fingers crossed it stays this way.
unraid_syslog_snippet.txt tower-diagnostics-20210901-1659.zip