To begin, I am an IT Technician with 10yrs of experience under my belt. Asking for help (in general) is very hard for me so admittedly I've been holding off on posting to these forums. After about a year of troubleshooting I'm truly out of ideas and can't think of where to go next.
I have a Dell R710 running the last supported BIOS, Lifecycle controller, and iDRAC package.
It is equipped with a PERC H200 flashed to IT/JBOD mode.
Periodically the server crashes and I lose everything. There is no rhyme or reason as to why and the only indicator I get (when I'm able to catch it) is my VMs and Dockers begin to crash/slow down/act weird. For instance I am running Home Assistant in a VM converted from their official website and my ZWave plugin begins to lose connectivity with my ZWave network.
I have read several posts regarding similar issues and found some people have luck changing their docker network interface from macvlan to ipvlan (which I have done), and running Memtest86 to verify ram functionality. I ran Memtest for about 13hrs and was unable to get it to fail.
I have BTRFS, TRIM, and mover schedules set so those are all being taken care of. No SMART errors either.
At this point I am at a loss and cannot for the life of me figure out how to mitigate these crashes. Any input or guidance would be appreciated. All I want is for this thing to work consistently and with the struggles of life and taking on a mortgage I'm running out of time and energy to fix this.
Attached is my diagnostics, if someone could please go through them and give me some pointers I would truly appreciate it. I feel like I'm just an idiot and missed a setting somewhere.
Thanks again in advance, I can provide more information if required!
zserve-diagnostics-20230207-2144.zip