I have run memtests, done single stick / single slot tests. Memory appears OK
I have confirmed externally using crystal disk info all my array drives report healthy
I have swapped my SATA PCIE controller with a new one
I can reproduce when i plug my array drives directly into my sata ports on my motherboard
I can reproduce without my VM pool active, so not related to Docker or VM's
I cannot reproduce from a live CD while running stress tests
My server is in a horrible state at the moment, and i'm to the point where I'm about to go purchase a new mobo, cpu, and power supply as they are the only things i have left to test. I have been diagnosing this for almost two weeks now.
My server will simply go unreponsive, or reboot itself. Sometimes it takes 12 hours, lately it takes 30 minutes to 2 hours.
Where this all began
Originally, I got a sata report in unraid claiming a disk showed signs of pre failure. So i replaced that disk. Then on rebuild I was seeing other reports of SMART failures, so i was like "ok my PCIE sata controller is dying out on me"
So i purchased an LSI card in IT mode, which led to this post
https://www.reddit.com/r/unRAID/comments/1c2a8yw/parity_drives_always_going_to_error_disabled/
So basically, I tried two different LSI cards and both would not mount parity disks and went straight to error disabled.
Ignoring that
I bought a new sata PCI controller from amazon instead. Figuring i would save myself the hassle. This works fine, mounts all my disks, parity rebuilt. However, now i have this issue where my unraid server dies all the time.
Syslogs just halt. There is nothing logged. Originally i was getting some ATA errors, so i removed that disk from the equation
It does not appear to be a kernel panic? As my display goes from the login screen to just being black (not asleep)
The system stays powered on. Sometimes it reboots, sometimes it just hard locks
Pings stop working as well.
What I have tried
Memtests
Stress tests on CPU/Memory
Shutting off Docker Containers
Shutting off VM's
Downgraded to OS 6.12.08 i believe was the version
Removing VM/Docker Container drives entirely
Updating my BIOS
Resetting my BIOS to defaults
Created and moved my license to a new USB drive to rule out flash failures
Tested each disk individually on my windows PC as an external disk
Plugged my entire array into my motherboard / PCIE and see the same behavior no matter my configuration
My next steps
I removed the autotweaker plugin, which i installed 20 days ago. Although this issue was more recent than that
I'm about to give up and build a brand new PC, but fear bringing my array with me will produce the same outcome if this is an OS bug.
My conclusions
It definitely seems related to my array specifically. Having unraid running without the array running "appears" stable (i only tested this for a couple hours)
If it is not my motherboard, maybe it is my power supply?
I need someone to check my diagnostic file, but unfortunately i'm not sure how much help it will be with all the trouble i've put into switching things around
unraid-diagnostics-20240425-1003.zip