February 1, 20224 yr I have installed unraid 6.9.2 on a new server. I'm using 4 disks of which 1 is parity. I added an Samsung 980pro 2TB disk as cache. I've installed a couple of Docker containers and I'm not using any VM's. All was running smooth till then. Since I don't want to loose my data when the cache disk breaks down I added a second cache disk (same model/size as the other one). Since I did that my server crashes constantly. Sometimes right after starting the array and the docker containers but usually within a few minutes. I've created a syslog server so I could save the logs before the crash but nothing is written to the logs. The server become totally unresponsive and I can't connect to it anymore. When I removed the second disk from the cache pool everything worked again. And when putting it back in it crashed again. The disk I have added is brand new and both the SMART and the badblocks test don't show any errors. What I sometimes notice is that the CPU lock up right before the crash: When I have top open I see a lot of wait states from the CPU: Anyone have any idea what could be wrong or how I can debug the issue? brain-diagnostics-20220201-1623.zip
February 5, 20224 yr Author Debugging the last couple of days; what I have tried without success: - upgrade to 6.10-rc2 - switch network from macvlan to ipvlan Then I managed to get logs just before the crash, there seem to be nvme errors: Strange thing is that it also happens on the other drive: It seems one of the nvme drives produces a read error and after that the server hangs on in an I/O wait state. I googled but could find any solution, some suggest it has to do with power management settings. I changed the disks to never spin down (they are ssd's so no need anyway) without any success. Any help would be appreciated, I'm a bit lost now.
February 5, 20224 yr Hi, New to unraid but have browsed hardware quite a bit. Some motherboards makes one onboard SATA-port unavailable when you use specific m2-slots on the motherboard. Don't know if this can be your case?
February 15, 20224 yr Author Hi Felixen, Thank you for your suggestion. I checked the Mainboard and this is not the issue. Sata 5 and 6 are shared with m2 but I am only using Sata 1 - 4. I did some more research and tried out a lot of different things, all without success. The things I tried are: Prevent the m.2 drive to go to sleep with the kernal parameter: nvme_core.default_ps_max_latency_us=0 Do some more m.2 tweaks: pcie_aspm.policy=performance pcie_aspm=off pcie_port_pm=off nvme_core.default_ps_max_latency_us=0 nvme_core.io_timeout=255 nvme_core.max_retries=10 nvme_core.shutdown_timeout=10 Disable IOMMU: iommu=off It seem to happen more often when there is heavy I/O. When the system is mostly idle it will stay online for hours but when there is heavy I/O it'll crash within minutes. I still don't have any clue what is causing this.
February 13, 20233 yr Author I've upgraded to a newer version and the issue magically went away but I still don't know why.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.