Apex_Budi Posted December 1, 2022 Share Posted December 1, 2022 Hi guys, I've recently been tinkering with my unraid server. I added a pcie card with 4-port nic to use with a pfSenseVM. I was also using powertop autotune. Everything was going fine and well, but I started getting some stability issues specifically with my Cache drive. I had one Kingston KC2500 NVME drive serving as my cache drive and for my docker and VMs. I would randomly get errors where my dockers and VM would crash and the cache drive was inaccessible. If I rebooted the cache drive would also be missing. I originally thought it could be the pcie card so I removed it and eventually also added a SATA SSD for the cache so now it is Raid 1. Also added the "append initrd=/bzroot nvme_core.default_ps_max_latency_us=0" to my flash. With no VM on and powertop autotune on I still get the random dropping cache pool on both the NVME and SATA SSD. It seems to be okay with powertop autotune off, but I'm not sure that was the problem. Log seems to suggest read errors on my nvme, is this a hardware issue? Could the pcie nic be affecting my NVME? Any advice is appreciated, thanks! tower-syslog-20221201-0435.zip tower-diagnostics-20221129-1406.zip Quote Link to comment
JorgeB Posted December 1, 2022 Share Posted December 1, 2022 Diags are after rebooting but if the device is dropping also add pcie_aspm=off to syslinux to see if helps. 1 Quote Link to comment
Apex_Budi Posted December 2, 2022 Author Share Posted December 2, 2022 21 hours ago, JorgeB said: Diags are after rebooting but if the device is dropping also add pcie_aspm=off to syslinux to see if helps. Is that for after the initrd=/bzroot as well? Sorry, here are the files. Another one just occurred. Reboot from the webUI didn't bring the NVME or the SATA SSD back, it shows that they were "missing disks". When I safely shutdown from the webUI and did not flip off the PSU switch, but manually turned the server one using the power button everything worked fine again. tower-syslog-20221201-2245.zip tower-diagnostics-20221202-1146.zip Quote Link to comment
Solution JorgeB Posted December 2, 2022 Solution Share Posted December 2, 2022 1 hour ago, Apex_Budi said: Is that for after the initrd=/bzroot as well? Yep. Dec 1 22:53:03 Tower kernel: nvme nvme0: Abort status: 0x371 Dec 1 22:54:04 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1 Dec 1 22:54:04 Tower kernel: nvme nvme0: Removing after probe failure status: -19 Dec 1 22:55:04 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1 Dec 1 22:55:04 Tower kernel: nvme0n1: detected capacity change from 488397168 to 0 Device is dropping offline, see if the above helps, if it doesn't look for a BIOS update or try a different NVMe device if available. 1 Quote Link to comment
Apex_Budi Posted December 2, 2022 Author Share Posted December 2, 2022 7 minutes ago, JorgeB said: Yep. Dec 1 22:53:03 Tower kernel: nvme nvme0: Abort status: 0x371 Dec 1 22:54:04 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1 Dec 1 22:54:04 Tower kernel: nvme nvme0: Removing after probe failure status: -19 Dec 1 22:55:04 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1 Dec 1 22:55:04 Tower kernel: nvme0n1: detected capacity change from 488397168 to 0 Device is dropping offline, see if the above helps, if it doesn't look for a BIOS update or try a different NVMe device if available. Thanks! i'll give that a go! I know BIOS is old because parts are second hand. Is it weird that the SATA SSD in the same cache pool also drops offline/goes missing when the NVME is the one that has problems? Quote Link to comment
Apex_Budi Posted December 2, 2022 Author Share Posted December 2, 2022 Would powertop affect these at all? Especially on powersaving modes? Quote Link to comment
JorgeB Posted December 2, 2022 Share Posted December 2, 2022 39 minutes ago, Apex_Budi said: Is it weird that the SATA SSD in the same cache pool also drops offline/goes missing when the NVME is the one that has problems? Looks like the SATA SSD is also dropping offline, but can't see that in the diags, probably due to all the other errors. 39 minutes ago, Apex_Budi said: Would powertop affect these at all? Especially on powersaving modes? It can, you should try disabling it for now. Quote Link to comment
Apex_Budi Posted December 3, 2022 Author Share Posted December 3, 2022 Had another missing disk even with pcie_aspm=off. But had been stable for almost 24h now with no powertop auto-tune. I suspect it could also be the CEC2019 that I enabled in BIOS (my mobo is Gigabyte H310M S2H) around the same time. It automatically turned on ASPM for everything. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.