chaosclarity Posted June 24, 2022 Share Posted June 24, 2022 Hi, This morning I received a notification about a cache pool device is missing: Samsung 980 1TB Now, the strange part is that in the dashboard main section I can see all 3 1TB devices (1x Samsung 980 1TB, 2x Teamgroup 1TB), but the one that looks strange is one of the Teamgroup 1TB ssd's as it's not reporting temperature readings any more. But, why did Unraid tell me the Samsung 1TB is missing? I'm really confused here. I had an unplanned power outage 2 days ago, and I'm guessing this could be the result of that, not sure. Any idea what I should do here? Quote Link to comment
ChatNoir Posted June 24, 2022 Share Posted June 24, 2022 13 minutes ago, chaosclarity said: Any idea what I should do here? You should probably start by attaching your diagnostics to your next post. Quote Link to comment
chaosclarity Posted June 24, 2022 Author Share Posted June 24, 2022 Just now, ChatNoir said: You should probably start by attaching your diagnostics to your next post. Attached tower-diagnostics-20220624-0831.zip Quote Link to comment
JorgeB Posted June 24, 2022 Share Posted June 24, 2022 29 minutes ago, chaosclarity said: but the one that looks strange is one of the Teamgroup 1TB ssd's as it's not reporting temperature readings any more. Yep, that's the one, it dropped offline: Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, aborting Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 171 QID 2 timeout, aborting Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 172 QID 2 timeout, aborting Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 173 QID 2 timeout, aborting Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 174 QID 2 timeout, aborting Jun 24 04:50:37 Tower kernel: nvme nvme0: I/O 12 QID 0 timeout, reset controller Jun 24 04:50:39 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, reset controller Jun 24 04:53:40 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1 The below can sometimes help, if not try a different PCIe/m.2 slot if available, or a different model device. Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 Reboot and see if it makes a difference. Quote Link to comment
chaosclarity Posted June 24, 2022 Author Share Posted June 24, 2022 5 minutes ago, JorgeB said: Yep, that's the one, it dropped offline: Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, aborting Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 171 QID 2 timeout, aborting Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 172 QID 2 timeout, aborting Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 173 QID 2 timeout, aborting Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 174 QID 2 timeout, aborting Jun 24 04:50:37 Tower kernel: nvme nvme0: I/O 12 QID 0 timeout, reset controller Jun 24 04:50:39 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, reset controller Jun 24 04:53:40 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1 The below can sometimes help, if not try a different PCIe/m.2 slot if available, or a different model device. Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 Reboot and see if it makes a difference. Thank you. One question, in this current state as it's running now, is the cache pool operating OK? Quote Link to comment
JorgeB Posted June 24, 2022 Share Posted June 24, 2022 Yes because there's redundancy, but you should bring the dropped device online and run a scrub to sync the pool, also good to monitor to cache any future issues, more info here. Quote Link to comment
chaosclarity Posted June 24, 2022 Author Share Posted June 24, 2022 51 minutes ago, JorgeB said: Yes because there's redundancy, but you should bring the dropped device online and run a scrub to sync the pool, also good to monitor to cache any future issues, more info here. Do you think then it's a good idea to setup a scrub schedule for the cache pool? Or, just run it if I have issues like this? Quote Link to comment
JorgeB Posted June 24, 2022 Share Posted June 24, 2022 Monthly scrub is a good idea, but much more important is to monitor the pool for any errors since the GUI currently doesn't show that. Quote Link to comment
chaosclarity Posted June 24, 2022 Author Share Posted June 24, 2022 1 hour ago, JorgeB said: Monthly scrub is a good idea, but much more important is to monitor the pool for any errors since the GUI currently doesn't show that. I added that config line as you stated and then rebooted. Unfortunately the teamgroup 1TB ssd didn't ever come back so the array was started without it and Unraid removed it from the cache pool. Is it safe to say this ssd is dead? Quote Link to comment
JorgeB Posted June 24, 2022 Share Posted June 24, 2022 Power cycle the server, rebooting might not be enough, if it still doesn't come back switch slots with another device, if still no it's likely dead. Quote Link to comment
chaosclarity Posted June 28, 2022 Author Share Posted June 28, 2022 On 6/24/2022 at 1:24 PM, JorgeB said: Power cycle the server, rebooting might not be enough, if it still doesn't come back switch slots with another device, if still no it's likely dead. I finally got around to power cycling the server. The m.2 came back this time around and was added to the cache pool again. Hopefully no more issues... 1 Quote Link to comment
chaosclarity Posted July 12, 2022 Author Share Posted July 12, 2022 On 6/24/2022 at 1:24 PM, JorgeB said: Power cycle the server, rebooting might not be enough, if it still doesn't come back switch slots with another device, if still no it's likely dead. Well, yesterday it dropped again. Attached diagnostics. But I'm almost thinking the drive is faulty. tower-diagnostics-20220712-0727.zip Quote Link to comment
chaosclarity Posted July 12, 2022 Author Share Posted July 12, 2022 4 minutes ago, chaosclarity said: Well, yesterday it dropped again. Attached diagnostics. But I'm almost thinking the drive is faulty. tower-diagnostics-20220712-0727.zip 281.2 kB · 0 downloads Doh, I powered off the server completely, unplugged. Upon checking to add the dropped drive back to the cache pool, it is now gone for good. Quote Link to comment
JorgeB Posted July 12, 2022 Share Posted July 12, 2022 Could be a device problem. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.