Cache Pool Device Missing - Which one?

chaosclarity · June 24, 2022

Hi,

This morning I received a notification about a cache pool device is missing:

Samsung 980 1TB

Now, the strange part is that in the dashboard main section I can see all 3 1TB devices (1x Samsung 980 1TB, 2x Teamgroup 1TB), but the one that looks strange is one of the Teamgroup 1TB ssd's as it's not reporting temperature readings any more. But, why did Unraid tell me the Samsung 1TB is missing? I'm really confused here.

I had an unplanned power outage 2 days ago, and I'm guessing this could be the result of that, not sure.

Any idea what I should do here?

ChatNoir · June 24, 2022

13 minutes ago, chaosclarity said:

Any idea what I should do here?

You should probably start by attaching your diagnostics to your next post.

chaosclarity · June 24, 2022

Just now, ChatNoir said:

You should probably start by attaching your diagnostics to your next post.

Attached

tower-diagnostics-20220624-0831.zip

JorgeB · June 24, 2022

29 minutes ago, chaosclarity said:

but the one that looks strange is one of the Teamgroup 1TB ssd's as it's not reporting temperature readings any more.

Yep, that's the one, it dropped offline:

Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 171 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 172 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 173 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 174 QID 2 timeout, aborting
Jun 24 04:50:37 Tower kernel: nvme nvme0: I/O 12 QID 0 timeout, reset controller
Jun 24 04:50:39 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, reset controller
Jun 24 04:53:40 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1

The below can sometimes help, if not try a different PCIe/m.2 slot if available, or a different model device.

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0

Reboot and see if it makes a difference.

chaosclarity · June 24, 2022

5 minutes ago, JorgeB said:
Yep, that's the one, it dropped offline:
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 171 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 172 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 173 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 174 QID 2 timeout, aborting
Jun 24 04:50:37 Tower kernel: nvme nvme0: I/O 12 QID 0 timeout, reset controller
Jun 24 04:50:39 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, reset controller
Jun 24 04:53:40 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
The below can sometimes help, if not try a different PCIe/m.2 slot if available, or a different model device.

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"
nvme_core.default_ps_max_latency_us=0
e.g.:
append initrd=/bzroot nvme_core.default_ps_max_latency_us=0
Reboot and see if it makes a difference.

Thank you. One question, in this current state as it's running now, is the cache pool operating OK?

JorgeB · June 24, 2022

Yes because there's redundancy, but you should bring the dropped device online and run a scrub to sync the pool, also good to monitor to cache any future issues, more info here.

chaosclarity · June 24, 2022

51 minutes ago, JorgeB said:

Yes because there's redundancy, but you should bring the dropped device online and run a scrub to sync the pool, also good to monitor to cache any future issues, more info here.

Do you think then it's a good idea to setup a scrub schedule for the cache pool? Or, just run it if I have issues like this?

JorgeB · June 24, 2022

Monthly scrub is a good idea, but much more important is to monitor the pool for any errors since the GUI currently doesn't show that.

chaosclarity · June 24, 2022

1 hour ago, JorgeB said:

Monthly scrub is a good idea, but much more important is to monitor the pool for any errors since the GUI currently doesn't show that.

I added that config line as you stated and then rebooted. Unfortunately the teamgroup 1TB ssd didn't ever come back so the array was started without it and Unraid removed it from the cache pool. Is it safe to say this ssd is dead?

JorgeB · June 24, 2022

Power cycle the server, rebooting might not be enough, if it still doesn't come back switch slots with another device, if still no it's likely dead.

chaosclarity · June 28, 2022

On 6/24/2022 at 1:24 PM, JorgeB said:

Power cycle the server, rebooting might not be enough, if it still doesn't come back switch slots with another device, if still no it's likely dead.

I finally got around to power cycling the server. The m.2 came back this time around and was added to the cache pool again. Hopefully no more issues...

chaosclarity · July 12, 2022

On 6/24/2022 at 1:24 PM, JorgeB said:

Power cycle the server, rebooting might not be enough, if it still doesn't come back switch slots with another device, if still no it's likely dead.

Well, yesterday it dropped again. Attached diagnostics. But I'm almost thinking the drive is faulty.

tower-diagnostics-20220712-0727.zip

chaosclarity · July 12, 2022

4 minutes ago, chaosclarity said:

Well, yesterday it dropped again. Attached diagnostics. But I'm almost thinking the drive is faulty.

tower-diagnostics-20220712-0727.zip 281.2 kB · 0 downloads

Doh, I powered off the server completely, unplugged. Upon checking to add the dropped drive back to the cache pool, it is now gone for good.

JorgeB · July 12, 2022

Could be a device problem.

Cache Pool Device Missing - Which one?

Recommended Posts

chaosclarity

Link to comment

ChatNoir

Link to comment

chaosclarity

Link to comment

JorgeB

Link to comment

chaosclarity

Link to comment

JorgeB

Link to comment

chaosclarity

Link to comment

JorgeB

Link to comment

chaosclarity

Link to comment

JorgeB

Link to comment

chaosclarity

Link to comment

chaosclarity

Link to comment

chaosclarity

Link to comment

JorgeB

Link to comment

Join the conversation