Cache Pool Device Missing - Which one?


Recommended Posts

Hi,

This morning I received a notification about a cache pool device is missing:

Samsung 980 1TB

 

Now, the strange part is that in the dashboard main section I can see all 3 1TB devices (1x Samsung 980 1TB, 2x Teamgroup 1TB), but the one that looks strange is one of the Teamgroup 1TB ssd's as it's not reporting temperature readings any more. But, why did Unraid tell me the Samsung 1TB is missing? I'm really confused here.

 

I had an unplanned power outage 2 days ago, and I'm guessing this could be the result of that, not sure.

 

Any idea what I should do here?

 

Untitled.png

Link to comment
29 minutes ago, chaosclarity said:

but the one that looks strange is one of the Teamgroup 1TB ssd's as it's not reporting temperature readings any more.

Yep, that's the one, it dropped offline:

 

Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 171 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 172 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 173 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 174 QID 2 timeout, aborting
Jun 24 04:50:37 Tower kernel: nvme nvme0: I/O 12 QID 0 timeout, reset controller
Jun 24 04:50:39 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, reset controller
Jun 24 04:53:40 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1

 

The below can sometimes help, if not try a different PCIe/m.2 slot if available, or a different model device.

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0


Reboot and see if it makes a difference.

 

Link to comment
5 minutes ago, JorgeB said:

Yep, that's the one, it dropped offline:

 

Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 171 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 172 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 173 QID 2 timeout, aborting
Jun 24 04:50:09 Tower kernel: nvme nvme0: I/O 174 QID 2 timeout, aborting
Jun 24 04:50:37 Tower kernel: nvme nvme0: I/O 12 QID 0 timeout, reset controller
Jun 24 04:50:39 Tower kernel: nvme nvme0: I/O 170 QID 2 timeout, reset controller
Jun 24 04:53:40 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1

 

The below can sometimes help, if not try a different PCIe/m.2 slot if available, or a different model device.

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0


Reboot and see if it makes a difference.

 

Thank you. One question, in this current state as it's running now, is the cache pool operating OK? 

Link to comment
51 minutes ago, JorgeB said:

Yes because there's redundancy, but you should bring the dropped device online and run a scrub to sync the pool, also good to monitor to cache any future issues, more info here.

Do you think then it's a good idea to setup a scrub schedule for the cache pool? Or, just run it if I have issues like this?

Link to comment
1 hour ago, JorgeB said:

Monthly scrub is a good idea, but much more important is to monitor the pool for any errors since the GUI currently doesn't show that.

I added that config line as you stated and then rebooted. Unfortunately the teamgroup 1TB ssd didn't ever come back so the array was started without it and Unraid removed it from the cache pool. Is it safe to say this ssd is dead?

Link to comment
On 6/24/2022 at 1:24 PM, JorgeB said:

Power cycle the server, rebooting might not be enough, if it still doesn't come back switch slots with another device, if still no it's likely dead.

I finally got around to power cycling the server. The m.2 came back this time around and was added to the cache pool again. Hopefully no more issues...

  • Like 1
Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.