Cache pool BTRFS missing device Samsung 980


Recommended Posts

Hello, I started to get errors with my cache pool that the device is missing (Samsung 980 NVMe). The device is still there and can be read from but not written too. If you restart the system, the device disappears from the cache pool. If you shut down the system then turn it on it comes back and works normally (can read/write). 

 

This is the second time I have gotten this issue in the past 4 days, can someone take a look at the logs and see if something is up? Thanks. 

z490-diagnostics-20211206-1216.zip

Edited by macieksoft
Typo
Link to comment

NVMe device dropped offline:

 

Dec  6 11:54:00 Z490 kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Dec  6 11:54:00 Z490 kernel: nvme nvme1: Removing after probe failure status: -19
Dec  6 11:54:30 Z490 kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1

 

This sometimes helps:

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0


Reboot and see if it makes a difference.

Link to comment
5 hours ago, JorgeB said:

NVMe device dropped offline:

 

Dec  6 11:54:00 Z490 kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Dec  6 11:54:00 Z490 kernel: nvme nvme1: Removing after probe failure status: -19
Dec  6 11:54:30 Z490 kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1

 

This sometimes helps:

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0


Reboot and see if it makes a difference.

 

Thanks, I added that and rebooted the server, ill keep an eye on it. Its odd that it just started to happen though, I wonder what's causing it. 

 

I had mitigations off so I think this is still ok: 

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 mitigations=off

Link to comment

I'm having the same issue with the same drive and it just started yesterday for me.

 

I tried the adjustment to the flash boot command as suggested and it still happens.

 

I thought it might of been an over heating issue, so I put all my fans on high for the time being and my SSD temp doesnt go above 45 degrees now and made no difference

 

My bios is up to date (as rock pro 4 x570), and trying to do an SSD firmware update didnt work, the samsung utility failed to find the drive

 

any other suggestions to try? thank you

 

tower-diagnostics-20211209-1201.zip

Link to comment
On 12/10/2021 at 2:57 AM, JorgeB said:

If the BIOS is up to date not much more you can do other than trying a different brand/model device (or a different board).

thanks for the info, bummed to hear that. 

 

Anything I can do to save the VM image i have on there?

 

I removed the drive from my array and mounted via unassigned plugin but the second I try to start copying the file, the drive goes missing again

 

Dec 11 12:59:09 Tower kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Dec 11 12:59:09 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 4640256 op 0x0:(READ) flags 0x84700 phys_seg 1 prio class 0

 

Dec 11 12:59:09 Tower kernel: nvme 0000:09:00.0: enabling device (0000 -> 0002)
Dec 11 12:59:09 Tower kernel: nvme nvme0: Removing after probe failure status: -19

                                      

Edited by djxstream
logs added
Link to comment
On 12/10/2021 at 2:57 AM, JorgeB said:

If the BIOS is up to date not much more you can do other than trying a different brand/model device (or a different board).

 

I definitely think this is an issue caused by 6.10 rc2, I haven't had any more issues after adding append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 but my Samsung 980 was fine for months before the update.  It's been confirmed by others that the temperature sensor is bonked with 6.10 rc2 and the drive disconnecting from the pool could be a side affect of rc2 as well. 

Link to comment
41 minutes ago, macieksoft said:

 

I definitely think this is an issue caused by 6.10 rc2, I haven't had any more issues after adding append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 but my Samsung 980 was fine for months before the update.  It's been confirmed by others that the temperature sensor is bonked with 6.10 rc2 and the drive disconnecting from the pool could be a side affect of rc2 as well. 

 I was on 6.9 when this started. I only upgraded to the rc2 today to try and trouble shoot

 

Yeah my machine too was up and running fine for months before this problem started this week

 

Link to comment
  • 2 weeks later...

Just to give an update.

 

  • Was able to pull drive from Unraid, add to a windows machine, and mount and copy my 2 VM images off it using WinBTRFS
  • Also was able to update firmware via the samsung software
  • Put the drive back in unraid and used unassigned drives to format it (btrfs) and mount it
  • started copying the VM disks back to it via mc, and it craps out again with the following logs
Dec 21 19:29:26 Tower kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Dec 21 19:29:26 Tower kernel: nvme 0000:03:00.0: enabling device (0000 -> 0002)
Dec 21 19:29:26 Tower kernel: nvme nvme0: Removing after probe failure status: -19
Dec 21 19:29:26 Tower kernel: nvme0n1: detected capacity change from 1953525168 to 0
Dec 21 19:29:26 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 72600576 op 0x1:(WRITE) flags 0x104000 phys_seg 4 prio class 0

 

 

I did the default_ps_max_latency_us=0 modification, seems to not have made a difference

 

I'm at a loss on what to do next, is unraid just not ready for NVME drives yet?I dont want to buy another drive to deal with the same issues, and again this drive has been in this system for months before any issue. are three any tests I should do in the windows machine as the drive seems to work flawlessly there? thanks again

 

 

 

Link to comment
1 minute ago, djxstream said:

Just to give an update.

 

  • Was able to pull drive from Unraid, add to a windows machine, and mount and copy my 2 VM images off it using WinBTRFS
  • Also was able to update firmware via the samsung software
  • Put the drive back in unraid and used unassigned drives to format it (btrfs) and mount it
  • started copying the VM disks back to it via mc, and it craps out again with the following logs
Dec 21 19:29:26 Tower kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Dec 21 19:29:26 Tower kernel: nvme 0000:03:00.0: enabling device (0000 -> 0002)
Dec 21 19:29:26 Tower kernel: nvme nvme0: Removing after probe failure status: -19
Dec 21 19:29:26 Tower kernel: nvme0n1: detected capacity change from 1953525168 to 0
Dec 21 19:29:26 Tower kernel: blk_update_request: I/O error, dev nvme0n1, sector 72600576 op 0x1:(WRITE) flags 0x104000 phys_seg 4 prio class 0

 

 

I did the default_ps_max_latency_us=0 modification, seems to not have made a difference

 

I'm at a loss on what to do next, is unraid just not ready for NVME drives yet?I dont want to buy another drive to deal with the same issues, and again this drive has been in this system for months before any issue. are three any tests I should do in the windows machine as the drive seems to work flawlessly there? thanks again

 

 

I have not had an issue since adding default_ps_max_latency_us=0 , but unRAID still things the drive hits 84c randomly for exactly 30 min or 60 min. I originally thought it was the drive and talked to Samsung, they were willing to pay for a return lable and send a new drive under warranty, you might be able to do that, but you will not have the drive for a while. 

 

I would definitely go with another drive next time. 

Link to comment

so just a follow up here, what i managed to do was create a VM where I pass the NVME drive to it and installed directly on there (Manjaro OS for the record). So far no problems.

 

I have my docker and libvrt images on a different sata ssd that is in a cache pool, both which were previously on this nvme drive

 

fingers crossed this continues to work without any new hardware needed.

 

thanks for all you help and responses

 

Link to comment
  • 1 year later...

In February 2023 I upgraded my NVMe cache drive from a Sabrent Rocket 1TB to a Samsung 980 Pro 2TB. I never had a single issue with the Sabrent drive - I just needed more capacity to deal with large one-off writes, and I went with the ‘best’ drive I could afford, which was the 980 Pro based off many reviews and benchmarks around continuous write/cache performance. 
 

But, I started having the same exact issue described in this thread with my Unraid server, I would say around 4 months ago now. Probably July 2023 onwards. It happened maybe once a month, but has now started happening more frequently in the last month or so, having gone ‘down’ 3 times in the last 30 days. The only way I get the cache drive to appear again is by shutting down unraid, removing the SSD from the M.2 slot, re-seating the SSD, and booting back up. This is causing unnecessary additional parity checks to run on my system too, as I guess unraid thinks a new device has been added or something. 
 

My 980 Pro is on the 5B2QGXA7 firmware (which it shipped with, this hasn’t been changed by me). This is the firmware known to fix some other issues with 980 Pros bricking into a read only state. 
 

Unraid is on 6.11.0 but I am not sure from memory if there were any changes to the Unraid version before/after installing the 980 Pro in my system, so I don’t know if there is any correlation there. 
 

Regardless of all of this, I have added the line of code shared in this thread into my flash syslinux config, and will report back if this resolves the issue. 
 

It’s infuriating coming home late at night for none of my motion sensors/automations to kick in, and immediately I know that the SSD has gone down again, killing half of my smart home stuff in the process 🙃
 

 

Edited by ju_media
Fixed emoji
Link to comment
On 12/22/2021 at 7:52 AM, JorgeB said:

Some hardware combinations could have kernel related issues, most users have NVMe devices without issues, including myself.

@JorgeB would you be so kind as to suggest your recommended NVMe drives for use as the cache drive(s) that you know from experience are stable on Unraid? Like I said, I never had issues with the standard Sabrent Rocket drive, but it didn’t have the best sustained write performance once I hit around the 500GB mark. The Samsung seems to just keep going, not slowing down even when I write 1TB+ to it, but this issue with the frequent drop-outs are a complete deal breaker. 
 

If you have a personal recommendation that has similar sustained write/internal cache performance, but is 100% stable, I’d love to hear it as I’ll just swap this one out I think. Not worth the hassle anymore. 

Edited by ju_media
Added last paragraph to clarify the point of the post
Link to comment
8 hours ago, ju_media said:

The only way I get the cache drive to appear again is by shutting down unraid, removing the SSD from the M.2 slot, re-seating the SSD, and booting back up.

Unlikely that you need to re-seat the device, power cycling the server (not just rebooting) should be enough.

 

8 hours ago, ju_media said:

would you be so kind as to suggest your recommended NVMe drives for use as the cache drive(s) that you know from experience are stable on Unraid?

I think this type of issue is more related to the board/device combo involved, possibly also the kernel version, Samsung devices are usually fine, and that model one I would recommend, I have a 970 Pro a multiple 980 devices working fine.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.