Cache Pool - Sabrent M.2 failing


Recommended Posts

hi,

i have a cache pool with 2x Sabrent 1TB Rocket Nvme PCIe 4.0 M.2.  (ASRock X570M, Ryzen 3400G, 32GB RAM).

a few days after completing my unraid build, one of the M.2s failed.  i ordered a replacement (this time with heatsink) and  replaced the failed drive.  again after just 2 days the replacement drive failed again (same slot). 

temps were/are ok (around 30C on average, never observed higher than 35C).  tried to scrub/repair, but it did not work, "uncorrectable".  the unraid shows missing cache 1 drive.  i'm thinking of getting a different brand's drives, but not sure if it is really a drive issue or mobo.   

appreciate any advice.

 

Link to comment

Unfortunately due to a logged large rsync transfer the syslog is missing when the device dropped, if you reboot/power cycle the server does the device come back online? If yes this sometimes helps with dropping NVMe devices:

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append" and before "initrd=/bzroot"

 

nvme_core.default_ps_max_latency_us=0

Reboot and see if it makes a difference, also make sure BIOS is up to date.

 

 

Link to comment
2 hours ago, JorgeB said:

Unfortunately due to a logged large rsync transfer the syslog is missing when the device dropped, if you reboot/power cycle the server does the device come back online? If yes this sometimes helps with dropping NVMe devices:

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append" and before "initrd=/bzroot"

 


nvme_core.default_ps_max_latency_us=0

Reboot and see if it makes a difference, also make sure BIOS is up to date.

 

 

hey,

added the line "nvme_core.default_ps_max_latency_us=0" as instructed.  i can see now the missing NVMe drive. added back to the cache pool, hope it will stay.  thanks for your help.

another question, i have an option to use Sabrent Rocket Gen4 (non-PLUS) or Samsung 980 Pro drives.  which one is preferred? (i do not intent to have any VMs, just storage and Media streaming).

 

Edited by onufry
Link to comment
1 hour ago, onufry said:

i have an option to use Sabrent Rocket Gen4 (non-PLUS) or Samsung 980 Pro drives.  which one is preferred?

Don't really known much about Sabrent, 980 Pro is probably one of the best NVMe devices around, but likely won't notice much performance different, still worth trying the 980 if those continue to drop.

  • Like 1
Link to comment

hi again,

it happened again.  1 of the 2 cache drives failed (same slot).

as before after it failed, i can see it listed in a pool, but cannot see temp (only * is shown).  Smart Report shows "Smartctl open device: /dev/nvme0 failed: No such device". 

need to say i have not updated BIOS yet.  i will do it tomorrow after i swap both Sabrent cache drives to Samsung ones.  in any case attaching the diagnostics, hope there is something in it that can point what is the cause of the issue.

jj-nas-diagnostics-20210102-1530.zip

cheers for your help.

 

Link to comment
11 hours ago, onufry said:

hope there is something in it that can point what is the cause of the issue.

Jan  2 15:01:59 JJ-NAS kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
Jan  2 15:02:30 JJ-NAS kernel: nvme nvme0: Device not ready; aborting reset
Jan  2 15:02:30 JJ-NAS kernel: nvme nvme0: Removing after probe failure status: -19

 

It's unlikely to be related to power states, as that's usually mentioned in the log when related, it appears to be dropping abruptly, but can't say if it's a device or board issue.

Link to comment
5 hours ago, JorgeB said:

Jan  2 15:01:59 JJ-NAS kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
Jan  2 15:02:30 JJ-NAS kernel: nvme nvme0: Device not ready; aborting reset
Jan  2 15:02:30 JJ-NAS kernel: nvme nvme0: Removing after probe failure status: -19

 

It's unlikely to be related to power states, as that's usually mentioned in the log when related, it appears to be dropping abruptly, but can't say if it's a device or board issue.

hi,

yes, i've noticed the "controller down" message.  anyways, i have updated bios and replaced NVMe drives with Samsung ones.  so far over 12 hours without failures...

fingers crossed.

cheers

Link to comment

posting an update, maybe someone finds this info useful.

my system seems to be stable after replacing both SABRENT 1TB Rocket Nvme PCIe 4.0 M.2 drives with Samsung 980 PRO SSD 1TB - M.2 NVMe.  i have not had any cache drives failures for 3 days now. 

the only difference i have noticed is that samsung drives' temperatures are slightly higher by 3C-4C, hovering around 33C-35C.  

  • Like 1
Link to comment
  • 2 weeks later...
On 1/5/2021 at 8:36 AM, onufry said:

posting an update, maybe someone finds this info useful.

my system seems to be stable after replacing both SABRENT 1TB Rocket Nvme PCIe 4.0 M.2 drives with Samsung 980 PRO SSD 1TB - M.2 NVMe.  i have not had any cache drives failures for 3 days now. 

the only difference i have noticed is that samsung drives' temperatures are slightly higher by 3C-4C, hovering around 33C-35C.  

yet another update, after 2 weeks without any issues, one of the samsung cache drives failed as well.  i  have not captured diagnostics. awaiting a replacement drive. 

i did suspect a motherboard, but it seems my issue has happened to others using different NVMe brands (see link to another thread below). 

overheating seems to be a best guess what is the source of a problem, i will try again to use a heatsink with NVMe drive this time.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.