onufry Posted January 1, 2021 Share Posted January 1, 2021 hi, i have a cache pool with 2x Sabrent 1TB Rocket Nvme PCIe 4.0 M.2. (ASRock X570M, Ryzen 3400G, 32GB RAM). a few days after completing my unraid build, one of the M.2s failed. i ordered a replacement (this time with heatsink) and replaced the failed drive. again after just 2 days the replacement drive failed again (same slot). temps were/are ok (around 30C on average, never observed higher than 35C). tried to scrub/repair, but it did not work, "uncorrectable". the unraid shows missing cache 1 drive. i'm thinking of getting a different brand's drives, but not sure if it is really a drive issue or mobo. appreciate any advice. Quote Link to comment
JorgeB Posted January 2, 2021 Share Posted January 2, 2021 Please post the diagnostics, ideally before rebooting. Quote Link to comment
onufry Posted January 2, 2021 Author Share Posted January 2, 2021 thanks for your reply. attached diagnostics, pretty sure i ran them before a reboot and removing failed drive from the pool. jj-nas-diagnostics-20210101-1710.zip Quote Link to comment
JorgeB Posted January 2, 2021 Share Posted January 2, 2021 Unfortunately due to a logged large rsync transfer the syslog is missing when the device dropped, if you reboot/power cycle the server does the device come back online? If yes this sometimes helps with dropping NVMe devices: Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append" and before "initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 Reboot and see if it makes a difference, also make sure BIOS is up to date. Quote Link to comment
onufry Posted January 2, 2021 Author Share Posted January 2, 2021 (edited) 2 hours ago, JorgeB said: Unfortunately due to a logged large rsync transfer the syslog is missing when the device dropped, if you reboot/power cycle the server does the device come back online? If yes this sometimes helps with dropping NVMe devices: Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append" and before "initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 Reboot and see if it makes a difference, also make sure BIOS is up to date. hey, added the line "nvme_core.default_ps_max_latency_us=0" as instructed. i can see now the missing NVMe drive. added back to the cache pool, hope it will stay. thanks for your help. another question, i have an option to use Sabrent Rocket Gen4 (non-PLUS) or Samsung 980 Pro drives. which one is preferred? (i do not intent to have any VMs, just storage and Media streaming). Edited January 2, 2021 by onufry Quote Link to comment
JorgeB Posted January 2, 2021 Share Posted January 2, 2021 1 hour ago, onufry said: i have an option to use Sabrent Rocket Gen4 (non-PLUS) or Samsung 980 Pro drives. which one is preferred? Don't really known much about Sabrent, 980 Pro is probably one of the best NVMe devices around, but likely won't notice much performance different, still worth trying the 980 if those continue to drop. 1 Quote Link to comment
onufry Posted January 2, 2021 Author Share Posted January 2, 2021 hi again, it happened again. 1 of the 2 cache drives failed (same slot). as before after it failed, i can see it listed in a pool, but cannot see temp (only * is shown). Smart Report shows "Smartctl open device: /dev/nvme0 failed: No such device". need to say i have not updated BIOS yet. i will do it tomorrow after i swap both Sabrent cache drives to Samsung ones. in any case attaching the diagnostics, hope there is something in it that can point what is the cause of the issue. jj-nas-diagnostics-20210102-1530.zip cheers for your help. Quote Link to comment
JorgeB Posted January 3, 2021 Share Posted January 3, 2021 11 hours ago, onufry said: hope there is something in it that can point what is the cause of the issue. Jan 2 15:01:59 JJ-NAS kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 Jan 2 15:02:30 JJ-NAS kernel: nvme nvme0: Device not ready; aborting reset Jan 2 15:02:30 JJ-NAS kernel: nvme nvme0: Removing after probe failure status: -19 It's unlikely to be related to power states, as that's usually mentioned in the log when related, it appears to be dropping abruptly, but can't say if it's a device or board issue. Quote Link to comment
onufry Posted January 3, 2021 Author Share Posted January 3, 2021 5 hours ago, JorgeB said: Jan 2 15:01:59 JJ-NAS kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 Jan 2 15:02:30 JJ-NAS kernel: nvme nvme0: Device not ready; aborting reset Jan 2 15:02:30 JJ-NAS kernel: nvme nvme0: Removing after probe failure status: -19 It's unlikely to be related to power states, as that's usually mentioned in the log when related, it appears to be dropping abruptly, but can't say if it's a device or board issue. hi, yes, i've noticed the "controller down" message. anyways, i have updated bios and replaced NVMe drives with Samsung ones. so far over 12 hours without failures... fingers crossed. cheers Quote Link to comment
onufry Posted January 5, 2021 Author Share Posted January 5, 2021 posting an update, maybe someone finds this info useful. my system seems to be stable after replacing both SABRENT 1TB Rocket Nvme PCIe 4.0 M.2 drives with Samsung 980 PRO SSD 1TB - M.2 NVMe. i have not had any cache drives failures for 3 days now. the only difference i have noticed is that samsung drives' temperatures are slightly higher by 3C-4C, hovering around 33C-35C. 1 Quote Link to comment
onufry Posted January 18, 2021 Author Share Posted January 18, 2021 On 1/5/2021 at 8:36 AM, onufry said: posting an update, maybe someone finds this info useful. my system seems to be stable after replacing both SABRENT 1TB Rocket Nvme PCIe 4.0 M.2 drives with Samsung 980 PRO SSD 1TB - M.2 NVMe. i have not had any cache drives failures for 3 days now. the only difference i have noticed is that samsung drives' temperatures are slightly higher by 3C-4C, hovering around 33C-35C. yet another update, after 2 weeks without any issues, one of the samsung cache drives failed as well. i have not captured diagnostics. awaiting a replacement drive. i did suspect a motherboard, but it seems my issue has happened to others using different NVMe brands (see link to another thread below). overheating seems to be a best guess what is the source of a problem, i will try again to use a heatsink with NVMe drive this time. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.