Cache disk (pool member) going bad?

impact-trombone · October 4, 2022

I have a cache pool consisting of two SSDs:

ADATA SU760 (SATA 256GB)
ADATA SX6000PNP (NVME 256GB)

and I'm wondering if my NVME disk ADATA SX6000PNP is about to go bad.

Below is a summary of my recent events.

I recently replaced my GPU (about 3 days ago) due to loud idle fan noise of an AMD FirePro v3900 (I switched to an AMD/ATI Radeon HD 4550) and booted my system and continued with my day. I noticed the day after, my docker containers were all stopped, so without thinking I started them back up. Later I was looking through my logs to determine why they were stopped as I could see the exited time on them that was listed earlier put them all stopping at about 4AM. Turned out this was due to my CA Backup / Restore plugin doing a backup of my appdata. Typically my appdata backups last about 30 minutes long, but in this case it lasted 15.5 hours ending at 7:27PM. The backup job finally ended and I checked my system logs. In the logs I noticed that about 1 hour and 15 minutes before the backup job finished (6:12PM) I could see the following entires:

Oct 3 18:12:52 Tower kernel: nvme nvme0: I/O 22 QID 4 timeout, aborting
Oct 3 18:12:52 Tower kernel: nvme nvme0: Abort status: 0x0
Oct 3 18:13:51 Tower kernel: nvme nvme0: I/O 244 QID 6 timeout, aborting
Oct 3 18:13:51 Tower kernel: nvme nvme0: Abort status: 0x0
Oct 3 18:13:53 Tower kernel: nvme nvme0: I/O 269 QID 2 timeout, aborting
Oct 3 18:13:53 Tower kernel: nvme nvme0: Abort status: 0x0
Oct 3 18:20:51 Tower webGUI: Successful login user root from 192.168.4.121
Oct 3 18:34:18 Tower kernel: nvme nvme0: I/O 295 QID 2 timeout, aborting
Oct 3 18:34:18 Tower kernel: nvme nvme0: Abort status: 0x0
Oct 3 18:35:09 Tower kernel: nvme nvme0: I/O 804 QID 1 timeout, aborting
Oct 3 18:35:09 Tower kernel: nvme nvme0: Abort status: 0x0
Oct 3 18:43:36 Tower ool www[23554]: Successful logout user root from 192.168.4.121
Oct 3 19:25:11 Tower kernel: nvme nvme0: I/O 167 QID 7 timeout, aborting
Oct 3 19:25:11 Tower kernel: nvme nvme0: Abort status: 0x0
Oct 3 19:25:20 Tower kernel: nvme nvme0: I/O 198 QID 1 timeout, aborting
Oct 3 19:25:20 Tower kernel: nvme nvme0: Abort status: 0x0
Oct 3 19:25:42 Tower kernel: nvme nvme0: I/O 170 QID 7 timeout, aborting
Oct 3 19:25:42 Tower kernel: nvme nvme0: Abort status: 0x0
Oct 3 19:25:50 Tower kernel: nvme nvme0: I/O 198 QID 1 timeout, reset controller
Oct 3 19:26:11 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Oct 3 19:26:14 Tower kernel: nvme nvme0: 7/0/0 default/read/poll queues
Oct 3 19:26:44 Tower kernel: nvme nvme0: I/O 198 QID 1 timeout, disable controller
Oct 3 19:26:44 Tower kernel: I/O error, dev nvme0n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
Oct 3 19:26:44 Tower kernel: I/O error, dev nvme0n1, sector 305988224 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
Oct 3 19:26:44 Tower kernel: nvme nvme0: failed to mark controller live state
Oct 3 19:26:44 Tower kernel: nvme nvme0: Removing after probe failure status: -19
Oct 3 19:26:44 Tower kernel: nvme0n1: detected capacity change from 500118192 to 0
Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 3, rd 0, flush 1, corrupt 0, gen 0
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 4, rd 0, flush 1, corrupt 0, gen 0
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 5, rd 0, flush 1, corrupt 0, gen 0
Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 6, rd 0, flush 1, corrupt 0, gen 0
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 6, rd 0, flush 2, corrupt 0, gen 0
Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 7, rd 0, flush 2, corrupt 0, gen 0
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): bdev /dev/nvme0n1p1 errs: wr 8, rd 0, flush 2, corrupt 0, gen 0
Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2
Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2
Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2
Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2
Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2
Oct 3 19:26:44 Tower kernel: BTRFS warning (device sdc1): lost page write due to IO error on /dev/nvme0n1p1 (-5)
Oct 3 19:26:44 Tower kernel: BTRFS error (device sdc1): error writing primary super block to device 2

The BTRFS errors kept continuing and eventually I was seeing lots of entries like the following (repeating over and over😞

Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1d0 len 4096 err no 10
Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1d8 len 4096 err no 10
Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1e0 len 4096 err no 10
Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1e8 len 4096 err no 10
Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1f0 len 4096 err no 10
Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee1f8 len 4096 err no 10
Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee200 len 4096 err no 10
Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee208 len 4096 err no 10
Oct 3 19:27:04 Tower kernel: BTRFS warning (device sdc1): direct IO failed ino 260 rw 0,0 sector 0x60ee210 len 4096 err no 10

At this point I decided to reboot the system. When it came back up, I noticed the ADATA SX6000PNP NVME had a red status (I think it was unavailable) and at the bottom the option to start the array was greyed out, but I could check a box and bring it back up without the NVME disk in the pool. I went ahead chose the option to start it without the NVME disk in the pool and then I could see the following in my logs (repeating over and over😞

Oct 3 20:59:35 Tower kernel: BTRFS info (device sdc1): found 7418 extents, stage: move data extents
Oct 3 20:59:36 Tower kernel: BTRFS info (device sdc1): found 7418 extents, stage: update data pointers
Oct 3 20:59:37 Tower kernel: BTRFS info (device sdc1): relocating block group 224467615744 flags data|raid1
Oct 3 21:00:49 Tower kernel: BTRFS info (device sdc1): found 7196 extents, stage: move data extents
Oct 3 21:00:51 Tower kernel: BTRFS info (device sdc1): found 7196 extents, stage: update data pointers
Oct 3 21:00:52 Tower kernel: BTRFS info (device sdc1): relocating block group 223393873920 flags data|raid1
Oct 3 21:01:58 Tower kernel: BTRFS info (device sdc1): found 7243 extents, stage: move data extents
Oct 3 21:02:00 Tower kernel: BTRFS info (device sdc1): found 7243 extents, stage: update data pointers

I let it run all night (in the logs I can see the BTRFS process ran until about 10:42 and I got a cache disk overheated alert for the ADATA SU760 SATA disk at one point) and this morning the logs seemed stable - of course my NVME disk did not appear in Unraid anywhere. At this point I went ahead and uploaded my hardware profile in case there is anything to see there and I then shut down my server.

I then started my server back up and went into BIOS and in there I could see the ADATA SX6000PNP NVME disk was detected. I exited BIOS and booted Unraid and now the NVME disk was present in Unraid, but this time it showed a blue status (in the pool section) and had a message "All existing data on this device will be OVERWRITTEN when array is Started". I went ahead and started the array and it once again shows the stop option greyed out and says "Disabled -- BTRFS operation is running". My assumption here is that it is rebuilding the cache pool.

Does all this indicate my ADATA SX6000PNP NVME disk is about to go bad and maybe I should replace it?

...or, is this just an instance where the file system or partition table on the disk became corrupted for some reason or another (I think I read in another post that the BTRFS file system isn't the most stable). Also, at this point I went ahead and uploaded another hardware profile.

Below are some more of my system specs:

CPU: AMD Ryzen 7 1700X
Mobo: B450 MSI Gaming Plus Max
RAM: 16GB of DDR4 (G.SKILL Ripjaws V Series 16GB, 2 x 8GB DDR4 3200)
GPU: AMD/ATI Radeon HD 4550 (was recently an AMD FirePro v3900, but I switched due to fan noise)

Thanks for any help anyone can provide. I can upload syslogs if need be or other data, just let me know if I need to scrub it for anything first. Again, thank you.

JorgeB · October 4, 2022

58 minutes ago, impact-trombone said:

Does all this indicate my ADATA SX6000PNP NVME disk is about to go bad and maybe I should replace it?

Not necessarily, it dropped offline, this sometimes helps with that:

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0

Reboot and see if it makes a difference.

impact-trombone · October 4, 2022

7 hours ago, JorgeB said:
Not necessarily, it dropped offline, this sometimes helps with that:

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"
nvme_core.default_ps_max_latency_us=0
e.g.:
append initrd=/bzroot nvme_core.default_ps_max_latency_us=0
Reboot and see if it makes a difference.

Thank you for this suggestion. I will try this to see if it helps.

I've had this disk in my server for about 3 months and have never seen this issue. Any reason why it would start showing it's head now.

The only things different I've done with this machine recently (past week) are upgrading unraid from 6.9 to 6.11 and played around with VMs with GPU passthrough. I wouldn't think those would have caused a drive to have power state issues, but I know sometimes things that seem unrelated may actually affect one another.

Thank you for your help.

Cache disk (pool member) going bad?

Recommended Posts

impact-trombone

Link to comment

JorgeB

Link to comment

impact-trombone

Link to comment

Join the conversation