[SOLVED (I think)] NVME drive failing in mirrored cache pool. BTRFS errors.


Go to solution Solved by hoppers99,

Recommended Posts

Hi Team,

 

I'm sorry if this is answered elsewhere but I've read a lot of different things that seem like they are in the right territory and I'm not sure I understand enough still to know the best approach.

 

This morning I had trouble playing something off my plex server. I tried restarting the docker container, and that failed. I soon realised my other docker containers were some sort of broken. This led me to logs, found a lot of BTRFS errors, and the subsequent googling. Best I can tell, one of my 2 nvme drives from my cache is "unhappy" and had unmounted/disconnected/died...

 

I have remounted the pool in ro mode to try and copy data off, which has thrown a lot of IO errors and I'm unsure how successful it could be called. I disabled docker as part of trying to reduce any surplus disk access to the cache while backing up.

 

After rebooting the server, nvme1 shows again and I see some `read error corrected` and other messages about errors on the cache:

 

...
May  7 11:36:28 Artemis kernel: loop3: detected capacity change from 0 to 209715200
May  7 11:36:28 Artemis kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 390506 off 65536 csum 0xcc09b20c expected csum 0x90ce6228 mirror 1
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 97755, rd 13117, flush 2864, corrupt 2, gen 0
May  7 11:36:28 Artemis kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 390506 off 65536 (dev /dev/nvme1n1p1 sector 1472333904)
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 1147752185856 wanted 13961728 found 13961392
May  7 11:36:28 Artemis kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 0 off 1147752185856 (dev /dev/nvme1n1p1 sector 1945964096)
May  7 11:36:28 Artemis kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 0 off 1147752189952 (dev /dev/nvme1n1p1 sector 1945964104)
May  7 11:36:28 Artemis kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 0 off 1147752194048 (dev /dev/nvme1n1p1 sector 1945964112)
May  7 11:36:28 Artemis kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 0 off 1147752198144 (dev /dev/nvme1n1p1 sector 1945964120)
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 1148293398528 wanted 13961753 found 13961464
May  7 11:36:28 Artemis kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 0 off 1148293398528 (dev /dev/nvme1n1p1 sector 879570784)
May  7 11:36:28 Artemis kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 0 off 1148293402624 (dev /dev/nvme1n1p1 sector 879570792)
May  7 11:36:28 Artemis kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 0 off 1148293406720 (dev /dev/nvme1n1p1 sector 879570800)
May  7 11:36:28 Artemis kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 0 off 1148293410816 (dev /dev/nvme1n1p1 sector 879570808)
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 1148299493376 wanted 13961753 found 13961464
May  7 11:36:28 Artemis kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 0 off 1148299493376 (dev /dev/nvme1n1p1 sector 879582688)
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 1147449704448 wanted 13961716 found 13961423
May  7 11:36:28 Artemis kernel: BTRFS: device fsid bbb9c5ed-2bf5-4e6b-8af1-280c4be09739 devid 1 transid 1847029 /dev/loop3 scanned by mount (10535)
May  7 11:36:28 Artemis kernel: BTRFS info (device loop3): using free space tree
May  7 11:36:28 Artemis kernel: BTRFS info (device loop3): has skinny extents
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 1148351053824 wanted 13961755 found 13961466
May  7 11:36:28 Artemis kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 390506 off 26857144320 csum 0x1754e162 expected csum 0x08eeb7e3 mirror 1
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 97755, rd 13117, flush 2864, corrupt 3, gen 0
May  7 11:36:28 Artemis kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 390506 off 26857148416 csum 0xcdb54873 expected csum 0x8941f998 mirror 1
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 97755, rd 13117, flush 2864, corrupt 4, gen 0
May  7 11:36:28 Artemis kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 390506 off 26857152512 csum 0x30afaf5e expected csum 0x8941f998 mirror 1
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 97755, rd 13117, flush 2864, corrupt 5, gen 0
May  7 11:36:28 Artemis kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 390506 off 26857156608 csum 0xfd2a2fd6 expected csum 0x8941f998 mirror 1
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 97755, rd 13117, flush 2864, corrupt 6, gen 0
May  7 11:36:28 Artemis kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 390506 off 26857160704 csum 0xecf6f5fe expected csum 0x10a79df3 mirror 1
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 97755, rd 13117, flush 2864, corrupt 7, gen 0
May  7 11:36:28 Artemis kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 390506 off 26857226240 csum 0x47125d6d expected csum 0x7675cf82 mirror 1
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 97755, rd 13117, flush 2864, corrupt 8, gen 0
May  7 11:36:28 Artemis kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 390506 off 26857164800 csum 0xbbe64e09 expected csum 0x8941f998 mirror 1
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 97755, rd 13117, flush 2864, corrupt 9, gen 0
May  7 11:36:28 Artemis kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 390506 off 26857168896 csum 0x5a4ecf38 expected csum 0x8941f998 mirror 1
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 97755, rd 13117, flush 2864, corrupt 10, gen 0
May  7 11:36:28 Artemis kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 390506 off 26857172992 csum 0x8941f998 expected csum 0xd5b8699a mirror 1
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 97755, rd 13117, flush 2864, corrupt 11, gen 0
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 1148353626112 wanted 13961755 found 13961466
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 1148353478656 wanted 13961755 found 13961466
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 1148353183744 wanted 13961755 found 13961466
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 1148353200128 wanted 13961755 found 13961466
May  7 11:36:28 Artemis kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on 1148352905216 wanted 13961755 found 13961466
May  7 11:36:28 Artemis kernel: BTRFS info (device loop3): enabling ssd optimizations
...

 

 

I attach diagnostics output from both before and after reboot.

 

Once the server was rebooted and nvme1 was showing again I tried to start docker again. Docker start failed. And at this point I'm rather lost.

 

My cache was setup in mirrored mode so I went looking for options to just remove the failed drive from the pool but reading stuff about balancing etc. and a number of people who have managed to break things left me confused. I "think" I should be able to remove the broken drive from the pool but thought I'd seek the wisdom of the forum before I lunch anything more than may already be lost.

 

Suggestions or even pointers to doco I should be following for my situation appreciated.

 

artemis-diagnostics-20230507-1157.zip artemis-diagnostics-20230507-1005.zip

Edited by hoppers99
Adding solved and tweaking title
Link to comment
  • Solution

Okay, I've got myself out of immediate failure I think (fingers crossed). 

 

I could not stop the array (/mnt/cache would not unmount despite nothing showing using it in lsof) so I disabled auto start and rebooted the server. I then managed to remove the faulty nvme drive from the cache pool, and then (using the small checkbox confirming I really wanted to do it) started the array again.

 

I'm now getting a number of 'reallocating block group' log messages in syslog and the cache files are accessable (I "think" this is btrfs doing a balance itself...). I was also able to start docker again. I'm now going to leave the cache to reallocate whatever else it may want while I go check the dates and warranties on the failed nvme (assuming it's the drive and not an issue with the motherboard slot).

 

I'd still be interested if anyone can help explain if I did the right thing or just took a risky option that fluked ending well.

 

A few forum posts I had read that I took largely some guesswork from (and I say guess as I still have no real understanding of balance or scrub, and only basic guesswork on the actual error I was having...)

 

 

 

 

 

Link to comment
  • hoppers99 changed the title to [SOLVED (I think)] NVME drive failing in mirrored cache pool. BTRFS errors.
May  7 04:16:26 Artemis kernel: nvme nvme1: I/O 322 QID 4 timeout, aborting
May  7 04:16:56 Artemis kernel: nvme nvme1: I/O 322 QID 4 timeout, reset controller
May  7 04:16:58 Artemis kernel: nvme nvme1: I/O 29 QID 0 timeout, reset controller
May  7 04:20:05 Artemis kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
May  7 04:20:05 Artemis kernel: nvme nvme1: Abort status: 0x371
May  7 04:20:05 Artemis kernel: nvme nvme1: Removing after probe failure status: -19

 

NVMe device dropped offline, this can sometimes help, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off


Reboot, run a scrub on the pool and see if it makes a difference.

 

P.S.

May  2 06:26:03 Artemis kernel: XFS (md4): Unmount and run xfs_repair

Check filesystem on disk4.

Link to comment
39 minutes ago, JorgeB said:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

 

Thanks @JorgeB, I had actually tried the latency flag but not the pcie one. I've added them now with the single nvme still. I think I'll look to just replace both cache drives anyway, just to be safe. The failed one should be under warranty if I can find the paperwork! :D

 

And thanks for the spot on disk4! Beside the logs, is there anywhere I should be watching where I might see that sort of warning?

Edited by hoppers99
Tagging
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.