BTRFS I/O errors on RAID 1 NVME


golli53

Recommended Posts

These NVME drives were mounted as unassigned devices in RAID1 and were running my Win10 VMs. Not sure whether this was due to a hardware or filesystem issue. The directory on the drive that contained my VMs (that were running at the time and crashed after the errors) shows up as empty now (all VM images missing). Not sure whether it is s afe to disconnect/reconnect the drives or best way to proceed. Would appreciate any advice.

 

[edit] I forced stopped my VM and disabled virtualization in the GUI - I don't know if this caused my VM images to go missing on disk. My highest priority is recovering these images

Sep  5 14:33:22 Tower kernel: nvme nvme0: I/O 408 QID 9 timeout, aborting
Sep  5 14:33:23 Tower kernel: nvme nvme0: I/O 409 QID 9 timeout, aborting
Sep  5 14:33:23 Tower kernel: nvme nvme0: I/O 410 QID 9 timeout, aborting
Sep  5 14:33:23 Tower kernel: nvme nvme0: I/O 411 QID 9 timeout, aborting
Sep  5 14:33:23 Tower kernel: nvme nvme0: I/O 737 QID 7 timeout, aborting
Sep  5 14:33:23 Tower kernel: nvme nvme0: I/O 738 QID 7 timeout, aborting
Sep  5 14:33:23 Tower kernel: nvme nvme0: I/O 739 QID 7 timeout, aborting
Sep  5 14:33:23 Tower kernel: nvme nvme0: I/O 198 QID 8 timeout, aborting
Sep  5 14:33:52 Tower kernel: nvme nvme0: I/O 408 QID 9 timeout, reset controller
Sep  5 14:34:23 Tower kernel: nvme nvme0: I/O 11 QID 0 timeout, reset controller
Sep  5 14:35:23 Tower kernel: nvme nvme0: Device not ready; aborting reset
Sep  5 14:35:23 Tower kernel: nvme nvme0: Abort status: 0x7
Sep  5 14:35:23 Tower kernel: nvme nvme0: Abort status: 0x7
Sep  5 14:35:23 Tower kernel: nvme nvme0: Abort status: 0x7
Sep  5 14:35:23 Tower kernel: nvme nvme0: Abort status: 0x7
Sep  5 14:35:23 Tower kernel: nvme nvme0: Abort status: 0x7
Sep  5 14:35:23 Tower kernel: nvme nvme0: Abort status: 0x7
Sep  5 14:35:23 Tower kernel: nvme nvme0: Abort status: 0x7
Sep  5 14:35:23 Tower kernel: nvme nvme0: Abort status: 0x7
Sep  5 14:35:42 Tower dhcpcd[2016]: br0: failed to renew DHCP, rebinding
Sep  5 14:35:42 Tower dhcpcd[2016]: br0: truncated packet (180) from 192.168.10.10
Sep  5 14:35:42 Tower dhcpcd[2016]: br0: truncated packet (180) from 192.168.10.10
Sep  5 14:35:42 Tower dhcpcd[2016]: br0: truncated packet (152) from 192.168.10.10
Sep  5 14:35:42 Tower dhcpcd[2016]: br0: truncated packet (152) from 192.168.10.10
Sep  5 14:35:42 Tower dhcpcd[2016]: br0: truncated packet (132) from 192.168.10.12
Sep  5 14:35:42 Tower dhcpcd[2016]: br0: truncated packet (132) from 192.168.10.12
Sep  5 14:35:42 Tower dhcpcd[2016]: br0: truncated packet (24) from 192.168.10.12
Sep  5 14:35:42 Tower dhcpcd[2016]: br0: truncated packet (24) from 192.168.10.12
Sep  5 14:35:54 Tower kernel: nvme nvme0: Device not ready; aborting reset
Sep  5 14:35:54 Tower kernel: nvme nvme0: Removing after probe failure status: -19
Sep  5 14:36:24 Tower kernel: nvme nvme0: Device not ready; aborting reset
Sep  5 14:36:24 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Sep  5 14:36:24 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 1419705920
Sep  5 14:36:24 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 1426329664
Sep  5 14:36:24 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 1313322472
Sep  5 14:36:24 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Sep  5 14:36:24 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 726592960
Sep  5 14:36:24 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Sep  5 14:36:24 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
Sep  5 14:36:24 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
Sep  5 14:36:24 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 1416326976
Sep  5 14:36:24 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Sep  5 14:36:24 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 1426329600
Sep  5 14:36:24 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 1385254080
Sep  5 14:36:24 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Sep  5 14:36:24 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Sep  5 14:36:24 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
Sep  5 14:36:24 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 1416326896
Sep  5 14:36:24 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 1419705792
Sep  5 14:36:24 Tower kernel: print_req_error: I/O error, dev nvme0n1, sector 1355724792
Sep  5 14:36:24 Tower kernel: nvme nvme0: failed to set APST feature (-19)
Sep  5 14:36:24 Tower kernel: BTRFS: error (device dm-14) in btrfs_run_delayed_refs:2935: errno=-5 IO failure
Sep  5 14:36:24 Tower kernel: BTRFS info (device dm-14): forced readonly
Sep  5 14:36:24 Tower kernel: BTRFS: error (device dm-14) in __btrfs_free_extent:6803: errno=-5 IO failure
Sep  5 14:36:24 Tower kernel: BTRFS: error (device dm-14) in btrfs_run_delayed_refs:2935: errno=-5 IO failure
Sep  5 14:36:24 Tower kernel: BTRFS warning (device dm-14): Skipping commit of aborted transaction.
Sep  5 14:36:24 Tower kernel: BTRFS: error (device dm-14) in cleanup_transaction:1846: errno=-5 IO failure
Sep  5 14:36:24 Tower kernel: BTRFS info (device dm-14): delayed_refs has NO entry
Sep  5 14:36:32 Tower kernel: btrfs_dev_stat_print_on_error: 641 callbacks suppressed
Sep  5 14:36:32 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 385, rd 268, flush 0, corrupt 0, gen 0
Sep  5 14:36:32 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 385, rd 269, flush 0, corrupt 0, gen 0
Sep  5 14:36:32 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 385, rd 270, flush 0, corrupt 0, gen 0
Sep  5 14:36:32 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 385, rd 271, flush 0, corrupt 0, gen 0
Sep  5 14:36:32 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 385, rd 272, flush 0, corrupt 0, gen 0
Sep  5 14:36:32 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 385, rd 273, flush 0, corrupt 0, gen 0
Sep  5 14:36:32 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 385, rd 274, flush 0, corrupt 0, gen 0
Sep  5 14:36:32 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 385, rd 275, flush 0, corrupt 0, gen 0
Sep  5 14:36:32 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 385, rd 276, flush 0, corrupt 0, gen 0
Sep  5 14:36:32 Tower kernel: BTRFS error (device dm-14): bdev /dev/mapper/ssd errs: wr 385, rd 277, flush 0, corrupt 0, gen 0

 

 

Edited by golli53
Link to comment
4 hours ago, golli53 said:

Sep 5 14:33:22 Tower kernel: nvme nvme0: I/O 408 QID 9 timeout, aborting Sep 5 14:33:23 Tower kernel: nvme nvme0: I/O 409 QID 9 timeout, aborting Sep 5 14:33:23 Tower kernel: nvme nvme0: I/O 410 QID 9 timeout, aborting Sep 5 14:33:23 Tower kernel: nvme nvme0: I/O 411 QID 9 timeout, aborting Sep 5 14:33:23 Tower kernel: nvme nvme0: I/O 737 QID 7 timeout, aborting Sep 5 14:33:23 Tower kernel: nvme nvme0: I/O 738 QID 7 timeout, aborting Sep 5 14:33:23 Tower kernel: nvme nvme0: I/O 739 QID 7 timeout, aborting Sep 5 14:33:23 Tower kernel: nvme nvme0: I/O 198 QID 8 timeout, aborting Sep 5 14:33:52 Tower kernel: nvme nvme0: I/O 408 QID 9 timeout, reset controller Sep 5 14:34:23 Tower kernel: nvme nvme0: I/O 11 QID 0 timeout, reset controller Sep 5 14:35:23 Tower kernel: nvme nvme0: Device not ready; aborting reset

This is a hardware problem, rebooting/power cycling should bring the device back online.

Edited by johnnie.black
Link to comment
6 hours ago, johnnie.black said:

This is a hardware problem, rebooting/power cycling should bring the device back online.

Thanks- I rebooted and the drive no longer shows up under Unassigned Devices. I guess the drive (or mobo controller) failed. What would you recommend as the safest way to recover the data from the other (hopefully) still good drive in the RAID1 btrfs array?

Link to comment
29 minutes ago, johnnie.black said:

There are some recovery options in the FAQ, thought the pool should mount with just the single device, unless it was created on v6.7+, because of a bug, and if that's the case recovery options on the FAQ won't help much either.

 

 

Oh shoot - I didn't see this bug. I do think it was created 6.7+. On second reboot, the bad drive (nvme0n1, ssd) re-appeared in unRAID, but can't mount the pool (hangs). Then tried mounting with degraded using the good drive (nvme1n1, ssd2) - see below. Am I toast?

login as: root
Linux 4.19.56-Unraid.
root@Tower:~# /usr/sbin/cryptsetup luksOpen /dev/nvme1n1p1 ssd2 --allow-discards --key-file /root/keyfile
root@Tower:~# mkdir /mnt/disks/ssd
root@Tower:~# /sbin/mount -o usebackuproot,ro '/dev/mapper/ssd' '/mnt/disks/ssd'
^C
root@Tower:~# /sbin/mount -o degraded,usebackuproot,ro '/dev/mapper/ssd2' '/mnt/disks/ssd'
mount: /mnt/disks/ssd: wrong fs type, bad option, bad superblock on /dev/mapper/ssd2, missing codepage or helper program, or other error.

 

Edited by golli53
Link to comment
1 minute ago, johnnie.black said:

If both drives are accessible you can try the recovery options above, you just need to try and mount any one device and the other will mount together (if corruption isn't very serious), failing that try btrfs restore, if neither option works I'm afraid not much help I can give.

Got it - is the error above when trying degraded mode ("wrong fs type, bad option, bad superblock on /dev/mapper/ssd2, missing codepage or helper program, or other error.") due to the bug that you referenced? Mounting as a pool unfortunately hangs forever

Link to comment
1 minute ago, johnnie.black said:

This means no superblock is detected, but you might be using the wrong device, your did you get ssd2 from?

 

Device should be /dev/mapper/nvme1n1p1

The ssd2 is from running: /usr/sbin/cryptsetup luksOpen /dev/nvme1n1p1 ssd2 --allow-discards --key-file /root/keyfile

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.