BTRFS ERROR in cache pool. One device failed to restart

April 10, 20233 yr

Hi,

I noticed today, I am getting BTRFS errors. Please see logs.

/dev/nvme2n1p1 seems to have failed. I ran a balance and a scrub and eventually stopped the array. When I brought the array backup it warned me that the cache pool would start without /dev/nvme2n1p1.

/dev/nvme1n1p1 seems fine. I've scrubbed it again and ran dev stats with 0 errors.

I've taken the array offline as its after 1am where I am.

I'm not sure what next steps are. Its not a cable issue, as its NVME on the motherboard (hard to reach as well).

Whats the best way of backing up the existing drive? Is it best to use mover to move the files onto the array and backup the rest?

Do I try a restart and rebuild if the second drive shows up? The drive is less than a year old.

How likely is it that my data on the drive is lost?

nexus-diagnostics-20230411-0008.zip

Quote

April 10, 20233 yr

Community Expert
Solution

NVMe device dropped offline:

Apr 10 21:20:23 Nexus kernel: nvme nvme2: I/O 203 QID 7 timeout, aborting
Apr 10 21:20:23 Nexus kernel: nvme nvme2: I/O 204 QID 7 timeout, aborting
Apr 10 21:20:23 Nexus kernel: nvme nvme2: I/O 205 QID 7 timeout, aborting
Apr 10 21:20:23 Nexus kernel: nvme nvme2: I/O 206 QID 7 timeout, aborting
Apr 10 21:20:54 Nexus kernel: nvme nvme2: I/O 203 QID 7 timeout, reset controller
Apr 10 21:21:24 Nexus kernel: nvme nvme2: I/O 24 QID 0 timeout, reset controller
Apr 10 21:22:28 Nexus kernel: nvme nvme2: Device not ready; aborting reset, CSTS=0x1
Apr 10 21:22:28 Nexus kernel: nvme nvme2: Abort status: 0x371
### [PREVIOUS LINE REPEATED 3 TIMES] ###
Apr 10 21:22:59 Nexus kernel: nvme nvme2: Device not ready; aborting reset, CSTS=0x1
Apr 10 21:22:59 Nexus kernel: nvme nvme2: Removing after probe failure status: -19
Apr 10 21:23:29 Nexus kernel: nvme nvme2: Device not ready; aborting reset, CSTS=0x1
Apr 10 21:23:29 Nexus kernel: nvme2n1: detected capacity change from 1953525168 to 0

Power cycle the server and if it comes back online you can re-add to the pool, assuming the pool was raid1, it it wasn't post new diags after the reboot, also see here for more info.

Quote

April 16, 20233 yr

Author

JorgeB, sorry long working week.

Thanks for your reply, I managed to sort it, thanks to you.

Quote

1

April 23, 20233 yr

Author

Hi,

I've found a new error today with the same cache drives.

Quote

Apr 22 22:57:18 Nexus emhttpd: shcmd (1984034): mount -t btrfs -o noatime,space_cache=v2,discard=async -U 653150f2-f281-4987-bb45-f3e553c14b9c /mnt/cache_nvme
Apr 22 22:57:18 Nexus kernel: BTRFS info (device nvme1n1p1): turning on async discard
Apr 22 22:57:18 Nexus kernel: BTRFS info (device nvme1n1p1): using free space tree
Apr 22 22:57:18 Nexus kernel: BTRFS info (device nvme1n1p1): has skinny extents
Apr 22 22:57:18 Nexus kernel: BTRFS info (device nvme1n1p1): bdev /dev/nvme2n1p1 errs: wr 48107036, rd 92189327, flush 49036, corrupt 6, gen 0

This is repeated in the syslogs. I would appreciate any advice / steps to resolve this. I've kept it in this thread as this drive dropped offline before. I'm planning on ordering two new cache drives to use with my LSI controller card instead but I need to know what to do with this in the meantime.

Quote

April 23, 20233 yr

Author

nexus-diagnostics-20230423-1553.zip

Quote

April 23, 20233 yr

Author

It appears that its only the one drive. Would the following steps be reasonable;

1. Use Mover to move contents of cache to the array.

2. Take cache offline, reformat and then use mover to move appdata, etc back to the cache.

I've run a balance and a scrub since originally posting this new problem.

Quote

April 23, 20233 yr

Author

It occurred to me that the btrfs stats would still be static after the first issue of the second drive dropping out.

I ran this;

Quote

root@Nexus:/mnt/disks/New_Volume/20230423_cache_nvmebackup/Handbrake/output# btrfs dev stats -z /mnt/cache_nvme/

[/dev/nvme1n1p1].write_io_errs    0

[/dev/nvme1n1p1].read_io_errs     0

[/dev/nvme1n1p1].flush_io_errs    0

[/dev/nvme1n1p1].corruption_errs 0

[/dev/nvme1n1p1].generation_errs 0

[/dev/nvme2n1p1].write_io_errs    48107036

[/dev/nvme2n1p1].read_io_errs     92189327

[/dev/nvme2n1p1].flush_io_errs    49036

[/dev/nvme2n1p1].corruption_errs 6

[/dev/nvme2n1p1].generation_errs 0

...and again;

Quote

root@Nexus:/mnt/disks/New_Volume/20230423_cache_nvmebackup/Handbrake/output# btrfs dev stats -z /mnt/cache_nvme/

[/dev/nvme1n1p1].write_io_errs    0

[/dev/nvme1n1p1].read_io_errs     0

[/dev/nvme1n1p1].flush_io_errs    0

[/dev/nvme1n1p1].corruption_errs 0

[/dev/nvme1n1p1].generation_errs 0

[/dev/nvme2n1p1].write_io_errs    0

[/dev/nvme2n1p1].read_io_errs     0

[/dev/nvme2n1p1].flush_io_errs    0

[/dev/nvme2n1p1].corruption_errs 0

[/dev/nvme2n1p1].generation_errs 0nexus-diagnostics-20230423-2024.zip

root@Nexus:/mnt/disks/New_Volume/20230423_cache_nvmebackup/Handbrake/output#

I ran an rsync before this, twice to two separate drives as a backup, so I'm hoping this may be resolved.

Is it possible after re-adding the second drive back into the pool and it rebuilds that the btrfs error messages were still reporting the original issue?

Is there a way to see what files were affected by the 6 corruptions?

I've also added my latest diagnostics nexus-diagnostics-20230423-2024.zip

Edited April 23, 20233 yr by Geck0

Quote

April 23, 20233 yr

Community Expert

That error suggests one of the NVMe devices dropped offline in the past, see the link I posted above for how to fix that and how to better monitor the pool.

Quote

April 25, 20233 yr

Author

I ran btrfs dev stats -z /mnt/cache_nvme/ as per this excellent post listed earlier in this post by JorgeB

Quote

If you get notified you can then check with the dev stats command which device is having issues and take the appropriate steps to fix them, most times when there are read/write errors, especially with SSDs, it's a cable issue, so start by replacing the cables, then and since the stats are for the lifetime of the filesystem, i.e., they don't reset with a reboot, force a reset of the stats with:

btrfs dev stats -z /mnt/cache

Finally run a scrub, make sure there are no uncorrectable errors and keep working normally, any more issues you'll get a new notification.

P.S. you can also monitor a single btrfs device or a non redundant pool, but for those any dropped device is usually quickly apparent.

and a scrub afterwards. I already had the script running to detect pool errors. The issue appeared after the original drop out that sparked off this post. I didn't realise the stats needed resetting. For those that follow, I suggest seeing this post BTRFS Issues

Edited April 25, 20233 yr by Geck0

Quote

BTRFS ERROR in cache pool. One device failed to restart

Featured Replies

Solved by JorgeB

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)