BTRFS ERROR in cache pool. One device failed to restart


Go to solution Solved by JorgeB,

Recommended Posts

Hi,

I noticed today, I am getting BTRFS errors. Please see logs.

/dev/nvme2n1p1 seems to have failed. I ran a balance and a scrub and eventually stopped the array. When I brought the array backup it warned me that the cache pool would start without /dev/nvme2n1p1.

 

/dev/nvme1n1p1 seems fine. I've scrubbed it again and ran dev stats with 0 errors.

I've taken the array offline as its after 1am where I am.

 

I'm not sure what next steps are. Its not a cable issue, as its NVME on the motherboard (hard to reach as well).

Whats the best way of backing up the existing drive? Is it best to use mover to move the files onto the array and backup the rest?

Do I try a restart and rebuild if the second drive shows up? The drive is less than a year old.

 

How likely is it that my data on the drive is lost?

 

 

 

nexus-diagnostics-20230411-0008.zip

Link to comment
  • Solution

NVMe device dropped offline:

 

Apr 10 21:20:23 Nexus kernel: nvme nvme2: I/O 203 QID 7 timeout, aborting
Apr 10 21:20:23 Nexus kernel: nvme nvme2: I/O 204 QID 7 timeout, aborting
Apr 10 21:20:23 Nexus kernel: nvme nvme2: I/O 205 QID 7 timeout, aborting
Apr 10 21:20:23 Nexus kernel: nvme nvme2: I/O 206 QID 7 timeout, aborting
Apr 10 21:20:54 Nexus kernel: nvme nvme2: I/O 203 QID 7 timeout, reset controller
Apr 10 21:21:24 Nexus kernel: nvme nvme2: I/O 24 QID 0 timeout, reset controller
Apr 10 21:22:28 Nexus kernel: nvme nvme2: Device not ready; aborting reset, CSTS=0x1
Apr 10 21:22:28 Nexus kernel: nvme nvme2: Abort status: 0x371
### [PREVIOUS LINE REPEATED 3 TIMES] ###
Apr 10 21:22:59 Nexus kernel: nvme nvme2: Device not ready; aborting reset, CSTS=0x1
Apr 10 21:22:59 Nexus kernel: nvme nvme2: Removing after probe failure status: -19
Apr 10 21:23:29 Nexus kernel: nvme nvme2: Device not ready; aborting reset, CSTS=0x1
Apr 10 21:23:29 Nexus kernel: nvme2n1: detected capacity change from 1953525168 to 0

 

Power cycle the server and if it comes back online you can re-add to the pool, assuming the pool was raid1, it it wasn't post new diags after the reboot, also see here for more info.

Link to comment

Hi,

I've found a new error today with the same cache drives.

 

Quote

Apr 22 22:57:18 Nexus  emhttpd: shcmd (1984034): mount -t btrfs -o noatime,space_cache=v2,discard=async -U 653150f2-f281-4987-bb45-f3e553c14b9c /mnt/cache_nvme
Apr 22 22:57:18 Nexus kernel: BTRFS info (device nvme1n1p1): turning on async discard
Apr 22 22:57:18 Nexus kernel: BTRFS info (device nvme1n1p1): using free space tree
Apr 22 22:57:18 Nexus kernel: BTRFS info (device nvme1n1p1): has skinny extents
Apr 22 22:57:18 Nexus kernel: BTRFS info (device nvme1n1p1): bdev /dev/nvme2n1p1 errs: wr 48107036, rd 92189327, flush 49036, corrupt 6, gen 0

 

This is repeated in the syslogs. I would appreciate any advice / steps to resolve this. I've kept it in this thread as this drive dropped offline before. I'm planning on ordering two new cache drives to use with my LSI controller card instead but I need to know what to do with this in the meantime.

Link to comment

It appears that its only the one drive. Would the following steps be reasonable;

 

1. Use Mover to move contents of cache to the array.

2. Take cache offline, reformat and then use mover to move appdata, etc back to the cache.

 

I've run a balance and a scrub since originally posting this new problem.

 

Link to comment

It occurred to me that the btrfs stats would still be static after the first issue of the second drive dropping out.

 

I ran this;

Quote

root@Nexus:/mnt/disks/New_Volume/20230423_cache_nvmebackup/Handbrake/output# btrfs dev stats -z /mnt/cache_nvme/

[/dev/nvme1n1p1].write_io_errs    0

[/dev/nvme1n1p1].read_io_errs     0

[/dev/nvme1n1p1].flush_io_errs    0

[/dev/nvme1n1p1].corruption_errs  0

[/dev/nvme1n1p1].generation_errs  0

[/dev/nvme2n1p1].write_io_errs    48107036

[/dev/nvme2n1p1].read_io_errs     92189327

[/dev/nvme2n1p1].flush_io_errs    49036

[/dev/nvme2n1p1].corruption_errs  6

[/dev/nvme2n1p1].generation_errs  0

 

...and again;

 

Quote

root@Nexus:/mnt/disks/New_Volume/20230423_cache_nvmebackup/Handbrake/output# btrfs dev stats -z /mnt/cache_nvme/

[/dev/nvme1n1p1].write_io_errs    0

[/dev/nvme1n1p1].read_io_errs     0

[/dev/nvme1n1p1].flush_io_errs    0

[/dev/nvme1n1p1].corruption_errs  0

[/dev/nvme1n1p1].generation_errs  0

[/dev/nvme2n1p1].write_io_errs    0

[/dev/nvme2n1p1].read_io_errs     0

[/dev/nvme2n1p1].flush_io_errs    0

[/dev/nvme2n1p1].corruption_errs  0

[/dev/nvme2n1p1].generation_errs  0nexus-diagnostics-20230423-2024.zip

root@Nexus:/mnt/disks/New_Volume/20230423_cache_nvmebackup/Handbrake/output#

 

I ran an rsync before this, twice to two separate drives as a backup, so I'm hoping this may be resolved.

 

Is it possible after re-adding the second drive back into the pool and it rebuilds that the btrfs error messages were still reporting the original issue?

Is there a way to see what files were affected by the 6 corruptions?

I've also added my latest diagnosticsnexus-diagnostics-20230423-2024.zip

Edited by Geck0
Link to comment

I ran btrfs dev stats -z /mnt/cache_nvme/ as per this excellent post listed earlier in this post by JorgeB

Quote

If you get notified you can then check with the dev stats command which device is having issues and take the appropriate steps to fix them, most times when there are read/write errors, especially with SSDs, it's a cable issue, so start by replacing the cables, then and since the stats are for the lifetime of the filesystem, i.e., they don't reset with a reboot, force a reset of the stats with:

btrfs dev stats -z /mnt/cache

Finally run a scrub, make sure there are no uncorrectable errors and keep working normally, any more issues you'll get a new notification.

 

P.S. you can also monitor a single btrfs device or a non redundant pool, but for those any dropped device is usually quickly apparent.

and a scrub afterwards. I already had the script running to detect pool errors. The issue appeared after the original drop out that sparked off this post. I didn't realise the stats needed resetting. For those that follow, I suggest seeing this post BTRFS Issues

 

Edited by Geck0
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.