BTRFS Errors after moving to new server


Recommended Posts

There are problems writing to one of the cache devices, cache2, this is a hardware issue:

Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme1n1p1

 

Link to comment

Ok I booted up with the `624` drive inserted by itself and immediately got an IO error and the drive being put into read only mode. I put `552` in by itself and it passes the scrub check with no errors.  The btrfs command gives this though:

root@Tower:~# btrfs dev stats /mnt/cache
[/dev/nvme0n1p1].write_io_errs    6455
[/dev/nvme0n1p1].read_io_errs     5533
[/dev/nvme0n1p1].flush_io_errs    22
[/dev/nvme0n1p1].corruption_errs  0
[/dev/nvme0n1p1].generation_errs  0

It was like this right after boot up, and the value hasn't changed in 15 minutes. VMs and docker are running off of this drive, they all appear to be functioning.

 

I looks like it is re-balancing now since the other drive was removed. Should I try and format the `624` drive and put it back into the array? Is anything like an "IO Error" indicative of a hardware error?

Edited by mlapaglia
Link to comment

So this might have something to do with the board and RAID1.

 

On the suspect drive I formatted it as btrfs separately from the cache and used it as an unassigned device. I copied the entire appdata folder over to it for testing. It copied without any io errors.

 

Since there were no issues here I put the drive back into the cache array. It formatted and set up the RAID1 without any errors. After a restart though, it started throwing IO errors again until I removed the 2nd drive.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.