BTRFS Errors after moving to new server

mlapaglia · February 24, 2020

After moving to my new server, Aorus x570 Pro with a 3900X, my cache is throwing lots of BTRFS errors. I reformatted the cache drive and restored by appdata but am still getting this issue.

The cache drives are two nvme drives attached to the motherboard. I've tried reseating them.

tower-diagnostics-20200224-0156.zip

JorgeB · February 24, 2020

There are problems writing to one of the cache devices, cache2, this is a hardware issue:

Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
Feb 23 22:48:58 Tower kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme1n1p1

mlapaglia · February 24, 2020

Thanks @johnnie.black, i swapped the drives on the motherboard, it looks like the problem is with the nvme drive since I am now seeing errors about nvme1n1p1?

tower-diagnostics-20200224-0816.zip

JorgeB · February 24, 2020

The errors you're sing now is because both devices are online and one of them has old data that is being corrected as it's being read, run a scrub and check that there are no uncorrectable errors, also see here for better pool monitoring.

mlapaglia · February 24, 2020

Ok I booted up with the `624` drive inserted by itself and immediately got an IO error and the drive being put into read only mode. I put `552` in by itself and it passes the scrub check with no errors. The btrfs command gives this though:

root@Tower:~# btrfs dev stats /mnt/cache
[/dev/nvme0n1p1].write_io_errs    6455
[/dev/nvme0n1p1].read_io_errs     5533
[/dev/nvme0n1p1].flush_io_errs    22
[/dev/nvme0n1p1].corruption_errs  0
[/dev/nvme0n1p1].generation_errs  0

It was like this right after boot up, and the value hasn't changed in 15 minutes. VMs and docker are running off of this drive, they all appear to be functioning.

I looks like it is re-balancing now since the other drive was removed. Should I try and format the `624` drive and put it back into the array? Is anything like an "IO Error" indicative of a hardware error?

Edited February 24, 2020 by mlapaglia

JorgeB · February 24, 2020

The errors are for the life of the filesystem, see the link above how to reset them.

mlapaglia · February 25, 2020

So this might have something to do with the board and RAID1.

On the suspect drive I formatted it as btrfs separately from the cache and used it as an unassigned device. I copied the entire appdata folder over to it for testing. It copied without any io errors.

Since there were no issues here I put the drive back into the cache array. It formatted and set up the RAID1 without any errors. After a restart though, it started throwing IO errors again until I removed the 2nd drive.

JorgeB · February 25, 2020

Look for a BIOS update, this looks like a hardware problem/compatibility issue.

BTRFS Errors after moving to new server

Recommended Posts

mlapaglia

Link to comment

JorgeB

Link to comment

mlapaglia

Link to comment

JorgeB

Link to comment

mlapaglia

Link to comment

JorgeB

Link to comment

mlapaglia

Link to comment

JorgeB

Link to comment

Join the conversation