Cache pool damaged/corrupt?

Luc1fer · February 24, 2020

Hi,

I have upgraded to 6.8.2 from 6.6.7

I have noticed that my cache pool was giving these errors:

kernel: BTRFS info (device sdr1): relocating block group 1055678005248 flags data|raid1

I tried to balance my pool but that didn't do anything, but started giving these errors:

kernel: BTRFS critical (device sdr1): corrupt leaf: root=5 block=286758944768 slot=2, bad key order, prev (11529215046071388418 254 281472965546548) current (2918659 1 0)

So I thought that I would move everything off the cache pool and then recreate the pool again. I set all of the shares that were on cache only and cache preferred to YES and invoked the mover.

The mover has finished, but there is still about 100gb of data on the cache pool that wont move to the array.

Data, RAID1: total=66.00GiB, used=34.74GiB

Data, single: total=53.00GiB, used=24.62GiB

System, RAID1: total=32.00MiB, used=48.00KiB

Metadata, RAID1: total=2.00GiB, used=68.08MiB

Metadata, single: total=1.00GiB, used=704.00KiB

GlobalReserve, single: total=36.80MiB, used=0.00B

No balance found on '/mnt/cache'

Mainly this seems to be the docker and libvirt folders and files. It also appears that there are 2 copies of docker.img, one on the cache drive and then what appears to be an older dated file on disk11. Also my domains and isos folder has not been moved to the array (i have copied all the data to a local drive on my pc as a backup).

I'm not sure how to proceed from here.

I have attached my diagnostics zip.

Thanks for your help.

chenbro-svr-diagnostics-20200224-2023.zip

JorgeB · February 24, 2020

One of the cache devices (cache2) has issues, likely it's dropping offline:

Feb 24 08:52:03 chenbro-svr kernel: BTRFS info (device sdr1): bdev /dev/sds1 errs: wr 3714141, rd 2077762, flush 152140, corrupt 0, gen 0

See here for more info.

For the mover you need to disable docker and VM services for those files to move, note that if any of the files exists on the array it won't overwrite them, but no need to worry about the docker image, delete all and then re-create.

Luc1fer · February 24, 2020

Thanks for the reply. I forgot to say that I stopped both the docker and VM services before invoking the mover, but those files still seem to have not moved, or for some of them they have been copied, but not deleted from the cache.

Running: btrfs dev stats /mnt/cache gives:

Linux 4.19.98-Unraid.
root@chenbro-svr:~# btrfs dev stats /mnt/cache
[/dev/sdr1].write_io_errs 0
[/dev/sdr1].read_io_errs 0
[/dev/sdr1].flush_io_errs 0
[/dev/sdr1].corruption_errs 0
[/dev/sdr1].generation_errs 0
[/dev/sds1].write_io_errs 3714141
[/dev/sds1].read_io_errs 2077762
[/dev/sds1].flush_io_errs 152140
[/dev/sds1].corruption_errs 0
[/dev/sds1].generation_errs 0

sds1 is a replacement SSD when I had an issue with a previous one in that slot dropping out and eventually failing. I changed the cable and put it back in and let unraid restore the cache pool. Obviously there is something about either that bay or SATA connector that is causing issues to the SSD? Should I power down and remove cache2 (sds1)?

Luc1fer · February 24, 2020

If i'm going to recreate the docker.img when I select delete in the docker settings will it delete both copies or should I delete one copy using MC?

JorgeB · February 24, 2020

16 minutes ago, Luc1fer said:

but those files still seem to have not moved, or for some of them they have been copied, but not deleted from the cache.

Enable mover logging and post a syslog after it runs.

JorgeB · February 24, 2020

Just now, Luc1fer said:

If i'm going to recreate the docker.img when I select delete in the docker settings will it delete both copies or should I delete one copy using MC?

It will only delete one

Luc1fer · February 24, 2020

It's still running but I am seeing alot of these:

Feb 24 21:30:01 chenbro-svr move: move: file /mnt/cache/system/docker/docker.img
Feb 24 21:30:01 chenbro-svr move: move: create_parent: /mnt/cache/system/docker error: Read-only file system
Feb 24 21:30:01 chenbro-svr move: move: create_parent: /mnt/cache/system error: Read-only file system
Feb 24 21:30:01 chenbro-svr move: move: file /mnt/cache/system/libvirt/libvirt.img
Feb 24 21:30:01 chenbro-svr move: move: create_parent: /mnt/cache/system/libvirt error: Read-only file system
Feb 24 21:30:01 chenbro-svr move: move: create_parent: /mnt/cache/system error: Read-only file system
Feb 24 21:30:01 chenbro-svr move: move_object: /mnt/cache/system: Read-only file system
Feb 24 21:30:01 chenbro-svr root: mover: finished

I'll attach a new syslog once it has stopped.

Luc1fer · February 24, 2020

Thanks for that.

Here is a new set of diagnostics after running mover again.

chenbro-svr-diagnostics-20200224-2135.zip

JorgeB · February 24, 2020

Since the filesystem has gone read-only those errors are normal, just make sure everything important was copied to the array, the pool will need to be re-formatted.

Luc1fer · February 24, 2020

Thanks for your advice, is it sufficient to just manually copy all data to a temporary location then copy it back once the cache pool has been reformatted?

JorgeB · February 24, 2020

Yep, but if there are vdisks they should be copied with the --sparse=always flag to remain sparse, especially when copying back to cache.

Luc1fer · February 24, 2020

7 hours ago, johnnie.black said:

Yep, but if there are vdisks they should be copied with the --sparse=always flag to remain sparse, especially when copying back to cache.

Thanks for your help. I've now got everything back up and running again. It took over 6 hours to copy stuff back to the cache drive mainly because of plex, I never realised that the plex appdata was so large!

Cache pool damaged/corrupt?

Recommended Posts

Luc1fer

Link to comment

JorgeB

Link to comment

Luc1fer

Link to comment

Luc1fer

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

Luc1fer

Link to comment

Luc1fer

Link to comment

JorgeB

Link to comment

Luc1fer

Link to comment

JorgeB

Link to comment

Luc1fer

Link to comment

Join the conversation