Cache pool damaged/corrupt?


Luc1fer

Recommended Posts

Hi,

 

I have upgraded to 6.8.2 from 6.6.7

 

I have noticed that my cache pool was giving these errors:

kernel: BTRFS info (device sdr1): relocating block group 1055678005248 flags data|raid1

 

I tried to balance my pool but that didn't do anything, but started giving these errors:

 

kernel: BTRFS critical (device sdr1): corrupt leaf: root=5 block=286758944768 slot=2, bad key order, prev (11529215046071388418 254 281472965546548) current (2918659 1 0)

 

So I thought that I would move everything off the cache pool and then recreate the pool again.  I set all of the shares that were on cache only and cache preferred to YES and invoked the mover.

 

The mover has finished, but there is still about 100gb of data on the cache pool that wont move to the array.

 

Data, RAID1: total=66.00GiB, used=34.74GiB

Data, single: total=53.00GiB, used=24.62GiB

System, RAID1: total=32.00MiB, used=48.00KiB

Metadata, RAID1: total=2.00GiB, used=68.08MiB

Metadata, single: total=1.00GiB, used=704.00KiB

GlobalReserve, single: total=36.80MiB, used=0.00B

 

No balance found on '/mnt/cache'

 

Mainly this seems to be the docker and libvirt folders and files.  It also appears that there are 2 copies of docker.img, one on the cache drive and then what appears to be an older dated file on disk11.  Also my domains and isos folder has not been moved to the array (i have copied all the data to a local drive on my pc as a backup).

 

I'm not sure how to proceed from here.

I have attached my diagnostics zip.

 

Thanks for your help.

chenbro-svr-diagnostics-20200224-2023.zip

Link to comment

One of the cache devices (cache2) has issues, likely it's dropping offline:

 

Feb 24 08:52:03 chenbro-svr kernel: BTRFS info (device sdr1): bdev /dev/sds1 errs: wr 3714141, rd 2077762, flush 152140, corrupt 0, gen 0

 

See here for more info.

 

For the mover you need to disable docker and VM services for those files to move, note that if any of the files exists on the array it won't overwrite them, but no need to worry about the docker image, delete all and then re-create.

Link to comment

Thanks for the reply.  I forgot to say that I stopped both the docker and VM services before invoking the mover, but those files still seem to have not moved, or for some of them they have been copied, but not deleted from the cache.

 

Running: btrfs dev stats /mnt/cache gives:

 

Linux 4.19.98-Unraid.
root@chenbro-svr:~# btrfs dev stats /mnt/cache
[/dev/sdr1].write_io_errs    0
[/dev/sdr1].read_io_errs     0
[/dev/sdr1].flush_io_errs    0
[/dev/sdr1].corruption_errs  0
[/dev/sdr1].generation_errs  0
[/dev/sds1].write_io_errs    3714141
[/dev/sds1].read_io_errs     2077762
[/dev/sds1].flush_io_errs    152140
[/dev/sds1].corruption_errs  0
[/dev/sds1].generation_errs  0

 

sds1 is a replacement SSD when I had an issue with a previous one in that slot dropping out and eventually failing.  I changed the cable and put it back in and let unraid restore the cache pool.  Obviously there is something about either that bay or SATA connector that is causing issues to the SSD?  Should I power down and remove cache2 (sds1)?

 

 

Link to comment

It's still running but I am seeing alot of these:

 

Feb 24 21:30:01 chenbro-svr move: move: file /mnt/cache/system/docker/docker.img
Feb 24 21:30:01 chenbro-svr move: move: create_parent: /mnt/cache/system/docker error: Read-only file system
Feb 24 21:30:01 chenbro-svr move: move: create_parent: /mnt/cache/system error: Read-only file system
Feb 24 21:30:01 chenbro-svr move: move: file /mnt/cache/system/libvirt/libvirt.img
Feb 24 21:30:01 chenbro-svr move: move: create_parent: /mnt/cache/system/libvirt error: Read-only file system
Feb 24 21:30:01 chenbro-svr move: move: create_parent: /mnt/cache/system error: Read-only file system
Feb 24 21:30:01 chenbro-svr move: move_object: /mnt/cache/system: Read-only file system
Feb 24 21:30:01 chenbro-svr root: mover: finished

 

I'll attach a new syslog once it has stopped.

Link to comment
7 hours ago, johnnie.black said:

Yep, but if there are vdisks they should be copied with the --sparse=always flag to remain sparse, especially when copying back to cache.

 

 

Thanks for your help.  I've now got everything back up and running again.  It took over 6 hours to copy stuff back to the cache drive mainly because of plex, I never realised that the plex appdata was so large!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.