Mover is stopping consistently short of the full data move

buccadebeppo · March 8, 2022

Hello!

I've been beating my head against my desk trying to figure this one out.

Mover is stopping consistently short of the full data move I am expecting. My cache pool is 4x1tb sata SSDs set up with the default raid 1 btrfs, leaving 2tb of unsalable space. Of this 2tb only about 550gb gets moved by the mover and the other roughly 1.5tb stays behind.

Things I have tried so far:

Repeatedly manually triggering Mover
Clean rebooting the server
Redownloading CA Mover Tuning (I had used it before but removed it when I didn't need it (long before this issue started))
Defined "Minimum free space" for each share (using 10MB as the value for each) (found idea from the link below)

Notable recent occurrences:

Had an unclean shut down
CA Backup/Restore Appdata ran a full backup (which is still on the cache)
- This backup is quite large (1.13TB) as it contains the Appdata for my plex server. This was something I had been intending to remove from the CA Backup/Restore Appdata backup but haven't got around to yet.

I have attached the diagnostics file from my server (pulled just now) to hopefully provide better details than I can.

Any ideas on how to get Mover functioning properly again?

Thanks!

anton-diagnostics-20220307-1900.zip

ChatNoir · March 8, 2022

There is a reason why Mover is not doing what you expect.

You should check that :

buccadebeppo · March 9, 2022

Thank you for pointing me to that link. I activated Mover logging (duh! don't know why I didn't do that already). Here's the errors I'm seeing:

Mar  9 02:50:21 Anton kernel: BTRFS warning (device sdc1): csum failed root 5 ino 292877677 off 10958700544 csum 0xdc3c8e5e expected csum 0x58e40fb6 mirror 1
Mar  9 02:50:21 Anton kernel: btrfs_dev_stat_print_on_error: 2 callbacks suppressed
Mar  9 02:50:21 Anton kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 305, gen 0
Mar  9 02:50:21 Anton kernel: BTRFS warning (device sdc1): csum failed root 5 ino 292877677 off 10958700544 csum 0xdc3c8e5e expected csum 0x58e40fb6 mirror 2
Mar  9 02:50:21 Anton kernel: BTRFS error (device sdc1): bdev /dev/sdaf1 errs: wr 0, rd 0, flush 0, corrupt 849, gen 0
Mar  9 02:50:21 Anton kernel: BTRFS warning (device sdc1): csum failed root 5 ino 292877677 off 10958700544 csum 0xdc3c8e5e expected csum 0x58e40fb6 mirror 1
Mar  9 02:50:21 Anton kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 306, gen 0
Mar  9 02:50:21 Anton kernel: BTRFS warning (device sdc1): csum failed root 5 ino 292877677 off 10958700544 csum 0xdc3c8e5e expected csum 0x58e40fb6 mirror 2
Mar  9 02:50:21 Anton kernel: BTRFS error (device sdc1): bdev /dev/sdaf1 errs: wr 0, rd 0, flush 0, corrupt 850, gen 0
Mar  9 02:50:21 Anton kernel: BTRFS warning (device sdc1): csum failed root 5 ino 292877677 off 10958700544 csum 0xdc3c8e5e expected csum 0x58e40fb6 mirror 1
Mar  9 02:50:21 Anton kernel: BTRFS error (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 307, gen 0
Mar  9 02:50:21 Anton shfs: copy_file: /mnt/cache/Media/4K Movies/Full Metal Jacket (1987)/Full Metal Jacket (1987) Remux-2160p h265.mkv /mnt/disk20/Media/4K Movies/Full Metal Jacket (1987)/Full Metal Jacket (1987) Remux-2160p h265.mkv (5) Input/output error

I've been trying to track down what this means exactly. Is this an indication of a failed SSD in the cache BTRFS pool?

JorgeB · March 9, 2022

Checksum errors mean btrfs is detecting data corruption:

Mar  7 16:08:38 Anton kernel: BTRFS info (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 288, gen 0
Mar  7 16:08:38 Anton kernel: BTRFS info (device sdc1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 818, gen 0
Mar  7 16:08:38 Anton kernel: BTRFS info (device sdc1): bdev /dev/sde1 errs: wr 0, rd 0, flush 0, corrupt 910, gen 0
Mar  7 16:08:38 Anton kernel: BTRFS info (device sdc1): bdev /dev/sdaf1 errs: wr 0, rd 0, flush 0, corrupt 796, gen 0

on all four devices, since you're using ECC RAM unlikely that that is the problem, other pool is also fine, suggesting the issue might be device related or some other thing, run a scrub, delete/restore any corrupt files listed in the syslog, then reset the stats and keep monitoring the pool.

buccadebeppo · March 9, 2022

Thanks for taking a look! I'm a bit of a noob (and hope this can help those who may need it in the future), so here's what I have looked up for this.

For the Scrub I found this link: https://wiki.unraid.net/Check_Disk_Filesystems#btrfs_scrub

Backlink from:

https://wiki.unraid.net/index.php/Check_Disk_Filesystems#Drives_formatted_with_BTRFS

https://www.reddit.com/r/unRAID/comments/qbwoa8/how_do_i_get_rid_of_the_btrfs_error_device_sde1/

Looks like I should run this on the terminal, correct?:

btrfs scrub start -B /dev/cache

Then how would I delete/restore any corrupt files? Would krusader work for deletion or is this a command line process?

Ultimately this is just a temporary cache before the data is written to the array and everything currently on it is very replaceable. My main goal with this is to restore full operation, even if that includes data loss.

ChatNoir · March 9, 2022

38 minutes ago, buccadebeppo said:
Looks like I should run this on the terminal, correct?:
btrfs scrub start -B /dev/cache

No, simply click on you cache pool first drive, scroll down until you find the scrub section, tick the repair box and hit the button (not sure if the box is useful in your case).

41 minutes ago, buccadebeppo said:

Then how would I delete/restore any corrupt files? Would krusader work for deletion or is this a command line process?

Your choice, through the network on your main machine (windows/mac/etc), or directly on Unraid (Krusader, Midnight Commander, etc.)

buccadebeppo · March 16, 2022

Thank you all for the help! My issue seems to be resolved now. Just wanted to provide what I did for anyone who may find this in the future.

I ran the BTRFS scrub as indicated above. It returned the following results:

Scrub started:    Thu Mar 10 00:41:11 2022
Status:           finished
Duration:         1:01:12
Total to scrub:   601.48GiB
Rate:             1.01GiB/s
Error summary:    csum=8
  Corrected:      0
  Uncorrectable:  8
  Unverified:     0

While this may have helped some, I did not perceive any improvement.

After this I went back to the Sys Log and sorted by Red Errors. I then deleted all files that were showing up as errors and re-downloaded them. This included my Appdata backup, which I changed the settings for to exclude Plex before rerunning. With Plex excluded the Appdata Backup runtime dropped from 12+ hours to ~1 hour to complete.

Since removing and re-adding all of these files everything appears to be working as intended.

Mover is stopping consistently short of the full data move

Recommended Posts

buccadebeppo

Link to comment

ChatNoir

Link to comment

buccadebeppo

Link to comment

JorgeB

Link to comment

buccadebeppo

Link to comment

ChatNoir

Link to comment

buccadebeppo

Link to comment

Join the conversation