Cache BTRFS Errors: csum mismatch, failed to load free space cache for block group

ldrax · March 29, 2020

I have 4 SSD drives in the cache pool. I trusted my itchy hands to do some cabling work, and turned out the power cables to 2 drives were not stable, resulting in the drives fell out and in of the pool during a short amount of period (less than 5 mins), so I quickly stopped the array and powered down the system. Now it's all back up, started array in maintenance mode and, did filesystem check --readonly on the cache:

[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 4040680275968
btrfs: space cache generation (126118) does not match inode (126182)
failed to load free space cache for block group 4044975243264
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 4561445060608
btrfs: space cache generation (126113) does not match inode (126155)
<------------------ truncated, there are about 200 lines of this error --------->

[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/sdh1
UUID: 3c12a05c-3bba-493e-98e5-d2d3a2c7e107
found 1430313537536 bytes used, no error found
total csum bytes: 993257632
total tree bytes: 2197733376
total fs tree bytes: 572276736
total extent tree bytes: 238682112
btree space waste bytes: 415841311
file data blocks allocated: 67854097842176
 referenced 1405220114432

I saw on another post, @johnnie.black mentioned that the 'csum mismatch' is just warning, nothing to worry about.
Can you advise on what to do from here? While I'm glad to notice the line 'found 1430313537536 bytes used, no error found', I hope nothing serious happened.
Do i restart the array in normal mode, and then do [repairing] scrub?
(I have disabled docker and VM for the time being).

Thanks!

JorgeB · March 29, 2020

That will usually clear by itself, but if you want the clear-space-cache option is considered safe to use with btrfs check, check man page for more info:

https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-check

ldrax · March 29, 2020

Thank you @johnnie.black as always! Do i run it with v1 or v2, the description are there, but to be honest I don't really grasp the concept of free space cache here.

--clear-space-cache v1|v2
    completely wipe all free space cache of given type
    For free space cache v1, the clear_cache kernel mount option only rebuilds the free space cache for block groups that are modified while the filesystem is mounted with that option. Thus, using this option with v1 makes it possible to actually clear the entire free space cache.
    For free space cache v2, the clear_cache kernel mount option destroys the entire free space cache. This option, with v2 provides an alternative method of clearing the free space cache that doesn’t require mounting the filesystem.

JorgeB · March 29, 2020

Default is v1, and like mentioned it's considered safe to clear but make sure backups are up do that before doing it.

ldrax · March 29, 2020

Thank you @johnnie.black I'll do that shortly and update again here. You've been very helpful each time.

ldrax · March 30, 2020

So before I did the --clear-space-cache command, I started the array in normal mode to be able to backup some selective files from the cache pool. While doing this, there were a lot of errors message (including message to correct them) on syslog, and rebuilding space cache message as well:

Mar 30 15:18:56 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4907223482368, rebuilding it now
Mar 30 15:18:56 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4158791876608, rebuilding it now
Mar 30 15:18:56 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4512052936704, rebuilding it now
Mar 30 15:18:56 gpt760t kernel: BTRFS error (device sdh1): csum mismatch on free space cache
Mar 30 15:18:56 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4999565279232, rebuilding it now
Mar 30 15:19:19 gpt760t kernel: io_ctl_check_generation: 21 callbacks suppressed
Mar 30 15:19:19 gpt760t kernel: BTRFS error (device sdh1): space cache generation (126117) does not match inode (126155)
Mar 30 15:19:19 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4959836831744, rebuilding it now
Mar 30 15:19:19 gpt760t kernel: BTRFS error (device sdh1): space cache generation (126115) does not match inode (126182)
Mar 30 15:19:19 gpt760t kernel: BTRFS warning (device sdh1): failed to load free space cache for block group 4998491537408, rebuilding it now

--- truncated, hundreds of these same messages ----

Once the backup is completed, I restarted the array in maintenance mode, and did a check --readonly, just to check. All previous errors are now gone:

[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/sdh1
UUID: 3c12a05c-3bba-493e-98e5-d2d3a2c7e107
found 1030725935104 bytes used, no error found
total csum bytes: 603511636
total tree bytes: 1761673216
total fs tree bytes: 533708800
total extent tree bytes: 218890240
btree space waste bytes: 481730316
file data blocks allocated: 67454937907200
 referenced 1007469522944

I guess I don't have to run the btrfs-check --clear-space-cache then?
Thanks @johnnie.black!

ldrax · March 30, 2020

BTRFS Scrub command (non repairing, yet), however, shows a lot of errors found:

scrub status for 3c12a05c-3bba-493e-98e5-d2d3a2c7e107
	scrub started at Mon Mar 30 15:52:07 2020, running for 00:01:21
	total bytes scrubbed: 121.53GiB with 14191 errors
	error details: csum=14191
	corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

(in progress)

JorgeB · March 30, 2020

23 minutes ago, ldrax said:

I guess I don't have to run the btrfs-check --clear-space-cache then?

Yes, like mentioned that issue tends to get fixed on its own.

18 minutes ago, ldrax said:

however, shows a lot of errors found:

That's a different issue, and unlike the previous one, that one is important, those checksum errors are in the data/metadata, run a correcting scrub and check all errors are corrected.

ldrax · March 30, 2020

Done, looks like all errors are corrected. Thanks!

scrub status for 3c12a05c-3bba-493e-98e5-d2d3a2c7e107
	scrub started at Mon Mar 30 16:21:50 2020 and finished after 00:26:03
	total bytes scrubbed: 1.87TiB with 300965 errors
	error details: csum=300965
	corrected errors: 300965, uncorrectable errors: 0, unverified errors: 0

JorgeB · March 30, 2020

Make sure to check this for better pool monitoring, one of the most common reasons for those errors in a pool is one the the devices dropping offline then coming back online.

ldrax · March 30, 2020

Thanks, I'll check it out!

Cache BTRFS Errors: csum mismatch, failed to load free space cache for block group

Recommended Posts

ldrax

Link to comment

JorgeB

Link to comment

ldrax

Link to comment

JorgeB

Link to comment

ldrax

Link to comment

ldrax

Link to comment

ldrax

Link to comment

JorgeB

Link to comment

ldrax

Link to comment

JorgeB

Link to comment

ldrax

Link to comment

Join the conversation