BTRFS Cache Drive Read Only

GlennCottam · August 19, 2021

Yesterday, I got a notification that the cache pool was full, and took action to rectify it (emptied my recycle bin, and invoked the mover, which cleared plenty of space off the pool).

This morning, all of my dockers where down, and all dockers where saying the file system was read only. This meant my cache was mounted as read only.

Thinking that one of the cache SSD's might have just been full, I attempted to balance the pool. The balance exited with:

"BTRFS info (device sdd1): balance: ended with status -30".

I rebooted the server, and attempted the balance again with the same error. The error only shows itself on the disk log for one of the SSD's. I have also tried a scrub which failed the same way.

The other 3 SSD's have no errors in their logs, but the SSD having issues has the following errors on boot:

Aug 19 13:00:33 Unraid kernel: BTRFS error (device sdd1): unable to find ref byte nr 11987553361920 parent 0 root 5 owner 36039698 offset 1277067264
Aug 19 13:00:33 Unraid kernel: BTRFS: error (device sdd1) in __btrfs_free_extent:3092: errno=-2 No such entry
Aug 19 13:00:33 Unraid kernel: BTRFS info (device sdd1): forced readonly
Aug 19 13:00:33 Unraid kernel: BTRFS: error (device sdd1) in btrfs_run_delayed_refs:2144: errno=-2 No such entry

I then rebooted into safe mode, and entered maintenance mode in order to see if more information would show itself. I was able to run a BTRFS check on the SSD getting the following results:

[1/7] checking root items
[2/7] checking extents
parent transid verify failed on 10084585357312 wanted 2949214 found 2949025
parent transid verify failed on 10957839237120 wanted 18446612689103786680 found 2993596
parent transid verify failed on 10957839237120 wanted 18446612689103786680 found 2993596
parent transid verify failed on 10957839237120 wanted 18446612689103786680 found 2993596
Ignoring transid failure
data backref 11987553361920 root 5 owner 36039698 offset 1277067264 num_refs 0 not found in extent tree
incorrect local backref count on 11987553361920 root 5 owner 36039698 offset 1277067264 found 1 wanted 0 back 0x11deca50
incorrect local backref count on 11987553361920 root 11763000595709952005 owner 4294936705 offset 1277067264 found 0 wanted 1 back 0x11df3a50
backref disk bytenr does not match extent record, bytenr=11987553361920, ref bytenr=0
backpointer mismatch on [11987553361920 86016]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
parent transid verify failed on 10957839237120 wanted 18446612689103786680 found 2993596
Ignoring transid failure
[5/7] checking only csums items (without verifying data)
parent transid verify failed on 10957839237120 wanted 18446612689103786680 found 2993596
Ignoring transid failure
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
ERROR: transid errors in file system
Opening filesystem to check...
Checking filesystem on /dev/sdd1
UUID: b81ae343-60cb-4303-8fac-192942135255
cache and super generation don't match, space cache will be invalidated
found 1205836599296 bytes used, error(s) found
total csum bytes: 1174204308
total tree bytes: 2446475264
total fs tree bytes: 917962752
total extent tree bytes: 130154496
btree space waste bytes: 430889587
file data blocks allocated: 1675174596608
 referenced 1200826040320

I am completely lost on what to do at this point, and need assistance. The only thing I can think to do now is to copy the contents of the cache onto the array, and reformat the cache disks. I appreciate any help I can get!

unraid-diagnostics-20210819-1322.zip

JorgeB · August 19, 2021

5 minutes ago, GlennCottam said:

The only thing I can think to do now is to copy the contents of the cache onto the array, and reformat the cache disks.

Yes, that's what you should do, some recovery options here if needed.

GlennCottam · August 19, 2021

Thanks for the quick reply! I did see that post in my googling but wanted to see if there was another option before I proceeded.

Started the copy to the array. Ill reply if anything comes up.

ChatNoir · August 19, 2021

1 hour ago, GlennCottam said:

Yesterday, I got a notification that the cache pool was full, and took action to rectify it

If not already set, you should probably configure your pool notification thresholds. It would allow you to act before you have the issue.

And setting a Minimum free space would allow Unraid to write to the Array instead of filling your Pool for all shares set as Prefer if I remember correctly. (even though it is not indicated as such in the help tooltip)

itimpi · August 19, 2021

13 minutes ago, ChatNoir said:

even though it is not indicated as such in the help tooltip)

I think that the help is actually not correct as it mentions new files resulting in an ‘out of space’ error. That is not true as if the User share Use Cache setting is Yes or Prefer since then new files can instead be written to the array.

GlennCottam · August 20, 2021

After moving the files all day yesterday, I am ready to try to repair the disk itself using:

btrfs check --repair /dev/sdd

I am wondering if using btrfs check --repair will erase the disk?

On a side note when I checked this morning, the server was unresponsive. I couldn't access the physical terminal (keyboard and monitor on the server) as the screen was blank, the web ui would not respond, and my SSH connection failed. I had to force reboot the server to get control back.

Unfortunately the syslog does not show anything useful. I have had these types of crashes before after I installed a new 10GbE networking card. Thought I had it fixed as it hasn't done this in a few weeks. I am not sure if this problem might be related to the cache disk, or another problem I have to find a solution for.

Regardless, the single cache disk is still acting strange, still giving the same error, and in read only format.

JorgeB · August 20, 2021

17 minutes ago, GlennCottam said:

I am wondering if using btrfs check --repair will erase the disk?

It shouldn't, but it might make things worse, or just fix it temporarily, recommend re-formatting it once it's backed up.

GlennCottam · August 20, 2021

10 minutes ago, JorgeB said:

It shouldn't, but it might make things worse, or just fix it temporarily, recommend re-formatting it once it's backed up.

Thank you! I formatted the drive, and then reformatted the cache pool.

However, I now have a new problem. The pool is saying I only have 1.1TB of cache, when I have 2.1TB of SSD's installed.

Everything that I can look at seams to be in working order. I am not sure where the issue lies. All the SSD's have the right allocation in their menus.

JorgeB · August 20, 2021

Default mode is raid1, you can convert to others.

GlennCottam · August 20, 2021

5 minutes ago, JorgeB said:

Default mode is raid1, you can convert to others.

That would make sense! Thank you!

Edited August 20, 2021 by GlennCottam

BTRFS Cache Drive Read Only

Recommended Posts

GlennCottam

Link to comment

JorgeB

Link to comment

GlennCottam

Link to comment

ChatNoir

Link to comment

itimpi

Link to comment

GlennCottam

Link to comment

JorgeB

Link to comment

GlennCottam

Link to comment

JorgeB

Link to comment

GlennCottam

Link to comment

Join the conversation