Cache pool mounted read only with RAID showing full but disk showing empty

97WaterPolo · March 9, 2023

Hi guys,

This morning my Unraid system suddenly started throwing a bunch of errors stating that a BTRFS errorwriting primary super block to device. It completely filled up by syslog and altered me to a problem.

After running btrfs dev stats /mnt/cache it looked like it had a bunch of errors so I ended up remotely shutting down by Unraid system. When I got home I restarted the system and it mounted by /mnt/cache drive in read-only mode, it is currently using 275GB with 1.72TB free according to Unraid dashboard.

After a lot of searching on the forums I think I finally figured out the issue. I did a migration from an old unraid build with the following steps for the cache pool

Nvme_1_256GB and Nvme_2_256TB SSDs were on the system
Remove Nvme_2_256GB and inserted Nvme_1_2TB
Rebuild cache pool (now is Nvme_1_256GB and Nvme_1_2TB)
Remove Nvme_1_256GB and inserted Nvme_2_2TB
Rebuild cache pool (now is Nvme_1_2TB and Nvme_2_2TB)

I think what happened is when it had different NVME drives the it copied over the old partion or raid, and now that I've passed the size of the original 256GB nvme, it is failing to write. If I open up my cache pool I can see the btrfs filesystem df which indicates that the total size is not the full 2TB but rather ~262GB and by usage ratio is 96.8%.

Now that I have my 2TB NVMEs how do I fix it so that it uses the full 2TB? The drive has been mounted as read-only so I can't execute the mover to move files off this drive and do a normal swap.

I am planning to copy the files from the /mnt/cache onto a disk on my array, and then reformat the two cache drives. Is this the correct way to ensure that all 2TB are available for the cache pool and to fix the issue?

Is there another way to fix this issue and bring my cache back online? No matter what I'm copying everything off the read-only /mnt/cache to /dev/disk7/restore using midnight commander so that I have a backup of everything.

I also attached my diagnostics!

EDIT: I was able to successfully unmount and mount my cache pool and deleted a 2GB syslog file which brought it within operating size. I then did a btrfs scrub and fixed a bunch of errors. Everything is back to working but I feel like I'm playing with fire being so close to the max size. So the new main question is, how do I alter the size of the RAID1 partition on my cache pool since it is still set from the old 256GB drive?

Thanks for the help!

aincrad-diagnostics-20230308-1910.zip

Edited March 9, 2023 by 97WaterPolo

JorgeB · March 9, 2023

That total is the currently allocated btrfs size, not the partition size, Unraid should always uses the full partition, you can check with

fdisk -l

97WaterPolo · March 10, 2023

16 hours ago, JorgeB said:
That total is the currently allocated btrfs size, not the partition size, Unraid should always uses the full partition, you can check with
fdisk -l

Thank you! I've attached my fdisk -l result, I do have it as the full partion size, how do I increase the currently allocated btrfs size in UnraidOS so that it can fully use the whole drive?

image.png.7fade3773a2c11a03c402c30e1493a52.png

apandey · March 10, 2023

6 hours ago, 97WaterPolo said:

how do I increase the currently allocated btrfs size in UnraidOS so that it can fully use the whole drive?

Normally you don’t have to, btrfs does that automatically. Also, if you are spooked by usage ratio being close to 100%, that is normal. That simply means btrfs is using allocated space efficiently without much wastage. Think of that like fragmentation loss, when that number is much less than 100%, you must run rebalance to rewrite data blocks to improve the situation. That is why it says “No balance required” next to it

I think your problem is nothing to do with space, but rather is a case of filesystem corruption when you tried to repair the errors. I don’t think that worked well. Your syslog has “parent transid verify failed” which is a sign of broken filesystem. That is probably what let to it going read only

I will leave it to someone more experienced with btrfs to suggest next steps, but I think availability of space is not something you need to worry about

apandey · March 10, 2023

run the following to check your filesystem as a start

btrfs device stats /mnt/cache

also try running a scrub to see if you have any checksum errors

if I had backup, I would re-do the cache than fighting this, unless the value add is purely for learning

JorgeB · March 10, 2023

Like mentioned nothing you need to do, btrfs first allocates 1GiB chunks, then any data goes there, same for metadata, chunks are allocated as needed (also removed em fully empty).

97WaterPolo · March 10, 2023

12 hours ago, apandey said:

Normally you don’t have to, btrfs does that automatically. Also, if you are spooked by usage ratio being close to 100%, that is normal. That simply means btrfs is using allocated space efficiently without much wastage. Think of that like fragmentation loss, when that number is much less than 100%, you must run rebalance to rewrite data blocks to improve the situation. That is why it says “No balance required” next to it

I think your problem is nothing to do with space, but rather is a case of filesystem corruption when you tried to repair the errors. I don’t think that worked well. Your syslog has “parent transid verify failed” which is a sign of broken filesystem. That is probably what let to it going read only

I will leave it to someone more experienced with btrfs to suggest next steps, but I think availability of space is not something you need to worry about

12 hours ago, apandey said:
run the following to check your filesystem as a start
btrfs device stats /mnt/cache
also try running a scrub to see if you have any checksum errors

if I had backup, I would re-do the cache than fighting this, unless the value add is purely for learning

Got it, thank you for the input! After I restarted by system and it got mounted as read-only, I stopped the array and mounted just the cache. I was then able to delete the syslog file and then I ran the btrfs scrub which fixed a bunch of errors (thankfully no uncorrectable errors) and then I did a "btrfs device stats -z /mnt/cache" to zero out the numbers. Since then it has been running smoothly with no issues (last few days). Re-do the cache as in change all my shares to the array, and then re-format both of my cache drives in my pool?

image.png.22d87283176bdd426fefa8220e63fe6b.png

9 hours ago, JorgeB said:

Like mentioned nothing you need to do, btrfs first allocates 1GiB chunks, then any data goes there, same for metadata, chunks are allocated as needed (also removed em fully empty).

Thank you for clarifying on how it allocates chunks. I didn't realize that it dynamically adds and removes as needed. Since it was rather class to the size of my old hard drives (around 256GB) I thought it was something related to that rather than a file system corruption. I followed

and set up alerts so that I will know if there are ever errors so I can do a scrub. Do you think it is still worth it to move cache to array, and then reformat my pool?

EDIT: Upon checking my pool I see that the balance and scrub is disabled, should I enable them on some sort of schedule?

Edited March 10, 2023 by 97WaterPolo

apandey · March 10, 2023

3 hours ago, 97WaterPolo said:

Re-do the cache as in change all my shares to the array, and then re-format both of my cache drives in my pool?

If you have managed to repair fs and scrub is no longer giving you errors, no need to do anything. You should probably run one more scrub to be sure

3 hours ago, 97WaterPolo said:

Upon checking my pool I see that the balance and scrub is disabled, should I enable them on some sort of schedule?

Good idea to do a scheduled scrub. I do once a month, separate time from my parity check)

Balance, you may not need. Depends on your data write pattern. If your utilization stays good, no need to rewrite everything (which is what balance does)

Edited March 10, 2023 by apandey

97WaterPolo · March 11, 2023

1 hour ago, apandey said:

If you have managed to repair fs and scrub is no longer giving you errors, no need to do anything. You should probably run one more scrub to be sure

Good idea to do a scheduled scrub. I do once a month, separate time from my parity check)

Balance, you may not need. Depends on your data write pattern. If your utilization stays good, no need to rewrite everything (which is what balance does)

Implemented a monthly scrub after running a scrub once more with no errors. Thank you for the input on the schedule and utilization! Much appreciated

Cache pool mounted read only with RAID showing full but disk showing empty

Recommended Posts

97WaterPolo

Link to comment

JorgeB

Link to comment

97WaterPolo

Link to comment

apandey

Link to comment

apandey

Link to comment

JorgeB

Link to comment

97WaterPolo

Link to comment

apandey

Link to comment

97WaterPolo

Link to comment

Join the conversation