97WaterPolo Posted March 9, 2023 Share Posted March 9, 2023 (edited) Hi guys, This morning my Unraid system suddenly started throwing a bunch of errors stating that a BTRFS errorwriting primary super block to device. It completely filled up by syslog and altered me to a problem. After running btrfs dev stats /mnt/cache it looked like it had a bunch of errors so I ended up remotely shutting down by Unraid system. When I got home I restarted the system and it mounted by /mnt/cache drive in read-only mode, it is currently using 275GB with 1.72TB free according to Unraid dashboard. After a lot of searching on the forums I think I finally figured out the issue. I did a migration from an old unraid build with the following steps for the cache pool Nvme_1_256GB and Nvme_2_256TB SSDs were on the system Remove Nvme_2_256GB and inserted Nvme_1_2TB Rebuild cache pool (now is Nvme_1_256GB and Nvme_1_2TB) Remove Nvme_1_256GB and inserted Nvme_2_2TB Rebuild cache pool (now is Nvme_1_2TB and Nvme_2_2TB) I think what happened is when it had different NVME drives the it copied over the old partion or raid, and now that I've passed the size of the original 256GB nvme, it is failing to write. If I open up my cache pool I can see the btrfs filesystem df which indicates that the total size is not the full 2TB but rather ~262GB and by usage ratio is 96.8%. Now that I have my 2TB NVMEs how do I fix it so that it uses the full 2TB? The drive has been mounted as read-only so I can't execute the mover to move files off this drive and do a normal swap. I am planning to copy the files from the /mnt/cache onto a disk on my array, and then reformat the two cache drives. Is this the correct way to ensure that all 2TB are available for the cache pool and to fix the issue? Is there another way to fix this issue and bring my cache back online? No matter what I'm copying everything off the read-only /mnt/cache to /dev/disk7/restore using midnight commander so that I have a backup of everything. I also attached my diagnostics! EDIT: I was able to successfully unmount and mount my cache pool and deleted a 2GB syslog file which brought it within operating size. I then did a btrfs scrub and fixed a bunch of errors. Everything is back to working but I feel like I'm playing with fire being so close to the max size. So the new main question is, how do I alter the size of the RAID1 partition on my cache pool since it is still set from the old 256GB drive? Thanks for the help! aincrad-diagnostics-20230308-1910.zip Edited March 9, 2023 by 97WaterPolo Quote Link to comment
JorgeB Posted March 9, 2023 Share Posted March 9, 2023 That total is the currently allocated btrfs size, not the partition size, Unraid should always uses the full partition, you can check with fdisk -l Quote Link to comment
97WaterPolo Posted March 10, 2023 Author Share Posted March 10, 2023 16 hours ago, JorgeB said: That total is the currently allocated btrfs size, not the partition size, Unraid should always uses the full partition, you can check with fdisk -l Thank you! I've attached my fdisk -l result, I do have it as the full partion size, how do I increase the currently allocated btrfs size in UnraidOS so that it can fully use the whole drive? Quote Link to comment
apandey Posted March 10, 2023 Share Posted March 10, 2023 6 hours ago, 97WaterPolo said: how do I increase the currently allocated btrfs size in UnraidOS so that it can fully use the whole drive? Normally you don’t have to, btrfs does that automatically. Also, if you are spooked by usage ratio being close to 100%, that is normal. That simply means btrfs is using allocated space efficiently without much wastage. Think of that like fragmentation loss, when that number is much less than 100%, you must run rebalance to rewrite data blocks to improve the situation. That is why it says “No balance required” next to it I think your problem is nothing to do with space, but rather is a case of filesystem corruption when you tried to repair the errors. I don’t think that worked well. Your syslog has “parent transid verify failed” which is a sign of broken filesystem. That is probably what let to it going read only I will leave it to someone more experienced with btrfs to suggest next steps, but I think availability of space is not something you need to worry about Quote Link to comment
apandey Posted March 10, 2023 Share Posted March 10, 2023 run the following to check your filesystem as a start btrfs device stats /mnt/cache also try running a scrub to see if you have any checksum errors if I had backup, I would re-do the cache than fighting this, unless the value add is purely for learning Quote Link to comment
JorgeB Posted March 10, 2023 Share Posted March 10, 2023 Like mentioned nothing you need to do, btrfs first allocates 1GiB chunks, then any data goes there, same for metadata, chunks are allocated as needed (also removed em fully empty). Quote Link to comment
97WaterPolo Posted March 10, 2023 Author Share Posted March 10, 2023 (edited) 12 hours ago, apandey said: Normally you don’t have to, btrfs does that automatically. Also, if you are spooked by usage ratio being close to 100%, that is normal. That simply means btrfs is using allocated space efficiently without much wastage. Think of that like fragmentation loss, when that number is much less than 100%, you must run rebalance to rewrite data blocks to improve the situation. That is why it says “No balance required” next to it I think your problem is nothing to do with space, but rather is a case of filesystem corruption when you tried to repair the errors. I don’t think that worked well. Your syslog has “parent transid verify failed” which is a sign of broken filesystem. That is probably what let to it going read only I will leave it to someone more experienced with btrfs to suggest next steps, but I think availability of space is not something you need to worry about 12 hours ago, apandey said: run the following to check your filesystem as a start btrfs device stats /mnt/cache also try running a scrub to see if you have any checksum errors if I had backup, I would re-do the cache than fighting this, unless the value add is purely for learning Got it, thank you for the input! After I restarted by system and it got mounted as read-only, I stopped the array and mounted just the cache. I was then able to delete the syslog file and then I ran the btrfs scrub which fixed a bunch of errors (thankfully no uncorrectable errors) and then I did a "btrfs device stats -z /mnt/cache" to zero out the numbers. Since then it has been running smoothly with no issues (last few days). Re-do the cache as in change all my shares to the array, and then re-format both of my cache drives in my pool? 9 hours ago, JorgeB said: Like mentioned nothing you need to do, btrfs first allocates 1GiB chunks, then any data goes there, same for metadata, chunks are allocated as needed (also removed em fully empty). Thank you for clarifying on how it allocates chunks. I didn't realize that it dynamically adds and removes as needed. Since it was rather class to the size of my old hard drives (around 256GB) I thought it was something related to that rather than a file system corruption. I followed and set up alerts so that I will know if there are ever errors so I can do a scrub. Do you think it is still worth it to move cache to array, and then reformat my pool? EDIT: Upon checking my pool I see that the balance and scrub is disabled, should I enable them on some sort of schedule? Edited March 10, 2023 by 97WaterPolo Quote Link to comment
Solution apandey Posted March 10, 2023 Solution Share Posted March 10, 2023 (edited) 3 hours ago, 97WaterPolo said: Re-do the cache as in change all my shares to the array, and then re-format both of my cache drives in my pool? If you have managed to repair fs and scrub is no longer giving you errors, no need to do anything. You should probably run one more scrub to be sure 3 hours ago, 97WaterPolo said: Upon checking my pool I see that the balance and scrub is disabled, should I enable them on some sort of schedule? Good idea to do a scheduled scrub. I do once a month, separate time from my parity check) Balance, you may not need. Depends on your data write pattern. If your utilization stays good, no need to rewrite everything (which is what balance does) Edited March 10, 2023 by apandey Quote Link to comment
97WaterPolo Posted March 11, 2023 Author Share Posted March 11, 2023 1 hour ago, apandey said: If you have managed to repair fs and scrub is no longer giving you errors, no need to do anything. You should probably run one more scrub to be sure Good idea to do a scheduled scrub. I do once a month, separate time from my parity check) Balance, you may not need. Depends on your data write pattern. If your utilization stays good, no need to rewrite everything (which is what balance does) Implemented a monthly scrub after running a scrub once more with no errors. Thank you for the input on the schedule and utilization! Much appreciated Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.