Better btrfs array monitoring

January 20, 20242 yr

With btrfs, if you have a live running pool and a disk disappears from the system (ie you pull it or a cable flakes out), or if the disk straight up just fails while the array is running, btrfs doesn't provide any indication via most of the monitoring commands to detect the missing disk. For example if you run `btrfs filesystem show` after a disk has dropped from a pool, it will still show a reference to the disk even though it's missing. Even if it's just a flaky cable and the disk reappears to the system, it will still remain unused until you fully remount the filesystem (and then scrub, not balance would be all that's necessary to resync but I digress).

If you unmount the pool and remount it with the disk still missing, you will need the degraded option which unraid handles, but it's only after that remount with the degraded option that `btrfs filesystem show` will indicate any missing devices. It's also only after stopping the array will unraid indicate there are any missing devices with a pool.

This means unraid users are in the dark if a disk fails or flakes out or completely fails while the array is running. If the user doesn't stop the array often, they could be in the dark that their pool is degraded for months even.

Btrfs does however provide a means to detect device failures and issues via the `btrfs device stats` command. If any device stats show a non-0 value, this indicates there is an issue with the array and it's possible it's degraded. When a disk flakes out or fails for example, the device stats will indicate write errors.

It is absolutely critical to monitor btrfs device stats to detect a degraded array event for a running array. Thus, this feature request is to have this critical feature be included in the unraid GUI when you're viewing a pool, and also have any non-0 value device stats be notified to the admin so they can act on it. Given that being able to reset these device stats to make detecting errors later possible, we'd also need a GUI option for resetting device stats after any issues are addressed.

I have some other ideas to make btrfs pools more resilient and efficient (particularly around the fact I feel unraid will run balances much more than necessary) but that is left for a separate feature request, device stat monitoring is the most critical feature request I believe is a requirement for proper pool monitoring.

Edited January 20, 20242 yr by JSE

Quote

January 21, 20242 yr

20 hours ago, JSE said:

device stat monitoring is the most critical feature request I believe is a requirement for proper pool monitoring.

Agree, have a feature request for this since 2016.

Quote

January 21, 20242 yr

Author

7 minutes ago, JorgeB said:

Agree, have a feature request for this since 2016.

High time it gets included then . Do you have a link? .... lack of this and lots of people can lose data, and could be recoverable right now and not even know it.

I'm on a raid (ha, no pun intended) of recommending some changes to improve the reliability of btrfs pools since I do a lot of this stuff manually in shell but it really should be included in a more user friendly way since most people are not familiar or have the experience in managing a btrfs pool from shell, and honestly, they shouldn't need to.

Quote

January 21, 20242 yr

Solution

The initial request was this one, it's been a long time since I've read it, and since I was just starting to use btrfs maybe some inaccuracies there, my last request to LT was to implement something like this, it also mentions zfs since those pools are not monitored as well, and while zfs handles a dropped device better than btrfs, the user still needs to be notified:

Basically monitor the output of "btrfs device stats /mountpoint" (or "zpool status poolname" for zfs) and use the existing GUI errors column for pools, which currently is just for show since it's always 0, to display the sum of all the possible device errors (there are 5 for btrfs and 3 for zfs), and if it's non zero for any pool device generate a system notification, optionally hovering the mouse over the errors would show the type and number of errors like we have now when hovering the mouse over the SMART thumbs down icon in the dash.

Additionally you'd also need a button or check mark to reset the errors, it would run "btrfs dev stats -z /mountpoint" or "zpool clear poolname".

Welcome any suggestions to make this better, just recently had the chance to ask LT about this again for 6.13, but as of now don't know if this or something similar is going to be done.

I've also created a FAQ entry to monitor btrfs and zfs pools with a script, while not perfect it's better than nothing, but of course most users won't see it, I try to send as many as I can there but usually it's only after there's already been a problem.

Quote

January 21, 20242 yr

Author

24 minutes ago, JorgeB said:

The initial request was this one, it's been a long time since I've read it, and since I was just starting to use btrfs maybe some inaccuracies there, my last request to LT was to implement something like this, it also mentions zfs since those pools are not monitored as well, and while zfs handles a dropped device better than btrfs, the user still needs to be notified:

Basically monitor the output of "btrfs device stats /mountpoint" (or "zpool status poolname" for zfs) and use the existing GUI errors column for pools, which currently is just for show since it's always 0, to display the sum of all the possible device errors (there are 5 for btrfs and 3 for zfs), and if it's non zero for any pool device generate a system notification, optionally hovering the mouse over the errors would show the type and number of errors like we have now when hovering the mouse over the SMART thumbs down icon in the dash.

Additionally you'd also need a button or check mark to reset the errors, it would run "btrfs dev stats -z /mountpoint" or "zpool clear poolname".

Welcome any suggestions to make this better, just recently had the chance to ask LT about this again for 6.13, but as of now don't know if this or something similar is going to be done.

I've also created a FAQ entry to monitor btrfs and zfs pools with a script, while not perfect it's better than nothing, but of course most users won't see it, I try to send as many as I can there but usually it's only after there's already been a problem.

This sounds perfect, exactly what we need. My thought was have the stats appear on pool page as well, possibly around the balance/scrub options with a button to clear the stats. But 100% we're on the same page here, definitely need this type of monitoring for pools. I haven't tested ZFS that much since it was added but if it's also missing monitoring we need that too

Quote

1

June 15, 20251 yr

Just since this surely overlooked by many, especially considering how central this is in Unraid and that there is nothing at all notifying you about it:

Run this and tell me you have all zeros:

for m in $(findmnt -t btrfs -n -o TARGET); do echo "--- $m ---"; btrfs device stats "$m"; done

(it will loop through all you btrfs mounts and output error statistics)

You can see the same info in Main > <each> Disk Settings > Pool Device Status

but unless you looking for it, and understand what it means and what you should do...

And as yet another iterating reminder:

Unraid Array protects you from disk loss, not data corruption.
BtrFS detects data corruption but can't do anything about it.
ZFS does both, but can only handle even size disks

(don't drift to debate over the specifics, the general gist eli5 is that data corruption happens under our noses)

Only persisting reason I noticed this was that https://github.com/netdata/netdata keeps notifying about it until reset.

Edited June 15, 20251 yr by Samsonight
source

Quote

Better btrfs array monitoring

Featured Replies

Solved by JorgeB

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)