JorgeB Posted December 17, 2016 Share Posted December 17, 2016 As it is now users get no warnings of a problem with the cache pool, IMO this contributes to BTRFS's bad reputation. For example, if a pool disk is dropped the only clue when looking at the main page is the lack of temperature display: http://s30.postimg.org/nneloztoh/devices_missing.png[/img] This can go unnoticed for some time. http://s24.postimg.org/x5v2oyol1/devices_missing_2.png[/img] Suggestion: monitor btrfs filesystem show, if it shows “***some devices missing” change cache icon to yellow and send a system notification. Another issue, if there are sporadic issues with one of the devices, it may not be dropped but it can make the pool have dual data profiles, raid1 and single, e.g., from a recent forum post: Data, RAID1: total=393.49GiB, used=240.37GiB Data, single: total=534.01GiB, used=412.19GiB System, single: total=4.00MiB, used=160.00KiB Metadata, RAID1: total=2.00GiB, used=1.33GiB Metadata, single: total=2.01GiB, used=448.48MiB GlobalReserve, single: total=512.00MiB, used=0.00B Suggestion: Monitor btrfs filesystem df, if there is more than a data profile change cache icon to yellow and send system notification. Link to comment
bonienl Posted December 17, 2016 Share Posted December 17, 2016 Good suggestions. I have added notifications for both situations you have described. Link to comment
JorgeB Posted December 17, 2016 Author Share Posted December 17, 2016 I have added notifications for both situations you have described. Cool, another thing I forgot to ask, when there's a pool could you add to the diagnostics the output of: btrfs fi show /mnt/cache and btrfs fi df /mnt/cache Link to comment
bonienl Posted December 17, 2016 Share Posted December 17, 2016 Cool, another thing I forgot to ask, when there's a pool could you add to the diagnostics the output of: btrfs fi show /mnt/cache and btrfs fi df /mnt/cache Done... Link to comment
JorgeB Posted January 2, 2017 Author Share Posted January 2, 2017 See reply below. Done... I have one more request on the same subject, I have no clue it this is easy to add, but it doesn't hurt to ask. All errors on a btrfs device appear on the syslog, and if it's a single device, be it a single cache or an array device they should be noticed by the user as a IO error, but for a cache pool it can go unnoticed for some time until it's too late or more difficult to recover, AFAIK they are allways logged like this: Jan 2 00:03:22 Tower kernel: BTRFS error (device sde1): bdev /dev/sdd1 errs: wr 0, rd 4, flush 0, corrupt 80, gen 0 Only change is the type of error, wr=write, rd=read, corrupt=checksum, etc, so would it possible to monitor the log and generate a notification if any btrfs error is detected? Another possible way would be to monitor the btrfs device device stats, mainly for cache pools, e.g.: btrfs device stats /mnt/cache [/dev/sdj1].write_io_errs 0 [/dev/sdj1].read_io_errs 5 [/dev/sdj1].flush_io_errs 0 [/dev/sdj1].corruption_errs 85 [/dev/sdj1].generation_errs 0 [/dev/sdk1].write_io_errs 0 [/dev/sdk1].read_io_errs 0 [/dev/sdk1].flush_io_errs 0 [/dev/sdk1].corruption_errs 0 [/dev/sdk1].generation_errs 0 Note however that these statistics, unless reset by the user, are persistent and survive a reboot, so they would have to be monitored for changes. Link to comment
JorgeB Posted January 4, 2017 Author Share Posted January 4, 2017 Scratch my previous request, I think this would be a better way of doing it. As you know the current error column on the cache device/pool is for show only as no errors are reported there (maybe it's tied to the MD driver??), so why not use the btrfs errors reported on the syslog? This would work on every cache pool and single brtrs device cache, pools seem the most vulnerable because errors can go unnoticed for some time when redundancy is used, when errors are detect by the user, i.e., the inbuilt redundancy can't fix them, it's usually too late or much more difficult to fix them. Option 1: monitor syslog for btrfs errors, any type of error, read, write, checksum, etc, and use the error counter to show them. Option 2: if option 1 is difficult to implement then this should be much easier, use the error counter to show the total errors stored for each device using btrfs device stats /mnt/cache, in this case and since that data is persistent you'd need to add a single button to clear the errors using the -z option. For either of these options it should be easy for bonienl to add a notification when any errors show up, similar to the current array notification. Link to comment
JorgeB Posted February 13, 2017 Author Share Posted February 13, 2017 On 17/12/2016 at 7:35 PM, bonienl said: Done... I hope that my suggestion above will be used in the future but for now I'd like to ask bonienl that when possible please also add to the diagnostics the output of: btrfs device stats /mnt/cache It will be similar to this, and very useful to detect btrfs device issues, including single device cache. root@Tower2:~# btrfs dev stats /mnt/cache [/dev/sdh1].write_io_errs 181614787 [/dev/sdh1].read_io_errs 147213104 [/dev/sdh1].flush_io_errs 3842528 [/dev/sdh1].corruption_errs 349010 [/dev/sdh1].generation_errs 1493 [/dev/sdg1].write_io_errs 0 [/dev/sdg1].read_io_errs 0 [/dev/sdg1].flush_io_errs 0 [/dev/sdg1].corruption_errs 0 [/dev/sdg1].generation_errs 0 Link to comment
t3 Posted February 13, 2017 Share Posted February 13, 2017 OH YES i recently had exactly the same problem; i only discovered that one of the ssd cache mirror disks was offline for almost two months, when, after a power outage, where the the UPS power down didn't work btw, the system did some whatever repair on the mirror... which did then corrupt all vm disk images on the cache disk and the docker image as well. read on here; other user ~ same story: https://lime-technology.com/forum/index.php?topic=52601.msg505506#msg505506 speaking of btrfs' bad reputation: it seems that a btrfs mirror is no save place for docker/vm disk images! as far i can tell, the mirror only kept files intact that were written before or after one of the disks dropped out. any file that was changed in-place - like usually disk images - was corrupt after the mirror was reactivated even having 1+ backups for all and everything - having this situation going on unnoticed for such a long time is a bit on the edge btw, this means that unraid is already that stable and mature, that i don't feel any need to check the system every other day... so, yeah, a notification with a number of big red exclamation marks is very helpful! Link to comment
jonp Posted February 13, 2017 Share Posted February 13, 2017 We do have plans to thoroughly address this as yes, it is indeed a problem. We want to add cache pool devices to the same management structure as the unRAID array, so that error messaging / reporting can be uniform between both. Link to comment
t3 Posted February 13, 2017 Share Posted February 13, 2017 We do have plans to thoroughly address this as yes, it is indeed a problem. We want to add cache pool devices to the same management structure as the unRAID array, so that error messaging / reporting can be uniform between both. great! btw, what do you think of the btrfs issue? is this something to expect in such a case? i must admit i didn't expect it... Link to comment
LordShad0w Posted March 4, 2017 Share Posted March 4, 2017 i had a cache drive from my pool (of 2) fail with no warning twice now as well. Now I just use a single one since for some reason even using a pair of identical drives (brand, capacity, same lot) one would fail after a short time and wreck it all forcing me to redo everything from scratch. Glad to hear this will be addressed. Link to comment
JorgeB Posted March 25, 2017 Author Share Posted March 25, 2017 @bonienl could you please remove from the diagnostics the output of both btrfs fi show /mnt/cache and btrfs fi df /mnt/cache and include instead: btrfs fi usage /mnt/cache This newer command is intended to replace both, also in case you didn't see above, since there was a release since, please also add to the diags the output of: btrfs device stats /mnt/cache Thanks. Link to comment
bonienl Posted March 27, 2017 Share Posted March 27, 2017 On 3/25/2017 at 3:02 PM, johnnie.black said: @bonienl could you please remove from the diagnostics the output of both btrfs fi show /mnt/cache and btrfs fi df /mnt/cache and include instead: btrfs fi show /mnt/cache This newer command is intended to replace both, also in case you didn't see above, since there was a release since, please also add to the diags the output of: btrfs device stats /mnt/cache Thanks. I have placed it on my todo list for an upcoming version. Link to comment
RobJ Posted March 27, 2017 Share Posted March 27, 2017 1 hour ago, bonienl said: I have placed it on my todo list for an upcoming version. Ooooo... can I ask for things too? None are vital. The lsmod list is randomly ordered, so that when I compare 2 diagnostics, they look very different, yet are usually identical. Would it be possible to change lsmod to lsmod | sort ? Another non-vital but useful snapshot, add 2 top listings, one at the beginning of diagnostic data collection, call it top1.txt, and the other at the end of the collecting, call it top2.txt. It would give us 2 quick snapshots of what's running, and how much memory they are using. Not needed much of the time, but really useful in certain situations. Link to comment
RobJ Posted March 27, 2017 Share Posted March 27, 2017 On 3/25/2017 at 10:02 AM, johnnie.black said: @bonienl could you please remove from the diagnostics the output of both btrfs fi show /mnt/cache and btrfs fi df /mnt/cache and include instead: btrfs fi show /mnt/cache This newer command is intended to replace both I'm probably missing something obvious, so feel free to snicker, but it looks like you're asking to remove command 1 and command 2, and replace them with command 3, *but* command 1 looks identical to command 3 ? Link to comment
JorgeB Posted March 27, 2017 Author Share Posted March 27, 2017 20 minutes ago, RobJ said: I'm probably missing something obvious, so feel free to snicker, but it looks like you're asking to remove command 1 and command 2, and replace them with command 3, *but* command 1 looks identical to command 3 ? D'oh! Corrected, thanks for pointing it out. Link to comment
bonienl Posted March 28, 2017 Share Posted March 28, 2017 7 hours ago, RobJ said: Ooooo... can I ask for things too? None are vital. The lsmod list is randomly ordered, so that when I compare 2 diagnostics, they look very different, yet are usually identical. Would it be possible to change lsmod to lsmod | sort ? Another non-vital but useful snapshot, add 2 top listings, one at the beginning of diagnostic data collection, call it top1.txt, and the other at the end of the collecting, call it top2.txt. It would give us 2 quick snapshots of what's running, and how much memory they are using. Not needed much of the time, but really useful in certain situations. Sure, no problem to ask. And as usual put on my todo list. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.