JonathanM Posted August 24, 2017 Share Posted August 24, 2017 6 hours ago, Maticks said: should this be raised as a bug to be looked at? Good luck. In the past, BTRFS problems have been dismissed as hardware faults, it's never the file system that's the issue according to the dev's. Quote Link to comment
Maticks Posted August 24, 2017 Share Posted August 24, 2017 it sounds very much like the Apple Iphone antenna issue, "you are holding it wrong" Quote Link to comment
JorgeB Posted August 24, 2017 Share Posted August 24, 2017 I'm not saying btrfs doesn't have something to do with this issue, but very much doubt it's only that, I never had this problem and I use btrfs in all my servers, for array disks and cache, both single and at some point an 8 device raid10 pool, never noticed this problem, just now did a test that with: -mover running (25GB from cache to the array) -manual copy on the console using cp of another 25GB from cache to array -copying another 25GB over lan from my desktop to cache disk load average peaked at about 4, this on a dual core pentium G620 and the webGUI was normally usable during all the operations. Quote Link to comment
aptalca Posted August 24, 2017 Author Share Posted August 24, 2017 I'm not saying btrfs doesn't have something to do with this issue, but very much doubt it's only that, I never had this problem and I use btrfs in all my servers, for array disks and cache, both single and at some point an 8 device raid10 pool, never noticed this problem, just now did a test that with: -mover running (25GB from cache to the array) -manual copy on the console using cp of another 25GB from cache to array -copying another 25GB over lan from my desktop to cache disk load average peaked at about 4, this on a dual core pentium G620 and the webGUI was normally usable during all the operations. Try copying a 25gb file from a btrfs drive (or pool) to the same drive. That's when I had issues. Also during unrar and repair where there is simultaneous read and write operations on the same disk. Quote Link to comment
JorgeB Posted August 24, 2017 Share Posted August 24, 2017 22 minutes ago, aptalca said: Try copying a 25gb file from a btrfs drive (or pool) to the same drive. That's when I had issues. Also during unrar and repair where there is simultaneous read and write operations on the same disk. Will try that tomorrow. Quote Link to comment
JorgeB Posted August 25, 2017 Share Posted August 25, 2017 Did another test, pool is 3 x 128GB SSDs in raid0: -cp 25GB in ISOs from one folder on the pool to another -a second simultaneous cp of 25GB in ISOs from one folder on the pool to another -transfer 25GB in ISOs from desktop to pool over lan Again WebGUI was always responsive and load average topped at about 3 Quote Link to comment
thomast_88 Posted August 25, 2017 Share Posted August 25, 2017 (edited) I have a BTRFS raid1 (2x250gb Samsung EVO), and I'm having the same issues as @aptalca for months. Raid is unusable when copying / moving stuff. If anybody got an idea how to trace this down, let me know. I'm willing to invest my time to get this fixed. Edited August 25, 2017 by thomast_88 Quote Link to comment
JorgeB Posted August 25, 2017 Share Posted August 25, 2017 5 minutes ago, thomast_88 said: I have a BTRFS raid1 (2x250gb Samsung EVO), and I'm having the same issues as @aptalca for months. Raid is unusable when copying / moving stuff. If anybody got an idea how to trace this down, let me know. I'm willing to invest my time to get this fixed. Two things come to mind, first make sure you're regularly trimming your pool, second, it happened to me once a btrfs filesystem becoming very slow when writing without an apparent reason, it was on a single NVMe device, re-formatting it fixed the problem, you can see if that helps using the replace cache procedure, but format instead of replace: Quote Link to comment
Maticks Posted September 6, 2017 Share Posted September 6, 2017 I am thinking it might be this copy on write function in the shares section it's on by default. Maybe that is breaking things it does mention btrfs should be set to nocow. id have to change back my pool to btrfs to test again though.... Quote Link to comment
JorgeB Posted September 6, 2017 Share Posted September 6, 2017 43 minutes ago, Maticks said: btrfs should be set to nocow. Mostly for spinners when using high fragmentable data, like VM images, usually SSDs can tolerate the high fragmentation, though it can still slowdown for heavily modified VM images or large databases, but with nodatacow you also lose checksuming. Quote Link to comment
JorgeB Posted September 13, 2017 Share Posted September 13, 2017 I have a theory on why some users may be having this issue, if anyone wants to try and post if there was an improvement please do. Currently fstrim on btrfs is only trimming the unallocated space, this is apparently a bug but it's been like this for some time, for some users with a large slack on the filesystem this will be a very small area of the SSD leaving all unused but allocated space untrimmed, this can lead to very poor performance, so first check for slack on the filesystem, i.e., the difference between the allocated and used space, on the main page click on cache and look at the "btrfs filesystem show" section, e.g.,: Label: none uuid: cea535d2-33f9-4cf2-9ff0-0b51826d48a1 Total devices 1 FS bytes used 265.61GiB devid 1 size 476.94GiB used 427.03GiB path /dev/nvme0n1p1 In this case there's about 161GiB of slack, 476.94GiB is the total device size, 427.03GiB are allocated but only 265.61GiB are in use, since only unallocated space is trimmed, fstrim will only trim 49.9GiB (476.94-427.03) so most free space will remain untrimmed, to fix this run a full balance to reclaim all allocated but unused space, on the console type: btrfs balance start --full-balance /mnt/cache This will take some time, in the end it should look like this: Label: none uuid: cea535d2-33f9-4cf2-9ff0-0b51826d48a1 Total devices 1 FS bytes used 265.68GiB devid 1 size 476.94GiB used 266.03GiB path /dev/nvme0n1p1 Now slack space is less than 1GiB, so fstrim will work on practically all unused space, trim you pool: fstrim -v /mnt/cache And check if performance improves. 2 Quote Link to comment
Maticks Posted September 13, 2017 Share Posted September 13, 2017 (edited) I think you are onto something there.. My second system running btfs on cache still is where downloads temp go for decompression before moving into the array. It has in the FS 14G Used in Unraid but in the btrfs show. Label: none uuid: 8e82d82d-f6f6-45d7-9e5a-d389fb0e0bb3 Total devices 1 FS bytes used 17.36GiB devid 1 size 238.47GiB used 161.03GiB path /dev/sdl1 running a balance now. This might explain why when i fill 160GB of my 256GB SSD the wheels full off. I tend to find docker crashes happen when the cache is around the 160-165GB area and the system slows to a crawl. if i thrash the IO when the cache is 160GB thats when things go wrong. Edited September 13, 2017 by Maticks Quote Link to comment
Maticks Posted September 13, 2017 Share Posted September 13, 2017 Just finished... Label: none uuid: 8e82d82d-f6f6-45d7-9e5a-d389fb0e0bb3 Total devices 1 FS bytes used 17.31GiB devid 1 size 238.47GiB used 18.03GiB path /dev/sdl1 root@Vault:~# fstrim -v /mnt/cache /mnt/cache: 220.5 GiB (236698525696 bytes) trimmed That so far is faster than my cache drive has ever run. ill load it up with data and see if it slows down. So we should cron btrfs balance daily as well ? Quote Link to comment
thomast_88 Posted September 13, 2017 Share Posted September 13, 2017 @johnnie.black I will test this straight away when I get home. Thanks for putting your findings up! Quote Link to comment
JorgeB Posted September 13, 2017 Share Posted September 13, 2017 8 minutes ago, Maticks said: So we should cron btrfs balance daily as well ? Typical cache usage, i.e., constantly filling up and emptying the cache exacerbates the large slack issue, this is supposed to improve once we get to kernel 4.14 as there are some modifications to deal with this, but until then it's a good idea to monitor this and/or do a periodic balance, not only because of the trim issues but also because in extreme cases you can run into another issue, btrfs reporting the device full when it's not because it's fully allocated and can't create any new chunks. If doing a periodic balance it should be enough to do a partial balance, it will recover most of the free allocated space but it will be much faster and cause much less wear on the SSD, e.g.: btrfs balance start -dusage=75 /mnt/cache This will only re-allocate chunks that are up to 75% unused. Quote Link to comment
Maticks Posted September 13, 2017 Share Posted September 13, 2017 I might do this once a month i have had those disk full messages before when i had 80G free before on the cache. Quote Link to comment
thomast_88 Posted September 13, 2017 Share Posted September 13, 2017 Before: Total devices 2 FS bytes used 186.99GiB devid 1 size 232.89GiB used 232.88GiB path /dev/sdc1 devid 2 size 232.89GiB used 232.88GiB path /dev/sde1 After: Total devices 2 FS bytes used 189.18GiB devid 1 size 232.89GiB used 192.03GiB path /dev/sdc1 devid 2 size 232.89GiB used 192.03GiB path /dev/sde1 Trim: fstrim -v /mnt/cache /mnt/cache: 81.7 GiB (87732568064 bytes) trimmed Not sure what those numbers exactly mean, but so far I feel a performance improvement - that is promising! I will try with some large files tomorrow :-) Quote Link to comment
JorgeB Posted September 13, 2017 Share Posted September 13, 2017 7 minutes ago, thomast_88 said: Not sure what those numbers exactly mean, but so far I feel a performance improvement You should, your filesystem was practically 100% allocated, 232.88 out of 232.89GiB, so only 0.01 GiB were being trimmed. Quote Link to comment
Tuftuf Posted September 13, 2017 Share Posted September 13, 2017 This has been an issue for me too, useful info! Quote Link to comment
thomast_88 Posted October 5, 2017 Share Posted October 5, 2017 I'm still having issues. Copied a 11GB file from my primary array, to the cache array (raid 1), and server load went to 25'ish before it ended, making my dockers / VMS crash. @johnnie.black I saw your tests earlier, and noticed you are using raid 0. Have you had any issues with raid 1, or maybe have any idea why this is happening? @aptalca did you get all the issues fixed - and are you running raid 0 or raid 1? Quote Link to comment
jonp Posted October 5, 2017 Share Posted October 5, 2017 Just had to chime in and thank @johnnie.black for all his work on this topic. I am marking this thread for future review so we can see if there are further ways using the knowledge in here to make things better for everyone. 1 Quote Link to comment
aptalca Posted October 5, 2017 Author Share Posted October 5, 2017 7 hours ago, thomast_88 said: I'm still having issues. Copied a 11GB file from my primary array, to the cache array (raid 1), and server load went to 25'ish before it ended, making my dockers / VMS crash. @johnnie.black I saw your tests earlier, and noticed you are using raid 0. Have you had any issues with raid 1, or maybe have any idea why this is happening? @aptalca did you get all the issues fixed - and are you running raid 0 or raid 1? I switched to a single disk xfs. No more issues 1 Quote Link to comment
binhex Posted October 6, 2017 Share Posted October 6, 2017 20 hours ago, jonp said: Just had to chime in and thank @johnnie.black for all his work on this topic. I am marking this thread for future review so we can see if there are further ways using the knowledge in here to make things better for everyone. if this greatly improves performance in general for people using ssd's (possible as either single or cache pool configuration) then im assuming the above commands could be enabled/disabled as options in the webui?, or possibly better, detect if cache drive is ssd, does have trim capability and if so then by default enable partial balance and trim on a configurable schedule, is that the sort of thing your considering @jonp ? Quote Link to comment
jonp Posted October 6, 2017 Share Posted October 6, 2017 if this greatly improves performance in general for people using ssd's (possible as either single or cache pool configuration) then im assuming the above commands could be enabled/disabled as options in the webui?, or possibly better, detect if cache drive is ssd, does have trim capability and if so then by default enable partial balance and trim on a configurable schedule, is that the sort of thing your considering [mention=62528]jonp[/mention] ?Possibly. Tom and I had a long conversation about proper trim support at one point. The real trick is when you have an SSD assigned to the array. Depending on the method of discard/trim that the SSD supports, it could potentially violate the integrity of parity (changing values on the device without updating the corresponding blocks on the parity disk). I realize the context of this thread is with relation to the cache, but if we are implementing proper support for trim, we will want to address this at the same time. Sent from my SM-G930P using Tapatalk Quote Link to comment
binhex Posted October 6, 2017 Share Posted October 6, 2017 1 minute ago, jonp said: Possibly. Tom and I had a long conversation about proper trim support at one point. The real trick is when you have an SSD assigned to the array. Depending on the method of discard/trim that the SSD supports, it could potentially violate the integrity of parity (changing values on the device without updating the corresponding blocks on the parity disk). I realize the context of this thread is with relation to the cache, but if we are implementing proper support for trim, we will want to address this at the same time. Sent from my SM-G930P using Tapatalk ahh ok, i wasn't thinking about this in relation to array disks, i see your point, maybe though a first step to address the issue with btrfs and cache drives would be welcome, go for the trendy 'agile' approach rather than waterfall 1 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.