FatherChon Posted August 22, 2018 Share Posted August 22, 2018 So I've got a fresh unraid 6.5.3 build on the trial. I've got two Intel DC S3500 300GB SSDs for cache in btrfs raid1 that appear to fill up to full with no explanation. All of my shares are set to cache=no, drives are LUKS encrypted with btrfs. I've got 2 ubuntu VMs running and about 12 docker containers. When I run a du to show what is actually on the system, only 53gb is used. root@unraid:/mnt/cache# du -sh /mnt/cache 53G /mnt/cache root@unraid:/mnt/cache# du -sh /mnt/cache/* 2.8G /mnt/cache/appdata 19G /mnt/cache/domains 31G /mnt/cache/system When I run a df, it shows 100% root@unraid:/mnt/cache# df -h Filesystem Size Used Avail Use% Mounted on Filesystem Size Used Avail Use% Mounted on rootfs 63G 758M 63G 2% / tmpfs 32M 596K 32M 2% /run devtmpfs 63G 0 63G 0% /dev tmpfs 63G 0 63G 0% /dev/shm cgroup_root 8.0M 0 8.0M 0% /sys/fs/cgroup tmpfs 128M 1.5M 127M 2% /var/log /dev/sda1 15G 215M 15G 2% /boot /dev/loop0 7.5M 7.5M 0 100% /lib/modules /dev/loop1 4.5M 4.5M 0 100% /lib/firmware /dev/mapper/md1 11T 9.6T 1.5T 88% /mnt/disk1 /dev/mapper/md2 11T 9.6T 1.4T 88% /mnt/disk2 /dev/mapper/md3 7.3T 4.6T 2.8T 63% /mnt/disk3 /dev/mapper/md4 7.3T 4.6T 2.8T 63% /mnt/disk4/dev/mapper/sdc1 280G 278G 128K 100% /mnt/cache shfs 37T 29T 8.2T 78% /mnt/user0 shfs 37T 29T 8.2T 78% /mnt/user Tried running a btrfs scrub, looked in each directory for any massive files but the amount of space there doesn't match up. Uptime is ~2 days Any ideas? I've manually ran the mover and it doesn't make a difference. Build: Intel S2600GZ Mobo 128GB DDR3 ECC RAM Intel E5-2670 LSI 2208 in IT mode to connect all drives Data drives 4x 12TB Seagate Ironwolf Pro 2x 8TB WD Reds 2x 5TB Toshiba 5TB SAS Enterprise 1x Crucial MX500 1TB for VMs Cache 2x Intel DC S3500 300GB SSDs Quote Link to comment
Stan464 Posted August 22, 2018 Share Posted August 22, 2018 (edited) Hey! using BTRFS like yourself. but running the drive as Unassigned and having the same exact issue as you. Drive loosing space gradually. (use case slightly different) but resulting in a Full Drive regardless. Only have 1 80GB File yet the drive space fills to the brim yet no other files are present and nothing else has changed. Maybe we have both fell upon the same or similar issue. Edited August 22, 2018 by Stan464 Quote Link to comment
John_M Posted August 22, 2018 Share Posted August 22, 2018 Does running a BTRFS balance on your cache pool make any difference? Quote Link to comment
FatherChon Posted August 22, 2018 Author Share Posted August 22, 2018 16 minutes ago, John_M said: Does running a BTRFS balance on your cache pool make any difference? I had tried with no luck. Rebooted the host and the cache dropped back down to the expected size. I'm gonna start graphing this on a separate monitoring system to see if it's gradual or just suddenly was at 100%. Quote Link to comment
John_M Posted August 22, 2018 Share Posted August 22, 2018 Does the free space get used up as the VMs write to their images? I suggested in Stan's thread that it might be something to do with CoW. That's the limit of my knowledge in that area, I'm afraid. Quote Link to comment
FatherChon Posted August 22, 2018 Author Share Posted August 22, 2018 4 hours ago, John_M said: Does the free space get used up as the VMs write to their images? I suggested in Stan's thread that it might be something to do with CoW. That's the limit of my knowledge in that area, I'm afraid. My VMs disks are running on the 1TB SSD that is in the pool, not on the cache drive. I'm not sure if KVM creates additional files on the cache drive, but the VMs are using raw as well. All of my docker containers are running on the cache though. I do have CoW enabled for the shares. Right now there is a 2G difference between what df and du show. I'm keeping an eye on it so we'll see if this changes. I've got 9 days left on my trial and other than this issue, I'm impressed, but when it filled up the cache drive it caused most of my docker containers to crash. I'll see if I can get an extension, but if I can't solve this then I'll be moving back to FreeNAS unfortunately. So far loving Unraid though, other than this. Quote Link to comment
BillClinton Posted August 23, 2018 Share Posted August 23, 2018 Are you running trim on your SSDs?I noticed you said all drive are connected to your LSI controller. There are a lot of posts and comments saying LSI controllers don’t support TRIM. Might be something to check. Try moving the SSDs to your mobo and schedule the TRIM in your webui. Quote Link to comment
FatherChon Posted August 23, 2018 Author Share Posted August 23, 2018 46 minutes ago, BillClinton said: Are you running trim on your SSDs? I noticed you said all drive are connected to your LSI controller. There are a lot of posts and comments saying LSI controllers don’t support TRIM. Might be something to check. Try moving the SSDs to your mobo and schedule the TRIM in your webui. Trim plugin enabled, cache SSDs are actually connected to the onboard SATA which is an Intel controller, but the data drives are all on the LSI. Ran out of slots as the 2208 so the cache drives are in the chassis. Now that I've found some of the btrfs options, I may give a defrag and then a sync a shot if I see the same behaviour. So far I have not seen it in the data pools, VM storage has been fine and all the other drives match up nicely. root@unraid:~# btrfs filesystem df /mnt/cache Data, RAID1: total=69.00GiB, used=63.81GiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=2.00GiB, used=65.20MiB GlobalReserve, single: total=22.61MiB, used=0.00B root@unraid:~# df -h /mnt/cache Filesystem Size Used Avail Use% Mounted on /dev/mapper/sdc1 280G 64G 214G 24% /mnt/cache root@unraid:~# du -sh /mnt/cache 62G /mnt/cache root@unraid:~# btrfs filesystem du -s /mnt/cache Total Exclusive Set shared Filename 61.76GiB 61.76GiB 16.00KiB /mnt/cache root@unraid:~# btrfs filesystem usage /mnt/cache Overall: Device size: 558.92GiB Device allocated: 142.06GiB Device unallocated: 416.86GiB Device missing: 0.00B Used: 127.75GiB Free (estimated): 213.62GiB (min: 213.62GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 22.62MiB (used: 0.00B) Data,RAID1: Size:69.00GiB, Used:63.81GiB /dev/mapper/sdb1 69.00GiB /dev/mapper/sdc1 69.00GiB Metadata,RAID1: Size:2.00GiB, Used:65.23MiB /dev/mapper/sdb1 2.00GiB /dev/mapper/sdc1 2.00GiB System,RAID1: Size:32.00MiB, Used:16.00KiB /dev/mapper/sdb1 32.00MiB /dev/mapper/sdc1 32.00MiB Unallocated: /dev/mapper/sdb1 208.43GiB /dev/mapper/sdc1 208.43GiB Quote Link to comment
FatherChon Posted August 23, 2018 Author Share Posted August 23, 2018 So it happened again last night while I was sleeping. It was a gradual increase, around 500mb/minute. Root cause was a kubernetes job that was writing to disk - du still shows 62G - df showed 100% full I did an lsof and looked through all of the files, and found that the vdisk1.img that I created for a kubernetes VM, which I set to use the 12TB drive which is supposed to live on disk1, was also on cache. This was a disk for some machine learning datasets that I have. It looks like this vdisk was also living on the /mnt/cache behind the scenes. If I do an ls /mnt/cache/domains/kubernetes01/vdisk1.img, no file is found. shfs 17140 root 150r REG 0,33 2199023255552 1346821 /mnt/disk1/domains/kubernetes01/vdisk1.img shfs 17140 root 152w REG 0,45 131730452480 829434 /mnt/cache/domains/kubernetes01/vdisk1.img shfs 17140 17141 shfs root 150r REG 0,33 2199023255552 1346821 /mnt/disk1/domains/kubernetes01/vdisk1.img shfs 17140 17141 shfs root 152w REG 0,45 131730452480 829434 /mnt/cache/domains/kubernetes01/vdisk1.img I'm assuming that shfs was caching writes(reads too?) for this img, but this seems to be the only file that was getting cached. The rest of the imgs for my VMs are not on cache and don't show up like this. This was also the first VM that I created, and I had to manually change the domains share to cache no and select the included disk to disk 1/2 Is there a caching mechanism for VMs that are running on other disks? I'm not seeing any options in VM to enable or disable this. Quote Link to comment
John_M Posted August 23, 2018 Share Posted August 23, 2018 The default setting for the domains user share is cache-prefer, which means that if you have a cache disk or pool and there's sufficient space on it the .img file will be created there, otherwise it will be created on the array. The logic behind this is that most users who have a cache will want to take advantage of its speed, while those who don't can still successfully create the virtual disk image without having to change the configuration. Since you created your VM with this default your .img was created on the cache. You then changed the domains share to cache-no and left the .img file stranded on the cache. What you should have done is this: change the domains share to cache-yes and then run the mover; then change the domains share to cache-no. The above needs explanation because it's possibly counter-intuitive. The four cache options (yes, no, prefer and only) are explained better than I could do it in this FAQ entry. Files stay on the disks where they are first written unless they are moved by either a human or by the mover. The mover is scheduled by chron and can also be run manually from the Main page of the GUI. The mover obeys the rules set out by the cache options for each user share. Anyway, you currently have two versions of the .img file. One on array disk1 and one on the cache. You need to decide which is the active one and which is stale and delete it. In order to be absolutely sure which is which I would check the location of the file in the XML of the VM and compare the timestamps. If the version you want to keep is on disk1 you're good. If the version you want to keep is on the cache you'll want to move it to the array so use the procedure in paragraph 2. The mover ignores files that are open so you'll need to stop the VM first. Quote Link to comment
JonathanM Posted August 23, 2018 Share Posted August 23, 2018 2 hours ago, FatherChon said: shfs 17140 root 150r REG 0,33 2199023255552 1346821 /mnt/disk1/domains/kubernetes01/vdisk1.img shfs 17140 root 152w REG 0,45 131730452480 829434 /mnt/cache/domains/kubernetes01/vdisk1.img Just to possibly help clear things up (or not) this situation results in /mnt/user/domains/kubernetes01/vdisk.img pointing to 2 possible files. If you use the disk path (disk1 or cache) then you are directly addressing one of the two files. If you address /mnt/user/domains user share path, then the system makes a choice between the two files and ignores the second file. Based on the behaviour you are seeing, I'd guess you have /mnt/user specified in the XML, and it's ignoring the file on disk1 and actively using the one on the cache disk. /mnt/disk1/domains/kubernetes01/vdisk1.img /mnt/cache/domains/kubernetes01/vdisk1.img Both are valid locations for /mnt/user/domains/kubernetes01/vdisk1.img so unraid chose to use the one on cache. Quote Link to comment
FatherChon Posted August 23, 2018 Author Share Posted August 23, 2018 (edited) Thanks guys, that makes sense, still a bit confused as to why it showed up in lsof but when I actually ls'd the file it didn't exist, but I'll chalk that up to weirdness since that was the first VM I created and changed the cache type after that first VM was created. I've nuked the VM anyways because its data pulled from another server, I should've checked it for corruption but didn't think of that until now. I had the mover scheduled too so maybe that helped cause the issue? File was moved on the filesystem level but the system was still writing to where that file was pointing or something. So what happened in case anyone else runs in to this issue: - fresh install of unraid - added all disks and built array, set up btrfs, encryption, mover schedule - created first VM with defaults of "domains" with cache-prefer - noticed it was on cache and would exceed the cache drives size - changed domains to cache-no and changed the disks that domains live on - changed the vdisk to a different /mnt/disk# in the VM config - started up VM, vdisk1 wrote to cache and filled up filesystem - (this is where I'm not 100% sure if this is how it happened, assumptions for the events after this point) - mover probably ran and started moving vdisk1.img - file pointer moved off cache to disk1, but all the space was still utilized on cache? - VM was still running but writing to cache disk without a file - reboot cleared up the space? - VM started back up after reboot, writes happened during the evening, same thing happened. Still gonna give it a bit more time before I purchase, make sure all of my VMs work properly because I need kubernetes in my lab. I'd like to make sure that this is fully solved before I plunk down some cash Edited August 23, 2018 by FatherChon Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.