Hi! I'm gonna briefly describe my setup and then my findings.
I have Unraid installed on my desktop PC. It has 64GB of RAM and two 3070s. Mostly for "2 Gamers - 1 PC setup" (I share it with my wife).
We have 2 windows VMs and 2 Ubuntu VMs, one of each for each person. Each set of VMs (1 win, 1 ubuntu) share the same GPU passthrough, so they can't be on at the same time because of this. We switch between VMs when we want to game or work.
For this, I created a script that is executed for this purpose. The script takes two parameters, the "from" and "to" VMs, then it checks if the "from" VM is ON at the moment, if so it shuts it off, sets USB passthrough config to the "to" VM and then turns it on.
For example, when I'm done working (Ubuntu VM on) and I want to game (turn on Windows VM) I call the script and it turns off the ubuntu VM, changes the USB devices assignation to the Windows VM and then turns it on.
Said script is actually executed by my HomeAssistant instance (running on another server) and it does so by connecting through ssh and running the script. This way I can ask Alexa to turn on the windows VM and all the magic is going to happen behind without me running any commands.
Home Assistant uses another set of commands, also through SSH, to pull each VM state just for showing it on the UI. This command is executed periodically (10s) for each VM.
When I was on 6.11.5 everything was working just fine. But when I upgraded to 6.12.4, after a couple hours I couldn't turn on the the VMs. I receive the following error:
QuoteFailed to create v2 cgroup '/sys/fs/cgroup/machine/qemu-5-<VM-NAME>.libvirt-qemu/': No space left on device
The only way to fix this is to reboot the server. I ignored the problem for a couple weeks and after I got tired of this I updated to 6.12.10 hoping the problem would go away. It didn't.
After digging a little bit more, I found out that this "not space left on device" was actually a problem with cgroups limit. I was hitting the cap for this (around 65k cgroups). The "no space on the device" it's because the kernel is not deleting the cgroups and thus it starts to error out once you reach the limit.
I embarked on the quest to find out why I'm the only person in the planet that has this problem and I thought it could be my script or the HomeAssistant interaction. So, I disabled all my HA scripts and commands and started checking how many cgroups were being created with that disabled. None was being created. I re-enabled the scripts and each time HA connected through SSH the cgroups increased by 4 (go figure). Turns out each SSH session that is opened creates a cgroup and when the session is closed, the cgroup is not being deleted.
After a couple minutes googling I found this old thread in StackOverflow that described the exact same problem I was having.
To confirm if this was my problem, while running the following command:
root@Unraid:~# cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 0 26 1
cpu 0 26 1
cpuacct 0 26 1
blkio 0 26 1
memory 0 26 1
devices 0 26 1
freezer 0 26 1
net_cls 0 26 1
perf_event 0 26 1
net_prio 0 26 1
hugetlb 0 26 1
pids 0 26 1
I connected and disconnected several times via SSH to the server. Confirming the increase on "hierarchy" count each time I connected and never decreasing when disconnecting. The combination of this issue and my HA script which connects 4 times each 10s drives this count to the limit in a matter of hours.
Then I found out that the Unraid 6.12 update introduced a new cgroup version and I think this is the culprit. This new version is having this issue. The Stack Overflow post said that the host having the problem had docker installed, which is also true for Unraid, even though I'm not using it and it's disabled. Also, I've seen on the forum some other post related to docker containers that had similar errors.
I think this is something important that needs to be looked into.
Recommended Comments
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.