November 30, 20241 yr Recent logs attached. Essentially, Ollama & Invoke will shut down at random intervals and without notice. Between boot and these logs I simply logged in, started array, let the autostart run. Then, didn't utilize the Tdarr node so the GPU would remain idle. Just let it sit until the problematic containers stopped and pulled the log. Longer details... I put together a Beta box with 2 NVMe devices and Nvidia GPU for using: Ollama Open-WebUI Invoke Tdarr Netdata BigAgi ClamAV installed for periodic scans This will happen even if they sit idle, and essentially run for up to 2 hours before finding out they stopped. However, it is usually a shorter period of 20-40 min and I might not notice. This also occurs if one or none are started with the array, but if running they will shutdown simultaneously. I've also tried with or without `--runtime=nvidia` in the extra parameters. A cause doesn't appear obvious, to me, but I'd rather be wrong and corrected than live in ignorance. Tdarr does not shut down, and continues to operate normally. I've also run all three of those containers at the same time, and aside from slower performance they function normally. Again, combinations were tested and eventually Ollama &/or Invoke would stop. They can be started up again, and perform normally. No obvious errors in the logs aside from mention of `nvidia-persistenced` and the docker interface going up or down when events occur. aiserver-diagnostics-20241130-1648.zip Edited November 30, 20241 yr by TheFullTimer rearranged some info
December 1, 20241 yr Author @bmartino1 Thank you for taking a look. I appreciate it. Ran docker inspect for the Invoke, Ollama, and Tdarr. Also attached is a log that continues on from yesterday to today. docker inspect invoke-default.txt docker inspect ollama.txt docker inspect tdarr_node_GPU.txt aiserver-diagnostics-20241201-1054.zip
December 1, 20241 yr Community Expert Thanks. Based on the logs and Docker inspection data you’ve shared, here are some observations and potential areas to investigate: 1. Ollama Container (Exit Code 137): The Ollama container appears to be exiting with ExitCode: 137, which usually indicates that the container was killed due to memory issues (i.e., out of memory). This could be a sign of resource constraints or excessive resource usage, especially if other containers (such as Tdarr) are running simultaneously. Action: Monitor the memory usage for the container and check if there’s a memory limit set. You can try increasing the container's memory limit or adjust other resource constraints if necessary. 2. Invoke Container (Exit Code 137): Similar to Ollama, the Invoke container is also exiting with ExitCode: 137. This further supports the theory that there might be resource constraints or conflicts that lead to the containers being killed. Action: Like Ollama, monitor the memory usage. Consider adjusting the resource limits (such as CPU and memory) in the Docker configuration. 3. Tdarr Container (Running Fine): The Tdarr container is running normally, which could suggest that it is less resource-intensive compared to Ollama and Invoke, or it is handling resources more efficiently. Action: Compare the resource usage patterns of Tdarr with Ollama and Invoke to understand why those containers might be stopping, while Tdarr continues to run. 4. General Observations: Both Ollama and Invoke are using NVIDIA GPUs, and there’s mention of NVIDIA_VISIBLE_DEVICES in the environment variables. Ensure that the GPU drivers are properly configured and that the --runtime=nvidia flag is correctly used in the Docker configuration. Check for any potential conflicts or issues with the GPU resources when running these containers simultaneously. 5. Logs for Further Debugging: It would be helpful to check the Docker logs and Unraid system logs for any memory-related errors or GPU-specific issues around the time the containers exit. Action: Investigate the system logs for any kernel or GPU driver-related messages, particularly around the times the containers are being stopped. 6. Docker Resource Settings: Containers are using the bridge network mode. If there are network-related issues, consider switching to host network mode or creating a dedicated Docker network for these containers to improve performance and reduce the chances of conflicts. Next Steps: Monitor Resource Usage: Use Docker's resource monitoring tools or Unraid's built-in resource monitoring to keep an eye on memory and CPU usage. You might want to set up alerts for high memory usage to proactively catch issues. Examine GPU Usage: Make sure that the GPU is being allocated correctly and that multiple containers using the GPU aren’t causing conflicts.
December 2, 20241 yr Author Thank you for the feedback, @bmartino1. I've attached screenshots from NetData for the range of the first Syslog I attached, when all services sat idle until failure. There doesn't appear to be a memory leak, and the moment the containers stopped is shown at ~16:47:02. No sign of a radical change at that time. You can see that the system has 128GB of rDimms and everything is consistent. I'll go back to testing a bit with `--runtime=nvidia` & `--gpus=all` in the extra parameters. It is intriguing that Tdarr runs happily with `--gpus=all` this whole time. Additionally, if it were a runaway GPU memory I'd expect, from prior experience, the GPU lock up or all containers using the GPU to fail; whereas Tdarr hums along happily.
December 3, 20241 yr Author This issue has persisted for these two containers whereas Tdarr runs without issue. I put this is General as it is a Beta system, but it's very curious. Tested combinations of `--runtime=nvidia` & `--gpus=all` in the extra parameters. Even with only one test container running and system restarted so that only the test container and NetData run at array start. Review and suggestions are welcome. If there's particular piece of data you'd like to see for a given container start/failure, please ask. --- Edit: 7.0.0-rc.1 2024-12-02 is out Switching docker to overlay2 storage driver and testing https://docs.unraid.net/unraid-os/release-notes/7.0.0/#add-support-for-overlay2-storage-driver Edited December 3, 20241 yr by TheFullTimer
December 3, 20241 yr Author 😑 Happened on 7.0.0-rc.1 after the previously mentioned changes (i.e. switch from btrfs to overlay2 storage driver) Containers were set back to recommended extra parameters of the developers. Ollama: --gpus=all Invoke: --runtime=nvidia --gpus=all Tdarr: --gpus=all
December 6, 20241 yr Author @bmartino1 I don't believe that's the cause, but I'll share the information below. In the past 48 hours I've made changes that kept the containers running stable. I'm working back through my changes to determine what configuration could work best for others. 🫡 From docker info: Storage Driver: overlay2 Backing Filesystem: zfs From ollama container: df -T Filesystem Type 1K-blocks Used Available Use% Mounted on overlay overlay 483475328 156656000 326819328 33% / tmpfs tmpfs 65536 0 65536 0% /dev shm tmpfs 65536 0 65536 0% /dev/shm shfs fuse.shfs 483655680 156836352 326819328 33% /root/.ollama cache zfs 483475328 156656000 326819328 33% /etc/hosts tmpfs tmpfs 65828356 12 65828344 1% /proc/driver/nvidia tmpfs tmpfs 65828356 4 65828352 1% /etc/nvidia/nvidia-application-profiles-rc.d overlay overlay 65810396 1403344 64407052 3% /usr/bin/nvidia-smi overlay overlay 65810396 1403344 64407052 3% /usr/lib/firmware/nvidia/565.57.01/gsp_ga10x.bin tmpfs tmpfs 65828356 0 65828356 0% /proc/acpi tmpfs tmpfs 65828356 0 65828356 0% /sys/firmware From invoke container: df -T Filesystem Type 1K-blocks Used Available Use% Mounted on overlay overlay 483475200 156656768 326818432 33% / tmpfs tmpfs 65536 0 65536 0% /dev shfs fuse.shfs 1885863936 1141609344 744254592 61% /install shm tmpfs 65536 0 65536 0% /dev/shm cache zfs 483475200 156656768 326818432 33% /etc/hosts overlay overlay 65810396 1419344 64391052 3% /usr/bin/nvidia-smi rootfs rootfs 65810396 1419344 64391052 3% /etc/vulkan/icd.d/nvidia_icd.json tmpfs tmpfs 65828356 12 65828344 1% /proc/driver/nvidia tmpfs tmpfs 65828356 4 65828352 1% /etc/nvidia/nvidia-application-profiles-rc.d overlay overlay 65810396 1419344 64391052 3% /usr/lib/firmware/nvidia/565.57.01/gsp_ga10x.bin tmpfs tmpfs 65828356 0 65828356 0% /proc/acpi tmpfs tmpfs 65828356 0 65828356 0% /sys/firmware Edited December 6, 20241 yr by TheFullTimer switch from md to code blocks
December 9, 20241 yr Author Solution Following up with this issue, the trouble has been resolved. If you didn't read the thread, this catches you up: Working through my thought process, I didn't freeze updates or other work on the system. I don't believe that a reasonable end-user would ignore container updates that might fix their issue, and being in the Beta i did update from 7.0.0 Beta-4 to 7.0.0-rc.1. I had deleted the images, switched to overlay2 storage driver, and rebuilt the containers. When the issue persisted I did look around to other concerns. Resolution and checking work(?): The most impactful difference to stability occurred after I checked through my specifications in the Tips and Tweaks plugin. Here I had kept the defaults of "vm.dirty_background_ratio to 3% and vm.dirty_ratio to 5%". As a reference, my testing system has 128 GB of RAM, so I adjusted these down to 1% and 2%, respectively. Again, a full Shutdown and Power On was performed after the change and the problematic Containers ran idle for 24 hours. As they now seemed to be stable, I began testing them with intermittent workloads to see if they might eventually shutdown, as they had previously. They did continue to operate normally. An update for both Invoke and Ollama had been released, so the containers were updated and continued to function without unexpectedly quitting. I also took the opportunity to update the Checkpoints and Loras of the Invoke container, and took the easiest path of deleting, removing appdata, and cleanly installing with updated Paths. It had then been ~60 hours and I went back to test the Cache ratios. They were returned to 3% & 5%, respectively, and the system was cleanly power cycled & left to idle with problematic containers running but otherwise inactive. Nothing changed, and the containers remained available for 24hr+. I then returned the setting back to 1% & 2%, where it now remains. TL:DR System has 128 GB of RAM. Adjusting "vm.dirty_background_ratio to 3% and vm.dirty_ratio to 5%" to "vm.dirty_background_ratio to 1% and vm.dirty_ratio to 2%" was most impactful. Additional Thoughts: This does raise some questions about those applications potentially crashing if the virtual memory is dirty. To resolve that concern, and 100% confirm the issue, a clean test environment would've been necessary. I didn't really expect the Cache ratio to have a large impact in the first place, but it had effectively corrected the problem on an idle system. It is also possible that work I performed when the cache ratio was lowered had fixed some underlying issue with one or more containers. Edited December 9, 20241 yr by TheFullTimer improve the Additional Thoughts section
March 30, 20251 yr @TheFullTimer I had a similar (somewhat same) problem with ollama or dockers using gpu in general and found your thread. But just with ollama a had those exit 137 occurrences which were pain to analyze. In my case the solution was so simple, I could slap myself (hard) ... My unraid executes several user scripts and of course those being recommened during the initial setup for using nvidia gpus, in detail one by SpaceInvaderOne (https://github.com/SpaceinvaderOne/nvidia_powersave). Fun fact: It has been a talk to gemma3:12b giving me the right idea where to look at - oh crap, of course! 😄 tl;dr: Check your nvidia powersave user script if any installed. Otherwise just skip the following. Quote Okay, this is a very likely culprit. The script you're running, specifically the `terminate_non_docker_gpu_processes` function, has a critical flaw that's almost certainly triggering the container stops with exit code 137. Let's break down why: Understanding Exit Code 137 Exit code 137 is a standard signal that signifies a process being killed by the operating system due to running out of memory (OOM - Out Of Memory). It's not a direct error within the application itself but a consequence of the system's resource management. Processes that are abruptly killed due to OOM often don't leave behind useful debug information within the application logs, which makes them frustrating to diagnose. How Your Script Causes the Problem The core problem lies in how `terminate_non_docker_gpu_processes` interacts with the GPU, particularly when a Docker container is actually using the GPU. Let's examine the critical sections: 1. `all_gpus_free()`: This function is designed to determine if any GPU is in use by a Docker container. However, the way it checks (`is_gpu_in_use_by_docker`) is imperfect. It relies on scanning `/proc/${pid}/cgroup` to identify Docker processes. This method isn't foolproof. A Docker container might be using a GPU, but the `cgroup` information might be missing, incorrect, or the check might not properly identify the container due to various configuration nuances. 2. `terminate_non_docker_gpu_processes()`: This function relies entirely on the output of `all_gpus_free()`. If `all_gpus_free()` returns true (meaning it thinks no Docker containers are using the GPU), it proceeds to execute `fuser -kv /dev/nvidia`. `fuser -kv /dev/nvidia`: This command forcefully kills any process using the `/dev/nvidia` devices (the GPU drivers). This is the critical point of failure. When `all_gpus_free()` falsely reports that no containers are using the GPU, this command aggressively terminates processes including those within running Docker containers. Why This Leads to Exit Code 137 When `fuser -kv /dev/nvidia` kills a process within a running Docker container, it's akin to abruptly pulling the rug out from under that container. The container is terminated unexpectedly. Because the kernel is directly killing the process due to resource constraints (the kernel thinks the container's process is taking up unneeded resources), it generates an exit code 137. The container doesn't have a graceful shutdown opportunity to clean up, release resources, or save its state. It's essentially killed "hard." Why the Imperfect Check Matters Even When Containers Are Using the GPU Even when containers are legitimately using GPUs, the script will still run periodically. While the `all_gpus_free` logic should prevent `fuser` from running, the imperfect check means it might sometimes believe that a container isn’t actually using the GPU. Even a momentary lapse in the check during the script's execution could lead to a disruptive kill. In Summary: The Script's Logic Creates a Trap The logic is creating a situation where a perfectly legitimate Docker container, running normally, can be forcibly terminated because the script misinterprets its resource usage. This abrupt termination results in the exit code 137. Recommendations - What You Should Do 1. Disable or Remove the Script: This is the most immediate solution. The script is causing more harm than good. 2. Improve `is_gpu_in_use_by_docker()`: If you absolutely need to manage GPU usage, drastically improve this check. Consider using more reliable methods to identify Docker containers, such as: Checking the container ID within the process's environment variables. Using the Docker API to query active containers and their resource usage. (This is the most reliable approach). 3. Graceful Termination (Highly Recommended): Instead of `fuser -kv`, implement a more graceful approach. If you need to free up GPUs, try sending a `SIGTERM` signal to the processes and allowing them time to shut down cleanly before resorting to a more forceful `SIGKILL` signal. 4. Logging: Add extensive logging to your script. Log every check, the result of `all_gpus_free()`, and what action is taken. This will provide invaluable insight into why the script is making the decisions it is. By addressing these points, you should be able to resolve the container stopping issue and gain more control over your GPU resource management. Start by disabling the script immediately—the problem is likely ongoing right now. Edited March 30, 20251 yr by FilewalkerDotNet
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.