January 23, 20251 yr I have encountered a strange issue about once per week for the last two weeks. Seemingly randomly, all the services on my instance crash. Dockers, VMs, WebGUI, ssh, etc. My only interface to the server when this occurs is monitor/mouse/keyboard. I am able to perform some bash commands but not others. For example, I can tail the syslog. The syslog does not seem to show anything interesting. Last logs refer to an SSD trim command running. I suspect that processes are largely failing to write to the syslog when I am in this error state. Others commands give me a bash input/output error. For example root@WarrentonUnraid:/mnt/user# /etc/rc.d/rc.nginx status -bash: /etc/rc.d/rc.nginx: Input/output error I was also unable to perform: powerdown -r because of the same IO error. Doing a hard reboot of the server causes everything to come back fine. Of course it initiates a parity check as expected, but all the VMs/Dockers/etc. run fine. I do not know how to induce this error state again. Does anyone have any advice on things to check or look into to prevent this issue from coming back? warrentonunraid-diagnostics-20250123-1109.zip
January 23, 20251 yr Community Expert There may be an issue with one of the containers, recommend retesting with just half of them enabled, if the same, try the other half, then keep drilling down.
January 23, 20251 yr Author Thanks for looking into it for me. This has happened twice so far and there was roughly one week of stability after the first instance before the second occurred. So in all likelihood it will take some time to narrow down a misbehaving container. After the most recent reboot, my snapcast failed to function correctly. The logs for that container are pretty messy, but it complains of a bad file descriptor 2025-01-23 08-56-21.267 [Error] (AirplayStream) Error opening metadata pipe, retrying in 500ms. Error: assign: Bad file descriptor [system:9 at /usr/include/boost/asio/detail/impl/reactive_descriptor_service.ipp:120 in function 'assign'] I'll keep snapcast turned off and see if that resolves the issue. I am little confused about how a misbehaving container could crash the whole system.
January 23, 20251 yr Community Expert 16 minutes ago, alkenerly said: I am little confused about how a misbehaving container could crash the whole system It happened before many times with docker fork bombs, and while there's some protection for that since 6.12.14, it may not prevent everything, and I saw this in the log: Jan 23 06:21:32 WarrentonUnraid kernel: cgroup: fork rejected by pids controller in /docker/f5fa35b71e37beb3cfa3e189254b280b4d04bf9317a6a88f453171dbbd0dcbb4
April 2, 20251 yr Author Ever since I made this forum post, I have been trying to figure out what has caused the occasional crash. It does not happen more than once per week, so it is not the kind of thing that can be quickly tested. I think I may have found a correlation with the crashes. I use vscode for remote development and will frequently attach my vscode window to my unraid server host. Doing that installs the vscode server on unraid itself. It works great! From there, I will sometimes remotely attach to a container. Either a development container that I have defined with a Dockerfile or another one of my running containers like nextcloud to edit configs, etc. @JorgeB, do you see a way that running vscode on the host and/or inside a container could cause this issue? Any way I could mitigate it?
April 2, 20251 yr Community Expert 4 minutes ago, alkenerly said: do you see a way that running vscode on the host and/or inside a container could cause this issue? Any way I could mitigate it? Sorry, no idea if that would be a problem or not.
May 16, 20251 yr Author Hey all, I just wanted to post an update on this topic. I have since upgraded to 7.0.1 but still experience the same issue. Each time the server runs correctly for 1-2 weeks and then all VMs, Docker Containers, webui, ssh suddenly crash at once. I can log in via the attached monitor and keyboard but no command functions properly. Even executing `shutdown` creates a lot of bus errors, seg faults, and never actually shuts down. Only option is a hard reset. Upon reboot, the server does a parity check and finds/fixes 5 errors (its always exactly 5) and runs as expected for another week or two. I disabled XMP in the bios, ran memtests, updated the BIOS, etc. but still had another crash today. I believe I have exhausted all of my troubleshooting ability. It definitely stinks that my unraid server is so unstable. It used to be rock solid! Does this seem like a hardware failure to anyone? Memory issue?
May 16, 20251 yr Community Expert It could be hardware, but difficult to say, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the docker containers.
May 16, 20251 yr Author @JorgeB I do agree that is a very sensible troubleshooting step. The main reason I have not attempted it fully is because I would need to let the server run for at least 3-4 weeks until I can confidently say its "stable" in this configuration. Reenabling containers and VMs could take several months at that rate and I struggle to not have a fully functioning NAS for several months. My main hang-up is that I was under the impression that VMs and containers both serve to segment the running processes from each other and the greater system. e.g. a docker or VM crashing shouldn't cause unraid itself to crash. I suppose that may be my only route. Or transfer to totally new server hardware to rule out hardware.
May 17, 20251 yr Community Expert 10 hours ago, alkenerly said: e.g. a docker or VM crashing shouldn't cause unraid itself to crash. In theory yes, in practice both have been found to be able to crash a server. On the hardware side, there's one easy thing to rule out, if you have multiple sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.