Unraid Becomes unavailible / Crashes After some random time

freddy0 · April 26, 2023

My unraid system seems to be freezing up. (Between 24 Hours and 10 days after boot) My connected 3d Printer just stops printing, the web Ui is not responding. My Logfile before/during the crash is filed with the following information:

Apr 26 19:43:29 Tower kernel: overlayfs: upper fs does not support tmpfile.
Apr 26 19:43:29 Tower kernel: overlayfs: upper fs does not support RENAME_WHITEOUT.

Followed by those messages

Apr 26 20:36:03 Tower root: Total Spundown: 0
Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethdacc4ba) entered disabled state
Apr 26 20:37:16 Tower kernel: veth3b1a1a4: renamed from eth0
Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethdacc4ba) entered disabled state
Apr 26 20:37:16 Tower kernel: device vethdacc4ba left promiscuous mode
Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethdacc4ba) entered disabled state
Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered blocking state
Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered disabled state
Apr 26 20:37:16 Tower kernel: device vethad3ff92 entered promiscuous mode
Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered blocking state
Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered forwarding state
Apr 26 20:37:17 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered disabled state
Apr 26 20:37:20 Tower kernel: vethfcf7e76: renamed from eth0
Apr 26 20:37:20 Tower kernel: docker0: port 39(vethdf98a50) entered disabled state
Apr 26 20:37:22 Tower kernel: docker0: port 39(vethdf98a50) entered disabled state
Apr 26 20:37:22 Tower kernel: device vethdf98a50 left promiscuous mode
Apr 26 20:37:22 Tower kernel: docker0: port 39(vethdf98a50) entered disabled state
Apr 26 20:37:23 Tower kernel: eth0: renamed from vethed59a92
Apr 26 20:37:23 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered blocking state
Apr 26 20:37:23 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered forwarding state
Apr 26 20:41:03 Tower root: Total Spundown: 0
Apr 26 20:46:06 Tower root: Total Spundown: 0
Apr 26 20:51:06 Tower root: Total Spundown: 0
Apr 26 20:55:30 Tower kernel: usb 1-2: USB disconnect, device number 15
Apr 26 20:55:30 Tower kernel: usb 1-2: failed to send control message: -19
Apr 26 20:55:30 Tower kernel: ch341-uart ttyUSB0: ch341-uart converter now disconnected from ttyUSB0
Apr 26 20:55:30 Tower kernel: ch341 1-2:1.0: device disconnected
Apr 26 20:55:57 Tower  shutdown[17538]: shutting down for system halt
Apr 26 20:55:57 Tower  init: Switching to runlevel: 0
Apr 26 20:55:57 Tower  init: Trying to re-exec init
Apr 26 20:56:04 Tower kernel: mdcmd (38): nocheck cancel
Apr 26 20:56:05 Tower  emhttpd: Spinning up all drives...
Apr 26 20:56:05 Tower  emhttpd: read SMART /dev/sdh
Apr 26 20:56:05 Tower  emhttpd: read SMART /dev/sdg
Apr 26 20:56:05 Tower  emhttpd: read SMART /dev/sdd
Apr 26 20:56:05 Tower  emhttpd: read SMART /dev/sde
Apr 26 20:56:05 Tower  emhttpd: read SMART /dev/sdb
Apr 26 20:56:05 Tower  emhttpd: read SMART /dev/sdf
Apr 26 20:56:05 Tower  emhttpd: read SMART /dev/sdc
Apr 26 20:56:05 Tower  emhttpd: read SMART /dev/nvme0n1
Apr 26 20:56:05 Tower  emhttpd: read SMART /dev/sda
Apr 26 20:56:06 Tower  emhttpd: Stopping services...
Apr 26 20:56:06 Tower root: Total Spundown: 0

I have attached all of the other diagnostics, but this is where i would guess the problem originates or at least this might have a relation to the underlying problem.

The extra logs file is what i extracted before/during rebot.

Unraid-Logs.txt tower-diagnostics-20230426-2122.zip

Squid · April 26, 2023

Try setting typical idle current (or similar wording) in the BIOS to be "typical". If no option, then ensure that c states is also disabled

You should also try running your memory at the SPD speed (probably 2133) instead of its xmp (overclocked) profile of 3200

freddy0 · May 2, 2023

On 4/26/2023 at 10:40 PM, Squid said:

Try setting typical idle current (or similar wording) in the BIOS to be "typical". If no option, then ensure that c states is also disabled

You should also try running your memory at the SPD speed (probably 2133) instead of its xmp (overclocked) profile of 3200

Typical current - Done
Cstate control disabled

Memory at 2133 MT/s

After some more troubleshooting and investigation i came to the conclusion that my problem is most definitely memory related. I reserved two cores, because it was recommended somewhere. After which i was able to sucessfully ssh into the server during it had soft locked (Web ui from different machine not loading correctly after login attempt and the local unraid web ui view was frozen). The cpu utilization was stuck on 98 percent or so on the server web ui view. Quick htop from terminal - cpu 85% used bykswapd0. Memory filled up completely 99.9% or so. I ran a diagnostics scan from my ssh session output is the smaller file. Unraid somehow then unstuck itself after i ran (diagnostics / lsof / free) then after it all started working again i ran diagnostics a second time. What might have caused this?
After some further research i no changed vm.dirty.ratio_xx to 1% and 2% respectively with "tips and tweaks" not sure if this might have already have solved the problem. will definitely report back. in the meantime it would be great is someone could make any sense of the attached diagnostics and or recommend anything else i could try.

i think it has to do with one of the docker containers from what is mentioned in the system logs after is all unstuck itself:

May  2 18:47:20 Tower kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/docker/8af47f41a4d2d92815f706ab6d61493c4859eb517e7a7abe51eab02584fd755a,task=node,pid=8791,uid=0
May  2 18:47:20 Tower kernel: Out of memory: Killed process 8791 (node) total-vm:57625864kB, anon-rss:53986004kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:117928kB oom_score_adj:0
May  2 18:47:24 Tower kernel: oom_reaper: reaped process 8791 (node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

tower-diagnostics-20230502-1848.zip tower-diagnostics-20230502-1831.zip

How can i possibly find out which container is causing this behavior?

Edited May 2, 2023 by freddy0

freddy0 · May 4, 2023

I want to further document what i have found. and add some keywords so that anyone searching for a related topic might find this.
The issue I described is of the type "server out of memory" or short oom. It seems that some service on my host machine is allowed to suck up all of my good ram (64Gb filled). the CPU usage is also high because of the extremely high ram usage. There is a process for ram usage (kswapd0) that is constantly hammering on the cpu cores.
With that i had finally isolated the problem.
Cpu usage extremely high - ram is geting filled up because of heavy io - and the system is unresponsive.

SSH is the only thing that still works to some extend.
At this point i don't know what the source is though. Top sort by memory

> top m

shows that a process called /usr/local/bin/node or kswapd0 is using all of this memory. This unfortunately does not help any further, because i suspect another program to cause extremely heavy io load so that the server cannot keep up and has to store everything in ram.
The Problem is likely to be caused by either a docker container that is misconfigured in some way or a plug-in that does something special in the background.

To find out which docker might cause this problem i run

docker stats --no-stream

To only show one final output that i need (it can take a very long time like 4 minutes to get a response, because the server is under heavy load)
Then i search for a container that is misbehaving in such a way, that is utilizes abnormal amounts of ram.
My output of

docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" --no-stream

CONTAINER ID   NAME                   CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O   PIDS
d4e7fd9e1f9f   dashy                  7.24%     49.56GiB / 62.73GiB   79.00%    44.4GB / 73.3MB   0B / 0B     22
9d7aa6b28199   xxxxxxx                1.47%     52.42MiB / 62.73GiB   0.08%     82.1MB / 8.88MB   0B / 0B     36
17426e04df93   xxxxxxxxxxx            0.10%     9.895MiB / 62.73GiB   0.02%     129MB / 97.7MB    0B / 0B     6
5849332c035d   xxxxxxxxxx             0.00%     40.43MiB / 62.73GiB   0.06%     2.59MB / 76.4kB   0B / 0B     12
3168dd810eb9   xxxxxxxxxxxxxxxxxxxx   0.02%     21.53MiB / 62.73GiB   0.03%     539MB / 1.78MB    0B / 0B     9
0b15f326c902   xxxxxxxxxxxxx          45.01%    230.1MiB / 62.73GiB   0.36%     260MB / 103MB     0B / 0B     22
4c04ff55ae43   xxxxxxxxxxxxxxx        0.12%     8.609MiB / 62.73GiB   0.01%     99.3MB / 90.2MB   0B / 0B     6
8e8b98f025c9   xxxxxxxxxxxxxxx        0.08%     9.469MiB / 62.73GiB   0.01%     135MB / 66.4MB    0B / 0B     6
ed9d251c5f8a   xxxxxxxxxxxxxxx        0.35%     10.41MiB / 62.73GiB   0.02%     219MB / 138MB     0B / 0B     7
124b0e9b6659   xxxxxxxxxxxxxxxxx      2.03%     178.9MiB / 62.73GiB   0.28%     472MB / 139GB     0B / 0B     42
7af51b6011e3   xxxxxxxxxxxxxx         0.00%     16.75MiB / 62.73GiB   0.03%     2.59MB / 21.8MB   0B / 0B     25
f503e0a68d75   xxxxxxxxxxx            0.89%     66.32MiB / 62.73GiB   0.10%     3.09MB / 338kB    0B / 0B     7
0162686fcb7f   xxxxxxxxxxxxxxx        0.09%     597.2MiB / 62.73GiB   0.93%     2.56MB / 39.8kB   0B / 0B     69
ffb7f7b840b4   xxxxxxx                0.04%     313.8MiB / 62.73GiB   0.49%     2.57MB / 861kB    0B / 0B     56
274f9b5071e0   xxxxxxxxxxxxxxxxx      3.87%     127.4MiB / 62.73GiB   0.20%     24.6MB / 8.31MB   0B / 0B     41

As you can see the docker Dashy is hammering on my IO and MEM usage quite intensively with 50gb ram and 44gb net IO. Even though this is a simple dashboard application.
I suspect this to be related to a toggle option in dashy that lets the server handle all the uptime pings for a given service instead of the client (which i toggled to be on)

I could either remove the docker or simply toggle this to let the client handle all the uptime pings again. This is a hypothesis though.

The problem ultimately is Dashy and i will try to narrow down the problem even further.

Some other related Reddit / forum posts i found to be somewhat related:
Unraid 6.8.3 OOM error on Large file transferr: https://www.reddit.com/r/unRAID/comments/ig5xyy/unraid_683_oom_error/
Heavy Disk IO:

Some application generating/checking data all the time (interval extremely short)

https://www.reddit.com/r/unRAID/comments/gip2tv/high_cpu_usage_all_of_a_sudden/

Documentation references
Docker stats: https://docs.docker.com/engine/reference/commandline/stats/

freddy0 · May 10, 2023

The root cause on why dashy even created a problem of this type is to be found. This post is reserved for when i find out more about the problem.

Edited December 7, 2023 by freddy0

Unraid Becomes unavailible / Crashes After some random time

Recommended Posts

freddy0

Link to comment

Squid

Link to comment

freddy0

Link to comment

freddy0

Link to comment

freddy0

Link to comment

Join the conversation