freddy0 Posted April 26, 2023 Share Posted April 26, 2023 My unraid system seems to be freezing up. (Between 24 Hours and 10 days after boot) My connected 3d Printer just stops printing, the web Ui is not responding. My Logfile before/during the crash is filed with the following information: Apr 26 19:43:29 Tower kernel: overlayfs: upper fs does not support tmpfile. Apr 26 19:43:29 Tower kernel: overlayfs: upper fs does not support RENAME_WHITEOUT. Followed by those messages Apr 26 20:36:03 Tower root: Total Spundown: 0 Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethdacc4ba) entered disabled state Apr 26 20:37:16 Tower kernel: veth3b1a1a4: renamed from eth0 Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethdacc4ba) entered disabled state Apr 26 20:37:16 Tower kernel: device vethdacc4ba left promiscuous mode Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethdacc4ba) entered disabled state Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered blocking state Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered disabled state Apr 26 20:37:16 Tower kernel: device vethad3ff92 entered promiscuous mode Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered blocking state Apr 26 20:37:16 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered forwarding state Apr 26 20:37:17 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered disabled state Apr 26 20:37:20 Tower kernel: vethfcf7e76: renamed from eth0 Apr 26 20:37:20 Tower kernel: docker0: port 39(vethdf98a50) entered disabled state Apr 26 20:37:22 Tower kernel: docker0: port 39(vethdf98a50) entered disabled state Apr 26 20:37:22 Tower kernel: device vethdf98a50 left promiscuous mode Apr 26 20:37:22 Tower kernel: docker0: port 39(vethdf98a50) entered disabled state Apr 26 20:37:23 Tower kernel: eth0: renamed from vethed59a92 Apr 26 20:37:23 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered blocking state Apr 26 20:37:23 Tower kernel: br-b3999385247f: port 4(vethad3ff92) entered forwarding state Apr 26 20:41:03 Tower root: Total Spundown: 0 Apr 26 20:46:06 Tower root: Total Spundown: 0 Apr 26 20:51:06 Tower root: Total Spundown: 0 Apr 26 20:55:30 Tower kernel: usb 1-2: USB disconnect, device number 15 Apr 26 20:55:30 Tower kernel: usb 1-2: failed to send control message: -19 Apr 26 20:55:30 Tower kernel: ch341-uart ttyUSB0: ch341-uart converter now disconnected from ttyUSB0 Apr 26 20:55:30 Tower kernel: ch341 1-2:1.0: device disconnected Apr 26 20:55:57 Tower shutdown[17538]: shutting down for system halt Apr 26 20:55:57 Tower init: Switching to runlevel: 0 Apr 26 20:55:57 Tower init: Trying to re-exec init Apr 26 20:56:04 Tower kernel: mdcmd (38): nocheck cancel Apr 26 20:56:05 Tower emhttpd: Spinning up all drives... Apr 26 20:56:05 Tower emhttpd: read SMART /dev/sdh Apr 26 20:56:05 Tower emhttpd: read SMART /dev/sdg Apr 26 20:56:05 Tower emhttpd: read SMART /dev/sdd Apr 26 20:56:05 Tower emhttpd: read SMART /dev/sde Apr 26 20:56:05 Tower emhttpd: read SMART /dev/sdb Apr 26 20:56:05 Tower emhttpd: read SMART /dev/sdf Apr 26 20:56:05 Tower emhttpd: read SMART /dev/sdc Apr 26 20:56:05 Tower emhttpd: read SMART /dev/nvme0n1 Apr 26 20:56:05 Tower emhttpd: read SMART /dev/sda Apr 26 20:56:06 Tower emhttpd: Stopping services... Apr 26 20:56:06 Tower root: Total Spundown: 0 I have attached all of the other diagnostics, but this is where i would guess the problem originates or at least this might have a relation to the underlying problem. The extra logs file is what i extracted before/during rebot. Unraid-Logs.txt tower-diagnostics-20230426-2122.zip Quote Link to comment
Squid Posted April 26, 2023 Share Posted April 26, 2023 Try setting typical idle current (or similar wording) in the BIOS to be "typical". If no option, then ensure that c states is also disabled You should also try running your memory at the SPD speed (probably 2133) instead of its xmp (overclocked) profile of 3200 Quote Link to comment
freddy0 Posted May 2, 2023 Author Share Posted May 2, 2023 (edited) On 4/26/2023 at 10:40 PM, Squid said: Try setting typical idle current (or similar wording) in the BIOS to be "typical". If no option, then ensure that c states is also disabled You should also try running your memory at the SPD speed (probably 2133) instead of its xmp (overclocked) profile of 3200 Typical current - Done Cstate control disabled Memory at 2133 MT/s After some more troubleshooting and investigation i came to the conclusion that my problem is most definitely memory related. I reserved two cores, because it was recommended somewhere. After which i was able to sucessfully ssh into the server during it had soft locked (Web ui from different machine not loading correctly after login attempt and the local unraid web ui view was frozen). The cpu utilization was stuck on 98 percent or so on the server web ui view. Quick htop from terminal - cpu 85% used bykswapd0. Memory filled up completely 99.9% or so. I ran a diagnostics scan from my ssh session output is the smaller file. Unraid somehow then unstuck itself after i ran (diagnostics / lsof / free) then after it all started working again i ran diagnostics a second time. What might have caused this? After some further research i no changed vm.dirty.ratio_xx to 1% and 2% respectively with "tips and tweaks" not sure if this might have already have solved the problem. will definitely report back. in the meantime it would be great is someone could make any sense of the attached diagnostics and or recommend anything else i could try. i think it has to do with one of the docker containers from what is mentioned in the system logs after is all unstuck itself: May 2 18:47:20 Tower kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/docker/8af47f41a4d2d92815f706ab6d61493c4859eb517e7a7abe51eab02584fd755a,task=node,pid=8791,uid=0 May 2 18:47:20 Tower kernel: Out of memory: Killed process 8791 (node) total-vm:57625864kB, anon-rss:53986004kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:117928kB oom_score_adj:0 May 2 18:47:24 Tower kernel: oom_reaper: reaped process 8791 (node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB tower-diagnostics-20230502-1848.zip tower-diagnostics-20230502-1831.zip How can i possibly find out which container is causing this behavior? Edited May 2, 2023 by freddy0 Quote Link to comment
Solution freddy0 Posted May 4, 2023 Author Solution Share Posted May 4, 2023 I want to further document what i have found. and add some keywords so that anyone searching for a related topic might find this. The issue I described is of the type "server out of memory" or short oom. It seems that some service on my host machine is allowed to suck up all of my good ram (64Gb filled). the CPU usage is also high because of the extremely high ram usage. There is a process for ram usage (kswapd0) that is constantly hammering on the cpu cores. With that i had finally isolated the problem. Cpu usage extremely high - ram is geting filled up because of heavy io - and the system is unresponsive. SSH is the only thing that still works to some extend. At this point i don't know what the source is though. Top sort by memory > top m shows that a process called /usr/local/bin/node or kswapd0 is using all of this memory. This unfortunately does not help any further, because i suspect another program to cause extremely heavy io load so that the server cannot keep up and has to store everything in ram. The Problem is likely to be caused by either a docker container that is misconfigured in some way or a plug-in that does something special in the background. To find out which docker might cause this problem i run docker stats --no-stream To only show one final output that i need (it can take a very long time like 4 minutes to get a response, because the server is under heavy load) Then i search for a container that is misbehaving in such a way, that is utilizes abnormal amounts of ram. My output of docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}" --no-stream CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS d4e7fd9e1f9f dashy 7.24% 49.56GiB / 62.73GiB 79.00% 44.4GB / 73.3MB 0B / 0B 22 9d7aa6b28199 xxxxxxx 1.47% 52.42MiB / 62.73GiB 0.08% 82.1MB / 8.88MB 0B / 0B 36 17426e04df93 xxxxxxxxxxx 0.10% 9.895MiB / 62.73GiB 0.02% 129MB / 97.7MB 0B / 0B 6 5849332c035d xxxxxxxxxx 0.00% 40.43MiB / 62.73GiB 0.06% 2.59MB / 76.4kB 0B / 0B 12 3168dd810eb9 xxxxxxxxxxxxxxxxxxxx 0.02% 21.53MiB / 62.73GiB 0.03% 539MB / 1.78MB 0B / 0B 9 0b15f326c902 xxxxxxxxxxxxx 45.01% 230.1MiB / 62.73GiB 0.36% 260MB / 103MB 0B / 0B 22 4c04ff55ae43 xxxxxxxxxxxxxxx 0.12% 8.609MiB / 62.73GiB 0.01% 99.3MB / 90.2MB 0B / 0B 6 8e8b98f025c9 xxxxxxxxxxxxxxx 0.08% 9.469MiB / 62.73GiB 0.01% 135MB / 66.4MB 0B / 0B 6 ed9d251c5f8a xxxxxxxxxxxxxxx 0.35% 10.41MiB / 62.73GiB 0.02% 219MB / 138MB 0B / 0B 7 124b0e9b6659 xxxxxxxxxxxxxxxxx 2.03% 178.9MiB / 62.73GiB 0.28% 472MB / 139GB 0B / 0B 42 7af51b6011e3 xxxxxxxxxxxxxx 0.00% 16.75MiB / 62.73GiB 0.03% 2.59MB / 21.8MB 0B / 0B 25 f503e0a68d75 xxxxxxxxxxx 0.89% 66.32MiB / 62.73GiB 0.10% 3.09MB / 338kB 0B / 0B 7 0162686fcb7f xxxxxxxxxxxxxxx 0.09% 597.2MiB / 62.73GiB 0.93% 2.56MB / 39.8kB 0B / 0B 69 ffb7f7b840b4 xxxxxxx 0.04% 313.8MiB / 62.73GiB 0.49% 2.57MB / 861kB 0B / 0B 56 274f9b5071e0 xxxxxxxxxxxxxxxxx 3.87% 127.4MiB / 62.73GiB 0.20% 24.6MB / 8.31MB 0B / 0B 41 As you can see the docker Dashy is hammering on my IO and MEM usage quite intensively with 50gb ram and 44gb net IO. Even though this is a simple dashboard application. I suspect this to be related to a toggle option in dashy that lets the server handle all the uptime pings for a given service instead of the client (which i toggled to be on) I could either remove the docker or simply toggle this to let the client handle all the uptime pings again. This is a hypothesis though. The problem ultimately is Dashy and i will try to narrow down the problem even further. Some other related Reddit / forum posts i found to be somewhat related: Unraid 6.8.3 OOM error on Large file transferr: https://www.reddit.com/r/unRAID/comments/ig5xyy/unraid_683_oom_error/ Heavy Disk IO: Some application generating/checking data all the time (interval extremely short) https://www.reddit.com/r/unRAID/comments/gip2tv/high_cpu_usage_all_of_a_sudden/ Documentation references Docker stats: https://docs.docker.com/engine/reference/commandline/stats/ Quote Link to comment
freddy0 Posted May 10, 2023 Author Share Posted May 10, 2023 (edited) The root cause on why dashy even created a problem of this type is to be found. This post is reserved for when i find out more about the problem. Edited December 7, 2023 by freddy0 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.