blaine07 Posted March 5, 2023 Share Posted March 5, 2023 Good Evening, About the same time every day ~2200 CST my Unraid CPU "runs away". I have grabbed diagnostics as event occurred, before it locked up and I couldn't. Also grabbed a few HTOP Screen shots. I don't really understand them but I would very appreciative if anyone could help me pinpoint what exactly is causing everything to hard crash at about the same time every day. It happened last night, and today I had MANY of my containers off thinking it was one of the containers causing the crash. Often it locks up, and never recovers without hard reset. Attached are my Diagnostics as event was happening and some various screenshots that hopefully someone will find useful. In system log I see nothing that "sticks out" but I am far from expert. SOS tower-diagnostics-20230304-2222.zip Quote Link to comment
blaine07 Posted March 5, 2023 Author Share Posted March 5, 2023 ~305 - 405 this morning it did it again. Attaching another Diagnostics. Please help; I have no idea what else to look at or do 😞 tower-diagnostics-20230305-0812.zip Quote Link to comment
blaine07 Posted March 5, 2023 Author Share Posted March 5, 2023 ar 5 08:06:17 Tower kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=8acaeec5cfd427a5dc7efe8f924e23706eefe68bf4115f6bfd00aa4b8354dcb6,mems_allowed=0-1,oom_memcg=/docker/8acaeec5cfd427a5dc7efe8f924e23706eefe68bf4115f6bfd00aa4b8354dcb6,task_memcg=/docker/8acaeec5cfd427a5dc7efe8f924e23706eefe68bf4115f6bfd00aa4b8354dcb6,task=nginx,pid=5455,uid=0 Mar 5 08:06:17 Tower kernel: Memory cgroup out of memory: Killed process 5455 (nginx) total-vm:188444kB, anon-rss:90488kB, file-rss:0kB, shmem-rss:344kB, UID:0 pgtables:240kB oom_score_adj:0 Mar 5 08:11:48 Tower webGUI: Successful login user root from 192.168.1.3 Mar 5 08:12:50 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Mar 5 08:12:50 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Mar 5 08:12:50 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Odd, shortly after my comment above it "ran away again." I have never saw the SCSI ioctl thing above before; could it be related? Another diagnostics as it was running away is attached. tower-diagnostics-20230305-0823.zip Quote Link to comment
blaine07 Posted March 5, 2023 Author Share Posted March 5, 2023 Anyone think recreating docker.img might be beneficial? Quote Link to comment
apandey Posted March 6, 2023 Share Posted March 6, 2023 I am on mobile, so haven't checked your diagnostics Can you check docker tab, advanced view to see if a specific container is eating up all the CPU? Or type docker stats in a terminal The memory being full is also not good. The first step is to separate the cause and effect. Quote Link to comment
blaine07 Posted March 6, 2023 Author Share Posted March 6, 2023 (edited) 3 hours ago, apandey said: I am on mobile, so haven't checked your diagnostics Can you check docker tab, advanced view to see if a specific container is eating up all the CPU? Or type docker stats in a terminal The memory being full is also not good. The first step is to separate the cause and effect. Last night re-created docker image(& converted from file to folder) and haven’t had any issue, yet. But we will see. If I see more, and I’d say there’s a good chance, I will run docker stats. if you do get time to look through diagnostics still and see anything helpful Please let me know. Edited March 6, 2023 by blaine07 Quote Link to comment
apandey Posted March 6, 2023 Share Posted March 6, 2023 the main thing I see is constant app crashes due to lack of memory. Are you running some sort of webserver that is exposed to other users? any chance you are getting unexpected high traffic if it happens again, try to look at docker resource utilization. I have a grafana dashboard setp up which helps a lot to look back at any trending data Quote Link to comment
blaine07 Posted March 6, 2023 Author Share Posted March 6, 2023 26 minutes ago, apandey said: the main thing I see is constant app crashes due to lack of memory. Are you running some sort of webserver that is exposed to other users? any chance you are getting unexpected high traffic if it happens again, try to look at docker resource utilization. I have a grafana dashboard setp up which helps a lot to look back at any trending data I had had a GRAV server. But when troubleshooting it should’ve been “off”. y Yeah, once cpu would run away though memory would max out utilization as well. Quote Link to comment
blaine07 Posted March 6, 2023 Author Share Posted March 6, 2023 4 hours ago, apandey said: the main thing I see is constant app crashes due to lack of memory. Are you running some sort of webserver that is exposed to other users? any chance you are getting unexpected high traffic if it happens again, try to look at docker resource utilization. I have a grafana dashboard setp up which helps a lot to look back at any trending data This happened again this afternoon. Unfortunately it go to “too locked up” before I caught it to get any logs. Any other ideas? Quote Link to comment
apandey Posted March 6, 2023 Share Posted March 6, 2023 I have influxdb + grafana setup to capture metrics from my unraid server (using unraid ultimate dashboards as base). That way I can see trending data on cpu / memory / docker resources. Useful to see what is using resources Quote Link to comment
blaine07 Posted March 7, 2023 Author Share Posted March 7, 2023 Looks like last night it did it a few times and recovered each time. tower-diagnostics-20230307-0519.zip Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 First try to identify what is invoking the OOM killer, possibly a container, disable all containers and then enable one by one to see if you can find the culprit, if you find it limit its resources. Quote Link to comment
blaine07 Posted March 7, 2023 Author Share Posted March 7, 2023 1 minute ago, JorgeB said: First try to identify what is invoking the OOM killer, possibly a container, disable all containers and then enable one by one to see if you can find the culprit, if you find it limit its resources. I only have basically a core of containers running - been playing with most not running at all. When I enable containers one by one what exactly am I looking for to determine it’s the culprit? Are we positive it’s one single container? (Sorry, genuinely want to understand) Quote Link to comment
blaine07 Posted March 7, 2023 Author Share Posted March 7, 2023 5 minutes ago, JorgeB said: First try to identify what is invoking the OOM killer, possibly a container, disable all containers and then enable one by one to see if you can find the culprit, if you find it limit its resources. I see this container name referenced with OOM. How can I turn this string into exactly which container? Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 3 minutes ago, blaine07 said: what exactly am I looking for to determine it’s the culprit? See if the OOM killer is still invoked, alternatively limit the amount of RAM for all containers. Quote Link to comment
blaine07 Posted March 7, 2023 Author Share Posted March 7, 2023 1 minute ago, JorgeB said: See if the OOM killer is still invoked, alternatively limit the amount of RAM for all containers. Pardon my idiocracy: how can I see if OOM killer is invoked? When cpu usage “runs away” memory is climbing to 100% too Quote Link to comment
blaine07 Posted March 7, 2023 Author Share Posted March 7, 2023 (edited) It appears that string above, it goes to NginxProxy Manager. How much/what should I limit CPU usage too? It's the only time I see OOM though and really wasn't at the time system "crashed"? And it already has "--memory=1G --no-healthcheck" in extra parameters? Edited March 7, 2023 by blaine07 Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 14 minutes ago, blaine07 said: Pardon my idiocracy: how can I see if OOM killer is invoked? Look for lines like this on the syslog: Mar 6 21:36:36 Tower kernel: nginx invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 Quote Link to comment
blaine07 Posted March 7, 2023 Author Share Posted March 7, 2023 3 minutes ago, JorgeB said: Look for lines like this on the syslog: Mar 6 21:36:36 Tower kernel: nginx invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 I found that, and container listed with it(right above it before OOM) is NginxProxyManager - I restricted CPU cores, but it already has "--memory=1G --no-healthcheck" in extra parameters? Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 10 minutes ago, blaine07 said: and container listed with it(right above it before OOM) is NginxProxyManager Where are you seeing that? Usually there's no reference about what's causing the OOM, just about the app that invoked it because it didn't have enough memory. Quote Link to comment
blaine07 Posted March 7, 2023 Author Share Posted March 7, 2023 4 minutes ago, JorgeB said: Where are you seeing that? Usually there's no reference about what's causing the OOM, just about the app that invoked it because it didn't have enough memory. Well the long string right ABOVE your excerpt: I took that and went to "shares" then "appdata" then "system" shares then clicked "docker" then "docker" again then clicked "container" and searched for above long string. Once I did that I went into corresponding folder and downloaded "hostconfig" and was able to determine that that long string was referencing NginxProxyManager. I don't know if that's right or if it's culprit; but thats how I arrived. I did limit CPU for NPM, too though. Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 I meant were are you seeing that in the syslog? Quote Link to comment
blaine07 Posted March 7, 2023 Author Share Posted March 7, 2023 2 minutes ago, JorgeB said: I meant were are you seeing that in the syslog? Well It was: Mar 6 21:36:36 Tower kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=b7bf47074734f898f67616851b6c9c6128f182ef264006024be566416b2d07e1,mems_allowed=0-1,oom_memcg=/docker/b7bf47074734f898f67616851b6c9c6128f182ef264006024be566416b2d07e1,task_memcg=/docker/b7bf47074734f898f67616851b6c9c6128f182ef264006024be566416b2d07e1,task=nginx,pid=25095,uid=0 Mar 6 21:36:36 Tower kernel: Memory cgroup out of memory: Killed process 25095 (nginx) total-vm:274036kB, anon-rss:176240kB, file-rss:4kB, shmem-rss:516kB, UID:0 pgtables:412kB oom_score_adj:0 Mar 6 21:36:38 Tower kernel: oom_reaper: reaped process 25095 (nginx), now anon-rss:0kB, file-rss:0kB, shmem-rss:516kB Mar 6 21:44:06 Tower root: Fix Common Problems Version 2023.03.04 Mar 6 21:44:08 Tower root: Fix Common Problems: Warning: unRaids built in FTP server is running ** Ignored Mar 6 21:44:16 Tower root: Fix Common Problems: Error: Out Of Memory errors detected on your server Mar 6 21:44:29 Tower root: Fix Common Problems: Warning: Wrong DNS entry for host ** Ignored 4 minutes ago, JorgeB said: I meant were are you seeing that in the syslog? Quote Link to comment
blaine07 Posted March 7, 2023 Author Share Posted March 7, 2023 Just a few ago it tried to lock up: Mar 7 06:19:31 Tower kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=0c85bb041886edc37981442c550f8522b2687c54eddc7769fe20d345c7a32c92,mems_allowed=0-1,oom_memcg=/docker/0c85bb041886edc37981442c550f8522b2687c54eddc7769fe20d345c7a32c92,task_memcg=/docker/0c85bb041886edc37981442c550f8522b2687c54eddc7769fe20d345c7a32c92,task=nginx,pid=16299,uid=0 Mar 7 06:19:31 Tower kernel: Memory cgroup out of memory: Killed process 16299 (nginx) total-vm:271632kB, anon-rss:173940kB, file-rss:0kB, shmem-rss:112kB, UID:0 pgtables:424kB oom_score_adj:0 Quote Link to comment
JorgeB Posted March 7, 2023 Share Posted March 7, 2023 That is after the oom killer invoked line, and if I understand correctly it's the app the was killed, not the app that caused the oom issue. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.