Jump to content

CPU Runaway


Recommended Posts

Good Evening,

 

About the same time every day ~2200 CST my Unraid CPU "runs away". I have grabbed diagnostics as event occurred, before it locked up and I couldn't. Also grabbed a few HTOP Screen shots. I don't really understand them but I would very appreciative if anyone could help me pinpoint what exactly is causing everything to hard crash at about the same time every day. It happened last night, and today I had MANY of my containers off thinking it was one of the containers causing the crash. Often it locks up, and never recovers without hard reset. 

 

Attached are my Diagnostics as event was happening and some various screenshots that hopefully someone will find useful. In system log I see nothing that "sticks out" but I am far from expert.

 

SOS

449738384_ScreenShot2023-03-04at10_23_52PM.png.cbcd9103f3164f4fb519559f7b970b39.png

 

1265880590_ScreenShot2023-03-04at10_13_17PM.thumb.png.c354efd683a839ff3a5359a26fd66834.png

 

 

1521427719_ScreenShot2023-03-04at10_14_12PM.thumb.png.ac9ddbf409cc9d602e481ab430bc613b.png

 

 

1517023012_ScreenShot2023-03-04at10_15_03PM.thumb.png.c1bdd3e5f352fdaaf80ae1c32158dfd6.png

 

tower-diagnostics-20230304-2222.zip

Link to comment
ar  5 08:06:17 Tower kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=8acaeec5cfd427a5dc7efe8f924e23706eefe68bf4115f6bfd00aa4b8354dcb6,mems_allowed=0-1,oom_memcg=/docker/8acaeec5cfd427a5dc7efe8f924e23706eefe68bf4115f6bfd00aa4b8354dcb6,task_memcg=/docker/8acaeec5cfd427a5dc7efe8f924e23706eefe68bf4115f6bfd00aa4b8354dcb6,task=nginx,pid=5455,uid=0

Mar  5 08:06:17 Tower kernel: Memory cgroup out of memory: Killed process 5455 (nginx) total-vm:188444kB, anon-rss:90488kB, file-rss:0kB, shmem-rss:344kB, UID:0 pgtables:240kB oom_score_adj:0

Mar  5 08:11:48 Tower webGUI: Successful login user root from 192.168.1.3

Mar  5 08:12:50 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Mar  5 08:12:50 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Mar  5 08:12:50 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Odd, shortly after my comment above it "ran away again." I have never saw the SCSI ioctl thing above before; could it be related?

 

Another diagnostics as it was running away is attached.

tower-diagnostics-20230305-0823.zip

Link to comment
3 hours ago, apandey said:

I am on mobile, so haven't checked your diagnostics 

 

Can you check docker tab, advanced view to see if a specific container is eating up all the CPU? Or type docker stats in a terminal

 

The memory being full is also not good. The first step is to separate the cause and effect. 

Last night re-created docker image(& converted from file to folder) and haven’t had any issue, yet. But we will see. If I see more, and I’d say there’s a good chance, I will run docker stats. 
 

if you do get time to look through diagnostics still and see anything helpful Please let me know. 

Edited by blaine07
Link to comment

the main thing I see is constant app crashes due to lack of memory. Are you running some sort of webserver that is exposed to other users? any chance you are getting unexpected high traffic

 

if it happens again, try to look at docker resource utilization. I have a grafana dashboard setp up which helps a lot to look back at any trending data

Link to comment
26 minutes ago, apandey said:

the main thing I see is constant app crashes due to lack of memory. Are you running some sort of webserver that is exposed to other users? any chance you are getting unexpected high traffic

 

if it happens again, try to look at docker resource utilization. I have a grafana dashboard setp up which helps a lot to look back at any trending data

I had had a GRAV server. But when troubleshooting it should’ve been “off”. 
 

y

Yeah, once cpu would run away though memory would max out utilization as well. 

Link to comment
4 hours ago, apandey said:

the main thing I see is constant app crashes due to lack of memory. Are you running some sort of webserver that is exposed to other users? any chance you are getting unexpected high traffic

 

if it happens again, try to look at docker resource utilization. I have a grafana dashboard setp up which helps a lot to look back at any trending data

This happened again this afternoon. Unfortunately it go to “too locked up” before I caught it to get any logs. 
 

Any other ideas?

Link to comment
1 minute ago, JorgeB said:

First try to identify what is invoking the OOM killer, possibly a container, disable all containers and then enable one by one to see if you can find the culprit, if you find it limit its resources.

I only have basically a core of containers running - been playing with most not running at all. When I enable containers one by one what exactly am I looking for to determine it’s the culprit? Are we positive it’s one single container? (Sorry, genuinely want to understand)

Link to comment
5 minutes ago, JorgeB said:

First try to identify what is invoking the OOM killer, possibly a container, disable all containers and then enable one by one to see if you can find the culprit, if you find it limit its resources.

I see this container name referenced with OOM. How can I turn this string into exactly which container?

 

 

DBB97E26-04DB-4DDE-9883-87964E867699.jpeg

Link to comment

It appears that string above, it goes to NginxProxy Manager. How much/what should I limit CPU usage too? It's the only time I see OOM though and really wasn't at the time system "crashed"?

 

And it already has "--memory=1G --no-healthcheck" in extra parameters?

Edited by blaine07
Link to comment
3 minutes ago, JorgeB said:

 

 

Look for lines like this on the syslog:

 

Mar  6 21:36:36 Tower kernel: nginx invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0

 

I found that, and container listed with it(right above it before OOM) is NginxProxyManager - I restricted CPU cores, but it already has  "--memory=1G --no-healthcheck" in extra parameters?

Link to comment
4 minutes ago, JorgeB said:

Where are you seeing that? Usually there's no reference about what's causing the OOM, just about the app that invoked it because it didn't have enough memory.

Well the long string right ABOVE your excerpt: 

thhh.jpeg.e77eb744dd1f99d6509723d39d334d9a.jpeg

 

I took that and went to "shares" then "appdata" then "system" shares then clicked "docker" then "docker" again then clicked "container" and searched for above long string. Once I did that I went into corresponding folder and downloaded "hostconfig" and was able to determine that that long string was referencing NginxProxyManager. I don't know if that's right or if it's culprit; but thats how I arrived. I did limit CPU for NPM, too though.

 

Link to comment
2 minutes ago, JorgeB said:

I meant were are you seeing that in the syslog?

Well It was:

 

Mar  6 21:36:36 Tower kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=b7bf47074734f898f67616851b6c9c6128f182ef264006024be566416b2d07e1,mems_allowed=0-1,oom_memcg=/docker/b7bf47074734f898f67616851b6c9c6128f182ef264006024be566416b2d07e1,task_memcg=/docker/b7bf47074734f898f67616851b6c9c6128f182ef264006024be566416b2d07e1,task=nginx,pid=25095,uid=0
Mar  6 21:36:36 Tower kernel: Memory cgroup out of memory: Killed process 25095 (nginx) total-vm:274036kB, anon-rss:176240kB, file-rss:4kB, shmem-rss:516kB, UID:0 pgtables:412kB oom_score_adj:0
Mar  6 21:36:38 Tower kernel: oom_reaper: reaped process 25095 (nginx), now anon-rss:0kB, file-rss:0kB, shmem-rss:516kB
Mar  6 21:44:06 Tower root: Fix Common Problems Version 2023.03.04
Mar  6 21:44:08 Tower root: Fix Common Problems: Warning: unRaids built in FTP server is running ** Ignored
Mar  6 21:44:16 Tower root: Fix Common Problems: Error: Out Of Memory errors detected on your server
Mar  6 21:44:29 Tower root: Fix Common Problems: Warning: Wrong DNS entry for host ** Ignored

 

4 minutes ago, JorgeB said:

I meant were are you seeing that in the syslog?

 

Link to comment

Just a few ago it tried to lock up: Mar  7 06:19:31 Tower kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=0c85bb041886edc37981442c550f8522b2687c54eddc7769fe20d345c7a32c92,mems_allowed=0-1,oom_memcg=/docker/0c85bb041886edc37981442c550f8522b2687c54eddc7769fe20d345c7a32c92,task_memcg=/docker/0c85bb041886edc37981442c550f8522b2687c54eddc7769fe20d345c7a32c92,task=nginx,pid=16299,uid=0
Mar  7 06:19:31 Tower kernel: Memory cgroup out of memory: Killed process 16299 (nginx) total-vm:271632kB, anon-rss:173940kB, file-rss:0kB, shmem-rss:112kB, UID:0 pgtables:424kB oom_score_adj:0

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...