CPU Runaway

blaine07 · March 5, 2023

Good Evening,

About the same time every day ~2200 CST my Unraid CPU "runs away". I have grabbed diagnostics as event occurred, before it locked up and I couldn't. Also grabbed a few HTOP Screen shots. I don't really understand them but I would very appreciative if anyone could help me pinpoint what exactly is causing everything to hard crash at about the same time every day. It happened last night, and today I had MANY of my containers off thinking it was one of the containers causing the crash. Often it locks up, and never recovers without hard reset.

Attached are my Diagnostics as event was happening and some various screenshots that hopefully someone will find useful. In system log I see nothing that "sticks out" but I am far from expert.

SOS

449738384_ScreenShot2023-03-04at10_23_52PM.png.cbcd9103f3164f4fb519559f7b970b39.png

tower-diagnostics-20230304-2222.zip

blaine07 · March 5, 2023

~305 - 405 this morning it did it again.

Attaching another Diagnostics. Please help; I have no idea what else to look at or do 😞

tower-diagnostics-20230305-0812.zip

blaine07 · March 5, 2023

ar  5 08:06:17 Tower kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=8acaeec5cfd427a5dc7efe8f924e23706eefe68bf4115f6bfd00aa4b8354dcb6,mems_allowed=0-1,oom_memcg=/docker/8acaeec5cfd427a5dc7efe8f924e23706eefe68bf4115f6bfd00aa4b8354dcb6,task_memcg=/docker/8acaeec5cfd427a5dc7efe8f924e23706eefe68bf4115f6bfd00aa4b8354dcb6,task=nginx,pid=5455,uid=0

Mar  5 08:06:17 Tower kernel: Memory cgroup out of memory: Killed process 5455 (nginx) total-vm:188444kB, anon-rss:90488kB, file-rss:0kB, shmem-rss:344kB, UID:0 pgtables:240kB oom_score_adj:0

Mar  5 08:11:48 Tower webGUI: Successful login user root from 192.168.1.3

Mar  5 08:12:50 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Mar  5 08:12:50 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Mar  5 08:12:50 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Odd, shortly after my comment above it "ran away again." I have never saw the SCSI ioctl thing above before; could it be related?

Another diagnostics as it was running away is attached.

tower-diagnostics-20230305-0823.zip

blaine07 · March 5, 2023

Anyone think recreating docker.img might be beneficial?

apandey · March 6, 2023

I am on mobile, so haven't checked your diagnostics

Can you check docker tab, advanced view to see if a specific container is eating up all the CPU? Or type docker stats in a terminal

The memory being full is also not good. The first step is to separate the cause and effect.

blaine07 · March 6, 2023

3 hours ago, apandey said:

I am on mobile, so haven't checked your diagnostics

Can you check docker tab, advanced view to see if a specific container is eating up all the CPU? Or type docker stats in a terminal

The memory being full is also not good. The first step is to separate the cause and effect.

Last night re-created docker image(& converted from file to folder) and haven’t had any issue, yet. But we will see. If I see more, and I’d say there’s a good chance, I will run docker stats.

if you do get time to look through diagnostics still and see anything helpful Please let me know.

Edited March 6, 2023 by blaine07

apandey · March 6, 2023

the main thing I see is constant app crashes due to lack of memory. Are you running some sort of webserver that is exposed to other users? any chance you are getting unexpected high traffic

if it happens again, try to look at docker resource utilization. I have a grafana dashboard setp up which helps a lot to look back at any trending data

blaine07 · March 6, 2023

26 minutes ago, apandey said:

the main thing I see is constant app crashes due to lack of memory. Are you running some sort of webserver that is exposed to other users? any chance you are getting unexpected high traffic

if it happens again, try to look at docker resource utilization. I have a grafana dashboard setp up which helps a lot to look back at any trending data

I had had a GRAV server. But when troubleshooting it should’ve been “off”.

y

Yeah, once cpu would run away though memory would max out utilization as well.

blaine07 · March 6, 2023

4 hours ago, apandey said:

the main thing I see is constant app crashes due to lack of memory. Are you running some sort of webserver that is exposed to other users? any chance you are getting unexpected high traffic

if it happens again, try to look at docker resource utilization. I have a grafana dashboard setp up which helps a lot to look back at any trending data

This happened again this afternoon. Unfortunately it go to “too locked up” before I caught it to get any logs.

Any other ideas?

apandey · March 6, 2023

I have influxdb + grafana setup to capture metrics from my unraid server (using unraid ultimate dashboards as base). That way I can see trending data on cpu / memory / docker resources. Useful to see what is using resources

blaine07 · March 7, 2023

Looks like last night it did it a few times and recovered each time.

tower-diagnostics-20230307-0519.zip

JorgeB · March 7, 2023

First try to identify what is invoking the OOM killer, possibly a container, disable all containers and then enable one by one to see if you can find the culprit, if you find it limit its resources.

blaine07 · March 7, 2023

1 minute ago, JorgeB said:

First try to identify what is invoking the OOM killer, possibly a container, disable all containers and then enable one by one to see if you can find the culprit, if you find it limit its resources.

I only have basically a core of containers running - been playing with most not running at all. When I enable containers one by one what exactly am I looking for to determine it’s the culprit? Are we positive it’s one single container? (Sorry, genuinely want to understand)

blaine07 · March 7, 2023

5 minutes ago, JorgeB said:

First try to identify what is invoking the OOM killer, possibly a container, disable all containers and then enable one by one to see if you can find the culprit, if you find it limit its resources.

I see this container name referenced with OOM. How can I turn this string into exactly which container?

JorgeB · March 7, 2023

3 minutes ago, blaine07 said:

what exactly am I looking for to determine it’s the culprit?

See if the OOM killer is still invoked, alternatively limit the amount of RAM for all containers.

blaine07 · March 7, 2023

1 minute ago, JorgeB said:

See if the OOM killer is still invoked, alternatively limit the amount of RAM for all containers.

Pardon my idiocracy: how can I see if OOM killer is invoked?

When cpu usage “runs away” memory is climbing to 100% too

blaine07 · March 7, 2023

It appears that string above, it goes to NginxProxy Manager. How much/what should I limit CPU usage too? It's the only time I see OOM though and really wasn't at the time system "crashed"?

And it already has "--memory=1G --no-healthcheck" in extra parameters?

Edited March 7, 2023 by blaine07

JorgeB · March 7, 2023

14 minutes ago, blaine07 said:

Pardon my idiocracy: how can I see if OOM killer is invoked?

Look for lines like this on the syslog:

Mar  6 21:36:36 Tower kernel: nginx invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0

blaine07 · March 7, 2023

3 minutes ago, JorgeB said:
Look for lines like this on the syslog:
Mar  6 21:36:36 Tower kernel: nginx invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0

I found that, and container listed with it(right above it before OOM) is NginxProxyManager - I restricted CPU cores, but it already has "--memory=1G --no-healthcheck" in extra parameters?

JorgeB · March 7, 2023

10 minutes ago, blaine07 said:

and container listed with it(right above it before OOM) is NginxProxyManager

Where are you seeing that? Usually there's no reference about what's causing the OOM, just about the app that invoked it because it didn't have enough memory.

blaine07 · March 7, 2023

4 minutes ago, JorgeB said:

Where are you seeing that? Usually there's no reference about what's causing the OOM, just about the app that invoked it because it didn't have enough memory.

Well the long string right ABOVE your excerpt:

thhh.jpeg.e77eb744dd1f99d6509723d39d334d9a.jpeg

I took that and went to "shares" then "appdata" then "system" shares then clicked "docker" then "docker" again then clicked "container" and searched for above long string. Once I did that I went into corresponding folder and downloaded "hostconfig" and was able to determine that that long string was referencing NginxProxyManager. I don't know if that's right or if it's culprit; but thats how I arrived. I did limit CPU for NPM, too though.

JorgeB · March 7, 2023

I meant were are you seeing that in the syslog?

blaine07 · March 7, 2023

2 minutes ago, JorgeB said:

I meant were are you seeing that in the syslog?

Well It was:

Mar 6 21:36:36 Tower kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=b7bf47074734f898f67616851b6c9c6128f182ef264006024be566416b2d07e1,mems_allowed=0-1,oom_memcg=/docker/b7bf47074734f898f67616851b6c9c6128f182ef264006024be566416b2d07e1,task_memcg=/docker/b7bf47074734f898f67616851b6c9c6128f182ef264006024be566416b2d07e1,task=nginx,pid=25095,uid=0
Mar 6 21:36:36 Tower kernel: Memory cgroup out of memory: Killed process 25095 (nginx) total-vm:274036kB, anon-rss:176240kB, file-rss:4kB, shmem-rss:516kB, UID:0 pgtables:412kB oom_score_adj:0
Mar 6 21:36:38 Tower kernel: oom_reaper: reaped process 25095 (nginx), now anon-rss:0kB, file-rss:0kB, shmem-rss:516kB
Mar 6 21:44:06 Tower root: Fix Common Problems Version 2023.03.04
Mar 6 21:44:08 Tower root: Fix Common Problems: Warning: unRaids built in FTP server is running ** Ignored
Mar 6 21:44:16 Tower root: Fix Common Problems: Error: Out Of Memory errors detected on your server
Mar 6 21:44:29 Tower root: Fix Common Problems: Warning: Wrong DNS entry for host ** Ignored

4 minutes ago, JorgeB said:

I meant were are you seeing that in the syslog?

blaine07 · March 7, 2023

Just a few ago it tried to lock up: Mar 7 06:19:31 Tower kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=0c85bb041886edc37981442c550f8522b2687c54eddc7769fe20d345c7a32c92,mems_allowed=0-1,oom_memcg=/docker/0c85bb041886edc37981442c550f8522b2687c54eddc7769fe20d345c7a32c92,task_memcg=/docker/0c85bb041886edc37981442c550f8522b2687c54eddc7769fe20d345c7a32c92,task=nginx,pid=16299,uid=0
Mar 7 06:19:31 Tower kernel: Memory cgroup out of memory: Killed process 16299 (nginx) total-vm:271632kB, anon-rss:173940kB, file-rss:0kB, shmem-rss:112kB, UID:0 pgtables:424kB oom_score_adj:0

JorgeB · March 7, 2023

That is after the oom killer invoked line, and if I understand correctly it's the app the was killed, not the app that caused the oom issue.

CPU Runaway

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation