Crazy high CPU randomly

kharntiitar · September 24, 2020

Hi one and all, for the past few weeks I've been having some weird CPU issues, I would see my docker images og to 80-160%, and a whole heap of "lsof" tasks would start being created.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 3035 201       39  19  186904 154500      0 S 104.6   0.5 276:22.82 netdata
 9886 nobody    20   0 1745432 424548  19832 S 100.3   1.3 176:57.72 mono
 8705 nobody    20   0 4245124 506380  19328 S  97.0   1.5 567:37.83 Radarr
29112 root      20   0 1142100  92976   1136 S  78.3   0.3 218:40.05 shfs
23106 root      20   0  118948  12160   4768 R  59.2   0.0   0:11.66 runc
23159 root      20   0  188844   9564   4768 R  58.6   0.0   0:10.05 runc
23700 root      20   0  188780  11592   4704 R  51.3   0.0   0:02.11 runc
 6192 root      20   0    2572     92      0 R  42.4   0.0   4:09.84 lsof
11290 root      20   0    3036   2032   1788 R  42.1   0.0   2:52.91 lsof
13951 root      20   0    3036   2008   1764 R  41.8   0.0   2:10.79 lsof
 1186 root      20   0    3036   1992   1748 R  41.1   0.0   5:40.82 lsof
 5954 root      20   0    3036   2032   1788 R  41.1   0.0   4:29.93 lsof
17990 root      20   0    3036   1992   1748 R  41.1   0.0   1:17.00 lsof
22466 root      20   0    3036   2020   1772 R  40.8   0.0   0:14.81 lsof
14587 root      20   0    3036   2104   1860 R  40.5   0.0   2:06.38 lsof
 4547 root      20   0    3036   1920   1664 R  39.8   0.0   4:48.70 lsof
 8947 root      20   0    3036   2044   1804 R  39.8   0.0   3:36.63 lsof
16682 root      20   0    3036   1944   1692 R  39.8   0.0   1:32.10 lsof
17283 root      20   0    3036   2088   1844 R  39.5   0.0   1:24.76 lsof
21801 root      20   0    3036   2008   1764 R  39.5   0.0   0:22.42 lsof
18524 root      20   0    3036   2008   1764 R  39.1   0.0   1:10.05 lsof
23589 root      20   0   12488  10748   2068 R  39.1   0.0   0:02.45 find
19585 root      20   0    3036   1936   1684 R  38.8   0.0   0:46.96 lsof
 8086 root      20   0    3036   1932   1684 R  38.5   0.0   3:45.83 lsof
12631 root      20   0    3036   2028   1788 R  38.5   0.0   2:34.27 lsof
13322 root      20   0    3036   1988   1748 R  38.2   0.0   2:23.64 lsof
 3817 root      20   0    3036   2008   1764 R  37.2   0.0   5:01.57 lsof
18737 root      20   0    3036   2020   1772 R  37.2   0.0   1:02.63 lsof
21037 root      20   0    3036   2048   1804 R  35.9   0.0   0:28.03 lsof
11991 root      20   0    3036   1988   1748 R  35.5   0.0   2:44.38 lsof
 2731 root      20   0    3036   2100   1860 R  34.5   0.0   5:15.58 lsof
23379 root      20   0    3036   2096   1860 R  28.9   0.0   0:03.30 lsof
  530 root      20   0 1712768   1.1g  22432 S   8.6   3.6 179:43.71 qemu-system-x86
 1968 root      20   0   21.0g 815380  20404 R   6.9   2.5  35:21.06 influxd
 1027 root      20   0 1691316   1.2g  22204 S   6.6   3.7  58:46.69 qemu-system-x86
23179 nobody    20   0    4312   3276   1768 R   4.9   0.0   0:00.83 unrar
10366 root      20   0  842064  92836   6196 S   3.6   0.3 672:58.29 telegraf
23750 root      20   0   16656   4540   3652 R   2.0   0.0   0:00.06 snmpget

At first I thought it was the docker images/disk that was the issue, but I learnt today that it is far more likely to be these lsof tasks as the cause, and that the docker cpu utilisation is a symptoms. All of the lsof images look the same:

root     11289  0.0  0.0   3840  2916 ?        S    14:15   0:00 sh -c LANG='en_US.UTF8' lsof -Owl /mnt/disk[0-9]* 2>/dev/null|awk '/^shfs/ && $0!~/\.AppleD(B|ouble)/ && $5=="REG"'|awk -F/ '{print $4}'
root     11290 48.7  0.0   3036  2032 ?        R    14:15   3:08 lsof -Owl /mnt/disk1 /mnt/disk10 /mnt/disk11 /mnt/disk12 /mnt/disk13 /mnt/disk14 /mnt/disk15 /mnt/disk2 /mnt/disk3 /mnt/disk4 /mnt/disk5 /mnt/disk6 /mnt/disk7 /mnt/disk8 /mnt/disk9

I've grepped through various folders that I thought might have something and the only thing I can find that was close was, the plugin file for dynamix' stop shell:

# find /boot/config/plugins -type f -exec grep -H lsof {} \; 

/boot/config/plugins/dynamix.stop.shell.plg:for PID in $(lsof /mnt/disk[0-9]* $cache /mnt/user /mnt/user0 2>/dev/null|awk '/^(bash|sh|mc) /{print $2}'); do

I have nothing in crontab that's even slightly close, and no other scripts set to run that are at all similar in user.scripts etc..

perfoming a `killall lsof` will make it good for a short while, but there is still something performing an lsof on unassigned drives.

I also had one Google search, whcih looked promising, and led me to here but then I couldn't find the text within the actual page.

I did have a CPU issue a while back due to ryzen cpu, but that has (presumably) been resolved for a while now ..

I can post diagnostics if needed, but if anybody has suggestions, I'd highly appreciate it!!!

kharntiitar · September 24, 2020

So I did the `killall lsof` about an hour ago, and it's just started a dozen or so instances of the previous mentioned 'lsof' command, and all cpu cores are at 100% ..

I failed to mention before, that the whole system is lagging when this happens, even ICMP calls to it the latency will go up by 5-20 seconds.. due to CPU wait time.

Osiris · February 15, 2022

Did you ever find a solution here? I have exactly the same issue.

Are you running a checkmk container or agent?

kharntiitar · February 16, 2022

Oh yeah, this was an expensive issue to work out too..

turned out I had 3 drives with bad sectors, and whenever there was data being read/written on those sectors the server would crap out... unfortunately they weren't reporting the issue, or at least, not via the alerting methods I'd configured...

I went through and looked at the status and there was just these 3 with crc errors, removed from array, ran for a while all good, so I replaced those drives, all became sweet...

it may be only 1/3 was stuck, but I replaced them all anyway, not worth the risk...

dada051 · May 30, 2022

Same problem here. Not sure to understand why there is a correlation between lsof and bad HDD. @Osiris did you found anything else?

dada051 · May 30, 2022

Crazy high CPU randomly

Recommended Posts

kharntiitar

Link to comment

kharntiitar

Link to comment

Osiris

Link to comment

kharntiitar

Link to comment

dada051

Link to comment

dada051

Link to comment

Join the conversation