Crazy high CPU randomly


Recommended Posts

Hi one and all, for the past few weeks I've been having some weird CPU issues, I would see my docker images og to 80-160%, and a whole heap of "lsof" tasks would start being created.

 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 3035 201       39  19  186904 154500      0 S 104.6   0.5 276:22.82 netdata
 9886 nobody    20   0 1745432 424548  19832 S 100.3   1.3 176:57.72 mono
 8705 nobody    20   0 4245124 506380  19328 S  97.0   1.5 567:37.83 Radarr
29112 root      20   0 1142100  92976   1136 S  78.3   0.3 218:40.05 shfs
23106 root      20   0  118948  12160   4768 R  59.2   0.0   0:11.66 runc
23159 root      20   0  188844   9564   4768 R  58.6   0.0   0:10.05 runc
23700 root      20   0  188780  11592   4704 R  51.3   0.0   0:02.11 runc
 6192 root      20   0    2572     92      0 R  42.4   0.0   4:09.84 lsof
11290 root      20   0    3036   2032   1788 R  42.1   0.0   2:52.91 lsof
13951 root      20   0    3036   2008   1764 R  41.8   0.0   2:10.79 lsof
 1186 root      20   0    3036   1992   1748 R  41.1   0.0   5:40.82 lsof
 5954 root      20   0    3036   2032   1788 R  41.1   0.0   4:29.93 lsof
17990 root      20   0    3036   1992   1748 R  41.1   0.0   1:17.00 lsof
22466 root      20   0    3036   2020   1772 R  40.8   0.0   0:14.81 lsof
14587 root      20   0    3036   2104   1860 R  40.5   0.0   2:06.38 lsof
 4547 root      20   0    3036   1920   1664 R  39.8   0.0   4:48.70 lsof
 8947 root      20   0    3036   2044   1804 R  39.8   0.0   3:36.63 lsof
16682 root      20   0    3036   1944   1692 R  39.8   0.0   1:32.10 lsof
17283 root      20   0    3036   2088   1844 R  39.5   0.0   1:24.76 lsof
21801 root      20   0    3036   2008   1764 R  39.5   0.0   0:22.42 lsof
18524 root      20   0    3036   2008   1764 R  39.1   0.0   1:10.05 lsof
23589 root      20   0   12488  10748   2068 R  39.1   0.0   0:02.45 find
19585 root      20   0    3036   1936   1684 R  38.8   0.0   0:46.96 lsof
 8086 root      20   0    3036   1932   1684 R  38.5   0.0   3:45.83 lsof
12631 root      20   0    3036   2028   1788 R  38.5   0.0   2:34.27 lsof
13322 root      20   0    3036   1988   1748 R  38.2   0.0   2:23.64 lsof
 3817 root      20   0    3036   2008   1764 R  37.2   0.0   5:01.57 lsof
18737 root      20   0    3036   2020   1772 R  37.2   0.0   1:02.63 lsof
21037 root      20   0    3036   2048   1804 R  35.9   0.0   0:28.03 lsof
11991 root      20   0    3036   1988   1748 R  35.5   0.0   2:44.38 lsof
 2731 root      20   0    3036   2100   1860 R  34.5   0.0   5:15.58 lsof
23379 root      20   0    3036   2096   1860 R  28.9   0.0   0:03.30 lsof
  530 root      20   0 1712768   1.1g  22432 S   8.6   3.6 179:43.71 qemu-system-x86
 1968 root      20   0   21.0g 815380  20404 R   6.9   2.5  35:21.06 influxd
 1027 root      20   0 1691316   1.2g  22204 S   6.6   3.7  58:46.69 qemu-system-x86
23179 nobody    20   0    4312   3276   1768 R   4.9   0.0   0:00.83 unrar
10366 root      20   0  842064  92836   6196 S   3.6   0.3 672:58.29 telegraf
23750 root      20   0   16656   4540   3652 R   2.0   0.0   0:00.06 snmpget

 

At first I thought it was the docker images/disk that was the issue, but I learnt today that it is far more likely to be these lsof tasks as the cause, and that the docker cpu utilisation is a symptoms. All of the lsof images look the same:

 

root     11289  0.0  0.0   3840  2916 ?        S    14:15   0:00 sh -c LANG='en_US.UTF8' lsof -Owl /mnt/disk[0-9]* 2>/dev/null|awk '/^shfs/ && $0!~/\.AppleD(B|ouble)/ && $5=="REG"'|awk -F/ '{print $4}'
root     11290 48.7  0.0   3036  2032 ?        R    14:15   3:08 lsof -Owl /mnt/disk1 /mnt/disk10 /mnt/disk11 /mnt/disk12 /mnt/disk13 /mnt/disk14 /mnt/disk15 /mnt/disk2 /mnt/disk3 /mnt/disk4 /mnt/disk5 /mnt/disk6 /mnt/disk7 /mnt/disk8 /mnt/disk9

 

I've grepped through various folders that I thought might have something and the only thing I can find that was close was, the plugin file for dynamix' stop shell:

# find /boot/config/plugins -type f -exec grep -H lsof {} \; 

/boot/config/plugins/dynamix.stop.shell.plg:for PID in $(lsof /mnt/disk[0-9]* $cache /mnt/user /mnt/user0 2>/dev/null|awk '/^(bash|sh|mc) /{print $2}'); do

I have nothing in crontab that's even slightly close, and no other scripts set to run that are at all similar in user.scripts etc.. 

 

 

perfoming a `killall lsof` will make it good for a short while, but there is still something performing an lsof on unassigned drives.

 

I also had one Google search, whcih looked promising, and led me to here but then I couldn't find the text within the actual page. 

 

I did have a CPU issue a while back due to ryzen cpu, but that has (presumably) been resolved for a while now .. 

 

I can post diagnostics if needed, but if anybody has suggestions, I'd highly appreciate it!!!

 

 

  • Thanks 1
Link to comment

So I did the `killall lsof` about an hour ago, and it's just started a dozen or so instances of the previous mentioned 'lsof' command, and all cpu cores are at 100% .. 

 

I failed to mention before, that the whole system is lagging when this happens, even ICMP calls to it the latency will go up by 5-20 seconds.. due to CPU wait time.

 

 

Link to comment
  • 1 year later...

Oh yeah, this was an expensive issue to work out too..

 

turned out I had 3 drives with bad sectors, and whenever there was data being read/written on those sectors the server would crap out... unfortunately they weren't reporting the issue, or at least, not via the alerting methods I'd configured... 

 

I went through and looked at the status and there was just these 3 with crc errors, removed from array, ran for a while all good, so I replaced those drives, all became sweet... 

 

it may be only 1/3 was stuck, but I replaced them all anyway, not worth the risk...

Link to comment
  • 3 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.