kharntiitar Posted September 24, 2020 Share Posted September 24, 2020 Hi one and all, for the past few weeks I've been having some weird CPU issues, I would see my docker images og to 80-160%, and a whole heap of "lsof" tasks would start being created. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3035 201 39 19 186904 154500 0 S 104.6 0.5 276:22.82 netdata 9886 nobody 20 0 1745432 424548 19832 S 100.3 1.3 176:57.72 mono 8705 nobody 20 0 4245124 506380 19328 S 97.0 1.5 567:37.83 Radarr 29112 root 20 0 1142100 92976 1136 S 78.3 0.3 218:40.05 shfs 23106 root 20 0 118948 12160 4768 R 59.2 0.0 0:11.66 runc 23159 root 20 0 188844 9564 4768 R 58.6 0.0 0:10.05 runc 23700 root 20 0 188780 11592 4704 R 51.3 0.0 0:02.11 runc 6192 root 20 0 2572 92 0 R 42.4 0.0 4:09.84 lsof 11290 root 20 0 3036 2032 1788 R 42.1 0.0 2:52.91 lsof 13951 root 20 0 3036 2008 1764 R 41.8 0.0 2:10.79 lsof 1186 root 20 0 3036 1992 1748 R 41.1 0.0 5:40.82 lsof 5954 root 20 0 3036 2032 1788 R 41.1 0.0 4:29.93 lsof 17990 root 20 0 3036 1992 1748 R 41.1 0.0 1:17.00 lsof 22466 root 20 0 3036 2020 1772 R 40.8 0.0 0:14.81 lsof 14587 root 20 0 3036 2104 1860 R 40.5 0.0 2:06.38 lsof 4547 root 20 0 3036 1920 1664 R 39.8 0.0 4:48.70 lsof 8947 root 20 0 3036 2044 1804 R 39.8 0.0 3:36.63 lsof 16682 root 20 0 3036 1944 1692 R 39.8 0.0 1:32.10 lsof 17283 root 20 0 3036 2088 1844 R 39.5 0.0 1:24.76 lsof 21801 root 20 0 3036 2008 1764 R 39.5 0.0 0:22.42 lsof 18524 root 20 0 3036 2008 1764 R 39.1 0.0 1:10.05 lsof 23589 root 20 0 12488 10748 2068 R 39.1 0.0 0:02.45 find 19585 root 20 0 3036 1936 1684 R 38.8 0.0 0:46.96 lsof 8086 root 20 0 3036 1932 1684 R 38.5 0.0 3:45.83 lsof 12631 root 20 0 3036 2028 1788 R 38.5 0.0 2:34.27 lsof 13322 root 20 0 3036 1988 1748 R 38.2 0.0 2:23.64 lsof 3817 root 20 0 3036 2008 1764 R 37.2 0.0 5:01.57 lsof 18737 root 20 0 3036 2020 1772 R 37.2 0.0 1:02.63 lsof 21037 root 20 0 3036 2048 1804 R 35.9 0.0 0:28.03 lsof 11991 root 20 0 3036 1988 1748 R 35.5 0.0 2:44.38 lsof 2731 root 20 0 3036 2100 1860 R 34.5 0.0 5:15.58 lsof 23379 root 20 0 3036 2096 1860 R 28.9 0.0 0:03.30 lsof 530 root 20 0 1712768 1.1g 22432 S 8.6 3.6 179:43.71 qemu-system-x86 1968 root 20 0 21.0g 815380 20404 R 6.9 2.5 35:21.06 influxd 1027 root 20 0 1691316 1.2g 22204 S 6.6 3.7 58:46.69 qemu-system-x86 23179 nobody 20 0 4312 3276 1768 R 4.9 0.0 0:00.83 unrar 10366 root 20 0 842064 92836 6196 S 3.6 0.3 672:58.29 telegraf 23750 root 20 0 16656 4540 3652 R 2.0 0.0 0:00.06 snmpget At first I thought it was the docker images/disk that was the issue, but I learnt today that it is far more likely to be these lsof tasks as the cause, and that the docker cpu utilisation is a symptoms. All of the lsof images look the same: root 11289 0.0 0.0 3840 2916 ? S 14:15 0:00 sh -c LANG='en_US.UTF8' lsof -Owl /mnt/disk[0-9]* 2>/dev/null|awk '/^shfs/ && $0!~/\.AppleD(B|ouble)/ && $5=="REG"'|awk -F/ '{print $4}' root 11290 48.7 0.0 3036 2032 ? R 14:15 3:08 lsof -Owl /mnt/disk1 /mnt/disk10 /mnt/disk11 /mnt/disk12 /mnt/disk13 /mnt/disk14 /mnt/disk15 /mnt/disk2 /mnt/disk3 /mnt/disk4 /mnt/disk5 /mnt/disk6 /mnt/disk7 /mnt/disk8 /mnt/disk9 I've grepped through various folders that I thought might have something and the only thing I can find that was close was, the plugin file for dynamix' stop shell: # find /boot/config/plugins -type f -exec grep -H lsof {} \; /boot/config/plugins/dynamix.stop.shell.plg:for PID in $(lsof /mnt/disk[0-9]* $cache /mnt/user /mnt/user0 2>/dev/null|awk '/^(bash|sh|mc) /{print $2}'); do I have nothing in crontab that's even slightly close, and no other scripts set to run that are at all similar in user.scripts etc.. perfoming a `killall lsof` will make it good for a short while, but there is still something performing an lsof on unassigned drives. I also had one Google search, whcih looked promising, and led me to here but then I couldn't find the text within the actual page. I did have a CPU issue a while back due to ryzen cpu, but that has (presumably) been resolved for a while now .. I can post diagnostics if needed, but if anybody has suggestions, I'd highly appreciate it!!! 1 Quote Link to comment
kharntiitar Posted September 24, 2020 Author Share Posted September 24, 2020 So I did the `killall lsof` about an hour ago, and it's just started a dozen or so instances of the previous mentioned 'lsof' command, and all cpu cores are at 100% .. I failed to mention before, that the whole system is lagging when this happens, even ICMP calls to it the latency will go up by 5-20 seconds.. due to CPU wait time. Quote Link to comment
Osiris Posted February 15, 2022 Share Posted February 15, 2022 Did you ever find a solution here? I have exactly the same issue. Are you running a checkmk container or agent? Quote Link to comment
kharntiitar Posted February 16, 2022 Author Share Posted February 16, 2022 Oh yeah, this was an expensive issue to work out too.. turned out I had 3 drives with bad sectors, and whenever there was data being read/written on those sectors the server would crap out... unfortunately they weren't reporting the issue, or at least, not via the alerting methods I'd configured... I went through and looked at the status and there was just these 3 with crc errors, removed from array, ran for a while all good, so I replaced those drives, all became sweet... it may be only 1/3 was stuck, but I replaced them all anyway, not worth the risk... Quote Link to comment
dada051 Posted May 30, 2022 Share Posted May 30, 2022 Same problem here. Not sure to understand why there is a correlation between lsof and bad HDD. @Osiris did you found anything else? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.