September 24, 20205 yr Hi one and all, for the past few weeks I've been having some weird CPU issues, I would see my docker images og to 80-160%, and a whole heap of "lsof" tasks would start being created. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3035 201 39 19 186904 154500 0 S 104.6 0.5 276:22.82 netdata 9886 nobody 20 0 1745432 424548 19832 S 100.3 1.3 176:57.72 mono 8705 nobody 20 0 4245124 506380 19328 S 97.0 1.5 567:37.83 Radarr 29112 root 20 0 1142100 92976 1136 S 78.3 0.3 218:40.05 shfs 23106 root 20 0 118948 12160 4768 R 59.2 0.0 0:11.66 runc 23159 root 20 0 188844 9564 4768 R 58.6 0.0 0:10.05 runc 23700 root 20 0 188780 11592 4704 R 51.3 0.0 0:02.11 runc 6192 root 20 0 2572 92 0 R 42.4 0.0 4:09.84 lsof 11290 root 20 0 3036 2032 1788 R 42.1 0.0 2:52.91 lsof 13951 root 20 0 3036 2008 1764 R 41.8 0.0 2:10.79 lsof 1186 root 20 0 3036 1992 1748 R 41.1 0.0 5:40.82 lsof 5954 root 20 0 3036 2032 1788 R 41.1 0.0 4:29.93 lsof 17990 root 20 0 3036 1992 1748 R 41.1 0.0 1:17.00 lsof 22466 root 20 0 3036 2020 1772 R 40.8 0.0 0:14.81 lsof 14587 root 20 0 3036 2104 1860 R 40.5 0.0 2:06.38 lsof 4547 root 20 0 3036 1920 1664 R 39.8 0.0 4:48.70 lsof 8947 root 20 0 3036 2044 1804 R 39.8 0.0 3:36.63 lsof 16682 root 20 0 3036 1944 1692 R 39.8 0.0 1:32.10 lsof 17283 root 20 0 3036 2088 1844 R 39.5 0.0 1:24.76 lsof 21801 root 20 0 3036 2008 1764 R 39.5 0.0 0:22.42 lsof 18524 root 20 0 3036 2008 1764 R 39.1 0.0 1:10.05 lsof 23589 root 20 0 12488 10748 2068 R 39.1 0.0 0:02.45 find 19585 root 20 0 3036 1936 1684 R 38.8 0.0 0:46.96 lsof 8086 root 20 0 3036 1932 1684 R 38.5 0.0 3:45.83 lsof 12631 root 20 0 3036 2028 1788 R 38.5 0.0 2:34.27 lsof 13322 root 20 0 3036 1988 1748 R 38.2 0.0 2:23.64 lsof 3817 root 20 0 3036 2008 1764 R 37.2 0.0 5:01.57 lsof 18737 root 20 0 3036 2020 1772 R 37.2 0.0 1:02.63 lsof 21037 root 20 0 3036 2048 1804 R 35.9 0.0 0:28.03 lsof 11991 root 20 0 3036 1988 1748 R 35.5 0.0 2:44.38 lsof 2731 root 20 0 3036 2100 1860 R 34.5 0.0 5:15.58 lsof 23379 root 20 0 3036 2096 1860 R 28.9 0.0 0:03.30 lsof 530 root 20 0 1712768 1.1g 22432 S 8.6 3.6 179:43.71 qemu-system-x86 1968 root 20 0 21.0g 815380 20404 R 6.9 2.5 35:21.06 influxd 1027 root 20 0 1691316 1.2g 22204 S 6.6 3.7 58:46.69 qemu-system-x86 23179 nobody 20 0 4312 3276 1768 R 4.9 0.0 0:00.83 unrar 10366 root 20 0 842064 92836 6196 S 3.6 0.3 672:58.29 telegraf 23750 root 20 0 16656 4540 3652 R 2.0 0.0 0:00.06 snmpget At first I thought it was the docker images/disk that was the issue, but I learnt today that it is far more likely to be these lsof tasks as the cause, and that the docker cpu utilisation is a symptoms. All of the lsof images look the same: root 11289 0.0 0.0 3840 2916 ? S 14:15 0:00 sh -c LANG='en_US.UTF8' lsof -Owl /mnt/disk[0-9]* 2>/dev/null|awk '/^shfs/ && $0!~/\.AppleD(B|ouble)/ && $5=="REG"'|awk -F/ '{print $4}' root 11290 48.7 0.0 3036 2032 ? R 14:15 3:08 lsof -Owl /mnt/disk1 /mnt/disk10 /mnt/disk11 /mnt/disk12 /mnt/disk13 /mnt/disk14 /mnt/disk15 /mnt/disk2 /mnt/disk3 /mnt/disk4 /mnt/disk5 /mnt/disk6 /mnt/disk7 /mnt/disk8 /mnt/disk9 I've grepped through various folders that I thought might have something and the only thing I can find that was close was, the plugin file for dynamix' stop shell: # find /boot/config/plugins -type f -exec grep -H lsof {} \; /boot/config/plugins/dynamix.stop.shell.plg:for PID in $(lsof /mnt/disk[0-9]* $cache /mnt/user /mnt/user0 2>/dev/null|awk '/^(bash|sh|mc) /{print $2}'); do I have nothing in crontab that's even slightly close, and no other scripts set to run that are at all similar in user.scripts etc.. perfoming a `killall lsof` will make it good for a short while, but there is still something performing an lsof on unassigned drives. I also had one Google search, whcih looked promising, and led me to here but then I couldn't find the text within the actual page. I did have a CPU issue a while back due to ryzen cpu, but that has (presumably) been resolved for a while now .. I can post diagnostics if needed, but if anybody has suggestions, I'd highly appreciate it!!!
September 24, 20205 yr Author So I did the `killall lsof` about an hour ago, and it's just started a dozen or so instances of the previous mentioned 'lsof' command, and all cpu cores are at 100% .. I failed to mention before, that the whole system is lagging when this happens, even ICMP calls to it the latency will go up by 5-20 seconds.. due to CPU wait time.
February 15, 20224 yr Did you ever find a solution here? I have exactly the same issue. Are you running a checkmk container or agent?
February 16, 20224 yr Author Oh yeah, this was an expensive issue to work out too.. turned out I had 3 drives with bad sectors, and whenever there was data being read/written on those sectors the server would crap out... unfortunately they weren't reporting the issue, or at least, not via the alerting methods I'd configured... I went through and looked at the status and there was just these 3 with crc errors, removed from array, ran for a while all good, so I replaced those drives, all became sweet... it may be only 1/3 was stuck, but I replaced them all anyway, not worth the risk...
May 30, 20224 yr Same problem here. Not sure to understand why there is a correlation between lsof and bad HDD. @Osiris did you found anything else?
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.