Slow server with 100% usage and massive lsof process usage

DrSpaldo · June 20, 2023

Hi, in the last couple of weeks I've noticed massive CPU usage spikes, which is bringing my server to a stand still. I pretty much have to reboot the server as even after 1-2 hours it won't recover from the CPU usage spikes.

This appears to have happened after updating from Unraid 6.9.2 (after staying on it for ages) to 6.11.5.

I noticed that 6.12.0 is now out, so I am happy to update and see if that will help. But, I was wondering if anyone has any ideas of what may be causing the issue or what I can try to fix it? See attached diagnostics and a few screenshots. Thanks

spaldounraid-diagnostics-20230620-1108.zip

DrSpaldo · June 20, 2023

I've found a few similar topics, so I am trying ideas in those as well. See attached updated process list using htop.

DrSpaldo · July 1, 2023

It looks like there are quite a few users with similar situations and similar massive CPU usage isses, RAM issues and the server essentially stopping:

https://forums.unraid.net/topic/86114-nginx-running-out-of-shared-memory/

https://forums.unraid.net/bug-reports/prereleases/612-rc5-server-at-100-cpu-usage-r2394/

The list keeps going on if you search for specific errors or messages. Unfortunately there appears to be limited assistance being offered by the Unraid team, I am not sure if there is a better way to get support?

JonathanM · July 4, 2023

Does it still act up with 6.12.2 in safe mode?

DrSpaldo · July 4, 2023

I've made so many changes that I have no idea what helped and what didn't. From memory, I've changed settings within Tips & Tweaks (vm.dirty_background_ratio to 3, vm.dirty_ratio to 6, changed the CPU scaling governor to performance), replaced my MB & CPU, replaced the SATA card, removed one of the HDD's, pinned all dockers to various CPU cores as well as isolating a few CPU cores.

So since this post though, I have updated to 6.12.2 and changed some of the ram settings in VM's and dockers. In particular reduced my tmpfs size for the Plex & Frigate dockers. Touch wood, I have a 3 day uptime, which is the most for a long time.... I won't celebrate yet but let's hope these last changes are the ones that have done it.

Of note & interest (maybe to the devs/other users); when the CPU maxed out to 100% and I completed the "/etc/rc.d/rc.nginx restart" command, the CPU dropped back to normal amounts, however, there was a catch, my RAM then maxed out and the system became totally unresponsive. The maxed out CPU at least let me issue some commands, all be it very very slowly.

From my point of view, it is not something I am willing to risk happening, so it is likely I will create a second server and split some of my services to Proxmox and leave others on Unraid so the load is balanced.

DrSpaldo · July 5, 2023

Spoke too soon. Crashed today. Almost made 4 days

JorgeB · July 5, 2023

Post new diags from v6.12.2 after you see the issues.

DrSpaldo · July 5, 2023

Here they are, but, it is post reboot so I am not sure how much it helps?

spaldounraid-diagnostics-20230705-2010.zip

JorgeB · July 5, 2023

Are you sing the issue when they were saved?

DrSpaldo · July 5, 2023

No, when the issue happens I can’t do anything really on the server. So near impossible to create the diagnostics as I will have to hard reset

JorgeB · July 5, 2023

Enable the mirror to flash drive option in the syslog server then post that after the problem.

DrSpaldo · July 5, 2023

3 hours ago, JorgeB said:

Enable the mirror to flash drive option in the syslog server then post that after the problem.

Will do

DrSpaldo · July 9, 2023

So this morning I had the issue again. CPU at max and I couldn't do anything. However, I decided to just leave the server running as I didn't need it today. Between 6.30am and 2.45pm I could not use the server at all, couldn't load the webui or even connect via SSH. After that though, all back to 'normal'. I am just not sure how I can work out exactly what was happening at that time and what I can do to stop it from happening?

spaldounraid-diagnostics-20230709-1609.zip

JorgeB · July 9, 2023

OOEM killer is being invoked, try limiting more the RAM for VMs and/or docker containers, the problem is usually not just about not enough RAM but more about fragmented RAM, alternatively a small swap file on disk might help, you can use the swapfile plugin:

https://forums.unraid.net/topic/109342-plugin-swapfile-for-691/

DrSpaldo · July 12, 2023

Thanks for the post @JorgeB. I've installed the swap file plugin. What would you recommend for monitoring the ram usage as well as the fragmented RAM?

JorgeB · July 12, 2023

5 hours ago, DrSpaldo said:

What would you recommend for monitoring the ram usage as well as the fragmented RAM?

Basically just confirm no more OOM errors, FCP will warn about those.

DrSpaldo · July 30, 2023

@JorgeBI've noticed that it has not been happening as often, almost not at random like before. Except it has happened a couple times when I am doing my weekly docker/appdata backup. This week I have switched from the old CA version to the new one, unfortunately it happened again this week. Looking at the logs it seems to be another OOM error with my Frigate docker. See attached logs.

Any ideas what I can do to prevent this from happening? Out of interest though, I usually run a camera stream from Frigate to my desktop, the entire time the stream did not stop while it should have been doing the backup, which means the container didn't stop? I couldn't get into the webui or SSH in during this time, only ping.

spaldounraid-diagnostics-20230730-1453.zip

JorgeB · July 31, 2023

On 7/30/2023 at 8:29 AM, DrSpaldo said:

Except it has happened a couple times when I am doing my weekly docker/appdata backup.

Is it just the OOM or do you also see high CPU usage?

DrSpaldo · July 31, 2023

In this instance, I am not sure. I just couldn't connect to the server in any way for 4 hours or so. I could not connect to see what the server was doing

JorgeB · July 31, 2023

If possible try leaving that container off.

DrSpaldo · July 31, 2023

Its my CCTV NVR, so it has to be on

DrSpaldo · August 3, 2023

I've switched that (Frigate container) over to my Proxmox server. So I will see how it goes...

Abdel · October 30, 2023

Curious to how things are going? I'm running into the exact same issue. Wonder what your solution was.

DrSpaldo · October 31, 2023

2 hours ago, Abdel said:

Curious to how things are going? I'm running into the exact same issue. Wonder what your solution was.

I would love to say that I figured out exactly what was causing the issue, but, I didn't. I made so many changes and in the end I am not sure if it was one of them or the combination of them. I am not getting the same CPU max out or freeze/crashes that I was before. It is back to normal.

Here are some of the things that I did:

- Daily script to delete docker locks
- increase tmpfs run size script on boot
- create 4G tmpfs plex ram scratch script on boot
- use the swapfile plugin
- change the VM dirty background ratio
- Replace the MB & CPU
- Replace the SATA card
- Pinned CPU cores / docker isolation

- Move some of the things running on Unraid to my Proxmox server; this included the Frigate container (my CCTV) & the Windows VM

I would say that I noticed some differences from the top down, but, the main one was moving Frigate & the VM to the Proxmox server.

I am sorry this isn't a good answer, but I just don't know. When you start searching for similar issues, many many people are having them as well, just in slightly different ways. The Unraid staff are not really helping (only sometimes, then they just give up). Fortunately I had JorgeB trying to help in this instance.

bucketphobia · October 31, 2023

Hey @DrSpaldo, I was having a very similar issue, but slightly different - the cause of mine was a dockerd log process.

The solution to mine was to ensure that "Permit exclusive shares" is on, and that "Exclusive Access" is indeed active on my cache pool.

Hope this helps.

Slow server with 100% usage and massive lsof process usage

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation