Slow server with 100% usage and massive lsof process usage


Recommended Posts

Hi, in the last couple of weeks I've noticed massive CPU usage spikes, which is bringing my server to a stand still. I pretty much have to reboot the server as even after 1-2 hours it won't recover from the CPU usage spikes.

 

This appears to have happened after updating from Unraid 6.9.2 (after staying on it for ages) to 6.11.5. 

 

I noticed that 6.12.0 is now out, so I am happy to update and see if that will help. But, I was wondering if anyone has any ideas of what may be causing the issue or what I can try to fix it?  See attached diagnostics and a few screenshots. Thanks :)

2023-06-20.screenshot (3).jpg

2023-06-20.screenshot (2).jpg

2023-06-20.screenshot (1).jpg

spaldounraid-diagnostics-20230620-1108.zip

Link to comment
  • 2 weeks later...

It looks like there are quite a few users with similar situations and similar massive CPU usage isses, RAM issues and the server essentially stopping:

 

https://forums.unraid.net/topic/86114-nginx-running-out-of-shared-memory/

 

https://forums.unraid.net/bug-reports/prereleases/612-rc5-server-at-100-cpu-usage-r2394/

 

The list keeps going on if you search for specific errors or messages. Unfortunately there appears to be limited assistance being offered by the Unraid team, I am not sure if there is a better way to get support? 

Link to comment

I've made so many changes that I have no idea what helped and what didn't. From memory, I've changed settings within Tips & Tweaks (vm.dirty_background_ratio to 3, vm.dirty_ratio to 6, changed the CPU scaling governor to performance), replaced my MB & CPU, replaced the SATA card, removed one of the HDD's, pinned all dockers to various CPU cores as well as isolating a few CPU cores.

 

So since this post though, I have updated to 6.12.2 and changed some of the ram settings in VM's and dockers. In particular reduced my tmpfs size for the Plex & Frigate dockers. Touch wood, I have a 3 day uptime, which is the most for a long time.... I won't celebrate yet but let's hope these last changes are the ones that have done it. 

 

Of note & interest (maybe to the devs/other users); when the CPU maxed out to 100% and I completed the "/etc/rc.d/rc.nginx restart" command, the CPU dropped back to normal amounts, however, there was a catch, my RAM then maxed out and the system became totally unresponsive. The maxed out CPU at least let me issue some commands, all be it very very slowly. 

 

From my point of view, it is not something I am willing to risk happening, so it is likely I will create a second server and split some of my services to Proxmox and leave others on Unraid so the load is balanced.

Link to comment

So this morning I had the issue again. CPU at max and I couldn't do anything. However, I decided to just leave the server running as I didn't need it today. Between 6.30am and 2.45pm I could not use the server at all, couldn't load the webui or even connect via SSH. After that though, all back to 'normal'. I am just not sure how I can work out exactly what was happening at that time and what I can do to stop it from happening?

spaldounraid-diagnostics-20230709-1609.zip

Link to comment
  • 3 weeks later...

@JorgeBI've noticed that it has not been happening as often, almost not at random like before. Except it has happened a couple times when I am doing my weekly docker/appdata backup. This week I have switched from the old CA version to the new one, unfortunately it happened again this week. Looking at the logs it seems to be another OOM error with my Frigate docker. See attached logs. 

 

Any ideas what I can do to prevent this from happening? Out of interest though, I usually run a camera stream from Frigate to my desktop, the entire time the stream did not stop while it should have been doing the backup, which means the container didn't stop? I couldn't get into the webui or SSH in during this time, only ping.

spaldounraid-diagnostics-20230730-1453.zip

Link to comment
  • 2 months later...
2 hours ago, Abdel said:

Curious to how things are going? I'm running into the exact same issue. Wonder what your solution was. 

 

I would love to say that I figured out exactly what was causing the issue, but, I didn't. I made so many changes and in the end I am not sure if it was one of them or the combination of them. I am not getting the same CPU max out or freeze/crashes that I was before. It is back to normal.

 

Here are some of the things that I did:

 

- Daily script to delete docker locks 
- increase tmpfs run size script on boot
- create 4G tmpfs plex ram scratch script on boot
- use the swapfile plugin
- change the VM dirty background ratio
- Replace the MB & CPU
- Replace the SATA card
- Pinned CPU cores / docker isolation

- Move some of the things running on Unraid to my Proxmox server; this included the Frigate container (my CCTV) & the Windows VM

 

I would say that I noticed some differences from the top down, but, the main one was moving Frigate & the VM to the Proxmox server.

 

I am sorry this isn't a good answer, but I just don't know. When you start searching for similar issues, many many people are having them as well, just in slightly different ways. The Unraid staff are not really helping (only sometimes, then they just give up). Fortunately I had JorgeB trying to help in this instance.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.