DrSpaldo Posted June 20, 2023 Share Posted June 20, 2023 Hi, in the last couple of weeks I've noticed massive CPU usage spikes, which is bringing my server to a stand still. I pretty much have to reboot the server as even after 1-2 hours it won't recover from the CPU usage spikes. This appears to have happened after updating from Unraid 6.9.2 (after staying on it for ages) to 6.11.5. I noticed that 6.12.0 is now out, so I am happy to update and see if that will help. But, I was wondering if anyone has any ideas of what may be causing the issue or what I can try to fix it? See attached diagnostics and a few screenshots. Thanks spaldounraid-diagnostics-20230620-1108.zip Quote Link to comment
DrSpaldo Posted June 20, 2023 Author Share Posted June 20, 2023 I've found a few similar topics, so I am trying ideas in those as well. See attached updated process list using htop. Quote Link to comment
DrSpaldo Posted July 1, 2023 Author Share Posted July 1, 2023 It looks like there are quite a few users with similar situations and similar massive CPU usage isses, RAM issues and the server essentially stopping: https://forums.unraid.net/topic/86114-nginx-running-out-of-shared-memory/ https://forums.unraid.net/bug-reports/prereleases/612-rc5-server-at-100-cpu-usage-r2394/ The list keeps going on if you search for specific errors or messages. Unfortunately there appears to be limited assistance being offered by the Unraid team, I am not sure if there is a better way to get support? Quote Link to comment
JonathanM Posted July 4, 2023 Share Posted July 4, 2023 Does it still act up with 6.12.2 in safe mode? Quote Link to comment
DrSpaldo Posted July 4, 2023 Author Share Posted July 4, 2023 I've made so many changes that I have no idea what helped and what didn't. From memory, I've changed settings within Tips & Tweaks (vm.dirty_background_ratio to 3, vm.dirty_ratio to 6, changed the CPU scaling governor to performance), replaced my MB & CPU, replaced the SATA card, removed one of the HDD's, pinned all dockers to various CPU cores as well as isolating a few CPU cores. So since this post though, I have updated to 6.12.2 and changed some of the ram settings in VM's and dockers. In particular reduced my tmpfs size for the Plex & Frigate dockers. Touch wood, I have a 3 day uptime, which is the most for a long time.... I won't celebrate yet but let's hope these last changes are the ones that have done it. Of note & interest (maybe to the devs/other users); when the CPU maxed out to 100% and I completed the "/etc/rc.d/rc.nginx restart" command, the CPU dropped back to normal amounts, however, there was a catch, my RAM then maxed out and the system became totally unresponsive. The maxed out CPU at least let me issue some commands, all be it very very slowly. From my point of view, it is not something I am willing to risk happening, so it is likely I will create a second server and split some of my services to Proxmox and leave others on Unraid so the load is balanced. Quote Link to comment
DrSpaldo Posted July 5, 2023 Author Share Posted July 5, 2023 Spoke too soon. Crashed today. Almost made 4 days Quote Link to comment
JorgeB Posted July 5, 2023 Share Posted July 5, 2023 Post new diags from v6.12.2 after you see the issues. Quote Link to comment
DrSpaldo Posted July 5, 2023 Author Share Posted July 5, 2023 Here they are, but, it is post reboot so I am not sure how much it helps? spaldounraid-diagnostics-20230705-2010.zip Quote Link to comment
JorgeB Posted July 5, 2023 Share Posted July 5, 2023 Are you sing the issue when they were saved? Quote Link to comment
DrSpaldo Posted July 5, 2023 Author Share Posted July 5, 2023 No, when the issue happens I can’t do anything really on the server. So near impossible to create the diagnostics as I will have to hard reset Quote Link to comment
JorgeB Posted July 5, 2023 Share Posted July 5, 2023 Enable the mirror to flash drive option in the syslog server then post that after the problem. 1 Quote Link to comment
DrSpaldo Posted July 5, 2023 Author Share Posted July 5, 2023 3 hours ago, JorgeB said: Enable the mirror to flash drive option in the syslog server then post that after the problem. Will do Quote Link to comment
DrSpaldo Posted July 9, 2023 Author Share Posted July 9, 2023 So this morning I had the issue again. CPU at max and I couldn't do anything. However, I decided to just leave the server running as I didn't need it today. Between 6.30am and 2.45pm I could not use the server at all, couldn't load the webui or even connect via SSH. After that though, all back to 'normal'. I am just not sure how I can work out exactly what was happening at that time and what I can do to stop it from happening? spaldounraid-diagnostics-20230709-1609.zip Quote Link to comment
JorgeB Posted July 9, 2023 Share Posted July 9, 2023 OOEM killer is being invoked, try limiting more the RAM for VMs and/or docker containers, the problem is usually not just about not enough RAM but more about fragmented RAM, alternatively a small swap file on disk might help, you can use the swapfile plugin: https://forums.unraid.net/topic/109342-plugin-swapfile-for-691/ 1 Quote Link to comment
DrSpaldo Posted July 12, 2023 Author Share Posted July 12, 2023 Thanks for the post @JorgeB. I've installed the swap file plugin. What would you recommend for monitoring the ram usage as well as the fragmented RAM? Quote Link to comment
JorgeB Posted July 12, 2023 Share Posted July 12, 2023 5 hours ago, DrSpaldo said: What would you recommend for monitoring the ram usage as well as the fragmented RAM? Basically just confirm no more OOM errors, FCP will warn about those. Quote Link to comment
DrSpaldo Posted July 30, 2023 Author Share Posted July 30, 2023 @JorgeBI've noticed that it has not been happening as often, almost not at random like before. Except it has happened a couple times when I am doing my weekly docker/appdata backup. This week I have switched from the old CA version to the new one, unfortunately it happened again this week. Looking at the logs it seems to be another OOM error with my Frigate docker. See attached logs. Any ideas what I can do to prevent this from happening? Out of interest though, I usually run a camera stream from Frigate to my desktop, the entire time the stream did not stop while it should have been doing the backup, which means the container didn't stop? I couldn't get into the webui or SSH in during this time, only ping. spaldounraid-diagnostics-20230730-1453.zip Quote Link to comment
JorgeB Posted July 31, 2023 Share Posted July 31, 2023 On 7/30/2023 at 8:29 AM, DrSpaldo said: Except it has happened a couple times when I am doing my weekly docker/appdata backup. Is it just the OOM or do you also see high CPU usage? Quote Link to comment
DrSpaldo Posted July 31, 2023 Author Share Posted July 31, 2023 In this instance, I am not sure. I just couldn't connect to the server in any way for 4 hours or so. I could not connect to see what the server was doing Quote Link to comment
JorgeB Posted July 31, 2023 Share Posted July 31, 2023 If possible try leaving that container off. Quote Link to comment
DrSpaldo Posted July 31, 2023 Author Share Posted July 31, 2023 Its my CCTV NVR, so it has to be on Quote Link to comment
DrSpaldo Posted August 3, 2023 Author Share Posted August 3, 2023 I've switched that (Frigate container) over to my Proxmox server. So I will see how it goes... Quote Link to comment
Abdel Posted October 30, 2023 Share Posted October 30, 2023 Curious to how things are going? I'm running into the exact same issue. Wonder what your solution was. Quote Link to comment
DrSpaldo Posted October 31, 2023 Author Share Posted October 31, 2023 2 hours ago, Abdel said: Curious to how things are going? I'm running into the exact same issue. Wonder what your solution was. I would love to say that I figured out exactly what was causing the issue, but, I didn't. I made so many changes and in the end I am not sure if it was one of them or the combination of them. I am not getting the same CPU max out or freeze/crashes that I was before. It is back to normal. Here are some of the things that I did: - Daily script to delete docker locks - increase tmpfs run size script on boot - create 4G tmpfs plex ram scratch script on boot - use the swapfile plugin - change the VM dirty background ratio - Replace the MB & CPU - Replace the SATA card - Pinned CPU cores / docker isolation - Move some of the things running on Unraid to my Proxmox server; this included the Frigate container (my CCTV) & the Windows VM I would say that I noticed some differences from the top down, but, the main one was moving Frigate & the VM to the Proxmox server. I am sorry this isn't a good answer, but I just don't know. When you start searching for similar issues, many many people are having them as well, just in slightly different ways. The Unraid staff are not really helping (only sometimes, then they just give up). Fortunately I had JorgeB trying to help in this instance. Quote Link to comment
bucketphobia Posted October 31, 2023 Share Posted October 31, 2023 Hey @DrSpaldo, I was having a very similar issue, but slightly different - the cause of mine was a dockerd log process. The solution to mine was to ensure that "Permit exclusive shares" is on, and that "Exclusive Access" is indeed active on my cache pool. Hope this helps. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.