Jump to content

CPU load grinds server to halt


Recommended Posts

Hi, I have had a problem with my Unraid server for a while now.

Upfront, I think my problem might be related to this one: 

Essentially, my Ryzen 7 CPU spikes to 100%, which leads to a currently running parity sync (had to change a disk, so my data is not even secure right now) progressing with ~5KB/s, my dockers being barely accessible, the Unraid GUI itself being barely accessible to the level where it does not load at all anymore. Rebooting only solves the problem for a couple of minutes. The problem only seems to exist when the array is on. I can still access the server via SSH, which is how I exported the attached diagnostics.

 

I am not sure where to continue looking, would appreciate any help.

tower-diagnostics-2023.zip

  • Like 1
Link to comment
1 minute ago, DrSpaldo said:

@MbecoI feel like you are going on the same crazy ride has I have been

Yes, I am following your thread. I have been using Unraid for years, this is the first time I feel they messed up and not me. It is a bit unsatisfactory to not have any proper solution to this problem.

Link to comment
3 minutes ago, Mbeco said:

Yes, I am following your thread. I have been using Unraid for years, this is the first time I feel they messed up and not me. It is a bit unsatisfactory to not have any proper solution to this problem.

Absolutely. I have had no issues with Unraid for sometime, it started happening around the time I updated from one of the older versions but I can’t 100% confirm that. I wish I remembered everything I changed, because it was so much. I’m much more stable now, just maxing out once a week during my appdata/CA backup day. 
 

tomorrow if I get some time, I will go through all the things I changed to see if that helps you.

 

How much RAM in total do you have an how many dockers / VM’s? How would they look for their system utilisation compared to your reasources?  

Link to comment
7 minutes ago, DrSpaldo said:

Absolutely. I have had no issues with Unraid for sometime, it started happening around the time I updated from one of the older versions but I can’t 100% confirm that. I wish I remembered everything I changed, because it was so much. I’m much more stable now, just maxing out once a week during my appdata/CA backup day. 
 

tomorrow if I get some time, I will go through all the things I changed to see if that helps you.

 

How much RAM in total do you have an how many dockers / VM’s? How would they look for their system utilisation compared to your reasources?  

18 Dockers (most of them low- to no-effort)

2 VMs (one on 2 shared cores, one on 4 dedicated cores totalling 20GB of RAM)

 

I am running a Ryzen 7 with 64GB of RAM, which is at an expected 42% utilisation. The CPU however is locked to 100% except for the dedicated cores, sometimes going down for a bit only to go up again. Nothing that is running would justify that level of CPU utilisation.

 

Edit: Adding to that, my docker CPU utilisation in the advanced view shows 0-0.02% utilisation per docker.

 

Full list of dockers I am running:

Cloudflared
wishthis
MariaDB-Monica
MariaDB-WishThis
pairdrop
duplicati
PaperlessRedis
paperless-ngx
PasswordPusherEphemeral
send
SendRedis
storagenode-v3
PhotoprismMariaDB-Official
PhotoPrism
AdGuard-Home
Cloudflare-DDNS
WireGuard-Easy
scrutiny

 

The Parity sync is PAUSED right now, utilisation is still close to 100%.

Edited by Mbeco
Link to comment
49 minutes ago, grumpy said:

Why not stop everything, and let Unraid run by itself with the array loaded. Then after turn on the parity and let it finish, then docker containers one by one to see where the issue is happening. Then finally the vm's.

TBH, I use the services I host so frequently that I cannot afford that much downtime. What's interesting is that this has happened somewhere around the time I upgraded from 6.11.X to 6.12.X. 

I cannot 100% but quite certainly rule out the VMs, as they are on their dedicated cores, so should not have an influence. As for the dockers, I could turn of all but 5 of them and see how the system behaves, but the remaining 5 are quite crucial for me.

Link to comment

Its been awhile since I have done tech support. (mind is leaving me) But until you can figure what layer is causing you issues you will not be able to fix it.

Unraid is many layers and you need to identifie the problem layer, so time is needed to do this or you may introduce other issues in trying to guess what the issue is.

Let Unraid run with just the array started and see what it is doing, no issues then move on to VM's no issues then start a docker container ... rinse and repeat till you find the issue. Then you can ask for more direct help or a light bulb turns on and you figure it out yourself.

 

I do understand not wanting to stop everything, I have been working on my own Unraid issues and time has found me a solution. Somebody smarter than me too.

 

Personally I would let the parity job finish, unless all your data is backed up or of no consequence if you lose it.

Link to comment

@Mbeco although it hasn't 100% fixed the issue (and I changed so many things in the process, they could also take part) but one of the top 3 potential issues that I feel it may be was the VM's I was running. I am running a Windows 10 VM & MacOS VM. I cannot tell you which was the most likely culprit but I lean towards Windows. For me, what I did was move both of the VM's over to another Proxmox server I had. Obviously this is not a solution, but, like you, I didn't have the time for weeks of downtime like people above have suggested. Because to truly work out what is causing the issue, you would need massive amounts of server downtime as well as your own time sitting there watching. 

 

If you can, I suggest turning off one of the VM's, or better still if you have spare hardware, move them over to a Proxmox instance and just see if that helps your situation. 


If it doesn't, then I can make suggestions on what I did for my docker troubleshooting & changes along the way...

 

PS. I see you are using PhotoPrism. You should give a very nice project Immich a go. Although in early development, I have been very impressed so far. There is an Unraid community app as well.

Edited by DrSpaldo
Link to comment
9 hours ago, DrSpaldo said:

@Mbeco although it hasn't 100% fixed the issue (and I changed so many things in the process, they could also take part) but one of the top 3 potential issues that I feel it may be was the VM's I was running. I am running a Windows 10 VM & MacOS VM. I cannot tell you which was the most likely culprit but I lean towards Windows. For me, what I did was move both of the VM's over to another Proxmox server I had. Obviously this is not a solution, but, like you, I didn't have the time for weeks of downtime like people above have suggested. Because to truly work out what is causing the issue, you would need massive amounts of server downtime as well as your own time sitting there watching. 

 

If you can, I suggest turning off one of the VM's, or better still if you have spare hardware, move them over to a Proxmox instance and just see if that helps your situation. 


If it doesn't, then I can make suggestions on what I did for my docker troubleshooting & changes along the way...

 

PS. I see you are using PhotoPrism. You should give a very nice project Immich a go. Although in early development, I have been very impressed so far. There is an Unraid community app as well.

Thank you @DrSpaldo. I have two Ubuntu VMs running, but they are on dedicated cores. I might still try to move them and see if it solves the problem, although tbh I have little hope.

Re Immich: I have tried that, but had a very weird problem with the database that changed its permissions everytime I started the docker. Might try it again in the future.

  • Like 1
Link to comment
2 hours ago, Mbeco said:

Thank you @DrSpaldo. I have two Ubuntu VMs running, but they are on dedicated cores. I might still try to move them and see if it solves the problem, although tbh I have little hope.

Re Immich: I have tried that, but had a very weird problem with the database that changed its permissions everytime I started the docker. Might try it again in the future.

Ah ok, interesting. My Ubuntu VM is actually still running without issue. So maybe not that!

Link to comment

Just to update this again: Problem appeared again this morning, cannot access UI (after half an hour of loading or so I get a 500 Internal Server Error). Can still access via SSH, CPU load in htop is at normal levels, /var/log/syslog shows 

 

Aug  6 14:12:15 Tower nginx: 2023/08/06 14:12:15 [error] 958#958: *2832760 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.53, server: , request: "GET /Docker HTTP/1.1", subrequest: "/auth-request.php", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.55", referrer: "http://192.168.178.55/Dashboard"
Aug  6 14:12:15 Tower nginx: 2023/08/06 14:12:15 [error] 958#958: *2832760 auth request unexpected status: 504 while sending to client, client: 192.168.178.53, server: , request: "GET /Docker HTTP/1.1", host: "192.168.178.55", referrer: "http://192.168.178.55/Dashboard"

 

Restarting nginx didn't help for now.

Starting to write this post did, however. While writing my dashboard suddenly loaded again. CPU load  pinned to 100% while htop shows low CPU load...

 

Edit: Curiously, my syslog only goes back to 4am at which point it seems some dockers restarted... Might be a trace


Aug  6 04:40:04 Tower kernel: docker0: port 5(veth34bdb38) entered blocking state
Aug  6 04:40:04 Tower kernel: docker0: port 5(veth34bdb38) entered disabled state
Aug  6 04:40:04 Tower kernel: device veth34bdb38 entered promiscuous mode
Aug  6 04:40:12 Tower kernel: eth0: renamed from veth6ae1c22
Aug  6 04:40:12 Tower kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth34bdb38: link becomes ready
Aug  6 04:40:12 Tower kernel: docker0: port 5(veth34bdb38) entered blocking state
Aug  6 04:40:12 Tower kernel: docker0: port 5(veth34bdb38) entered forwarding state
Aug  6 04:44:49 Tower kernel: veth6ae1c22: renamed from eth0
Aug  6 04:44:49 Tower kernel: docker0: port 5(veth34bdb38) entered disabled state
Aug  6 04:44:50 Tower kernel: docker0: port 5(veth34bdb38) entered disabled state
Aug  6 04:44:50 Tower kernel: device veth34bdb38 left promiscuous mode
Aug  6 04:44:50 Tower kernel: docker0: port 5(veth34bdb38) entered disabled state
Aug  6 04:50:01 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="1095" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Aug  6 06:06:19 Tower kernel: vetha224743: renamed from eth0
Aug  6 06:06:59 Tower kernel: eth0: renamed from veth713ced8
Aug  6 09:30:18 Tower kernel: veth713ced8: renamed from eth0
Aug  6 10:02:09 Tower kernel: eth0: renamed from vethed83cb5
Aug  6 12:16:49 Tower kernel: vethed83cb5: renamed from eth0
Aug  6 12:49:40 Tower kernel: eth0: renamed from veth3842221
Aug  6 13:55:26 Tower webGUI: Successful login user root from 192.168.178.53
Aug  6 13:56:55 Tower sshd[10509]: Connection from 192.168.178.53 port 50425 on 192.168.178.55 port 22 rdomain ""
Aug  6 13:56:55 Tower sshd[10509]: Postponed keyboard-interactive for root from 192.168.178.53 port 50425 ssh2 [preauth]
Aug  6 13:56:58 Tower sshd[10509]: Postponed keyboard-interactive/pam for root from 192.168.178.53 port 50425 ssh2 [preauth]
Aug  6 13:56:58 Tower sshd[10509]: Accepted keyboard-interactive/pam for root from 192.168.178.53 port 50425 ssh2
Aug  6 13:56:58 Tower sshd[10509]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug  6 13:56:58 Tower elogind-daemon[1306]: New session c3 of user root.
Aug  6 13:56:58 Tower sshd[10509]: Starting session: shell on pts/2 for root from 192.168.178.53 port 50425 id 0
Aug  6 14:00:26 Tower root: geth-ethereum: Could not download icon https://geth.ethereum.org/static/images/favicon.png
Aug  6 14:12:15 Tower nginx: 2023/08/06 14:12:15 [error] 958#958: *2832760 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.53, server: , request: "GET /Docker HTTP/1.1", subrequest: "/auth-request.php", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.55", referrer: "http://192.168.178.55/Dashboard"
Aug  6 14:12:15 Tower nginx: 2023/08/06 14:12:15 [error] 958#958: *2832760 auth request unexpected status: 504 while sending to client, client: 192.168.178.53, server: , request: "GET /Docker HTTP/1.1", host: "192.168.178.55", referrer: "http://192.168.178.55/Dashboard"
Aug  6 14:28:42 Tower root: geth-ethereum: Could not download icon https://geth.ethereum.org/static/images/favicon.png
Aug  6 14:28:43 Tower root: geth-ethereum: Could not download icon https://geth.ethereum.org/static/images/favicon.png
Aug  6 14:28:45 Tower emhttpd: read SMART /dev/sdc
Aug  6 14:30:35 Tower root: geth-ethereum: Could not download icon https://geth.ethereum.org/static/images/favicon.png
Aug  6 14:30:35 Tower root: geth-ethereum: Could not download icon https://geth.ethereum.org/static/images/favicon.png

 

Edited by Mbeco
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...