Mbeco Posted July 31, 2023 Share Posted July 31, 2023 Hi, I have had a problem with my Unraid server for a while now. Upfront, I think my problem might be related to this one: Essentially, my Ryzen 7 CPU spikes to 100%, which leads to a currently running parity sync (had to change a disk, so my data is not even secure right now) progressing with ~5KB/s, my dockers being barely accessible, the Unraid GUI itself being barely accessible to the level where it does not load at all anymore. Rebooting only solves the problem for a couple of minutes. The problem only seems to exist when the array is on. I can still access the server via SSH, which is how I exported the attached diagnostics. I am not sure where to continue looking, would appreciate any help. tower-diagnostics-2023.zip 1 Quote Link to comment
itimpi Posted July 31, 2023 Share Posted July 31, 2023 Had the problem occurred when these diagnostics were captured? Quote Link to comment
Mbeco Posted July 31, 2023 Author Share Posted July 31, 2023 2 minutes ago, itimpi said: Had the problem occurred when these diagnostics were captured? Yes, it has and currently still is Quote Link to comment
Mbeco Posted July 31, 2023 Author Share Posted July 31, 2023 After killing nginx (not stopping it) and starting it again, I can at least access the GUI again, although CPU usage is still high. I expect the problem to occur again soon Quote Link to comment
DrSpaldo Posted July 31, 2023 Share Posted July 31, 2023 @MbecoI feel like you are going on the same crazy ride has I have been 1 Quote Link to comment
Mbeco Posted July 31, 2023 Author Share Posted July 31, 2023 1 minute ago, DrSpaldo said: @MbecoI feel like you are going on the same crazy ride has I have been Yes, I am following your thread. I have been using Unraid for years, this is the first time I feel they messed up and not me. It is a bit unsatisfactory to not have any proper solution to this problem. Quote Link to comment
DrSpaldo Posted July 31, 2023 Share Posted July 31, 2023 3 minutes ago, Mbeco said: Yes, I am following your thread. I have been using Unraid for years, this is the first time I feel they messed up and not me. It is a bit unsatisfactory to not have any proper solution to this problem. Absolutely. I have had no issues with Unraid for sometime, it started happening around the time I updated from one of the older versions but I can’t 100% confirm that. I wish I remembered everything I changed, because it was so much. I’m much more stable now, just maxing out once a week during my appdata/CA backup day. tomorrow if I get some time, I will go through all the things I changed to see if that helps you. How much RAM in total do you have an how many dockers / VM’s? How would they look for their system utilisation compared to your reasources? Quote Link to comment
Mbeco Posted July 31, 2023 Author Share Posted July 31, 2023 (edited) 7 minutes ago, DrSpaldo said: Absolutely. I have had no issues with Unraid for sometime, it started happening around the time I updated from one of the older versions but I can’t 100% confirm that. I wish I remembered everything I changed, because it was so much. I’m much more stable now, just maxing out once a week during my appdata/CA backup day. tomorrow if I get some time, I will go through all the things I changed to see if that helps you. How much RAM in total do you have an how many dockers / VM’s? How would they look for their system utilisation compared to your reasources? 18 Dockers (most of them low- to no-effort) 2 VMs (one on 2 shared cores, one on 4 dedicated cores totalling 20GB of RAM) I am running a Ryzen 7 with 64GB of RAM, which is at an expected 42% utilisation. The CPU however is locked to 100% except for the dedicated cores, sometimes going down for a bit only to go up again. Nothing that is running would justify that level of CPU utilisation. Edit: Adding to that, my docker CPU utilisation in the advanced view shows 0-0.02% utilisation per docker. Full list of dockers I am running: Cloudflared wishthis MariaDB-Monica MariaDB-WishThis pairdrop duplicati PaperlessRedis paperless-ngx PasswordPusherEphemeral send SendRedis storagenode-v3 PhotoprismMariaDB-Official PhotoPrism AdGuard-Home Cloudflare-DDNS WireGuard-Easy scrutiny The Parity sync is PAUSED right now, utilisation is still close to 100%. Edited July 31, 2023 by Mbeco Quote Link to comment
grumpy Posted July 31, 2023 Share Posted July 31, 2023 Why not stop everything, and let Unraid run by itself with the array loaded. Then after turn on the parity and let it finish, then docker containers one by one to see where the issue is happening. Then finally the vm's. Quote Link to comment
Mbeco Posted July 31, 2023 Author Share Posted July 31, 2023 49 minutes ago, grumpy said: Why not stop everything, and let Unraid run by itself with the array loaded. Then after turn on the parity and let it finish, then docker containers one by one to see where the issue is happening. Then finally the vm's. TBH, I use the services I host so frequently that I cannot afford that much downtime. What's interesting is that this has happened somewhere around the time I upgraded from 6.11.X to 6.12.X. I cannot 100% but quite certainly rule out the VMs, as they are on their dedicated cores, so should not have an influence. As for the dockers, I could turn of all but 5 of them and see how the system behaves, but the remaining 5 are quite crucial for me. Quote Link to comment
grumpy Posted July 31, 2023 Share Posted July 31, 2023 Its been awhile since I have done tech support. (mind is leaving me) But until you can figure what layer is causing you issues you will not be able to fix it. Unraid is many layers and you need to identifie the problem layer, so time is needed to do this or you may introduce other issues in trying to guess what the issue is. Let Unraid run with just the array started and see what it is doing, no issues then move on to VM's no issues then start a docker container ... rinse and repeat till you find the issue. Then you can ask for more direct help or a light bulb turns on and you figure it out yourself. I do understand not wanting to stop everything, I have been working on my own Unraid issues and time has found me a solution. Somebody smarter than me too. Personally I would let the parity job finish, unless all your data is backed up or of no consequence if you lose it. Quote Link to comment
DrSpaldo Posted July 31, 2023 Share Posted July 31, 2023 (edited) @Mbeco although it hasn't 100% fixed the issue (and I changed so many things in the process, they could also take part) but one of the top 3 potential issues that I feel it may be was the VM's I was running. I am running a Windows 10 VM & MacOS VM. I cannot tell you which was the most likely culprit but I lean towards Windows. For me, what I did was move both of the VM's over to another Proxmox server I had. Obviously this is not a solution, but, like you, I didn't have the time for weeks of downtime like people above have suggested. Because to truly work out what is causing the issue, you would need massive amounts of server downtime as well as your own time sitting there watching. If you can, I suggest turning off one of the VM's, or better still if you have spare hardware, move them over to a Proxmox instance and just see if that helps your situation. If it doesn't, then I can make suggestions on what I did for my docker troubleshooting & changes along the way... PS. I see you are using PhotoPrism. You should give a very nice project Immich a go. Although in early development, I have been very impressed so far. There is an Unraid community app as well. Edited July 31, 2023 by DrSpaldo Quote Link to comment
Mbeco Posted August 1, 2023 Author Share Posted August 1, 2023 9 hours ago, DrSpaldo said: @Mbeco although it hasn't 100% fixed the issue (and I changed so many things in the process, they could also take part) but one of the top 3 potential issues that I feel it may be was the VM's I was running. I am running a Windows 10 VM & MacOS VM. I cannot tell you which was the most likely culprit but I lean towards Windows. For me, what I did was move both of the VM's over to another Proxmox server I had. Obviously this is not a solution, but, like you, I didn't have the time for weeks of downtime like people above have suggested. Because to truly work out what is causing the issue, you would need massive amounts of server downtime as well as your own time sitting there watching. If you can, I suggest turning off one of the VM's, or better still if you have spare hardware, move them over to a Proxmox instance and just see if that helps your situation. If it doesn't, then I can make suggestions on what I did for my docker troubleshooting & changes along the way... PS. I see you are using PhotoPrism. You should give a very nice project Immich a go. Although in early development, I have been very impressed so far. There is an Unraid community app as well. Thank you @DrSpaldo. I have two Ubuntu VMs running, but they are on dedicated cores. I might still try to move them and see if it solves the problem, although tbh I have little hope. Re Immich: I have tried that, but had a very weird problem with the database that changed its permissions everytime I started the docker. Might try it again in the future. 1 Quote Link to comment
DrSpaldo Posted August 1, 2023 Share Posted August 1, 2023 2 hours ago, Mbeco said: Thank you @DrSpaldo. I have two Ubuntu VMs running, but they are on dedicated cores. I might still try to move them and see if it solves the problem, although tbh I have little hope. Re Immich: I have tried that, but had a very weird problem with the database that changed its permissions everytime I started the docker. Might try it again in the future. Ah ok, interesting. My Ubuntu VM is actually still running without issue. So maybe not that! Quote Link to comment
Mbeco Posted August 6, 2023 Author Share Posted August 6, 2023 (edited) Just to update this again: Problem appeared again this morning, cannot access UI (after half an hour of loading or so I get a 500 Internal Server Error). Can still access via SSH, CPU load in htop is at normal levels, /var/log/syslog shows Aug 6 14:12:15 Tower nginx: 2023/08/06 14:12:15 [error] 958#958: *2832760 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.53, server: , request: "GET /Docker HTTP/1.1", subrequest: "/auth-request.php", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.55", referrer: "http://192.168.178.55/Dashboard" Aug 6 14:12:15 Tower nginx: 2023/08/06 14:12:15 [error] 958#958: *2832760 auth request unexpected status: 504 while sending to client, client: 192.168.178.53, server: , request: "GET /Docker HTTP/1.1", host: "192.168.178.55", referrer: "http://192.168.178.55/Dashboard" Restarting nginx didn't help for now. Starting to write this post did, however. While writing my dashboard suddenly loaded again. CPU load pinned to 100% while htop shows low CPU load... Edit: Curiously, my syslog only goes back to 4am at which point it seems some dockers restarted... Might be a trace Aug 6 04:40:04 Tower kernel: docker0: port 5(veth34bdb38) entered blocking state Aug 6 04:40:04 Tower kernel: docker0: port 5(veth34bdb38) entered disabled state Aug 6 04:40:04 Tower kernel: device veth34bdb38 entered promiscuous mode Aug 6 04:40:12 Tower kernel: eth0: renamed from veth6ae1c22 Aug 6 04:40:12 Tower kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth34bdb38: link becomes ready Aug 6 04:40:12 Tower kernel: docker0: port 5(veth34bdb38) entered blocking state Aug 6 04:40:12 Tower kernel: docker0: port 5(veth34bdb38) entered forwarding state Aug 6 04:44:49 Tower kernel: veth6ae1c22: renamed from eth0 Aug 6 04:44:49 Tower kernel: docker0: port 5(veth34bdb38) entered disabled state Aug 6 04:44:50 Tower kernel: docker0: port 5(veth34bdb38) entered disabled state Aug 6 04:44:50 Tower kernel: device veth34bdb38 left promiscuous mode Aug 6 04:44:50 Tower kernel: docker0: port 5(veth34bdb38) entered disabled state Aug 6 04:50:01 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="1095" x-info="https://www.rsyslog.com"] rsyslogd was HUPed Aug 6 06:06:19 Tower kernel: vetha224743: renamed from eth0 Aug 6 06:06:59 Tower kernel: eth0: renamed from veth713ced8 Aug 6 09:30:18 Tower kernel: veth713ced8: renamed from eth0 Aug 6 10:02:09 Tower kernel: eth0: renamed from vethed83cb5 Aug 6 12:16:49 Tower kernel: vethed83cb5: renamed from eth0 Aug 6 12:49:40 Tower kernel: eth0: renamed from veth3842221 Aug 6 13:55:26 Tower webGUI: Successful login user root from 192.168.178.53 Aug 6 13:56:55 Tower sshd[10509]: Connection from 192.168.178.53 port 50425 on 192.168.178.55 port 22 rdomain "" Aug 6 13:56:55 Tower sshd[10509]: Postponed keyboard-interactive for root from 192.168.178.53 port 50425 ssh2 [preauth] Aug 6 13:56:58 Tower sshd[10509]: Postponed keyboard-interactive/pam for root from 192.168.178.53 port 50425 ssh2 [preauth] Aug 6 13:56:58 Tower sshd[10509]: Accepted keyboard-interactive/pam for root from 192.168.178.53 port 50425 ssh2 Aug 6 13:56:58 Tower sshd[10509]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Aug 6 13:56:58 Tower elogind-daemon[1306]: New session c3 of user root. Aug 6 13:56:58 Tower sshd[10509]: Starting session: shell on pts/2 for root from 192.168.178.53 port 50425 id 0 Aug 6 14:00:26 Tower root: geth-ethereum: Could not download icon https://geth.ethereum.org/static/images/favicon.png Aug 6 14:12:15 Tower nginx: 2023/08/06 14:12:15 [error] 958#958: *2832760 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.53, server: , request: "GET /Docker HTTP/1.1", subrequest: "/auth-request.php", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.55", referrer: "http://192.168.178.55/Dashboard" Aug 6 14:12:15 Tower nginx: 2023/08/06 14:12:15 [error] 958#958: *2832760 auth request unexpected status: 504 while sending to client, client: 192.168.178.53, server: , request: "GET /Docker HTTP/1.1", host: "192.168.178.55", referrer: "http://192.168.178.55/Dashboard" Aug 6 14:28:42 Tower root: geth-ethereum: Could not download icon https://geth.ethereum.org/static/images/favicon.png Aug 6 14:28:43 Tower root: geth-ethereum: Could not download icon https://geth.ethereum.org/static/images/favicon.png Aug 6 14:28:45 Tower emhttpd: read SMART /dev/sdc Aug 6 14:30:35 Tower root: geth-ethereum: Could not download icon https://geth.ethereum.org/static/images/favicon.png Aug 6 14:30:35 Tower root: geth-ethereum: Could not download icon https://geth.ethereum.org/static/images/favicon.png Edited August 6, 2023 by Mbeco Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.