September 11, 20241 yr I have a unRAID server running for a while with two ZFS pools (3x1tb SSD and 3x4tb HDD) and one 5x2tb array serving only as a backup system for the ZFS pools. It has a 7700T running on it (but it's an engineering sample so it's marked as a "0000" CPU) which hasn't given me any trouble until lately: For the past months I am facing spontaneous unresponsiveness on the server, looks like all CPU cores go to 100% and I can't access to the web UI or access my files via SMB. The only way to solve this is manually restarting the server. I have some containers (nextcloud, immich, photoprism and a couple personal sites) and a VM running ubuntu with very low load (just a small script running every 15 minutes). Most of this setup was already there before the server started giving me troubles. The server was upgraded to 7.00 beta but downgraded later to 6.12.13 (I thought this might be the culprit, but looks like not). I have attached here the diagnostics file in case more info is needed. Any idea on what might be going around? Thanks! capsulecorp-diagnostics-20240911-1024.zip
September 11, 20241 yr I was having similar issues, I read somewhere to log out of the web gui when not interacting for extended periods of time, I used to just always leave the session open even for days without using it on my desktop. Since I started logging out when I don't plan on using the web interface I haven't had any issues. Might be worth a try.
September 11, 20241 yr Author 55 minutes ago, bbrodka said: I was having similar issues, I read somewhere to log out of the web gui when not interacting for extended periods of time, I used to just always leave the session open even for days without using it on my desktop. Since I started logging out when I don't plan on using the web interface I haven't had any issues. Might be worth a try. I don't think that's the issue, as I'm facing this problem recently and I'm not doing anything significantly different from one year ago... One thing is true though: When I'm not at home it runs smoothly, and it tends to fail more when I make use of it from another device (e.g. from SMB sharing). I haven't found a common trigger yet, unaffortunatly.
September 12, 20241 yr Author so I haven't found the culprit yet, but I think I've put everything under control by isolating the 4 CPUs/threads where nextcloud/photoprism/immich were pinned... so it's likely one of those docker images...
October 7, 20241 yr Author I am still facing this issue, even changing the CPU to a 7700T production one. Before the system goes to a total freeze it allows me to connect via ssh and some containers are still working, so I got some output from btop. Also raised a netdata and glances docker instance and here's the output.... I have absolutely no idea where else to look at. In Glances, at certain point it stops listing docker containers, I managed to execute docker ps via ssh and there are a couple "unhealthy" instances but after a while it is completely impossible to do anything... Even when sending the power off signal I still need to manually disconnect the whole server. I am completely lost with this Edited October 7, 20241 yr by shick
October 7, 20241 yr Author I had it enabled, this was after last shutdown (I was able to reboot from command line) capsulecorp-diagnostics-20241007-0948.zip
October 7, 20241 yr There's something trying to access these shares before the array is started: Oct 6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.376705, 0] ../../source3/smbd/smb2_service.c:772(make_connection_snum) Oct 6 16:46:53 CapsuleCorp smbd[2646]: make_connection_snum: canonicalize_connect_path failed for service TVShows, path /mnt/media/TVShows Oct 6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.378205, 0] ../../source3/smbd/smb2_service.c:772(make_connection_snum) Oct 6 16:46:53 CapsuleCorp smbd[2646]: make_connection_snum: canonicalize_connect_path failed for service Photography, path /mnt/media/Photography Oct 6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.379664, 0] ../../source3/smbd/smb2_service.c:772(make_connection_snum) Oct 6 16:46:53 CapsuleCorp smbd[2646]: make_connection_snum: canonicalize_connect_path failed for service Movies, path /mnt/media/Movies Oct 6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.381152, 0] ../../source3/smbd/smb2_service.c:772(make_connection_snum) Oct 6 16:46:53 CapsuleCorp smbd[2646]: make_connection_snum: canonicalize_connect_path failed for service Mario, path /mnt/ssd/Mario Oct 6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.382525, 0] ../../source3/smbd/smb2_service.c:772(make_connection_snum) Oct 6 16:46:53 CapsuleCorp smbd[2646]: make_connection_snum: canonicalize_connect_path failed for service Downloads, path /mnt/media/Downloads
October 7, 20241 yr Author hm how bad is that? not sure what it might be, those shares are under the ZFS storage, shared via SMB extra config. The way I use the system is: - 5x2tb HDDs in array, with +1 HDD parity. It works as a backup of the following ZFS systems (executing rsync weekly): - 3x4tb HDDs in ZFS for media. - 3x1tb SSD in ZFS for personal docs and docker apps storage. So I guess the ZFS system is likely trying to be accessed even with the array unmounted by other systems...
October 7, 20241 yr Author okay thanks! just removed all the SMB extra config and set up with the unRAID native share system (took a while but looks much cleaner) I had all my config still running from unRAID before natively supporting ZFS. Let's see if that fixes it. BTW, I see this error from netdata: "BTRFS device corruption errors = 7 errors"
October 7, 20241 yr 8 minutes ago, shick said: Let's see if that fixes it. Unlikely that was the reason for the server to get unresponsive, but is was spamming the log. 12 minutes ago, shick said: "BTRFS device corruption errors = 7 errors" Post new diags.
October 7, 20241 yr Author 5 minutes ago, JorgeB said: Post new diags. here you are capsulecorp-diagnostics-20241007-1748.zip
October 7, 20241 yr It's the docker image, you should recreate it: https://docs.unraid.net/unraid-os/manual/docker-management/#re-create-the-docker-image-file Then: https://docs.unraid.net/unraid-os/manual/docker-management/#re-installing-docker-applications Also see below if you have any custom docker networks: https://docs.unraid.net/unraid-os/manual/docker-management/#docker-custom-networks
October 7, 20241 yr Author Oh thanks a lot! might if I ask how did you guess it was corrupted? Already did the fix, hope it works fine from now
October 8, 20241 yr 10 hours ago, shick said: if I ask how did you guess it was corrupted? Difficult to say, but if it happens again soon there may be an underlying issue.
October 8, 20241 yr Author So here we go again only 12 hours after recreating the docker image the server again got unresponsive and had to mannually shut it down... I attach the diags: capsulecorp-diagnostics-20241008-0928.zip CPU consumption for dockerd goes above the clouds:
October 8, 20241 yr Unfortunately there's nothing relevant logged, this can also be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers.
October 8, 20241 yr Author So sad news ^_^' thank you any ways I have ran Memtest and seems in good health. I suspect on the nvme disk, has been running for 3 years and has no SMART capabilities... As it is much of a hassle not being able to use the server besides its NAS capabilities, for the time being I have done the following: - Unmount the NVMe device. - Placed the docker image under one of the pools. - Stopped all VMs and most containers. If it keeps breaking I'll dig into the plugins and remaining docker images... Will report if I find the solution.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.