shick Posted September 11, 2024 Posted September 11, 2024 I have a unRAID server running for a while with two ZFS pools (3x1tb SSD and 3x4tb HDD) and one 5x2tb array serving only as a backup system for the ZFS pools. It has a 7700T running on it (but it's an engineering sample so it's marked as a "0000" CPU) which hasn't given me any trouble until lately: For the past months I am facing spontaneous unresponsiveness on the server, looks like all CPU cores go to 100% and I can't access to the web UI or access my files via SMB. The only way to solve this is manually restarting the server. I have some containers (nextcloud, immich, photoprism and a couple personal sites) and a VM running ubuntu with very low load (just a small script running every 15 minutes). Most of this setup was already there before the server started giving me troubles. The server was upgraded to 7.00 beta but downgraded later to 6.12.13 (I thought this might be the culprit, but looks like not). I have attached here the diagnostics file in case more info is needed. Any idea on what might be going around? Thanks! capsulecorp-diagnostics-20240911-1024.zip Quote
bbrodka Posted September 11, 2024 Posted September 11, 2024 I was having similar issues, I read somewhere to log out of the web gui when not interacting for extended periods of time, I used to just always leave the session open even for days without using it on my desktop. Since I started logging out when I don't plan on using the web interface I haven't had any issues. Might be worth a try. Quote
shick Posted September 11, 2024 Author Posted September 11, 2024 55 minutes ago, bbrodka said: I was having similar issues, I read somewhere to log out of the web gui when not interacting for extended periods of time, I used to just always leave the session open even for days without using it on my desktop. Since I started logging out when I don't plan on using the web interface I haven't had any issues. Might be worth a try. I don't think that's the issue, as I'm facing this problem recently and I'm not doing anything significantly different from one year ago... One thing is true though: When I'm not at home it runs smoothly, and it tends to fail more when I make use of it from another device (e.g. from SMB sharing). I haven't found a common trigger yet, unaffortunatly. Quote
shick Posted September 12, 2024 Author Posted September 12, 2024 so I haven't found the culprit yet, but I think I've put everything under control by isolating the 4 CPUs/threads where nextcloud/photoprism/immich were pinned... so it's likely one of those docker images... Quote
shick Posted October 7, 2024 Author Posted October 7, 2024 (edited) I am still facing this issue, even changing the CPU to a 7700T production one. Before the system goes to a total freeze it allows me to connect via ssh and some containers are still working, so I got some output from btop. Also raised a netdata and glances docker instance and here's the output.... I have absolutely no idea where else to look at. In Glances, at certain point it stops listing docker containers, I managed to execute docker ps via ssh and there are a couple "unhealthy" instances but after a while it is completely impossible to do anything... Even when sending the power off signal I still need to manually disconnect the whole server. I am completely lost with this Edited October 7, 2024 by shick Quote
JorgeB Posted October 7, 2024 Posted October 7, 2024 Enable the syslog server and post that after a crash. Quote
shick Posted October 7, 2024 Author Posted October 7, 2024 I had it enabled, this was after last shutdown (I was able to reboot from command line) capsulecorp-diagnostics-20241007-0948.zip Quote
JorgeB Posted October 7, 2024 Posted October 7, 2024 There's something trying to access these shares before the array is started: Oct 6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.376705, 0] ../../source3/smbd/smb2_service.c:772(make_connection_snum) Oct 6 16:46:53 CapsuleCorp smbd[2646]: make_connection_snum: canonicalize_connect_path failed for service TVShows, path /mnt/media/TVShows Oct 6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.378205, 0] ../../source3/smbd/smb2_service.c:772(make_connection_snum) Oct 6 16:46:53 CapsuleCorp smbd[2646]: make_connection_snum: canonicalize_connect_path failed for service Photography, path /mnt/media/Photography Oct 6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.379664, 0] ../../source3/smbd/smb2_service.c:772(make_connection_snum) Oct 6 16:46:53 CapsuleCorp smbd[2646]: make_connection_snum: canonicalize_connect_path failed for service Movies, path /mnt/media/Movies Oct 6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.381152, 0] ../../source3/smbd/smb2_service.c:772(make_connection_snum) Oct 6 16:46:53 CapsuleCorp smbd[2646]: make_connection_snum: canonicalize_connect_path failed for service Mario, path /mnt/ssd/Mario Oct 6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.382525, 0] ../../source3/smbd/smb2_service.c:772(make_connection_snum) Oct 6 16:46:53 CapsuleCorp smbd[2646]: make_connection_snum: canonicalize_connect_path failed for service Downloads, path /mnt/media/Downloads Quote
shick Posted October 7, 2024 Author Posted October 7, 2024 hm how bad is that? not sure what it might be, those shares are under the ZFS storage, shared via SMB extra config. The way I use the system is: - 5x2tb HDDs in array, with +1 HDD parity. It works as a backup of the following ZFS systems (executing rsync weekly): - 3x4tb HDDs in ZFS for media. - 3x1tb SSD in ZFS for personal docs and docker apps storage. So I guess the ZFS system is likely trying to be accessed even with the array unmounted by other systems... Quote
JorgeB Posted October 7, 2024 Posted October 7, 2024 You should remove that, you can export the shares using the GUI. Quote
shick Posted October 7, 2024 Author Posted October 7, 2024 okay thanks! just removed all the SMB extra config and set up with the unRAID native share system (took a while but looks much cleaner) I had all my config still running from unRAID before natively supporting ZFS. Let's see if that fixes it. BTW, I see this error from netdata: "BTRFS device corruption errors = 7 errors" Quote
JorgeB Posted October 7, 2024 Posted October 7, 2024 8 minutes ago, shick said: Let's see if that fixes it. Unlikely that was the reason for the server to get unresponsive, but is was spamming the log. 12 minutes ago, shick said: "BTRFS device corruption errors = 7 errors" Post new diags. Quote
shick Posted October 7, 2024 Author Posted October 7, 2024 5 minutes ago, JorgeB said: Post new diags. here you are capsulecorp-diagnostics-20241007-1748.zip Quote
JorgeB Posted October 7, 2024 Posted October 7, 2024 It's the docker image, you should recreate it: https://docs.unraid.net/unraid-os/manual/docker-management/#re-create-the-docker-image-file Then: https://docs.unraid.net/unraid-os/manual/docker-management/#re-installing-docker-applications Also see below if you have any custom docker networks: https://docs.unraid.net/unraid-os/manual/docker-management/#docker-custom-networks Quote
shick Posted October 7, 2024 Author Posted October 7, 2024 Oh thanks a lot! might if I ask how did you guess it was corrupted? Already did the fix, hope it works fine from now Quote
JorgeB Posted October 8, 2024 Posted October 8, 2024 10 hours ago, shick said: if I ask how did you guess it was corrupted? Difficult to say, but if it happens again soon there may be an underlying issue. 1 Quote
shick Posted October 8, 2024 Author Posted October 8, 2024 So here we go again only 12 hours after recreating the docker image the server again got unresponsive and had to mannually shut it down... I attach the diags: capsulecorp-diagnostics-20241008-0928.zip CPU consumption for dockerd goes above the clouds: Quote
JorgeB Posted October 8, 2024 Posted October 8, 2024 Unfortunately there's nothing relevant logged, this can also be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers. Quote
shick Posted October 8, 2024 Author Posted October 8, 2024 So sad news ^_^' thank you any ways I have ran Memtest and seems in good health. I suspect on the nvme disk, has been running for 3 years and has no SMART capabilities... As it is much of a hassle not being able to use the server besides its NAS capabilities, for the time being I have done the following: - Unmount the NVMe device. - Placed the docker image under one of the pools. - Stopped all VMs and most containers. If it keeps breaking I'll dig into the plugins and remaining docker images... Will report if I find the solution. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.