unRAID server unresponsive

September 11, 20241 yr

I have a unRAID server running for a while with two ZFS pools (3x1tb SSD and 3x4tb HDD) and one 5x2tb array serving only as a backup system for the ZFS pools. It has a 7700T running on it (but it's an engineering sample so it's marked as a "0000" CPU) which hasn't given me any trouble until lately: For the past months I am facing spontaneous unresponsiveness on the server, looks like all CPU cores go to 100% and I can't access to the web UI or access my files via SMB. The only way to solve this is manually restarting the server.

I have some containers (nextcloud, immich, photoprism and a couple personal sites) and a VM running ubuntu with very low load (just a small script running every 15 minutes). Most of this setup was already there before the server started giving me troubles.

The server was upgraded to 7.00 beta but downgraded later to 6.12.13 (I thought this might be the culprit, but looks like not). I have attached here the diagnostics file in case more info is needed.

Any idea on what might be going around? Thanks!

capsulecorp-diagnostics-20240911-1024.zip

Quote

September 11, 20241 yr

I was having similar issues, I read somewhere to log out of the web gui when not interacting for extended periods of time, I used to just always leave the session open even for days without using it on my desktop. Since I started logging out when I don't plan on using the web interface I haven't had any issues. Might be worth a try.

Quote

September 11, 20241 yr

Author

55 minutes ago, bbrodka said:

I was having similar issues, I read somewhere to log out of the web gui when not interacting for extended periods of time, I used to just always leave the session open even for days without using it on my desktop. Since I started logging out when I don't plan on using the web interface I haven't had any issues. Might be worth a try.

I don't think that's the issue, as I'm facing this problem recently and I'm not doing anything significantly different from one year ago... One thing is true though: When I'm not at home it runs smoothly, and it tends to fail more when I make use of it from another device (e.g. from SMB sharing). I haven't found a common trigger yet, unaffortunatly.

Quote

September 12, 20241 yr

Author

so I haven't found the culprit yet, but I think I've put everything under control by isolating the 4 CPUs/threads where nextcloud/photoprism/immich were pinned... so it's likely one of those docker images...

Quote

October 7, 20241 yr

Author

I am still facing this issue, even changing the CPU to a 7700T production one.

Before the system goes to a total freeze it allows me to connect via ssh and some containers are still working, so I got some output from btop. Also raised a netdata and glances docker instance and here's the output.... I have absolutely no idea where else to look at.

In Glances, at certain point it stops listing docker containers, I managed to execute docker ps via ssh and there are a couple "unhealthy" instances but after a while it is completely impossible to do anything... Even when sending the power off signal I still need to manually disconnect the whole server. I am completely lost with this

Edited October 7, 20241 yr by shick

Quote

October 7, 20241 yr

Enable the syslog server and post that after a crash.

Quote

October 7, 20241 yr

Author

I had it enabled, this was after last shutdown (I was able to reboot from command line)

capsulecorp-diagnostics-20241007-0948.zip

Quote

October 7, 20241 yr

There's something trying to access these shares before the array is started:

Oct  6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.376705,  0] ../../source3/smbd/smb2_service.c:772(make_connection_snum)
Oct  6 16:46:53 CapsuleCorp smbd[2646]:   make_connection_snum: canonicalize_connect_path failed for service TVShows, path /mnt/media/TVShows
Oct  6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.378205,  0] ../../source3/smbd/smb2_service.c:772(make_connection_snum)
Oct  6 16:46:53 CapsuleCorp smbd[2646]:   make_connection_snum: canonicalize_connect_path failed for service Photography, path /mnt/media/Photography
Oct  6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.379664,  0] ../../source3/smbd/smb2_service.c:772(make_connection_snum)
Oct  6 16:46:53 CapsuleCorp smbd[2646]:   make_connection_snum: canonicalize_connect_path failed for service Movies, path /mnt/media/Movies
Oct  6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.381152,  0] ../../source3/smbd/smb2_service.c:772(make_connection_snum)
Oct  6 16:46:53 CapsuleCorp smbd[2646]:   make_connection_snum: canonicalize_connect_path failed for service Mario, path /mnt/ssd/Mario
Oct  6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.382525,  0] ../../source3/smbd/smb2_service.c:772(make_connection_snum)
Oct  6 16:46:53 CapsuleCorp smbd[2646]:   make_connection_snum: canonicalize_connect_path failed for service Downloads, path /mnt/media/Downloads

Quote

October 7, 20241 yr

Author

hm how bad is that? not sure what it might be, those shares are under the ZFS storage, shared via SMB extra config.

The way I use the system is:

- 5x2tb HDDs in array, with +1 HDD parity. It works as a backup of the following ZFS systems (executing rsync weekly):

- 3x4tb HDDs in ZFS for media.

- 3x1tb SSD in ZFS for personal docs and docker apps storage.

So I guess the ZFS system is likely trying to be accessed even with the array unmounted by other systems...

Quote

October 7, 20241 yr

You should remove that, you can export the shares using the GUI.

Quote

October 7, 20241 yr

Author

okay thanks! just removed all the SMB extra config and set up with the unRAID native share system (took a while but looks much cleaner)

I had all my config still running from unRAID before natively supporting ZFS. Let's see if that fixes it.

BTW, I see this error from netdata: "BTRFS device corruption errors = 7 errors"

Quote

October 7, 20241 yr

8 minutes ago, shick said:

Let's see if that fixes it.

Unlikely that was the reason for the server to get unresponsive, but is was spamming the log.

12 minutes ago, shick said:

"BTRFS device corruption errors = 7 errors"

Post new diags.

Quote

October 7, 20241 yr

Author

5 minutes ago, JorgeB said:

Post new diags.

here you are

capsulecorp-diagnostics-20241007-1748.zip

Quote

October 7, 20241 yr

It's the docker image, you should recreate it:

https://docs.unraid.net/unraid-os/manual/docker-management/#re-create-the-docker-image-file
Then:
https://docs.unraid.net/unraid-os/manual/docker-management/#re-installing-docker-applications
Also see below if you have any custom docker networks:
https://docs.unraid.net/unraid-os/manual/docker-management/#docker-custom-networks

Quote

October 7, 20241 yr

Author

Oh thanks a lot! might if I ask how did you guess it was corrupted?

Already did the fix, hope it works fine from now

Quote

October 8, 20241 yr

10 hours ago, shick said:

if I ask how did you guess it was corrupted?

Difficult to say, but if it happens again soon there may be an underlying issue.

Quote

1

October 8, 20241 yr

Author

So here we go again only 12 hours after recreating the docker image the server again got unresponsive and had to mannually shut it down...

I attach the diags:

capsulecorp-diagnostics-20241008-0928.zip

CPU consumption for dockerd goes above the clouds:
image.png.d35db8e4e7662ab07c56092cd81037f2.png

Quote

October 8, 20241 yr

Unfortunately there's nothing relevant logged, this can also be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers.

Quote

October 8, 20241 yr

Author

So sad news ^_^' thank you any ways I have ran Memtest and seems in good health. I suspect on the nvme disk, has been running for 3 years and has no SMART capabilities...

As it is much of a hassle not being able to use the server besides its NAS capabilities, for the time being I have done the following:
- Unmount the NVMe device.

- Placed the docker image under one of the pools.
- Stopped all VMs and most containers.

If it keeps breaking I'll dig into the plugins and remaining docker images... Will report if I find the solution.

Quote

unRAID server unresponsive

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)