Jump to content

unRAID server unresponsive


shick

Recommended Posts

Posted

I have a unRAID server running for a while with two ZFS pools (3x1tb SSD and 3x4tb HDD) and one 5x2tb array serving only as a backup system for the ZFS pools. It has a 7700T running on it (but it's an engineering sample so it's marked as a "0000" CPU) which hasn't given me any trouble until lately: For the past months I am facing spontaneous unresponsiveness on the server, looks like all CPU cores go to 100% and I can't access to the web UI or access my files via SMB. The only way to solve this is manually restarting the server. 

 

I have some containers (nextcloud, immich, photoprism and a couple personal sites) and a VM running ubuntu with very low load (just a small script running every 15 minutes). Most of this setup was already there before the server started giving me troubles. 

 

The server was upgraded to 7.00 beta but downgraded later to 6.12.13 (I thought this might be the culprit, but looks like not). I have attached here the diagnostics file in case more info is needed.

 

Any idea on what might be going around? Thanks!

capsulecorp-diagnostics-20240911-1024.zip

Posted

I was having similar issues, I read somewhere to log out of the web gui when not interacting for extended periods of time, I used to just always leave the session open even for days without using it on my desktop.  Since I started logging out when I don't plan on using the web interface I haven't had any issues.   Might be worth a try.

Posted
55 minutes ago, bbrodka said:

I was having similar issues, I read somewhere to log out of the web gui when not interacting for extended periods of time, I used to just always leave the session open even for days without using it on my desktop.  Since I started logging out when I don't plan on using the web interface I haven't had any issues.   Might be worth a try.

I don't think that's the issue, as I'm facing this problem recently and I'm not doing anything significantly different from one year ago... One thing is true though: When I'm not at home it runs smoothly, and it tends to fail more when I make use of it from another device (e.g. from SMB sharing). I haven't found a common trigger yet, unaffortunatly.

Posted

so I haven't found the culprit yet, but I think I've put everything under control by isolating the 4 CPUs/threads where nextcloud/photoprism/immich were pinned... so it's likely one of those docker images...

  • 4 weeks later...
Posted (edited)

I am still facing this issue, even changing the CPU to a 7700T production one.

 

Before the system goes to a total freeze it allows me to connect via ssh and some containers are still working, so I got some output from btop.  Also raised a netdata and glances docker instance and here's the output.... I have absolutely no idea where else to look at.

 

In Glances, at certain point it stops listing docker containers, I managed to execute docker ps via ssh and there are a couple "unhealthy" instances but after a while it is completely impossible to do anything... Even when sending the power off signal I still need to manually disconnect the whole server. I am completely lost with this :(

 

image.thumb.png.6116368c53f2fd960404574d87c5e53e.png

 

image.thumb.png.99b2a19043fda7b05e23982ff5f8f742.png

 

 

 

 

 

 

image.thumb.png.033ae56ba547a8d1c90c10aa5066f843.png

 

image.png

 

Edited by shick
Posted

There's something trying to access these shares before the array is started:

 

Oct  6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.376705,  0] ../../source3/smbd/smb2_service.c:772(make_connection_snum)
Oct  6 16:46:53 CapsuleCorp smbd[2646]:   make_connection_snum: canonicalize_connect_path failed for service TVShows, path /mnt/media/TVShows
Oct  6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.378205,  0] ../../source3/smbd/smb2_service.c:772(make_connection_snum)
Oct  6 16:46:53 CapsuleCorp smbd[2646]:   make_connection_snum: canonicalize_connect_path failed for service Photography, path /mnt/media/Photography
Oct  6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.379664,  0] ../../source3/smbd/smb2_service.c:772(make_connection_snum)
Oct  6 16:46:53 CapsuleCorp smbd[2646]:   make_connection_snum: canonicalize_connect_path failed for service Movies, path /mnt/media/Movies
Oct  6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.381152,  0] ../../source3/smbd/smb2_service.c:772(make_connection_snum)
Oct  6 16:46:53 CapsuleCorp smbd[2646]:   make_connection_snum: canonicalize_connect_path failed for service Mario, path /mnt/ssd/Mario
Oct  6 16:46:53 CapsuleCorp smbd[2646]: [2024/10/06 16:46:53.382525,  0] ../../source3/smbd/smb2_service.c:772(make_connection_snum)
Oct  6 16:46:53 CapsuleCorp smbd[2646]:   make_connection_snum: canonicalize_connect_path failed for service Downloads, path /mnt/media/Downloads

 

Posted

hm how bad is that? not sure what it might be, those shares are under the ZFS storage, shared via SMB extra config.

 

The way I use the system is:

- 5x2tb HDDs in array, with +1 HDD parity. It works as a backup of the following ZFS systems (executing rsync weekly): 

- 3x4tb HDDs in ZFS for media.

- 3x1tb SSD in ZFS for personal docs and docker apps storage.

 

So I guess the ZFS system is likely trying to be accessed even with the array unmounted by other systems...

Posted

okay thanks! just removed all the SMB extra config and set up with the unRAID native share system (took a while but looks much cleaner)

 

I had all my config still running from unRAID before natively supporting ZFS. Let's see if that fixes it.

 

BTW, I see this error from netdata: "BTRFS device corruption errors = 7 errors"

 

image.thumb.png.9c26e89c9988380093ffa95e0915f6d7.png

Posted
8 minutes ago, shick said:

Let's see if that fixes it.

Unlikely that was the reason for the server to get unresponsive, but is was spamming the log.

 

12 minutes ago, shick said:

"BTRFS device corruption errors = 7 errors"

Post new diags.

Posted

Unfortunately there's nothing relevant logged, this can also be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers. 

Posted

So sad news ^_^' thank you any ways :) I have ran Memtest and seems in good health. I suspect on the nvme disk, has been running for 3 years and has no SMART capabilities... 

 

As it is much of a hassle not being able to use the server besides its NAS capabilities, for the time being I have done the following:
- Unmount the NVMe device.

- Placed the docker image under one of the pools.
- Stopped all VMs and most containers.

 

If it keeps breaking I'll dig into the plugins and remaining docker images... Will report if I find the solution.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...