Jump to content

Server becomes unresponsive since 6.12 upgrade


Recommended Posts

Hey,  I upgraded to 6.12 over the weekend (Saturday morning I believe), and initially everything looked OK, however I found Saturday night that the server became nearly unresponsive, shares were largely inaccessible, web GUI was inaccessible, and most concerning, even trying to log in locally via the console would result in a login timeout after 60 seconds.  I forced a reboot and on restart noticed that I had docker set to macvlan, so I stopped the service, changed it to ipvlan, and restarted Docker (but perhaps notably didn't restart the server).  This morning I found the server in a similar state.

 

This time as I was forcing the power down, I managed to allow it to capture diagnostics (which took well over an hour I'd guess).

 

The only thing suspicious I notice in the logs is:

Jun 19 03:04:14 unRAID winbindd[30865]: [2023/06/19 03:04:14.631987,  0] ../../source3/winbindd/winbindd_samr.c:71(open_internal_samr_conn)
Jun 19 03:04:14 unRAID winbindd[30865]:   open_internal_samr_conn: Could not connect to samr pipe: NT_STATUS_CONNECTION_DISCONNECTED[/CODE]

 

Fix Common Problems shows no errors, just some warnings for dockers that I haven't updated yet (seems they were updated since yesterday) and I haven't had a chance yet, and I don't have UD+ installed but I don't use that.  I have Docker host access to custom networks disabled.

 

Is there something else I need to look at/fix?  Dare I hope it was the fact that I didn't restart the server after fixing the macvlan->ipvlan thing that caused the problem a second time?

 

Thanks

 

unraid-diagnostics-20230619-0722.zip

Link to comment
  • 2 weeks later...
  • 3 weeks later...

Unfortunately 6.12.3 doesn't appear to have resolved my issues though I've managed to capture better data this time.  Firstly diagnostics is from restarting the server last night, it's still running today but I'm starting to see errors in the logs.

 

Perhaps more interesting, since I'd failed to capture diagnostics in the past due to the maching being completely unresponsive to external user input (web gui or local shell), I setup a an external syslog server and configured my unraid box to log to that, so syslog.zip contains that.  To enumerate a bit what's in there:
6.12.3 was installed and the reboot at Jul 16 09:04:52

 

Sometime around Jul 16 17:32:48 is where I returned to find the system unresponsive again and had to forcefully power off by turning off the power supply (after maybe an hour waiting for a "graceful" shutdown from the power button.

 

At that reboot, I (due to misunderstanding) disabled all my Docker containers using fixed IP addresses (I don't have custom macs), but then ran into trouble and after finding out that shouldn't be an issue reset lms to it's prior fixed IP.  That was sometime later in the evening (before July 16 22:00)

 

What stand out to me, other than the ata kernel errors which I have not seen before, is a large number of "unRAID    daemon    warning    php-fpm[7216]    [WARNING] [pool www] child 4069 exited on signal 9 (SIGKILL) after 17.999238 seconds from start" errors in the log from when the system was unresponsive, and which have resumed again overnight.

 

At the moment the server is still responsive if there is anything I can try to either resolve the issue or additionally debug it.

 

Thanks

unraid-diagnostics-20230717-0627.zipsyslog.zip

Link to comment

A couple of weeks ago I added 4 new drives to complete my array, one of them was for parity 2 and the others were to be added to my media share and I was on 6.12.1. Instead I decided to make a new share with existing media containing drives and these new ones then copy media from one share to another (there's probably a way better way to do this but it's mostly done now). About 12hrs+ through transferring files disk 1 was showing as disabled and I could then see it in unassigned devices. I checked the data was all still there then rebooted and let parity sync always without errors. This happened about 3 times in total with disk 1 and disk 2, with varying lengths of times into transfers, sometimes requiring hard restarts because I suspect mc still had an open session somewhere which had been closed on my browser and I could no longer access.
It's happened a fourth time then I found this thread and decided to roll back to 6.11 and I've run extended smart tests on all 16 disks which are doing well. I had to add my existing docker containers via template to get them to work again as most of them wouldn't start after the rollback.
Adding diagnostics here for either the unraid team and/or anyone that confirm the link between 6.12.x and drive issues

citadel-diagnostics-20230718-0703.zip

Link to comment

For disks to suddenly appear under Unassigned devices when they should be part of the array, then that means they temporarily dropped offline, and then reconnected with a different device Id.   I would carefully check cabling SATA and Power to the drives in case there is an intermittent connection somewhere.

Link to comment
  • 1 month later...

I saw 6.12.4 came out in the last day or so with a lot of networking fixes for people with "similar" (server unresponsive to some degree), so I gave it a shot today.  Unfortunately within about an hour I had the same issue, the server became basically totally unresponsive, the web UI wouldn't respond, I couldn't log in to the shell either via ssh or local (would get a timeout trying to log in).  I was unable to get diagnostics with the system in that state, I tried logging in to the console before the issue would occur and staying logged in which did allow me to run the diagnostics command, but after it hadn't returned for several hours I had to reboot the system and revert back to 6.11.5.  I have a couple of diagnostics zips, unfortunately they are both from just shortly after a reboot after the system went unresponsive.  I do have a syslog from a syslog server, attached below.

 

Unfortunately as noted I reverted back to 6.11.5 because I won't have time to babysit/debug this weekend, but any help and suggestions for me to try next week when I have more time would be great.

 

 

unraid-diagnostics-20230901-1525.zip unraid-diagnostics-20230901-1053.zip

syslog_2023-09-01.txt

Edited by stanger89
Link to comment

Pretty sure those are all from being connected via IPMI/AMT remote desktop, for some reason that connection seems to timeout/reconnect periodically which causes all it's redirected devices to reconnect.  I'm not normally connected with that when things are running smoothly so I can't say for sure if that's something that happens in 6.11, but I can fire that up for a while and see.  I don't recall it being normal, except when the server's in it's "unresponsive" state.

 

--edit

 

Yeah, I see the same thing with 6.11.5 if I leave IPMI/AMT remote desktop connected.

Edited by stanger89
Link to comment

Alright, hopefully good news for debugging.  I updated to 6.12.4 again this morning, unfortunately as expected it went unresponsive after less than an hour.  This time I hit the power button hoping it would eventually shut down and capture diagnostics.  Turns out after about 9 hours or so, eventually the server had finally shut itself down and appears to have captured diagnostics.  I've attached the diagnostics.  Right now I'm running in safe mode, I started the array and disabled docker we'll see what happens.


Unfortunately nothing is jumping out at me in the syslog before I pressed the power button (server was unresponsive).

unraid-diagnostics-20230903-1158.zip

Link to comment

In the syslog, I can see where you pressed the power button:

Sep  3 09:41:36 unRAID elogind-daemon[1326]: Power key pressed.

But before that, there are no indications of any problems.


I'm wondering if the system is experiencing network problems. Maybe it isn't actually hanging, but is losing access to the network for some reason? This would result in it appearing to be unresponsive from the network.

 

Can you plug a keyboard/monitor into the system and try logging in to the console while it is unresponsive? If you can get in, try to ping another system on your network

Link to comment

Yeah, it's a little complicated.  You're right, it's not totally frozen and does respond to a keyboard/mouse, however, if you're not already logged in locally on the console, you can't log in, it times out after 60 seconds from entering the username.  That was what I tried above with the remote syslog (with all the USB cruft in the logs from the IPMI periodically reconnecting), I logged in to the console before the apparent network issues, and yes it wasn't completely frozen.  This is also evidenced by the fact that eventually the shutdown/powerdown sequence completed after hitting the power button, but it was on the order of 9 hours later it finally completed.

 

I do agree that it seems network related, and currently the leading suspect is something docker related.  Right now my system is still up in safe mode with Home Assistant and zwave-js-ui running both with host network config.  I've also started plex, which is also host networking and LMS which is one of my containers which are custom IPs on br0.  I have several other containers that also use custom IPs on br0.  I'm wondering if there's something wrong with how I've got that setup that's not compatible with 6.12. 

 

Here's my docker config:
image.thumb.png.08ae452d1e751fd85cf6cefc619101e0.png

Link to comment

OK, so that sounds less like networking and more like the system was under such heavy load that it couldn't respond to anything. 

 

To confirm that, type "htop" at the console and keep an eye on the load as you start more containers. If it gets to where it is unresponsive again, I'm thinking all the CPUs will be pegged and the load will be skyhigh.  If possible, take a photo of the screen with your phone so we can see what commands are the culprits (I'm guessing Docker, but better to have proof)

 

To recover from this... first I would let it run for a while. See if it can finish doing whatever it is doing and recover on its own.

After a while (at least an hour) you could try exiting htop and typing `/etc/rc.d/rc.docker stop`. It may be able to stop things to the point that the load recovers and you can stop the array.  If that fails, you'll probably have to hit the power button again.


If it looks like Docker is the issue, I'm thinking it might be worth deleting and recreating the Docker image:
https://forums.unraid.net/topic/36647-official-guide-restoring-your-docker-applications-in-a-new-image-file/ 

 

Link to comment
  • 2 weeks later...

So far I've managed to verify it's "stable" with Community Applications and Check For Common Problems plugins enabled.  I had an issue once after installing Disk Location, but was not able to capture any useful data and haven't taken the time to try again. 

 

Now I say "stable" in quotes because after a number of days of solid operation things started going unresponsive again.  I did capture diagnostics before it was completely unresponsive, but once it was unresponsive I capture the following from the console:
image.thumb.png.1e978422f956a4d68c8434aef32675cb.png

 

It looks like this might be different than what I was previously running into, since I wasn't seeing the call traces before.  Unfortunately I probably won't be able to get back to debug for a while, but I wanted to collect the data before I lost it.

unraid-diagnostics-20230925-1315.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...