HELP! Zombie Mode Again

UncleDirtNap · January 3, 2016

Hopefully someone will see this and give me a suggestion ...

Recently upgraded to version 6 after running version 5 for several years and find myself experiencing a serious problem.

After having been up and running for a little over four (4) days my unRaid server suddenly has suddenly gone into a sort of zombie mode. The web gui won't respond at all and shares are not available ... though oddly Sabnzbd running in a docker contain on the same machine is still responding and Sonarr seems to be accessible from some machines on my LAN but not others.

The machine is responding to pings and the console is taking basic commands for the time being. I was going to restart the server from the console but I recall that when I experienced the same problem during an earlier aborted upgrade attempt when 6 was still in beta that all he logs got cleared ..

Any suggestions on what to do at this point?

UPDATE: Like I said I can still run commands from the console and after reading some other posts I ran top and saw this one process:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME COMMAND

1857 root 20 0 546m 4308 786 S 100 0.1 776.32 shfs

The time is constantly increasing ...

Not sure what I'm looking at but guessing this process has spiked my CPU???

UPDATE II: oh well, just as with my last upgrade attempt the entire server eventually became unresponsive while looking at the documentation and trying to gather more information. The console stopped responding as did Sabnzbd.

Hard Reset

chazlarson · January 20, 2016

I'm seeing the same thing. No fix yet, but you're not alone.

roland · January 21, 2016

Yes, same here.

Happened a few times now.

A bit annoying, always ends in a hard reset.

Can anybody suggest where I can start looking to figure this out?

This time I copied the syslog. It seems to be the mover...

Jan 21 00:02:46 Tower kernel: mdcmd (280): spindown 0
Jan 21 00:09:44 Tower kernel: mdcmd (281): spindown 1
Jan 21 01:02:14 Tower kernel: mdcmd (282): spindown 1
Jan 21 02:18:29 Tower kernel: mdcmd (283): spindown 1
Jan 21 03:18:05 Tower kernel: mdcmd (284): spindown 1
Jan 21 03:40:01 Tower logger: mover started
Jan 21 03:40:01 Tower logger: skipping "Apps"
Jan 21 03:40:01 Tower logger: skipping "IoT"
Jan 21 03:40:01 Tower logger: moving "UserShares"
Jan 21 03:40:01 Tower logger: ./UserShares/Roland/rolan/OBELIX/Data/C/Users/rolan/Tracing/WPPMedia/Skype_MediaStackETW-6.0.8948.320-lcsmedia_vnext_release4(rtbldlab)-x86fre-U (2016_01_20 10_26_30 UTC).etl
Jan 21 03:55:02 Tower kernel: mdcmd (285): spindown 1
Jan 21 20:17:59 Tower php: /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker 'restart' 'PlexMediaServer'
Jan 21 20:23:05 Tower in.telnetd[17844]: connect from 192.168.2.8 (192.168.2.
Jan 21 20:23:07 Tower login[17845]: ROOT LOGIN  on '/dev/pts/0' from 'Obelix'
Jan 21 20:23:51 Tower in.telnetd[17990]: connect from 192.168.2.8 (192.168.2.
Jan 21 20:23:58 Tower login[17991]: ROOT LOGIN  on '/dev/pts/1' from 'Obelix'

when I restarted the Plex Docker because it did not stream anymore the web became unresponsive.

I could still telnet into the server and it would respond as long as I did not need the array.

ls /mnt/user

would not work anymore

but

ls /mnt/cache/Apps

s was fine

mover was still running at 8pm on the 21/01

root     32467  1478  0 03:40 ?        00:00:00 /bin/sh -c /usr/local/sbin/mover |& logger
root     32468 32467  0 03:40 ?        00:00:00 /bin/bash /usr/local/sbin/mover

Any ideas?

UncleDirtNap · January 21, 2016

A bit disappointed at the lack of response and support for this as several people seem to have posted elsewhere about similar freeze ups of v 6 ... but I may have stumbled into a work around.

After posting the original message on cue just shy of three full days of being up my server froze up again and required another hard reset but unlike the previous event it was to far gone to get logged in. Started watching very closely after that and gain just around three days the server became unresponsive again but this time like the first I was able to login at the console before it went full zombie and noted the same process had my CPU pegged at 100% again. Since I still had access I shut it down more gently from there.

On reboot this time I thought rather than chance not being able to get in again when/if it hung I'd just run the top process on the console so I could check in on it periodically to see what was happening ... that was about 14 days ago and it's been smooth sailing since.

Not sure what's happening here but it seems as though so long as the server is running some task it doesn't nod off. The top process is pretty light weight and doesn't use a lot of resources so running it constantly doesn't impact anything else so I'll just keep it going as see what happens.

grither · January 21, 2016

A bit disappointed at the lack of response and support for this as several people seem to have posted elsewhere about similar freeze ups of v 6 ... but I may have stumbled into a work around.

After posting the original message on cue just shy of three full days of being up my server froze up again and required another hard reset but unlike the previous event it was to far gone to get logged in. Started watching very closely after that and gain just around three days the server became unresponsive again but this time like the first I was able to login at the console before it went full zombie and noted the same process had my CPU pegged at 100% again. Since I still had access I shut it down more gently from there.

On reboot this time I thought rather than chance not being able to get in again when/if it hung I'd just run the top process on the console so I could check in on it periodically to see what was happening ... that was about 14 days ago and it's been smooth sailing since.

Not sure what's happening here but it seems as though so long as the server is running some task it doesn't nod off. The top process is pretty light weight and doesn't use a lot of resources so running it constantly doesn't impact anything else so I'll just keep it going as see what happens.

thanks for posting... what is the 'top process?'

GerryGER · January 21, 2016

thanks for posting... what is the 'top process?'

http://linux.about.com/od/commands/l/blcmdl1_top.htm

This is how it looks (on a RPi though):

htop is a more fancy version (doesn't come with standard Linux though):

trurl · January 21, 2016

See V6 Help in my sig

jonp · January 22, 2016

A bit disappointed at the lack of response and support for this as several people seem to have posted elsewhere about similar freeze ups of v 6 ... but I may have stumbled into a work around.

After posting the original message on cue just shy of three full days of being up my server froze up again and required another hard reset but unlike the previous event it was to far gone to get logged in. Started watching very closely after that and gain just around three days the server became unresponsive again but this time like the first I was able to login at the console before it went full zombie and noted the same process had my CPU pegged at 100% again. Since I still had access I shut it down more gently from there.

On reboot this time I thought rather than chance not being able to get in again when/if it hung I'd just run the top process on the console so I could check in on it periodically to see what was happening ... that was about 14 days ago and it's been smooth sailing since.

Not sure what's happening here but it seems as though so long as the server is running some task it doesn't nod off. The top process is pretty light weight and doesn't use a lot of resources so running it constantly doesn't impact anything else so I'll just keep it going as see what happens.

The reason there has been no response from us on this is that we have yet to recreate this issue in our own labs (not on any equipment we own). Without the ability to recreate, we really have no way of diagnosing the problem. One thing we're hopeful about is that the next release of unRAID will feature a newer version of Docker, which may or may not alleviate this issue.

We are definitely paying attention to these threads though and every now and then we do make more efforts to see if we can recreate this problem, but so far, it just hasn't happened yet.

What would be helpful is if someone could leave a monitor connected to the unRAID console, log into the console, and run the command: tail /var/log/syslog -f

This will cause the system log to continue outputting directly to that console. When the system goes into an unresponsive state, check the monitor for any messages and if possible, take a picture with a smartphone or something (make sure it's legible) and upload it here (in fact, post it here AND PM me with it).

chazlarson · January 22, 2016

Docker is implicated or suspected?

chazlarson · January 25, 2016

I'm in this state again. Wait, today there's a difference; I have no shares or devices visible in the UI, CPU shows 0%, but the start/stop buttons are still present under Main. Starting and stopping the array brought the devices back but not the shares. I've updated the attached syslog; it should now include that stop/start event.

After a reboot I'll run "tail /var/log/syslog -f" on the console downstairs. Once it fails again [in four days, if the pattern holds], I'll get you a screenshot, then delete my dockers and try it again.

FWIW, right now:

Clicking the log button:

Jan 25 07:17:37 Tower dhcpcd[1324]: eth0: renew in 3600 seconds, rebind in 6300 seconds

Jan 25 07:17:37 Tower dhcpcd[1324]: eth0: writing lease `/var/lib/dhcpcd/dhcpcd-eth0.lease'

Jan 25 07:17:37 Tower dhcpcd[1324]: eth0: IP address 192.168.0.100/24 already exists

Jan 25 07:17:37 Tower dhcpcd[1324]: eth0: executing `/lib/dhcpcd/dhcpcd-run-hooks' RENEW

Jan 25 07:17:37 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (1 of 2), next in 2.0 seconds

Jan 25 07:17:39 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (2 of 2)

Jan 25 07:33:07 Tower dhcpcd[1324]: eth0: xid 0x7cd12051 is for hwaddr d0:23:db:a8:6b:ad:00:00:00:00:00:00:00:00:00:00

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: renewing lease of 192.168.0.100

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: rebind in 2700 seconds, expire in 3600 seconds

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: sending REQUEST (xid 0xca5984da), next in 3.3 seconds

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: acknowledged 192.168.0.100 from 192.168.0.1

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: leased 192.168.0.100 for 7200 seconds

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: renew in 3600 seconds, rebind in 6300 seconds

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: writing lease `/var/lib/dhcpcd/dhcpcd-eth0.lease'

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: IP address 192.168.0.100/24 already exists

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: executing `/lib/dhcpcd/dhcpcd-run-hooks' RENEW

Jan 25 08:17:38 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (1 of 2), next in 2.0 seconds

Jan 25 08:17:40 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (2 of 2)

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: renewing lease of 192.168.0.100

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: rebind in 2700 seconds, expire in 3600 seconds

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: sending REQUEST (xid 0xa46cfb70), next in 3.5 seconds

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: acknowledged 192.168.0.100 from 192.168.0.1

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: leased 192.168.0.100 for 7200 seconds

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: renew in 3600 seconds, rebind in 6300 seconds

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: writing lease `/var/lib/dhcpcd/dhcpcd-eth0.lease'

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: IP address 192.168.0.100/24 already exists

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: executing `/lib/dhcpcd/dhcpcd-run-hooks' RENEW

Jan 25 09:17:38 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (1 of 2), next in 2.0 seconds

Jan 25 09:17:40 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (2 of 2)

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: renewing lease of 192.168.0.100

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: rebind in 2700 seconds, expire in 3600 seconds

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: sending REQUEST (xid 0x59e7eba), next in 4.4 seconds

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: acknowledged 192.168.0.100 from 192.168.0.1

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: leased 192.168.0.100 for 7200 seconds

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: renew in 3600 seconds, rebind in 6300 seconds

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: writing lease `/var/lib/dhcpcd/dhcpcd-eth0.lease'

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: IP address 192.168.0.100/24 already exists

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: executing `/lib/dhcpcd/dhcpcd-run-hooks' RENEW

Jan 25 10:17:38 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (1 of 2), next in 2.0 seconds

Jan 25 10:17:40 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (2 of 2)

Jan 25 11:06:29 Tower atd[17756]: File a000050171b4a2 is in wrong format - aborting

Jan 25 11:07:20 Tower emhttp: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog

Attempting to generate Diagnostics.zip gives a 404.

I can view four days of log in the Tools, but attempting to download gives a 404. Copy-pasted contents are attached.

syslog.txt.zip

chazlarson · January 25, 2016

Maybe interesting?

UPDATE: NOT. This was apparently an unrelated issue having to do with Docker temp/log file locations.

/dev/loop0 3.4G 3.4G 0 100% /var/lib/docker

trurl · January 25, 2016

Here

chazlarson · January 25, 2016

Thanks. I've edited those messages so they don't confuse the issue in this thread.

roland · January 26, 2016

After this had happened to me on the 21st (see above) I came back today from a long weekend away and the server was in the same state.

Please find attached both syslogs (21st and 26th)

The shsf process takes 100% CPU

I can telnet to the server and even browse around the flash and cache disk, but not the array. If I hit the array the telnet session freezes up as well.

The GUI is not available.

Hardware is in the Sig.

unRaid 6.1.6

Dockers:

binhex-delugevpn

crashplan

NodeRed

PlexMediaServer (needo)

Sonarr (linuxserver)

syslogs.zip

ausrhino · January 26, 2016

Here are some other threads that have had similiar issues. I think this is Docker related. I had the same issue as described.

Some light reading anyway - with some good ideas on how to troubleshoot the problem.

https://lime-technology.com/forum/index.php?topic=39510.0

http://lime-technology.com/forum/index.php?topic=45286.0

chazlarson · January 26, 2016

In my case, at least, the dockers have never gone unresponsive. They continue to work. It's parts of the UNRAID UI that are not responsive, and the shares don't work.

goinsnoopin · January 26, 2016

By any chance are your drives formatted with the ReiserFS? I can tell you that I had issues similar to yours back in October and November of last year. I provided HTOP screenshots showing shfs, logs and diagnostics, etc and no one was able to recreate my situation. I did see several other users posts indicating that conversion to XFS solved their issue, so I went through the process of converting my disks to XFS in November of 2015 and haven't had an issue since.

For converting to XFS I followed this posting: http://lime-technology.com/forum/index.php?topic=37490.msg346739#msg346739

Dan

chazlarson · January 26, 2016

In my case, they're xfs.

UncleDirtNap · January 30, 2016

Just some info wrt whether it's Docker causing the problem....

The first time I tried to upgrade from v5 to v6, this past summer, and began having the problem I was running as vanilla a setup as you could imagine ... no Docker programs running no plugins nothing just an array with some shares.

Also, while this began happening again after my second attempt to upgrade and I did eventually start using some docker programs the hanging problem predated their installation. And like the other user stated ... even after the unRaid gui becomes unresponsive my Docker applications are still accessible ... though oddly only from certain machines on my LAN.

And another note... since I posted about running the top process and the expected freeze up not occurring after three days it's still up while the TOP process has been running continuously

grither · February 9, 2016

http://lime-technology.com/forum/index.php?topic=45286.0

the above fixed an unresponsive server for me. i was locking up every couple of days, now i've been up for over 3 weeks with no prob

karateo · February 9, 2016

when I had similar symptoms I found through console that my root folder / was 100% full

I rebooted into safe mode and I had no problems

then I removed all plugins and started to install one at a time

still I am running rock stable with more plugins than before

So I didn't find the cause, but found a solution with the help os the forum!

http://lime-technology.com/forum/index.php?topic=44969.0

HELP! Zombie Mode Again

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation