HELP! Zombie Mode Again


Recommended Posts

Hopefully someone will see this and give me a suggestion ...

 

Recently upgraded to version 6 after running version 5 for several years and find myself experiencing a serious problem.

 

After having been up and running for a little over four (4) days my unRaid server suddenly has suddenly gone into a sort of zombie mode.  The web gui won't respond at all and shares are not available ... though oddly Sabnzbd running in a docker contain on the same machine is still responding and Sonarr seems to be accessible from some machines on my LAN but not others.

 

The machine is responding to pings and the console is taking basic commands for the time being.  I was going to restart the server from the console but I recall that when I experienced the same problem during an earlier aborted upgrade attempt when 6 was still in beta that all he logs got cleared ..

 

Any suggestions on what to do at this point?

 

UPDATE: Like I said I can still  run commands from the console and after reading some other posts I ran top and saw this one process:

 

PID    USER    PR    NI    VIRT    RES    SHR    S    %CPU  %MEM    TIME    COMMAND

1857  root    20    0      546m  4308  786    S      100      0.1        776.32  shfs

 

The time is constantly increasing ...

 

Not sure what I'm looking at but guessing this process has spiked my CPU???

 

 

UPDATE II:  oh well, just as with my last upgrade attempt the entire server eventually became unresponsive while looking at the documentation and trying to gather more information.  The console stopped responding as did Sabnzbd.

 

Hard Reset

Link to comment
  • 3 weeks later...

Yes, same here.

 

Happened a few times now.

A bit annoying, always ends in a hard reset.

 

Can anybody suggest where I can start looking to figure this out?

This time I copied the syslog. It seems to be the mover...

 

Jan 21 00:02:46 Tower kernel: mdcmd (280): spindown 0
Jan 21 00:09:44 Tower kernel: mdcmd (281): spindown 1
Jan 21 01:02:14 Tower kernel: mdcmd (282): spindown 1
Jan 21 02:18:29 Tower kernel: mdcmd (283): spindown 1
Jan 21 03:18:05 Tower kernel: mdcmd (284): spindown 1
Jan 21 03:40:01 Tower logger: mover started
Jan 21 03:40:01 Tower logger: skipping "Apps"
Jan 21 03:40:01 Tower logger: skipping "IoT"
Jan 21 03:40:01 Tower logger: moving "UserShares"
Jan 21 03:40:01 Tower logger: ./UserShares/Roland/rolan/OBELIX/Data/C/Users/rolan/Tracing/WPPMedia/Skype_MediaStackETW-6.0.8948.320-lcsmedia_vnext_release4(rtbldlab)-x86fre-U (2016_01_20 10_26_30 UTC).etl
Jan 21 03:55:02 Tower kernel: mdcmd (285): spindown 1
Jan 21 20:17:59 Tower php: /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker 'restart' 'PlexMediaServer'
Jan 21 20:23:05 Tower in.telnetd[17844]: connect from 192.168.2.8 (192.168.2.
Jan 21 20:23:07 Tower login[17845]: ROOT LOGIN  on '/dev/pts/0' from 'Obelix'
Jan 21 20:23:51 Tower in.telnetd[17990]: connect from 192.168.2.8 (192.168.2.
Jan 21 20:23:58 Tower login[17991]: ROOT LOGIN  on '/dev/pts/1' from 'Obelix'

 

when I restarted the Plex Docker because it did not stream anymore the web became unresponsive.

I could still telnet into the server and it would respond as long as  I did not need the array.

ls /mnt/user

would not work anymore

but

ls /mnt/cache/Apps

s was fine

 

mover was still running at 8pm on the 21/01

root     32467  1478  0 03:40 ?        00:00:00 /bin/sh -c /usr/local/sbin/mover |& logger
root     32468 32467  0 03:40 ?        00:00:00 /bin/bash /usr/local/sbin/mover

 

Any ideas?

 

 

 

Link to comment

A bit disappointed at the lack of response and support for this as several people seem to have posted elsewhere about similar freeze ups of v 6 ... but I may have stumbled into a work around. 

 

After posting the original message on cue just shy of three full days of being up my server froze up again and required another hard reset but unlike the previous event it was to far gone to get logged in.  Started watching very closely after that and gain just around three days the server became unresponsive again but this time like the first I was able to login at the console before it went full zombie and noted the same process had my CPU pegged at 100% again.  Since I still had access I shut it down more gently from there. 

 

On reboot this time I thought rather than chance not being able to get in again when/if it hung I'd just run the top process on the console so I could check in on it periodically to see what was happening ... that was about 14 days ago and it's been smooth sailing since.

 

Not sure what's happening here but it seems as though so long as the server is running some task it doesn't nod off.  The top process is pretty light weight and doesn't use a lot of resources so running it constantly doesn't impact anything else so I'll just keep it going as see what happens.

Link to comment

A bit disappointed at the lack of response and support for this as several people seem to have posted elsewhere about similar freeze ups of v 6 ... but I may have stumbled into a work around. 

 

After posting the original message on cue just shy of three full days of being up my server froze up again and required another hard reset but unlike the previous event it was to far gone to get logged in.  Started watching very closely after that and gain just around three days the server became unresponsive again but this time like the first I was able to login at the console before it went full zombie and noted the same process had my CPU pegged at 100% again.  Since I still had access I shut it down more gently from there. 

 

On reboot this time I thought rather than chance not being able to get in again when/if it hung I'd just run the top process on the console so I could check in on it periodically to see what was happening ... that was about 14 days ago and it's been smooth sailing since.

 

Not sure what's happening here but it seems as though so long as the server is running some task it doesn't nod off.  The top process is pretty light weight and doesn't use a lot of resources so running it constantly doesn't impact anything else so I'll just keep it going as see what happens.

 

thanks for posting... what is the 'top process?'

Link to comment

A bit disappointed at the lack of response and support for this as several people seem to have posted elsewhere about similar freeze ups of v 6 ... but I may have stumbled into a work around. 

 

After posting the original message on cue just shy of three full days of being up my server froze up again and required another hard reset but unlike the previous event it was to far gone to get logged in.  Started watching very closely after that and gain just around three days the server became unresponsive again but this time like the first I was able to login at the console before it went full zombie and noted the same process had my CPU pegged at 100% again.  Since I still had access I shut it down more gently from there. 

 

On reboot this time I thought rather than chance not being able to get in again when/if it hung I'd just run the top process on the console so I could check in on it periodically to see what was happening ... that was about 14 days ago and it's been smooth sailing since.

 

Not sure what's happening here but it seems as though so long as the server is running some task it doesn't nod off.  The top process is pretty light weight and doesn't use a lot of resources so running it constantly doesn't impact anything else so I'll just keep it going as see what happens.

 

The reason there has been no response from us on this is that we have yet to recreate this issue in our own labs (not on any equipment we own).  Without the ability to recreate, we really have no way of diagnosing the problem.  One thing we're hopeful about is that the next release of unRAID will feature a newer version of Docker, which may or may not alleviate this issue.

 

We are definitely paying attention to these threads though and every now and then we do make more efforts to see if we can recreate this problem, but so far, it just hasn't happened yet.

 

What would be helpful is if someone could leave a monitor connected to the unRAID console, log into the console, and run the command:  tail /var/log/syslog -f

 

This will cause the system log to continue outputting directly to that console.  When the system goes into an unresponsive state, check the monitor for any messages and if possible, take a picture with a smartphone or something (make sure it's legible) and upload it here (in fact, post it here AND PM me with it).

Link to comment

I'm in this state again.  Wait, today there's a difference; I have no shares or devices visible in the UI, CPU shows 0%, but the start/stop buttons are still present under Main.  Starting and stopping the array brought the devices back but not the shares.  I've updated the attached syslog; it should now include that stop/start event.

 

After a reboot I'll run "tail /var/log/syslog -f" on the console downstairs.  Once it fails again [in four days, if the pattern holds], I'll get you a screenshot, then delete my dockers and try it again.

 

FWIW, right now:

 

Clicking the log button:

Jan 25 07:17:37 Tower dhcpcd[1324]: eth0: renew in 3600 seconds, rebind in 6300 seconds

Jan 25 07:17:37 Tower dhcpcd[1324]: eth0: writing lease `/var/lib/dhcpcd/dhcpcd-eth0.lease'

Jan 25 07:17:37 Tower dhcpcd[1324]: eth0: IP address 192.168.0.100/24 already exists

Jan 25 07:17:37 Tower dhcpcd[1324]: eth0: executing `/lib/dhcpcd/dhcpcd-run-hooks' RENEW

Jan 25 07:17:37 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (1 of 2), next in 2.0 seconds

Jan 25 07:17:39 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (2 of 2)

Jan 25 07:33:07 Tower dhcpcd[1324]: eth0: xid 0x7cd12051 is for hwaddr d0:23:db:a8:6b:ad:00:00:00:00:00:00:00:00:00:00

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: renewing lease of 192.168.0.100

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: rebind in 2700 seconds, expire in 3600 seconds

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: sending REQUEST (xid 0xca5984da), next in 3.3 seconds

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: acknowledged 192.168.0.100 from 192.168.0.1

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: leased 192.168.0.100 for 7200 seconds

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: renew in 3600 seconds, rebind in 6300 seconds

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: writing lease `/var/lib/dhcpcd/dhcpcd-eth0.lease'

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: IP address 192.168.0.100/24 already exists

Jan 25 08:17:37 Tower dhcpcd[1324]: eth0: executing `/lib/dhcpcd/dhcpcd-run-hooks' RENEW

Jan 25 08:17:38 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (1 of 2), next in 2.0 seconds

Jan 25 08:17:40 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (2 of 2)

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: renewing lease of 192.168.0.100

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: rebind in 2700 seconds, expire in 3600 seconds

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: sending REQUEST (xid 0xa46cfb70), next in 3.5 seconds

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: acknowledged 192.168.0.100 from 192.168.0.1

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: leased 192.168.0.100 for 7200 seconds

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: renew in 3600 seconds, rebind in 6300 seconds

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: writing lease `/var/lib/dhcpcd/dhcpcd-eth0.lease'

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: IP address 192.168.0.100/24 already exists

Jan 25 09:17:37 Tower dhcpcd[1324]: eth0: executing `/lib/dhcpcd/dhcpcd-run-hooks' RENEW

Jan 25 09:17:38 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (1 of 2), next in 2.0 seconds

Jan 25 09:17:40 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (2 of 2)

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: renewing lease of 192.168.0.100

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: rebind in 2700 seconds, expire in 3600 seconds

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: sending REQUEST (xid 0x59e7eba), next in 4.4 seconds

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: acknowledged 192.168.0.100 from 192.168.0.1

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: leased 192.168.0.100 for 7200 seconds

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: renew in 3600 seconds, rebind in 6300 seconds

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: writing lease `/var/lib/dhcpcd/dhcpcd-eth0.lease'

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: IP address 192.168.0.100/24 already exists

Jan 25 10:17:37 Tower dhcpcd[1324]: eth0: executing `/lib/dhcpcd/dhcpcd-run-hooks' RENEW

Jan 25 10:17:38 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (1 of 2), next in 2.0 seconds

Jan 25 10:17:40 Tower dhcpcd[1324]: eth0: ARP announcing 192.168.0.100 (2 of 2)

Jan 25 11:06:29 Tower atd[17756]: File a000050171b4a2 is in wrong format - aborting

Jan 25 11:07:20 Tower emhttp: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog

 

Attempting to generate Diagnostics.zip gives a 404.

 

I can view four days of log in the Tools, but attempting to download gives a 404.  Copy-pasted contents are attached.

syslog.txt.zip

Link to comment

After this had happened to me on the 21st (see above) I came back today from a long weekend away and the server was in the same state.

 

Please find attached both syslogs (21st and 26th)

 

The shsf process takes 100% CPU

I can telnet to the server and even browse around the flash and cache disk, but not the array. If I hit the array the telnet session freezes up as well.

The GUI is not available.

 

Hardware is in the Sig.

unRaid 6.1.6

 

Dockers:

binhex-delugevpn

crashplan

NodeRed

PlexMediaServer (needo)

Sonarr (linuxserver)

syslogs.zip

Link to comment

By any chance are your drives formatted with the ReiserFS?  I can tell you that I had issues similar to yours back in October and November of last year.  I provided HTOP screenshots showing shfs, logs and diagnostics, etc and no one was able to recreate my situation.    I did see several other users posts indicating that conversion to XFS solved their issue, so I went through the process of converting my disks to XFS in November of 2015 and haven't had an issue since. 

 

For converting to XFS I followed this posting:  http://lime-technology.com/forum/index.php?topic=37490.msg346739#msg346739

 

Dan

Link to comment

Just some info wrt whether it's Docker causing the problem....

 

The first time I tried to upgrade from v5 to v6, this past summer,  and began having the problem I was running as vanilla a setup as you could imagine ... no Docker programs running no plugins nothing just an array with some shares. 

 

Also, while this began happening again after my second attempt to upgrade and I did eventually start using some docker programs the hanging problem predated their installation.  And like the other user stated ... even after the unRaid gui becomes unresponsive my Docker applications are still accessible ... though oddly only from certain machines on my LAN.

 

And another note... since I posted about running the top process and the expected freeze up not occurring after three days it's still up while the TOP process has been running continuously

 

Link to comment
  • 2 weeks later...

when I had similar symptoms I found through console that my root folder / was 100% full

I rebooted into safe mode and I had no problems

 

then I removed all plugins and started to install one at a time

still I am running rock stable with more plugins than before

 

So I didn't find the cause, but found a solution with the help os the forum!

http://lime-technology.com/forum/index.php?topic=44969.0

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.