unRAID unresponsive - shfs 100% cpu


P_K

Recommended Posts

This morning I got a notification (from PlexPy) that my Plex server is down. I tried to open the unraid webpage to check but it doesn't open. I can ssh into the server and can see in top that shfs uses 100% cpu.

 

Any ideas to bring my server back to normal?

 

unraid 6.2.4 with several dockers.

Screen_Shot_2017-01-21_at_09_29_13.png.5c3384cd18df91400bd519da5f8ea7dc.png

Link to comment
  • 4 weeks later...

I have this issue also, never seen it before.

 

top - 13:52:17 up 8 days, 14:25,  2 users,  load average: 12.60, 11.40, 7.22
Tasks: 303 total,   1 running, 302 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us, 25.7 sy,  0.0 ni, 73.6 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 12065688 total,   265184 free,  4316796 used,  7483708 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6720088 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2723 root      20   0 1371464  22564    816 S 101.0  0.2 103:54.83 shfs

 

 

GUI unresponsive, shares failing to respond.

 

Nothing in the logs:

Feb 15 12:56:35 Tower kernel: mdcmd (178): spindown 3
Feb 15 12:57:20 Tower kernel: mdcmd (179): spindown 5
Feb 15 13:11:12 Tower kernel: mdcmd (180): spindown 0
Feb 15 13:37:39 Tower emhttp: shcmd (113218): mkdir '/mnt/user/Home Videos' |& logger
Feb 15 13:37:39 Tower emhttp: shcmd (113219): chmod 0777 '/mnt/user/Home Videos'
Feb 15 13:37:39 Tower emhttp: shcmd (113220): chown 'nobody':'users' '/mnt/user/Home Videos'
Feb 15 13:37:52 Tower emhttp: shcmd (113248): smbcontrol smbd close-share 'Home Videos'
Feb 15 13:38:05 Tower emhttp: shcmd (113276): smbcontrol smbd close-share 'Home Videos'
Feb 15 13:42:17 Tower sshd[18890]: Accepted password for root from 192.168.1.31 port 49922 ssh2
Feb 15 13:47:44 Tower sshd[21497]: Accepted password for root from 192.168.1.31 port 49998 ssh2

 

 

As you can see just created a new share and was moving data from existing share to new share

Link to comment

Seeing this is syslog. 

 

Feb 15 20:49:00 uNAS shfs/user: err: get_key_info: get_message: /boot/config/._Plus.key (-3)

 

That's a red herring. I'd lay money on the fact that you prepared your boot device on a Mac and copied the licence key using the Finder. That created the ._Plus.key file alongside the original Plus.key file. unRAID just sees the file as a spurious alternative key file and rejects it. You might want to delete it:

 

rm /boot/config/._Plus.key

 

Is there a easy way to rollback to 6.3 or 6.2.x from command line?

 

Yes:

 

cp /boot/previous/bz* /boot

 

and then reboot.

 

Link to comment

I'm seeing the same thing, no cache_dirs plugin here, and yes there are ReiserFS shares.  Seemed to happen when I tried to delete a bunch of files, but that may be coincidental.

 

top - 16:15:19 up 4 days,  2:11,  1 user,  load average: 7.02, 7.16, 7.17
Tasks: 420 total,   2 running, 417 sleeping,   0 stopped,   1 zombie
%Cpu(s):  0.5 us, 13.3 sy,  0.0 ni, 86.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16410152 total,   921220 free,  1721092 used, 13767840 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13508808 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
6464 root      20   0 1031988   7600    744 S  99.7  0.0 150:20.69 shfs

 

Nothing interesting in syslog, GUI unresponsive:

Feb 16 01:20:32 Tower kernel: mdcmd (151): spindown 6
Feb 16 01:20:33 Tower kernel: mdcmd (152): spindown 7
Feb 16 01:20:34 Tower kernel: mdcmd (153): spindown 8
Feb 16 01:20:34 Tower kernel: mdcmd (154): spindown 9
Feb 16 01:20:35 Tower kernel: mdcmd (155): spindown 10
Feb 16 01:20:41 Tower kernel: mdcmd (156): spindown 11
Feb 16 03:30:28 Tower kernel: mdcmd (157): spindown 3
Feb 16 03:30:57 Tower kernel: mdcmd (158): spindown 4
Feb 16 03:31:05 Tower kernel: mdcmd (159): spindown 11
Feb 16 03:31:08 Tower kernel: mdcmd (160): spindown 0
Feb 16 03:31:08 Tower kernel: mdcmd (161): spindown 2
Feb 16 03:31:08 Tower kernel: mdcmd (162): spindown 6
Feb 16 03:31:09 Tower kernel: mdcmd (163): spindown 7
Feb 16 03:31:10 Tower kernel: mdcmd (164): spindown 8
Feb 16 03:31:10 Tower kernel: mdcmd (165): spindown 9
Feb 16 03:31:11 Tower kernel: mdcmd (166): spindown 10
Feb 16 03:31:13 Tower kernel: mdcmd (167): spindown 1
Feb 16 03:31:14 Tower kernel: mdcmd (168): spindown 5
Feb 16 10:42:14 Tower in.telnetd[7649]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 10:42:21 Tower login[7650]: ROOT LOGIN  on '/dev/pts/0' from '192.168.1.10'
Feb 16 11:49:40 Tower kernel: mdcmd (169): spindown 1
Feb 16 11:49:41 Tower kernel: mdcmd (170): spindown 5
Feb 16 12:16:15 Tower kernel: mdcmd (171): spindown 4
Feb 16 12:16:15 Tower kernel: mdcmd (172): spindown 6
Feb 16 12:16:16 Tower kernel: mdcmd (173): spindown 7
Feb 16 12:16:17 Tower kernel: mdcmd (174): spindown 8
Feb 16 12:16:17 Tower kernel: mdcmd (175): spindown 9
Feb 16 12:16:18 Tower kernel: mdcmd (176): spindown 10
Feb 16 12:16:18 Tower kernel: mdcmd (177): spindown 11
Feb 16 13:18:43 Tower kernel: mdcmd (178): spindown 1
Feb 16 13:18:44 Tower kernel: mdcmd (179): spindown 5
Feb 16 14:11:27 Tower in.telnetd[27488]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 14:11:31 Tower login[27489]: ROOT LOGIN  on '/dev/pts/3' from '192.168.1.10'

Link to comment

Also - any suggestions on how to cleanly reboot at this point?  Nothing seems to do anything: poweroff, powerdown, shutdown, reboot just give syslog messages:

Feb 16 16:21:24 Tower shutdown[31150]: shutting down for system halt
Feb 16 16:22:04 Tower in.telnetd[31777]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:22:10 Tower login[31779]: ROOT LOGIN  on '/dev/pts/0' from '192.168.1.10'
Feb 16 16:22:15 Tower shutdown[31948]: shutting down for system halt
Feb 16 16:22:31 Tower in.telnetd[32156]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:22:35 Tower login[32158]: ROOT LOGIN  on '/dev/pts/1' from '192.168.1.10'
Feb 16 16:22:38 Tower root: /usr/local/sbin/powerdown has been deprecated
Feb 16 16:22:38 Tower shutdown[32240]: shutting down for system halt
Feb 16 16:23:36 Tower in.telnetd[659]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:23:39 Tower login[660]: ROOT LOGIN  on '/dev/pts/4' from '192.168.1.10'
Feb 16 16:23:49 Tower root: /usr/local/sbin/powerdown has been deprecated
Feb 16 16:23:49 Tower shutdown[946]: shutting down for system reboot
Feb 16 16:24:42 Tower in.telnetd[1878]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:24:46 Tower login[1879]: ROOT LOGIN  on '/dev/pts/5' from '192.168.1.10'
Feb 16 16:24:47 Tower shutdown[1947]: shutting down for system reboot

 

Unless someone has an idea on how to deal with this, I'll have to go in via IPMI and reset the server, which I'm sure will mean a nice day-long parity check.  Starting to regret this upgrade...

Link to comment

Also - any suggestions on how to cleanly reboot at this point?  Nothing seems to do anything: poweroff, powerdown, shutdown, reboot just give syslog messages:

Feb 16 16:21:24 Tower shutdown[31150]: shutting down for system halt
Feb 16 16:22:04 Tower in.telnetd[31777]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:22:10 Tower login[31779]: ROOT LOGIN  on '/dev/pts/0' from '192.168.1.10'
Feb 16 16:22:15 Tower shutdown[31948]: shutting down for system halt
Feb 16 16:22:31 Tower in.telnetd[32156]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:22:35 Tower login[32158]: ROOT LOGIN  on '/dev/pts/1' from '192.168.1.10'
Feb 16 16:22:38 Tower root: /usr/local/sbin/powerdown has been deprecated
Feb 16 16:22:38 Tower shutdown[32240]: shutting down for system halt
Feb 16 16:23:36 Tower in.telnetd[659]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:23:39 Tower login[660]: ROOT LOGIN  on '/dev/pts/4' from '192.168.1.10'
Feb 16 16:23:49 Tower root: /usr/local/sbin/powerdown has been deprecated
Feb 16 16:23:49 Tower shutdown[946]: shutting down for system reboot
Feb 16 16:24:42 Tower in.telnetd[1878]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:24:46 Tower login[1879]: ROOT LOGIN  on '/dev/pts/5' from '192.168.1.10'
Feb 16 16:24:47 Tower shutdown[1947]: shutting down for system reboot

 

Unless someone has an idea on how to deal with this, I'll have to go in via IPMI and reset the server, which I'm sure will mean a nice day-long parity check.  Starting to regret this upgrade...

There is a 90-second timeout before it kills everything if it has to. Not sure if that timeout gets reset whenever you keep banging on it as you have done here.
Link to comment

There is a 90-second timeout before it kills everything if it has to. Not sure if that timeout gets reset whenever you keep banging on it as you have done here.

 

Appreciate the quick response.  I issued a 'poweroff' that has been sitting in a window untouched (unbanged?) for 5 minutes now that isn't doing anything.  Server doesn't want to die apparently.  Any other ideas?

 

Just another (minor) data point: even though the primary GUI is unresponsive, I have a half dozen containers that are all responding like nothing is happening.

Link to comment

No reiser messages in syslog after the initial boot stuff some days ago.  But, your question adds to my suspicion that I need to go down to path of migrating to XFS rather soon.  I did a reiserfsck on all the disks after a recent kernel oops (another thread) that came back clean.

 

Looks like the system decided for me.  While troubleshooting I did a 'lsof' that hung, and I wasn't able to log back in from either telnet or console, so I've reset and it's now checking that dirty, dirty filesystem.  So much for that I guess.  Still would like to understand why shfs decided to go crazy.

Link to comment

I don't think anyone knows at the moment, but Lime Tech are well aware of the problem. Since the user file systems are presented as an aggregation of several different filesystems (/mnt/user/share = /mnt/cache/share + /mnt/disk1/share + /mnt/disk2/share + ... ) an issue affecting one of the components is going to have a bad effect on that user share. There are known problems with ReiserFS in recent kernels, in particular its handling of extended attributes seems to be broken. I don't have any Reiser-formatted disks but if I did I would be looking to migrate the data to XFS.

Link to comment

This is happening to me also but all my filesystems are XFS. My server with run for about 2 days then the load goes through the roof and I can no longer run any commands or shut off. I have to hold the power button to restart.

 

You may well have a different problem with similar symptoms so start a new thread and attach your diagnostics zip (Tools -> Diagnostics).

 

Link to comment

I upgraded to 6.3.1 and had the same thing happen last night. Definitely a new problem introduced since 6.2.4 never did this.

 

unRAID pushed ReiserFS as the only filesystem for years so it's BS to now say that existing disks won't work and need to be changed to XFS.

Not too much different than Windows XP effectively forcing you to change hard drives to NTFS from FAT32
Link to comment

Didn't Microsoft provide a tool that allowed FAT32 file systems to be converted to NTFS with files in situ without losing very many of them (meaning that quite a sizeable amount of free working space had to be present on the disk for the process to be successful)? OTOH it's hardly fair to blame LT for the fact that ReiserFS is no longer maintained.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.