unRAID unresponsive - shfs 100% cpu

P_K · January 21, 2017

This morning I got a notification (from PlexPy) that my Plex server is down. I tried to open the unraid webpage to check but it doesn't open. I can ssh into the server and can see in top that shfs uses 100% cpu.

Any ideas to bring my server back to normal?

unraid 6.2.4 with several dockers.

testdasi · January 21, 2017

Do you have CachedDir plugin with usershare option in the plugin turned on? I vaguely remember having shfs using CPU everytime it runs but then nothing near 100%. What's your server spec?

P_K · January 21, 2017

cachedirs is installed but I turned it off long time ago as I had some issues with it.

This is an i5 processor with 16Gb of memory.

trurl · January 21, 2017

Tools - Diagnostics, post complete zip

P_K · January 21, 2017

I restarted the server as the load in top went up to 50. The diagnostics files are taken after the reload (before reload I couldn't access the gui to take them). Hopefully they are still useful.

tower-diagnostics-20170121-2053.zip

JonathanM · January 21, 2017

Do you have any ReiserFS formatted drives?

P_K · January 22, 2017

Yes, I do indeed. See attachment.

Marvel · February 13, 2017

I have this issue as well. How can I kill the process and restart? kill pid is not working.

fireplex · February 15, 2017

I have this issue also, never seen it before.

top - 13:52:17 up 8 days, 14:25,  2 users,  load average: 12.60, 11.40, 7.22
Tasks: 303 total,   1 running, 302 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us, 25.7 sy,  0.0 ni, 73.6 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 12065688 total,   265184 free,  4316796 used,  7483708 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6720088 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2723 root      20   0 1371464  22564    816 S 101.0  0.2 103:54.83 shfs

GUI unresponsive, shares failing to respond.

Nothing in the logs:

Feb 15 12:56:35 Tower kernel: mdcmd (178): spindown 3
Feb 15 12:57:20 Tower kernel: mdcmd (179): spindown 5
Feb 15 13:11:12 Tower kernel: mdcmd (180): spindown 0
Feb 15 13:37:39 Tower emhttp: shcmd (113218): mkdir '/mnt/user/Home Videos' |& logger
Feb 15 13:37:39 Tower emhttp: shcmd (113219): chmod 0777 '/mnt/user/Home Videos'
Feb 15 13:37:39 Tower emhttp: shcmd (113220): chown 'nobody':'users' '/mnt/user/Home Videos'
Feb 15 13:37:52 Tower emhttp: shcmd (113248): smbcontrol smbd close-share 'Home Videos'
Feb 15 13:38:05 Tower emhttp: shcmd (113276): smbcontrol smbd close-share 'Home Videos'
Feb 15 13:42:17 Tower sshd[18890]: Accepted password for root from 192.168.1.31 port 49922 ssh2
Feb 15 13:47:44 Tower sshd[21497]: Accepted password for root from 192.168.1.31 port 49998 ssh2

As you can see just created a new share and was moving data from existing share to new share

deadsoulz · February 16, 2017

This is happening to me as well. 2 days after latest update.

deadsoulz · February 16, 2017

Seeing this is syslog.

Feb 15 20:49:00 uNAS shfs/user: err: get_key_info: get_message: /boot/config/._Plus.key (-3)

Is there a easy way to rollback to 6.3 or 6.2.x from command line?

John_M · February 16, 2017

Seeing this is syslog.

Feb 15 20:49:00 uNAS shfs/user: err: get_key_info: get_message: /boot/config/._Plus.key (-3)

That's a red herring. I'd lay money on the fact that you prepared your boot device on a Mac and copied the licence key using the Finder. That created the ._Plus.key file alongside the original Plus.key file. unRAID just sees the file as a spurious alternative key file and rejects it. You might want to delete it:

rm /boot/config/._Plus.key

Is there a easy way to rollback to 6.3 or 6.2.x from command line?

Yes:

cp /boot/previous/bz* /boot

and then reboot.

DavejaVu · February 16, 2017

I'm seeing the same thing, no cache_dirs plugin here, and yes there are ReiserFS shares. Seemed to happen when I tried to delete a bunch of files, but that may be coincidental.

top - 16:15:19 up 4 days,  2:11,  1 user,  load average: 7.02, 7.16, 7.17
Tasks: 420 total,   2 running, 417 sleeping,   0 stopped,   1 zombie
%Cpu(s):  0.5 us, 13.3 sy,  0.0 ni, 86.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16410152 total,   921220 free,  1721092 used, 13767840 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13508808 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
6464 root      20   0 1031988   7600    744 S  99.7  0.0 150:20.69 shfs

Nothing interesting in syslog, GUI unresponsive:

Feb 16 01:20:32 Tower kernel: mdcmd (151): spindown 6
Feb 16 01:20:33 Tower kernel: mdcmd (152): spindown 7
Feb 16 01:20:34 Tower kernel: mdcmd (153): spindown 8
Feb 16 01:20:34 Tower kernel: mdcmd (154): spindown 9
Feb 16 01:20:35 Tower kernel: mdcmd (155): spindown 10
Feb 16 01:20:41 Tower kernel: mdcmd (156): spindown 11
Feb 16 03:30:28 Tower kernel: mdcmd (157): spindown 3
Feb 16 03:30:57 Tower kernel: mdcmd (158): spindown 4
Feb 16 03:31:05 Tower kernel: mdcmd (159): spindown 11
Feb 16 03:31:08 Tower kernel: mdcmd (160): spindown 0
Feb 16 03:31:08 Tower kernel: mdcmd (161): spindown 2
Feb 16 03:31:08 Tower kernel: mdcmd (162): spindown 6
Feb 16 03:31:09 Tower kernel: mdcmd (163): spindown 7
Feb 16 03:31:10 Tower kernel: mdcmd (164): spindown 8
Feb 16 03:31:10 Tower kernel: mdcmd (165): spindown 9
Feb 16 03:31:11 Tower kernel: mdcmd (166): spindown 10
Feb 16 03:31:13 Tower kernel: mdcmd (167): spindown 1
Feb 16 03:31:14 Tower kernel: mdcmd (168): spindown 5
Feb 16 10:42:14 Tower in.telnetd[7649]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 10:42:21 Tower login[7650]: ROOT LOGIN  on '/dev/pts/0' from '192.168.1.10'
Feb 16 11:49:40 Tower kernel: mdcmd (169): spindown 1
Feb 16 11:49:41 Tower kernel: mdcmd (170): spindown 5
Feb 16 12:16:15 Tower kernel: mdcmd (171): spindown 4
Feb 16 12:16:15 Tower kernel: mdcmd (172): spindown 6
Feb 16 12:16:16 Tower kernel: mdcmd (173): spindown 7
Feb 16 12:16:17 Tower kernel: mdcmd (174): spindown 8
Feb 16 12:16:17 Tower kernel: mdcmd (175): spindown 9
Feb 16 12:16:18 Tower kernel: mdcmd (176): spindown 10
Feb 16 12:16:18 Tower kernel: mdcmd (177): spindown 11
Feb 16 13:18:43 Tower kernel: mdcmd (178): spindown 1
Feb 16 13:18:44 Tower kernel: mdcmd (179): spindown 5
Feb 16 14:11:27 Tower in.telnetd[27488]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 14:11:31 Tower login[27489]: ROOT LOGIN  on '/dev/pts/3' from '192.168.1.10'

DavejaVu · February 16, 2017

Also - any suggestions on how to cleanly reboot at this point? Nothing seems to do anything: poweroff, powerdown, shutdown, reboot just give syslog messages:

Feb 16 16:21:24 Tower shutdown[31150]: shutting down for system halt
Feb 16 16:22:04 Tower in.telnetd[31777]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:22:10 Tower login[31779]: ROOT LOGIN  on '/dev/pts/0' from '192.168.1.10'
Feb 16 16:22:15 Tower shutdown[31948]: shutting down for system halt
Feb 16 16:22:31 Tower in.telnetd[32156]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:22:35 Tower login[32158]: ROOT LOGIN  on '/dev/pts/1' from '192.168.1.10'
Feb 16 16:22:38 Tower root: /usr/local/sbin/powerdown has been deprecated
Feb 16 16:22:38 Tower shutdown[32240]: shutting down for system halt
Feb 16 16:23:36 Tower in.telnetd[659]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:23:39 Tower login[660]: ROOT LOGIN  on '/dev/pts/4' from '192.168.1.10'
Feb 16 16:23:49 Tower root: /usr/local/sbin/powerdown has been deprecated
Feb 16 16:23:49 Tower shutdown[946]: shutting down for system reboot
Feb 16 16:24:42 Tower in.telnetd[1878]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:24:46 Tower login[1879]: ROOT LOGIN  on '/dev/pts/5' from '192.168.1.10'
Feb 16 16:24:47 Tower shutdown[1947]: shutting down for system reboot

Unless someone has an idea on how to deal with this, I'll have to go in via IPMI and reset the server, which I'm sure will mean a nice day-long parity check. Starting to regret this upgrade...

trurl · February 16, 2017

Also - any suggestions on how to cleanly reboot at this point? Nothing seems to do anything: poweroff, powerdown, shutdown, reboot just give syslog messages:

Feb 16 16:21:24 Tower shutdown[31150]: shutting down for system halt
Feb 16 16:22:04 Tower in.telnetd[31777]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:22:10 Tower login[31779]: ROOT LOGIN  on '/dev/pts/0' from '192.168.1.10'
Feb 16 16:22:15 Tower shutdown[31948]: shutting down for system halt
Feb 16 16:22:31 Tower in.telnetd[32156]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:22:35 Tower login[32158]: ROOT LOGIN  on '/dev/pts/1' from '192.168.1.10'
Feb 16 16:22:38 Tower root: /usr/local/sbin/powerdown has been deprecated
Feb 16 16:22:38 Tower shutdown[32240]: shutting down for system halt
Feb 16 16:23:36 Tower in.telnetd[659]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:23:39 Tower login[660]: ROOT LOGIN  on '/dev/pts/4' from '192.168.1.10'
Feb 16 16:23:49 Tower root: /usr/local/sbin/powerdown has been deprecated
Feb 16 16:23:49 Tower shutdown[946]: shutting down for system reboot
Feb 16 16:24:42 Tower in.telnetd[1878]: connect from 192.168.1.10 (192.168.1.10)
Feb 16 16:24:46 Tower login[1879]: ROOT LOGIN  on '/dev/pts/5' from '192.168.1.10'
Feb 16 16:24:47 Tower shutdown[1947]: shutting down for system reboot

Unless someone has an idea on how to deal with this, I'll have to go in via IPMI and reset the server, which I'm sure will mean a nice day-long parity check. Starting to regret this upgrade...

There is a 90-second timeout before it kills everything if it has to. Not sure if that timeout gets reset whenever you keep banging on it as you have done here.

DavejaVu · February 16, 2017

There is a 90-second timeout before it kills everything if it has to. Not sure if that timeout gets reset whenever you keep banging on it as you have done here.

Appreciate the quick response. I issued a 'poweroff' that has been sitting in a window untouched (unbanged?) for 5 minutes now that isn't doing anything. Server doesn't want to die apparently. Any other ideas?

Just another (minor) data point: even though the primary GUI is unresponsive, I have a half dozen containers that are all responding like nothing is happening.

John_M · February 16, 2017

Are there any ReiserFS errors mentioned in the syslog?

DavejaVu · February 16, 2017

No reiser messages in syslog after the initial boot stuff some days ago. But, your question adds to my suspicion that I need to go down to path of migrating to XFS rather soon. I did a reiserfsck on all the disks after a recent kernel oops (another thread) that came back clean.

Looks like the system decided for me. While troubleshooting I did a 'lsof' that hung, and I wasn't able to log back in from either telnet or console, so I've reset and it's now checking that dirty, dirty filesystem. So much for that I guess. Still would like to understand why shfs decided to go crazy.

John_M · February 16, 2017

I don't think anyone knows at the moment, but Lime Tech are well aware of the problem. Since the user file systems are presented as an aggregation of several different filesystems (/mnt/user/share = /mnt/cache/share + /mnt/disk1/share + /mnt/disk2/share + ... ) an issue affecting one of the components is going to have a bad effect on that user share. There are known problems with ReiserFS in recent kernels, in particular its handling of extended attributes seems to be broken. I don't have any Reiser-formatted disks but if I did I would be looking to migrate the data to XFS.

the_larizzo · February 17, 2017

This is happening to me also but all my filesystems are XFS. My server with run for about 2 days then the load goes through the roof and I can no longer run any commands or shut off. I have to hold the power button to restart.

John_M · February 17, 2017

This is happening to me also but all my filesystems are XFS. My server with run for about 2 days then the load goes through the roof and I can no longer run any commands or shut off. I have to hold the power button to restart.

You may well have a different problem with similar symptoms so start a new thread and attach your diagnostics zip (Tools -> Diagnostics).

zeroryu · February 17, 2017

i'm having similar issue:

100% shfs and all my shares/drives are reiserfs.

poweroff, reboot are non responsive. the only option i can do is to manually turn off my server which will trigger the 10+ hours of parity check. On top of that, i'm also getting some errors when parity check finishes.

tower-diagnostics-20170216-2125.zip

lionelhutz · February 18, 2017

I upgraded to 6.3.1 and had the same thing happen last night. Definitely a new problem introduced since 6.2.4 never did this.

unRAID pushed ReiserFS as the only filesystem for years so it's BS to now say that existing disks won't work and need to be changed to XFS.

Squid · February 18, 2017

I upgraded to 6.3.1 and had the same thing happen last night. Definitely a new problem introduced since 6.2.4 never did this.

unRAID pushed ReiserFS as the only filesystem for years so it's BS to now say that existing disks won't work and need to be changed to XFS.

Not too much different than Windows XP effectively forcing you to change hard drives to NTFS from FAT32

John_M · February 18, 2017

Didn't Microsoft provide a tool that allowed FAT32 file systems to be converted to NTFS with files in situ without losing very many of them (meaning that quite a sizeable amount of free working space had to be present on the disk for the process to be successful)? OTOH it's hardly fair to blame LT for the fact that ReiserFS is no longer maintained.

unRAID unresponsive - shfs 100% cpu

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

grither

ixnu

lionelhutz

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation