unRAID unresponsive - shfs 100% cpu

lionelhutz · February 18, 2017

Well, there is really no mention of ANYTHING to do with this from LT and apparently an all XFS system does this too so I'm not convinced it's a RFS problem.

It's always been best to install windows on a clean partition so that hardly qualifies as a comparison to having to wipe-out many drives worth of data...

lordsiris · February 19, 2017

I seem to have a similar problem. Came home yesterday to a completely unresponsive server. Had to reset as nothing else worked. Parity check came back clean. Just a while ago it was unresponsive again. Load of 875/875/875. Shfs process at 100%. No plugins, no docker. Don't have logs as wife was complaining because Kodi couldn't play files. So reset again.

I did the latest update on Thursday I believe.

John_M · February 19, 2017

Have you considered, either as an aid to troubleshooting or as a temporary work-round, disabling user shares?

thomast_88 · February 21, 2017

I have this problem already since 6.3.0-rc5 . It's easy for me to replicate it. If i start writing new files to the cache array (which is 2x 250 gb evos BTRFS raid 1) server load keeps growing until 50-60, then eventually VM's starts dying, then docker apps. If I stop writing to the cache array then load goes down again to normal (< 1).

I noticed the problem when using FileBot Docker container to rename (move) files from cache array -> cache array. First 3-4 files are blazing fast, and then the load just keeps growing. Should i provide some logs, while I'm doing the renaming to further investigate this?

Edited February 21, 2017 by thomast_88

lionelhutz · February 21, 2017

11 hours ago, thomast_88 said:

I have this problem already since 6.3.0-rc5 . It's easy for me to replicate it. If i start writing new files to the cache array (which is 2x 250 gb evos BTRFS raid 1) server load keeps growing until 50-60, then eventually VM's starts dying, then docker apps. If I stop writing to the cache array then load goes down again to normal (< 1).

I noticed the problem when using FileBot Docker container to rename (move) files from cache array -> cache array. First 3-4 files are blazing fast, and then the load just keeps growing. Should i provide some logs, while I'm doing the renaming to further investigate this?

It's not the same problem, but could be related. This problem causes a CPU core to be pegged at 100% continuously and the web-gui along with other stuff quits responding. It also never recovers.

JustinAiken · March 9, 2017

```

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

14102 root 20 0 1259740 48048 876 S 200.0 0.3 605:19.65 shfs
```

- I do have cache_dirs, but not enabled for user shares

- Mostly reiser drives, but a few XFS

- Too froze to get diagnostics or anything

ZipsServer · March 10, 2017

Same here. Webgui unresponsive, shfs at 100%, kill-reboot-shutdown not working

berizzle · March 22, 2017

Has anyone found a fix for machines in this state?

One of my unRAID machines is having the same issue

JorgeB · March 22, 2017

Has anyone found a fix for machines in this state?

One of my unRAID machines is having the same issue

Any reiserfs disks?

grither · March 22, 2017

was just about to post a separate topic when i found this. very similar sounding, server locks up every two days or so. dockers down, can't access webgui. can log in with putty, but can't seem to issue any commands. have had to pull power, obviously not good but nothing else works.

can i roll back to an older version of unraid? any issues doing that?

EDIT: out of interest, i have two machines, one seems unaffected, however the one i'm referring to has had this issue a couple of weeks

Edited March 22, 2017 by grither

ixnu · March 23, 2017

I had this EXACT issue and failed to find the root cause after 30 hours of troubleshooting.

My problem pointed to a software issue since it happened to me on completely disparate hardware using the same disks.

I built a new machine and mounted the RFS disks on snapraid for a stable mount. It wasn't a big deal since I needed to replace my ancient backup server.

The problem vanished after I migrated to new disks (including cache) and formatted as XFS.

I emailed Tom with suggestion that RFS be completely removed from support, but he is convinced that this is too extreme.

lionelhutz · March 25, 2017

On 3/23/2017 at 11:11 AM, ixnu said:

I emailed Tom with suggestion that RFS be completely removed from support, but he is convinced that this is too extreme.

It doesn't seem to be only RFS systems that have this issue which points to something else causing it...

On 2/16/2017 at 9:29 PM, the_larizzo said:

This is happening to me also but all my filesystems are XFS. My server with run for about 2 days then the load goes through the roof and I can no longer run any commands or shut off. I have to hold the power button to restart.

Edited March 25, 2017 by lionelhutz

the_larizzo · March 25, 2017

I ended up having to fallback to 6.2.4. Server would lock up every 2 days, XFS only.

ixnu · March 25, 2017

I do not lurk much, but the first course of action always seems to be move off RFS. It's obviously a difficult problem, but seems to be far more common on RFS n'est pas?

JorgeB · March 25, 2017

8 minutes ago, ixnu said:

I do not lurk much, but the first course of action always seems to be move off RFS. It's obviously a difficult problem, but seems to be far more common on RFS n'est pas?

Agree, seems to be the most common reason, also if the user has at least one xfs disk or a non reiser cache disk it's easy to confirm that is the cause of the problem by limiting all writes to non reiser disks and testing for a few days or weeks.

grither · March 25, 2017

56 minutes ago, the_larizzo said:

I ended up having to fallback to 6.2.4. Server would lock up every 2 days, XFS only.

did this help? please let us know. also, did you have to rebuild all dockers or did they survive the rollback?

berizzle · March 27, 2017

On 3/22/2017 at 3:15 AM, johnnie.black said:

Any reiserfs disks?

Yes of course. Been running this machines for years now. Maybe 5.

Just happened again and after a day of it not "coming back" I need to kill the machine.

JorgeB · March 27, 2017

2 hours ago, berizzle said:

Yes of course. Been running this machines for years now. Maybe 5.

Just happened again and after a day of it not "coming back" I need to kill the machine.

Reiserfs disks seem to be the #1 reason for this issue, convert one of your disks to XFS, limit all writes to that disk for a few days/weeks by changing your shares(s) included disks and see if crashing stops, if yes convert remaining disks.

PS: IMO you should convert even if this isn't the source of the problem, there have been multiple issues with reiser lately and they have terrible performance in certain situations.

berizzle · March 27, 2017

1 minute ago, johnnie.black said:

Reiserfs disks seem to be the #1 reason for this issue, convert one of your disks to XFS, limit all writes to that disk for a few days/weeks by changing your shares(s) included disks and see if crashing stops, if yes convert remaining disks.

PS: IMO you should convert even if this isn't the source of the problem, there have been multiple issues with reiser lately and they have terrible performance in certain situations.

I have 23 drives, 21 are Reiserfs 42TB and 2 XFS 6TB.

9TB free over all the drives.

Is there any process that makes sense to convert these disks?

JorgeB · March 27, 2017

7 minutes ago, berizzle said:

I have 23 drives, 21 are Reiserfs 42TB and 2 XFS 6TB.

9TB free over all the drives.

Is there any process that makes sense to convert these disks?

Since you already have 2 XFS disks you can test before doing the conversion and confirm if that will really help by limiting all your writes to those disks, go to your share(s) and set included disks to only those 2 (all share data on the other disks will still be accessible but all new writes will go to the XFS disks), test for a few days/weeks.

To convert see this thread:

berizzle · March 27, 2017

Great thanks.

MisterLas · March 28, 2017

I've been experiencing this since 6.3.2 as well, all XFS disks. Seeing more and more of these threads pop up. I've tried downgrading to 6.3.0, and made it past the 2 day mark that 6.3.2 would die at. I made it to 4 days and it died during the night as well. When I return from work, I'll be downgrading to 6.2.4.

While I agree that XFS > reiser, I feel there is something more at play here.

lionelhutz · March 28, 2017

I'm curious how many people have converted all their drives to XFS and eliminated the problem. To me, there are far more systems with RFS drives out there so that could explain why more systems with RFS drives have the issue.

Clink · April 1, 2017

I have had this problem in a mixed ReiserFS/XFS system; converted all drives to XFS and no shfs lock ups since then (a couple of weeks now).

Of course it could have been some weird file/directory structure inconsistency situation that went away when the files/directories were newly created during copying.

DavejaVu · April 1, 2017

I finished converting 14 disks from Reiser to XFS about 5 days ago and things seem stable since. I realize this isn't that interesting of a data point because it's not a long time, but that's the longest uptime I've had since upgrading to 6.3.x. And, agreed, I did everything via rsync so it's entirely possible that the act of recreating everything in a new dir structure cleaned up something. Still, it seems that ReiserFS is a quasi-supported relic at this point, so moving off of it seems to be advised. While the rsync method works well, it's a bit tedious...would be nice to have a more automated method for those of us who have been upgrading our systems over the years.

unRAID unresponsive - shfs 100% cpu

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

grither

ixnu

lionelhutz

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation