Samba and httpd unresponsive with shfs

ixnu · December 5, 2016

Unraid 6.3-rc5

No dockers or unsupported plugins - just a vanilla file server. Fix common problems completed.

Long story, but I have not been able to properly shutdown my box after it's been up for more than 24 hours for about two weeks. I have attempted to use the powerdown and native shutdown scripts to no avail on 6.2.4. Neither script completes

Usually, the box stays up fine after a hard reboot, but eventually samba and http no longer respond and shfs (zombie) pegs the cpu with dozens of smbd prcesses that can't be stopped.

I can continue to login via ssh.

I upgraded to the newest rc in hopes that the improved shutdown would help, but shutdown never completes.

During this time two week period, I had a red ball and rebuilt successfully.

I had read on the forums that Reiser could cause this issue if disks were nearing capacity, so i reduced them all below 90%. This did not work, so I started the process of migrating to XFS. I have also read that dockers could cause this, but i have never enabled it. If this should be combined with another topic, please forgive my search skillz.

Unfortunately, I have filled my sata slots and I needed to consolidate in order to move to XFS. I shrunk the array and rebuilt parity successfully last night. But the issue has returned after about 24 hours.

I can't create the hardware diag from the tools, so I have just included the syslog. I'm currently on travel and can't hard restart the box.

I am in the process of building a new server, but I really would like this one to stay up as a backup and eventually replace my existing backup.

Anybody have any ideas?

log_05170043.log.zip

ixnu · December 8, 2016

More info:

I finally got home to take a look.

The shfs process was zombie and there were dozens of smbd processes opened by user account.

I could not kill or HUP these and the box would never shutdown / restart with either shutdown or the powerdown script.

Any ideas?

ixnu · December 8, 2016

Currently running reiserfsck on all the disks.

Not a fan of reiser...

trurl · December 8, 2016

If you could get your diagnostics we could see the SMART for all your drives. Do the SMART attributes for all of your drives look OK?

Have you done a memtest recently?

Tell us about your hardware.

ixnu · December 8, 2016

Smarts look OK to me.

To answer your question, this is a box that has been running for about 5 years and has gone from 4.7 to 6.0 to 6.2.x to 6.3-rc5. This is the first non stable release that has been on it.

Here is a full diag that I got after a hard reboot when I got back home this morning.

I certainly appreciate you taking a look at it.

The only option that I could think would be fsck fun.

I have not run a memtest recently and it's not ECC

tower-diagnostics-20161208-0822.zip

trurl · December 8, 2016

SMARTs look OK.

Can't hurt to try a memtest.

Tell us about your hardware.

ixnu · December 8, 2016

True dat.

Will run memtest when completed with fsck.

ixnu · December 8, 2016

Reiserfsck on all disk = 0 corruptions

xfs_repair = 0 corruptions

Surprising to me since I've had to hard reboot about a dozen times in last two weeks.

JonathanM · December 8, 2016

Reiserfsck on all disk = 0 corruptions

Surprising to me since I've had to hard reboot about a dozen times in last two weeks.

ReiserFS is very forgiving about hard shutdowns. Problem is, it seems to me that any version of unraid after 5 doesn't like ReiserFS, and seems to take a notion to lock down under intense use. Since ReiserFS is dying, nobody really wants to troubleshoot the issue.

Just be glad you don't have a BTRFS cache pool as well as ReiserFS array disks. That combo caused me no end of grief, the ReiserFS issues causing runaway CPU usage, necessitating a hard reboot, causing BTRFS breakage. I initially thought it was BTRFS causing the issue, but found out it was the Reiser disks. Converted all ReiserFS to XFS, no more issues.

IMHO, of the choices given in unraid,

Reiser = extremely robust, very repairable, has issues with large disks running close to full, issues with newer kernels.

BTRFS = extremely fragile, recovery dicey, works great as long as you unmount it cleanly every time.

XFS = fairly robust, metadata log seems fragile, data recovery ok. Fewest overall issues.

ixnu · December 8, 2016

XFS = fairly robust, metadata log seems fragile, data recovery ok. Fewest overall issues.

Anybody got the "reiser_to_xfs" slackware package?

Can't seem to find it

JonathanM · December 8, 2016

Anybody got the "reiser_to_xfs" slackware package?

Can't seem to find it

I believe you are being funny, but for others that may find this and assume such a thing exists...

https://lime-technology.com/forum/index.php?topic=37490.0

Long thread discussing options for moving data from ReiserFS disks to XFS.

tl;dr - backup and move the data to the newly formatted disks. no in place conversion possible.

ixnu · December 8, 2016

I believe you are being funny, but for others that may find this and assume such a thing exists...

https://lime-technology.com/forum/index.php?topic=37490.0

Ha! That's how nasty rumors get started.

Any theories on how Reiser causes this (zombie shfs and runaway smbd processes)?

JonathanM · December 8, 2016

I believe you are being funny, but for others that may find this and assume such a thing exists...

https://lime-technology.com/forum/index.php?topic=37490.0

Ha! That's how nasty rumors get started.

Any theories on how Reiser causes this (zombie shfs and runaway smbd processes)?

Only uneducated guesses. I suspect the file system takes too long to complete an operation and causes a timeout that isn't caught. My reasoning is that smaller disks and fresh formats of ReiserFS don't seem to cause the issue.

ixnu · December 8, 2016

I have removed the cache drive to test this speed theory. My cache is an SSD.

I should also mention that fuser -mv /dev/md* would not complete when I was having issues.

ixnu · December 9, 2016

Memtest 4 passes and no errors. Updated to rc6.

No errors on: Memtest; file systems; and SMART.

Nothing in the syslog.

What next? Willing to try anything.

Voodoo dolls?

Blood sacrifice?

trurl · December 9, 2016

You never did

Tell us about your hardware.

ixnu · December 9, 2016

You never did

My apology. Thought diag might be enough. Again, thanks for taking a look!

If this is not enough info, just let me know what you need.

This box is about 5 years old and everything was bought new. I have had 2 red balls with Seagate drives that were rebuilt properly.

All of the controller slots are occupied and one on the MB is empty. I shrunk the array so that I could use this slot to move everything to XFS.

MB:

AsRock 880GM-LE - AMD SB710

CPU:

AMD Sempron 140

RAM:

Gskill 2 x 4GB DDR3/1066

Conroller:

Supermicro AOC-SASLP-MV8

PSU:

CORSAIR CX series CX750 750W

UPS: CyberPower CP1350AVRLCD (1350 VA)

The below sit in 3 x Supermicro 5-in-3 drive containers.

HDDs (Most are ~80% full and highest is 88% full)

Parity HGST_HMS5C4040ALE640_PL2331LAHE3V0J -(sdc) 4 TB

Disk 1 Hitachi_HDS5C3020ALA632_ML0221F30420HD - 2 TB reiserfs 2 TB

Disk 2 SAMSUNG_HD204UI_S2H7J1SZ911253 - reiserfs 2 TB

Disk 3 SAMSUNG_HD204UI_S2H7J1CZ918725 - reiserfs 2 TB

Disk 4 WDC_WD40EZRZ-00WN9B0_WD-WCC4E0ST3K77 - reiserfs 4 TB

-Empty Slot -

Disk 6 WDC_WD20EZRX-00DC0B0_WD-WCC300793356 reiserfs 2 TB

Disk 7 WDC_WD20EZRX-22D8PB0_WD-WCC4M5ET1KUD - xfs 2 TB

Disk 8 WDC_WD20EVDS-63T3B0_WD-WCAVY5276605 - reiserfs 2 TB

Disk 9 WDC_WD20EVDS-63T3B0_WD-WCAVY5328007 - reiserfs 2 TB

Disk 10 ST2000DM001-1CH164_Z2F0TGTB - reiserfs 2 TB

Disk 11 WDC_WD20EARX-008FB0_WD-WCAZAD975980 reiserfs 2 TB

Disk 12 WDC_WD2002FAEX-007BA0_WD-WMAY05158670 reiserfs 2 TB

Cache SAMSUNG_SSD_830_Series_S0WENYAC301281 xfs 128 GB

Boot Flash JD_FireFly - 2 GB (sda)

ixnu · December 10, 2016

And we are back to shfs at 100% cpu.

However this time it's not zombie.

Also no smbd processes.

Can I try something during this state to help troubleshooting?

Syslog is clean.

ixnu · December 10, 2016

Still no joy. Web and smb not responsive.

CPU is pegged with shfs (Htop screen included.)

/mnt/user0 & /mnt/user will not complete a simple "ls"

Neither "shudown -r now" nor "powerdown -r now" will complete.

ixnu · December 11, 2016

Still at a loss on how to troubleshoot.

Would greatly appreciate any Unraid^TM Inc. input.

Since I have nothing to go on, it seems to happen at idle.

I have created a simple cron to write a file every 15 mins to the array and set the mover to every hour. I'm hopeful this will give me something.

Squid · December 11, 2016

The general solution from the forum users is to ditch reiserfs and switch all drives to xfs.

But, if you want LT to have a look at this, then best to email [email protected] as they don't generally randomly surf the forum.

ixnu · December 11, 2016

Thanks Squid!

Has a root cause been identified or just cargo cult troubleshooting?

Squid · December 11, 2016

width=300 http://www.karenkuehn.com/media/original/CARGO%20CULT%20Cover-copy2.jpg[/img]

But it does seem to be the solution. Reiser is ultimately in the grand scheme of things an unsupported FS on linux. No development happens anymore on it, the creator is in jail, and the last time someone attempted a fix on it it caused file corruption.

ixnu · December 11, 2016

Yeah, maybe this is a more appropriate analogy running ReiserFS on UnRaid 6.x

Squid · December 11, 2016

Samba and httpd unresponsive with shfs

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation