High CPU, broken SAMBA - server down

nicr4wks · June 17, 2020

Hi All.

I've been trying to track down this intermittent issue for the last couple of weeks.

** EDIT - as of an hour ago restarting unraid no longer fixes the issue, samba is constantly slow/inaccessible **

** Double edit - after running for 5hrs server has recovered and operational again **

It becomes apparent as samba shares from unraid will become incredibly slow, browsing folders takes ~10 seconds to load each directory and transfers to/from the server crawl to <5KB/s from each client.

While the transfer issues are present I have noted this process consumes 100% of 1 core and sits like that for hours:

/usr/local/sbin/shfs /mnt/user -disks 4094 -o noatime,allow_other -o remember=0

To get the server to resume normal operation it needs to be restarted which can be done via the GUI fine.

Other parts of unraid continue to work without issue, VM's are at full speed, Docker responsive, WebGUI loads fine. I can even access disks via SCP at full speed, it's just samba that dies.

In previous threads shfs problems seem to stem from reiserfs disks of which I have none, also no cache drive.

Any ideas?

Unraid: 6.8.3
CPU: Intel® Xeon® CPU E5-1620 @ 3.60GHz
Memory: 56 GiB DDR3 Multi-bit ECC
Network: eth0: 1000 Mbps, full duplex, mtu 1500

tower-diagnostics-20200617-1109.zip

Edited June 17, 2020 by nicr4wks

jonp · June 19, 2020

Hi there,

Sorry to hear you're having issues. I'm curious if any of your VMs are taking advantage of passing through a PCI device. If so, I'd like to ask that you disable IOMMU on your motherboard, reboot, and try again with your VMs off. From doing some research on some items in the log, it appears that in some cases, certain hardware has been known to cause issues when IOMMU is enabled. It would also help to get the full hardware specs from you (motherboard, etc.) to know what kind of gear we are working with.

It's also odd to hear that things were slow, then reboots stopped being effective at resolving the issue, then after a while everything was normal again. That is really odd and more indicative of something else on the network than the server itself (otherwise you'd expect the behavior to remain consistent).

Let us know if the behavior happens again and if so, please try the test I am proposing and let us know if that has any impact.

EthanBB · June 19, 2020

I have exactly same problem. IOMMU is off in my case and I have no VMs running. I rebooted twice yesterday, but today it's slow again.

alexandria-diagnostics-20200619-2224.zip

Edited June 19, 2020 by EthanBB
attachment

nicr4wks · June 19, 2020

5 hours ago, jonp said:

Sorry to hear you're having issues. I'm curious if any of your VMs are taking advantage of passing through a PCI device. If so, I'd like to ask that you disable IOMMU on your motherboard, reboot, and try again with your VMs off. From doing some research on some items in the log, it appears that in some cases, certain hardware has been known to cause issues when IOMMU is enabled. It would also help to get the full hardware specs from you (motherboard, etc.) to know what kind of gear we are working with.

It's also odd to hear that things were slow, then reboots stopped being effective at resolving the issue, then after a while everything was normal again. That is really odd and more indicative of something else on the network than the server itself (otherwise you'd expect the behavior to remain consistent).

Thanks for taking a look.

It's running on a HP Z600 workstation, the motherboard is identified as "Hewlett-Packard 158A Version 0.00"

I have 1 device passed through to a VM which is a "RealtekCorp. RTL2838 DVB-T (0bda:2838)" connected via USB. I should also mention that when the issue occurs I have tried stopping ALL VM's and dockers and this does not change any of the symptoms at all. I'll physically remove the realtek USB device and keep the VM off over the next week for testing, I'll also note that this hardware configuration has not changed and run flawless for the last 4 years.

I do not believe network performance plays any part in this, I mentioned in the original post that when SMB is not responding/running slow I can use SCP over the same network from the same client computer to directly access the disks at full speed.

It seems like whatever the shfs process is stuck doing is the cause of the performance issue.

EthanBB · June 22, 2020

Any updates?

My server is basically unusable because of this bug. It flip-flops between being bugged for 5-8 hours, then is fine for another 5-8h ... and then bad again.

nicr4wks · June 23, 2020

14 hours ago, EthanBB said:

Any updates?

My server is basically unusable because of this bug. It flip-flops between being bugged for 5-8 hours, then is fine for another 5-8h ... and then bad again.

I've heard nothing, can you SSH in to your server and run top would be interesting to see if yours has the same stuck shfs process.

Also tried downgrading to 6.8.2 where problem is still present.

EthanBB · June 23, 2020

4 hours ago, nicr4wks said:

... can you SSH in to your server and run top would be interesting to see if yours has the same stuck shfs process.

Screenshot from yesterday morning: https://imgur.com/IgsKqLy - yep, same as you. But I'm currently at 1331 CPU time hours.

Edited June 23, 2020 by EthanBB

nicr4wks · June 24, 2020

@jonp any more info for us?

High CPU, broken SAMBA - server down

Recommended Posts

nicr4wks

Link to comment

jonp

Link to comment

EthanBB

Link to comment

nicr4wks

Link to comment

EthanBB

Link to comment

nicr4wks

Link to comment

EthanBB

Link to comment

nicr4wks

Link to comment

Join the conversation