Unraid Slow Samba on cache and/or spinning disk (with & without FUSE)

thenonsense · November 6, 2022

HI all,

I usually try to get these resolved myself but I'm at the end of my rope here. I'm curious if anyone has seen this and has some helpful tips. Essentially my rig has a pool of 6 SATA3 SSDs in RAID10 on which sit gaming libraries (shares) for two players, and the vdisks for two gaming VMs. The VM-vdisk I/O seems fine, but regularly the gaming libraries will go from acceptable transfer speeds to completely stalling out, crashing games or worse. I experimented with disk shares to take FUSE out of the equation, but transfers would still stall out. I checked from both VMs and from other devices on the network, but the behavior is the same. I've also tried to validate the VMs do not have their vdisk IO interrupted, and this seemed the case until most recently one of the VMs completely crashed on reboot, leading me to believe the disk failed to be written to on shutdown.

The most recent version of this problem I've noticed is that this cache pool isn't the only storage to stall, or else I'd suspect a disk failure. My spinning disks will also stall out completely. Thus I'm not sure if the issue is Samba, one disk's IO that is tanking all of Unraid, or something else. This issue has persisted since 6.10, so it's not limited exclusively to 6.11.x for me.

I've got two diagnostics here. The earlier one has Samba logs enabled, but that was spitting out the smbd synthetic pathref error over and over again, which according to other posts is more a nonissue. The second diagnostics had samba logging set to 0 to better show the rest of the syslog.

Please let me know if you have any ideas. I appreciate your help!

undoot-diagnostics-20221104-1202.zip undoot-diagnostics-20221106-1247.zip

thenonsense · November 6, 2022

UPDATE: Rebooting a VM during the time when IO seems to be crippled (incurring hefty read/write on the disks) appears to occur with no issue. Therefore, IO to/from the disks themselves isn't the problem. The problem does seem to be Samba, but I don't know Samba enough to figure out where...

Edited November 6, 2022 by thenonsense

JorgeB · November 7, 2022

Nov  6 12:46:58 unDOOT kernel: BTRFS info (device sdc1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
Nov  6 12:46:58 unDOOT kernel: BTRFS info (device sdc1): bdev /dev/sdh1 errs: wr 0, rd 0, flush 0, corrupt 50, gen 0
Nov  6 12:46:58 unDOOT kernel: BTRFS info (device sdc1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 55, gen 0
Nov  6 12:46:58 unDOOT kernel: BTRFS info (device sdc1): bdev /dev/sdd1 errs: wr 0, rd 0, flush 0, corrupt 46, gen 0
Nov  6 12:46:58 unDOOT kernel: BTRFS info (device sdc1): bdev /dev/sde1 errs: wr 0, rd 0, flush 0, corrupt 14, gen 0
Nov  6 12:46:58 unDOOT kernel: BTRFS info (device sdc1): bdev /dev/sdg1 errs: wr 0, rd 0, flush 0, corrupt 20, gen 0

Data corruption detected on all members, Ryzen with overclocked RAM like you have is known to is some cases corrupt data, after correcting that run a scrub and delete/restore from backups any corrupt files, they will be listed in the syslog.

thenonsense · November 7, 2022

Wow, that's good info. I'm dropping the array now and running some BTRFS recovery commands, then I'll post back.

thenonsense · November 8, 2022

Pulled files off the drives, reset the pool, BOOM file corruption on new files. Pool was verified clean before the new files were written, and immediately we have the same issue with lapses in IO. I'm not sure if the IO crippling is caused by the corruption, but neither are really acceptable. This specific pool is about 2 or so months old, and the version before it was single-device, so ironically the best thing for stability might just be a lack of raid lol. Rather than that I've decided to throw the baby with the bathwater, sell the farm, and transplant a new spine into the patient.

7950x, DDR5, and Asrock Taichi are in the mail. We'll see if we can resolve these errors the David Martinez way, speed.

thenonsense · November 9, 2022

@JorgeB thanks for your help earlier, sorry to call you out, do you think it'd be a good idea to keep th cache on btrfs or should I bump to zfs or otherwise? This is the first time I've run a protected cache on btrfs since the rig's conception in 2017.

JorgeB · November 9, 2022

Btrfs detecting corruption after resetting the pool suggests a hardware problem, did you revert RAM speeds to max supported like detailed in the link above?

thenonsense · November 10, 2022

After the first wipe, I honestly bumped the voltages and tried to keep the clocks for now, since it's a gaming PC. After the second one (still found corruption) I tried reverting to 2133MHz and upon writing back to the restored pool we had a crash on the mover. No corruption though on this second try, but I'll need a day or two to verify. Looks like RAM or first gen infinity fabric was indeed the issue.

I've been seeing some weird activity (isolated cores @ 100% but no VM running to use them) and I'm wondering if there's something with the bus only being connected to one CCX/die that's also at play here. Sadly I feel like I know a lot about the first and second gen threadripper quirks, but I get shown something new at least once every few weeks. Part of my goal in jumping to the 7k series is to hop on that unified bus and see if that (plus a more mature process) helps address these issues while maintaining the sweet spot of 6k MHz on the DDR5.

I'm still new to what zen4 architectural changes there are, and I haven't yet come across a thorough enough article to get ahold of the CCX organization, infinity fabric, UMA, and everything else.

thenonsense · November 20, 2022

I wanted to provide an update after testing, including an upgrade to Ryzen 7000.

The CPU is a 7950x, sitting on an AsRock Taichi, cores split exactly as before but since these aren't 4-cores/CCX but 8-cores/CCX, infinity fabric doesn't factor in nearly as much for VM performance and on a separate note the gaming is fast. I'm sitting on 64GB DDR5, having tested it at stock settings and and at EXPO 6000. It's one of fefw boards with enough SATA slots, but they aren't all sitting nicely on adjacent controllers, as an LSTOPO has revealed.

Now fortunately the corruption has appeared to have vanished. I'm glad to have that behind me. However the speed issue is not addressed. I'm not sure if minute read/writes from VMs are eating up bus traffic, or if 16GB (after each VM takes 24GB) is enough for both Unraid to operate as a hypervisor and for SMB to serve files for gaming. Or maybe 2 HT cores are not enough. I'm not sure what else to test here.

If anyone has ideas, I'd love to hear them. I'm going to try Dynamix Active Streams to see if that gives me any more insight into what SMB is doing and what's taking so long.

thenonsense · November 21, 2022

For anyone else that has a similar issue, I think I found it. It was brought forward by my install of the "active streams" plugin. I noticed as soon as I opened it up a TON of activity despite the client being supposedly asleep, the client being one of my windows VMs. Turns out, Firefox was writing session data constantly back home, and File History was apparently writing it to my backup share, basically clogging my SMB pipe with a ton of garbage. Now I'm not sure how File History could be so busy since it's only set to run about every 15 minutes (still an aggressive setting), but I do have File History set to back up my entire user folder just in case I lose a wayward game data save.

Two solutions could be employed here. One is to decrease the frequency by which Firefox writes session data. Not a bad solution, but not a perfect fit. The second is to prevent File History from writing Firefox entirely. I employed both, since I just don't like Firefox pounding any disk, including my OS. I've yet to have another problem yet in about half a day of testing. Given that my samba config is vanilla, as was my hardware, I believe this ultimately was the problem. Someone out there likely also backs up their appdata folder or even their whole user folder, and might run into the same annoyance. I hope you find my words here.

Shoutout to JorgeB for noting the corruption, and motivating me to upgrade my rig. I cannot believe I never noticed those messages in the syslog, and of the two problems chatted about in this thread, corruption is 100% worse for this usecase than speed.

thenonsense · November 22, 2022

No issues with speed after more thorough testing, closing this with the above as the solution.

Unraid Slow Samba on cache and/or spinning disk (with & without FUSE)

Recommended Posts

thenonsense

Link to comment

thenonsense

Link to comment

JorgeB

Link to comment

thenonsense

Link to comment

thenonsense

Link to comment

thenonsense

Link to comment

JorgeB

Link to comment

thenonsense

Link to comment

thenonsense

Link to comment

thenonsense

Link to comment

thenonsense

Link to comment

Join the conversation