Poor file transfer performance over WireGuard


Go to solution Solved by weirdcrap,

Recommended Posts

On 5/5/2021 at 4:41 PM, Torben said:

I'm kind of in the same boat. 2 unRAID-Servers connected thru Wireguard, accessing SMB shares mounted thru Unassigned Devices (but it also happens when mounted manually). unRAID and all plugins are updated.

 

Here's what happens: The download starts and most of the time - not always though - it drops from 250 MBit/s to 10 MBit/s at some point in time. Sometimes after a minute, sometimes after hours, while copying a file and sometimes while starting to copy a new one, with rsync or Midnight Commander. I mostly can fix it by unmounting the share and remounting it and the game starts again. I tried troubleshooting for hours, using Google and the unRAID forum (and just trying out stuff) and more than once thought I made it...but it always comes back.

 

What I think I should mention is CPU load/multi core behavior. When wireguard and SMB is working, multiple cores are used and load is jumping all over them, sometimes spiking up to 98% on a core for a split second then using multiple cores again. But when the problem occurs, only one core is used. It shoots up to about 90% and some milliseconds later to 100% and it stays there for maybe two seconds. Then it jumps to another core, 90% 100% 2s, next core. But it doesn't seem to make a difference in "top", the CPU load per process seems to stay the same as if it's still using multiple cores.

 

Unfortunately I have no idea what info could be usefull to continue troubleshooting, so please let me know if you have any suggestions.

I have not seen any correlation between CPU usage spikes and my wireguard performance woes. My processor can be entirely idle with nothing running and I'll still have a speed drop.

 

Anecdotally I have had some marginal improvements in how long I can maintain a transfer at speed through the tunnel by adjusting the MTU for the tunnel down to 1420. However this is not fool proof and it will still lose speed at some point and require me to restart the transfer.

 

At 1420 I was able to move about 300GB of data at speed over the course of 12-14 hours.  In total this month I was able to move about 700GB and only had to restart the transfer 4-5 times.

 

 

Link to comment

Hm, it's weird that you don't have this CPU correlation.

 

Since the default MTU of Wireguard is 1420, I'd be surprised if this really made a difference - except you didn't set it to "auto" before.

In my experience it doesn't matter how much data or how many files you transfer, it looks like happening randomly and that's what makes troubleshooting so fricking annoying. And when you lose speed, the MTU size normally is too high as you need two packet fragments instead of one packet.

I didn't lose speed with MTU 1420, but I tried to find out the max MTU of my servers internet connection instead of wireguards and besides getting weird results on unRAID (1600 working with a max of 1500...what?!), I tried it on my Windows machine connecting with the Windows WG client. The result was that the internet connection maxed out at MTU 1492 (and later on this was what unRAID also said), which means a Wireguard MTU of 1412 (using 80 as buffer by default). I'm now trying it with MTU 1412, have max speed and so far copying for 45 minutes without a problem...but as we know this doesn't mean a thing.

Edited by Torben
Link to comment
16 hours ago, Torben said:

Hm, it's weird that you don't have this CPU correlation.

 

Since the default MTU of Wireguard is 1420, I'd be surprised if this really made a difference - except you didn't set it to "auto" before.

In my experience it doesn't matter how much data or how many files you transfer, it looks like happening randomly and that's what makes troubleshooting so fricking annoying. And when you lose speed, the MTU size normally is too high as you need two packet fragments instead of one packet.

I didn't lose speed with MTU 1420, but I tried to find out the max MTU of my servers internet connection instead of wireguards and besides getting weird results on unRAID (1600 working with a max of 1500...what?!), I tried it on my Windows machine connecting with the Windows WG client. The result was that the internet connection maxed out at MTU 1492 (and later on this was what unRAID also said), which means a Wireguard MTU of 1412 (using 80 as buffer by default). I'm now trying it with MTU 1412, have max speed and so far copying for 45 minutes without a problem...but as we know this doesn't mean a thing.

Yeah I've read that the default is 1420 but I figured it couldn't hurt to set it manually in case it wasn't auto detecting correctly. I had also tried going all the way down to 1380 and it didn't seem to make much of any improvement. 

I'm not sure if taking my MTU to far down could be making things worse so maybe 1380 was to far the other way and 1412 is the sweet spot? 

Let me know if you can get consistent results with several large transfers. 

 

It would be great to get some sort of debug output from wireguard like I mentioned a few posts back with the kernel debug options. I don't know if we'll ever nail this down without some logs or WG dev help.

Edited by weirdcrap
Link to comment

Don't know if you already knew, but you can get the current MTU for all interfaces with: ifconfig | grep -i MTU

 

I'm by far no expert, but the MTU size depends on your servers internet connection. For LAN it's 1500, on the internet it mostly isn't. You can check it by using the "ping" command.


 

Working:

root@tower:/mnt/user# ping -c 1 -M do -s 1464 9.9.9.9
PING 9.9.9.9 (9.9.9.9) 1464(1492) bytes of data.
1472 bytes from 9.9.9.9: icmp_seq=1 ttl=60 time=10.6 ms

--- 9.9.9.9 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 10.645/10.645/10.645/0.000 ms



Not working:

root@tower:/mnt/user# ping -c 1 -M do -s 1465 9.9.9.9
PING 9.9.9.9 (9.9.9.9) 1465(1493) bytes of data.
ping: local error: message too long, mtu=1492

--- 9.9.9.9 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

 

There's a buffer sizes on the network, for internet connections 28, for wireguard 80. That's why I'm just pinging with a package of 1464 instead of 1492 on the internet connection and for wireguard it would mean 1412. For you it could mean something different, since every internet connection can have a different MTU size.

 

The networking guy at work meant the MTU size of WG is independant from the internet connection, since it's a different protocol, but my logic tells me that wireguard is still running over the internet connection and if the WG package is too large, the internet connection can't handle it correctly. Ok, I didn't have any problems before unRAID 6.9 - or I just didn't realize -, but I tried that much other stuff that I'm grasping at any straw I can to get constantly high transfer speeds again. 🙂

Link to comment
22 minutes ago, Torben said:

Don't know if you already knew, but you can get the current MTU for all interfaces with: ifconfig | grep -i MTU

 

I'm by far no expert, but the MTU size depends on your servers internet connection. For LAN it's 1500, on the internet it mostly isn't. You can check it by using the "ping" command.


 


Working:

root@tower:/mnt/user# ping -c 1 -M do -s 1464 9.9.9.9
PING 9.9.9.9 (9.9.9.9) 1464(1492) bytes of data.
1472 bytes from 9.9.9.9: icmp_seq=1 ttl=60 time=10.6 ms

--- 9.9.9.9 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 10.645/10.645/10.645/0.000 ms



Not working:

root@tower:/mnt/user# ping -c 1 -M do -s 1465 9.9.9.9
PING 9.9.9.9 (9.9.9.9) 1465(1493) bytes of data.
ping: local error: message too long, mtu=1492

--- 9.9.9.9 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

 

There's a buffer sizes on the network, for internet connections 28, for wireguard 80. That's why I'm just pinging with a package of 1464 instead of 1492 on the internet connection and for wireguard it would mean 1412. For you it could mean something different, since every internet connection can have a different MTU size.

 

The networking guy at work meant the MTU size of WG is independant from the internet connection, since it's a different protocol, but my logic tells me that wireguard is still running over the internet connection and if the WG package is too large, the internet connection can't handle it correctly. Ok, I didn't have any problems before unRAID 6.9 - or I just didn't realize -, but I tried that much other stuff that I'm grasping at any straw I can to get constantly high transfer speeds again. 🙂

My LAN MTUs are all 1500 and running traceroute --mtu myendpoint shows my MTU is not incremented down at all before it reaches it's destination out on the internet so the default wireguard setting of 1420 should be correct for me and my network. 

I was simply saying I'd be willing to try tuning it lower just to see if it makes any difference if you had promising results. In  my experience it doesn't seem to have any effect.

Link to comment

I almost wanted to say: "Works", since I did several large transfers, but when I did the last test I ran into the same problem...CPU spikes up to 100% on one core and switches like that to all the others without ever going back to multi core. There are 2 differences: The speed first drops from 90 MBit/s to 80 MBit after some time and then shortly after to 70-80 MBit/s and stays there until everything goes wrong and the speed instantly drops. But now it drops to 13 MBit/s. Before it was instantly going to 10 MBit/s without the steps in between. And it was the same with a 250 MBit connection - instant drop to 10 MBit/s.

 

I don't know, I'll try MTU 1350 next, found some stuff on some forums. And if that doesn't make a difference I'll probably search for a different solution, since this problem annoys the crap out of me.

Edited by Torben
Link to comment
59 minutes ago, Torben said:

I almost wanted to say: "Works", since I did several large transfers, but when I did the last test I ran into the same problem...CPU spikes up to 100% on one core and switches like that to all the others without ever going back to multi core. There are 2 differences: The speed first drops from 90 MBit/s to 80 MBit after some time and then shortly after to 70-80 MBit/s and stays there until everything goes wrong and the speed instantly drops. But now it drops to 13 MBit/s. Before it was instantly going to 10 MBit/s without the steps in between. And it was the same with a 250 MBit connection - instant drop to 10 MBit/s.

 

I don't know, I'll try MTU 1350 next, found some stuff on some forums. And if that doesn't make a difference I'll probably search for a different solution, since this problem annoys the crap out of me.

I feel your pain. Once a month when I sync my servers I get frustrated and do a day of furious research while I baby sit the transfer. So far I haven't found anything to resolve the issue but please do report back if you figure it out first

Link to comment

I just wanted to let you know about my testing today: With MTU 1350 I reached speeds a bit higher and more stable than before and it worked very well until it stopped working. Same CPU sh*t as before.

 

Then I had the idea to see what happens if I don't max out the connection with "rsync --bwlimit". I set it a bit below max speed, about 8 MBit/s (8%, better a bit less i guess) and holy shit...the CPU load used by Wireguard/the copy process reduced drastically overall. Every couple of seconds it uses one or 2 cores up to 45-70% total, the rest of the time the CPU is almost as idle as it is without a copy process running.

Link to comment
14 hours ago, Torben said:

I just wanted to let you know about my testing today: With MTU 1350 I reached speeds a bit higher and more stable than before and it worked very well until it stopped working. Same CPU sh*t as before.

 

Then I had the idea to see what happens if I don't max out the connection with "rsync --bwlimit". I set it a bit below max speed, about 8 MBit/s (8%, better a bit less i guess) and holy shit...the CPU load used by Wireguard/the copy process reduced drastically overall. Every couple of seconds it uses one or 2 cores up to 45-70% total, the rest of the time the CPU is almost as idle as it is without a copy process running.

Ah I didn't realize you weren't already using bwlimit. That might explain why I haven't seen the CPU spikes you have. My remote server shares bandwidth with a friend's business so I always make sure to only use roughly have of the available bandwidth.

 

So with the bandwidth limit can your WireGuard transfer consistently maintain it's speed? I'm still seeing my speeds drop to 1MBps > at random points during my tests with my bwlimit of 5MBps

Edited by weirdcrap
Link to comment

The first time I tried, it failed again after propably some hours, but I had to run the mover at the same time, since my cache drive filled up because of all the testing. ;-)

 

I had to do a large file transfer this night and did two changes at once...which you shouldn't do troubleshooting, but...yeah. I uninstalled the Unassigned Devices Plugin (just to make sure it doesn't do...something), disabled settings in Dockers that react to new files and the transfer went for 10h at max. speed, just dropping from 8,5 MB/s to 7,5 MB/s for one file. I'll continue testing and see what happens.

 

I mounted with a simple "mount -t cifs -o username=USERNAME,iocharset=utf8 SOURCE TARGET"  and "rsync --bwlimit..."

Link to comment

And here are the news:

 

Some hours after the successful 10h transfer I was browsing the remote share without reconnecting it a bit to contol if everything was alright and wanted to start a new transfer. This transfer instantly started with 12 MBit/s, so the SMB connection was broken and I had to reconnect.

 

Now I got one of my colleagues to join the "WG-SMB-Test-Club", connecting to my server. She's got a way slower internet connection (--bwlimit to half the internet connection, maybe I have to go lower for consistency) and especially a way less powerfull CPU (J5040 vs 10900T).

We started with MTU 1350 and after 2 minutes the CPU didn't use multi core anymore like it did before and just uses one core jumping in 2 steps to 100% switching to the next at 100%...the old story. Additional info: We tried to do a "clear" in bash and "top" while this happend, but it took several seconds (5+) for the commands to be executed.

 

I then realized that we didn't do a change I did on my server as one of the first steps of troubleshooting: Disable Settings -> Global Share Settings -> Tunable (support Hard Links)

 

So we killed all rsync processes, disconnected SMB, stopped the array, changed the setting and we're now transferring for over 1.5h without a problem so far. Let's see when or if it fails.

 

I have no idea how all this corellates - and if it really does or everything is just a fluke, but I'll try to get everything to "the same setup" as close as possible and see if I can find out more things in common.

Edited by Torben
Link to comment
  • weirdcrap changed the title to Poor file transfer performance over WireGuard

In my testing hardware doesnt seem to make a difference. I have two i7 Haswells each with 32GB of RAM and they struggle just as badly as my old AMD FX-6300 did.

 

I would be surprised if Hardlinks being one or off makes a difference. I have that setting on and never considered turning it off. If you think you see improvements with it off I'll give it a shot.

 

Like right now I just started a transfer no more than a few minutes ago and the speed has already tanked to barely 1MB/s. My CPU usage on both servers is >5%. 

Link to comment

It's hard to say if it makes a difference or not, since we now had the problem again that the transfer was staying at 10 MBit/s almost immediately after a fresh SMB mount. But my colleague was surfing the web, so it could be the overall speed drops that caused it. As far as I could say, I had less problems with hardlinks turned off and since I don't experience any other problems because of that, I'll leave it off for now.

 

What I wanted to say with the difference in hardware was, that it doesn't make a difference how much power your cores have, it's always one core, 100%. Did you check overall CPU usage or single cores? The overall usage stays the same, it's just not multi core with e.g 1 core at x% and one core at y% (with bwlimit it just ramps up every couple of seconds), it's one core at 100% and the rest is handling the background noise. We now have 2 unRAID servers showing this behavior. Also both servers show used/free space of the remote share in Unassigned Devices as 0/0, not the real numbers as it does when everything's fine. But that's an already known thing, as far as I googled thru the last three weeks of somewhat trying to workaround this issue.

 

I bet we're both talking about the same speeds, but you're talking about 1 MB/s and I'm talking about 10 MBit/s (I'm using the unRAID Dashboard network part to check the speed). To me it looks like a fallback speed, since it's 10 MBit/s no matter what...1 GBit or 50 MBit download connection, 100 or 250 or 500 MBit upload connection - always 10 MBit/s in this "fail state" scenario.

Link to comment
14 hours ago, Torben said:

It's hard to say if it makes a difference or not, since we now had the problem again that the transfer was staying at 10 MBit/s almost immediately after a fresh SMB mount. But my colleague was surfing the web, so it could be the overall speed drops that caused it. As far as I could say, I had less problems with hardlinks turned off and since I don't experience any other problems because of that, I'll leave it off for now.

 

What I wanted to say with the difference in hardware was, that it doesn't make a difference how much power your cores have, it's always one core, 100%. Did you check overall CPU usage or single cores? The overall usage stays the same, it's just not multi core with e.g 1 core at x% and one core at y% (with bwlimit it just ramps up every couple of seconds), it's one core at 100% and the rest is handling the background noise. We now have 2 unRAID servers showing this behavior. Also both servers show used/free space of the remote share in Unassigned Devices as 0/0, not the real numbers as it does when everything's fine. But that's an already known thing, as far as I googled thru the last three weeks of somewhat trying to workaround this issue.

 

I bet we're both talking about the same speeds, but you're talking about 1 MB/s and I'm talking about 10 MBit/s (I'm using the unRAID Dashboard network part to check the speed). To me it looks like a fallback speed, since it's 10 MBit/s no matter what...1 GBit or 50 MBit download connection, 100 or 250 or 500 MBit upload connection - always 10 MBit/s in this "fail state" scenario.

I checked overall CPU usage on the stats page as well as individual core utilization on the WebUI dashboard. With nothing else running except a file transfer I don't see utilization over 5% on any single core. There may be an occasional spike but it always quickly drops back down to almost nothing.


Yes we are referring to roughly the same speeds, I do the big B for bytes. I agree I have also noticed In the last few months it does seem to linger around 1MB/s more so than it used to. When this issue first started for me when the speed dropped it would drop down into the kilobytes and stay there until I cancelled it. 1MB/s is an improvement over the dial-up like speeds I was getting before.

Link to comment
  • 1 month later...

This is still very much broken for me. Right now no matter how many times I restart the transfer or the servers I can't even top 100KB/s. 

 

My current file is running at 1.5KB/s, it's disgusting how badly this runs.

 

If I find time this weekend I'm going to go bother the wireguard devs on IRC and see if they have any ideas. I really really want this to work but in its current state it is unusable.

Edited by weirdcrap
Link to comment

Funny that you posted today...after some frustrating time I tried some stuff over the weekend until now. I reset everything I changed before to defaults and gave it a try. It sucks more than with changing MTU and what not, but...yeah.

 

1. When running with the "Userscripts" plugin in the background (or "manually" in bash) and high speeds, it fails within 1-2 minutes dropping to the speed we already found as some kind of fallback speed.

2. When running with the "Userscripts" plugin in the foreground and high speeds, it runs between 15 and 60 minutes, then drops.

3. I limited rsync speed to 10 MB/s (80-85 MBit/s) and it's behaving like 2.

4. We tried using NFS to see what happens. Running without limiting rsync it runs full speed for the first file (250 MBit/s) and drops to 85-100 MBit/s afterwards. I limited rsync to have less CPU usage and transfered almost 2TB at 85-100 MBit/s speed without a problem. So it's not working correctly, but it sucks way less than SMB. Unfortunately NFS can't be secured the way I'd like, so I'm still hoping someone finds a solution.

 

All I didn't try so far is switching to SMBv1 for testing purposes (found that in a on reddit or so). And I still see the same behavior of one CPU core going up to 100% and jumping to the next when failing (with SMB). I'm almost thinking about a driver issue, kernel issue with newer CPUs, C-State issue, I have no fricking idea.

Link to comment
  • 1 year later...

Me too. And even NFS speed over wireguard got worth running at about 60% of the speed it was before. So I tried to find a workaround and installed an Ubuntu VM with the folders mounted via mount-tags on the receiving unRAID server, which I found out causes the problem (even a fresh installed test server on different hardware on 6.8.3 works, same server on 6.9+ shows the problem). So far it's working great for a couple of weeks now. In the beginning I wanted to do some more investigating on why it works in the VM - or if I can replicate the problem in the VM -, since I think having a VM just for doing what the host was/should be capable of is "a bit" unnecessary, but well, a lot of time spent and no real idea left where to continue.

 

I'm curious what happens when you have new hardware.

Link to comment
  • 2 weeks later...
On 7/20/2022 at 4:36 PM, Torben said:

Me too. And even NFS speed over wireguard got worth running at about 60% of the speed it was before. So I tried to find a workaround and installed an Ubuntu VM with the folders mounted via mount-tags on the receiving unRAID server, which I found out causes the problem (even a fresh installed test server on different hardware on 6.8.3 works, same server on 6.9+ shows the problem). So far it's working great for a couple of weeks now. In the beginning I wanted to do some more investigating on why it works in the VM - or if I can replicate the problem in the VM -, since I think having a VM just for doing what the host was/should be capable of is "a bit" unnecessary, but well, a lot of time spent and no real idea left where to continue.

 

I'm curious what happens when you have new hardware.

I may have to go the VM route as well. The new hardware made no difference in the file transfer speeds. Not that I honestly expected it to.😥

Link to comment
  • 3 months later...
  • Solution

A final update to this.

 

Unfortunately I had to abandon using UnRAID as the WireGuard server as I couldn't resolve this. This issue coupled with my other issue drove me to invest in putting my own router in front of the remote server.

 

I've now built a site to site WireGuard tunnel between my two routers and everything is working exactly as I would expect it to over the WireGuard tunnel. I'm getting my full speed up to the bandwidth limit I set in the rsync command.

 

So TL;DR is don't expect UnRAID's implementation of WireGuard to be able to move large amounts of data without choking. At least not as of this post, hopefully it can be improved in the future.

Link to comment

That's also my plan, the Pi4 with OpenWRT is already prepared.

 

The sad thing about this is that it was working flawlessly with 6.8.3 and something somewhere somehow broke the functionality starting with 6.9. But well, because of that I started looking at OpenWRT and got something new to tinker with. 

Link to comment
6 hours ago, Torben said:

That's also my plan, the Pi4 with OpenWRT is already prepared.

 

The sad thing about this is that it was working flawlessly with 6.8.3 and something somewhere somehow broke the functionality starting with 6.9. But well, because of that I started looking at OpenWRT and got something new to tinker with. 

Oh yeah it never worked right for me. From the day I set it up it ran like crap.

 

pfSense's implementation works great. Got a Proctectli FW4B at home and a J4125 at the remote site both running pfsense 2.6.0.

  • Thanks 1
Link to comment
  • 1 month later...
On 1/24/2023 at 7:42 AM, PlanetDyna said:

Unfortunately, no one is working to fix the problem?

The problem is no one seems to be able to definitively pinpoint the cause of the issue.

 

LimeTech is apparently unable to reproduce this in their testing and it seems to be limited to only a small subset of users so it just isn't garnering much attention I think. The more people who find this thread and share their experiences the more likely someone will start to take a more serious look at the problem.

 

The only solution at this time is to just not use UnRAID as a WireGuard server if you want to be able to move large amounts of data quickly.

Edited by weirdcrap
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.