Jump to content

Help? Strange 10Gbit network behaviour, trying to migrate TBs of data, ultra-slow speed transfer between 2 unraid servers.


Recommended Posts

Hi, so I'm trying to move/migrate about 8TB of data between two unraid servers.

Both systems equipped with 10Gbit networking, and have been working fine otherwise.

I have mounted the UNRAID2 SMB unraid shares on UNRAID1 as remote mount points.

I have been trying a "push" migration, running commands on UNRAID1 to "push" files over to UNRAID2.

I have checked through all networking settings and everthing seems fine, nothing has been changed.

Everything else is working fine on the network.

 

I started doing migration using `mv` but then it cannot be resumed, so I switched to rsync using the following command:

`rsync -avzh --remove-source-files --progress /mnt/user/isos/ /mnt/remotes/UNRAID2_isos/`

 

After this rsync migration completes, I intend to run this to find and remove the empty dirs left behind (I never got this far):

`find /mnt/user/isos/ -type d -empty -delete`

 

I noticed that it would fill up the 2TB cache drive, and everything would slow right down while the mover on UNRAID2 started emptying the cache, and the transfer continued, and the whole process would stall while the mover frantically emptied the cache while more data was pushed.

 

So, I figured that I would simply wait for the mover to move everything from the cache drive to array, on UNRAID2 change the `iso` share options to write directly to array without cache involved, and resume the move/migration again...

 

Before:

image.png.bcb2cabe2fbfce05155c426dd99440ba.png

 

After:

image.png.f6b775a15000ffad2f0b8486f7dead2d.png

 

I figured it would be slower writing to the array disks at ~250MBs rather than NVME speed far above that, but I could leave it and let it run until its done, and the mover would not be required. I figured the source is the array disks on UNRAID1, so in the end its not much different in overall speed.

 

Now when I started doing the migration without cache/mover involved, I've noticed that the transfer will start, about 3 or 4 files will be moved over, and then it would just stall.

 

System Stats from UNRAID2:

Screenshot2024-06-13023124.png.5632472cae575670c7bf3d574d04c3d2.png

You can see here, the transfer would begin, it would copy some files, with some breaks, then stall and wait.

 

After doing this a few times, I rebooted both systems multiple times, cold booted both systems multiple times, and rechecked all settings. Nothing seems odd.

 

After this, I tried reconfiguring the `isos` share on UNRAID2 to use cache again, and that seemed to start working, but slower.

So that seemed to work, but then the same problem would eventually occur where the cache would fill and speeds would slow down while the mover does its job again.

 

So I stopped the transfer, ran the mover, and waited for all files to be moved.

After this I reconfigured the `isos` share back to not use cache, and the same problem appeared (UNRAID2):

image.thumb.png.1448d3b2833f251ec553ba7654f32b14.png

 

I'm really at quite at a loss as to what is happening here.

 

Can anyone help?

Link to comment

what you see is totally normal.

10G lan is much too fast for the uncached array disks.

So transfer starts, the receiving linux puts it into RAM, RAM fills up, transfer is halted by flow control, RAM buffer is slowly written to disk, RAM is freed, flow control resumes transfer. loop until all files are done.

Even with a 2.5G LAN it would happen, just a lot later. Modern Disks should be able to keep up with 1G LAN.

 

And even with a cache drive, it would also happen, just again a bit later (but then the pauses are even much longer because of the mover).

 

So: BE PATIENT! thats the only solution.

 

Otherwise lower the LAN speed down to 1 or 2Gb/s, this will slow down the transfer likely allow the disks to keep up. But beware, an array with parity hardly can keep over 70-80Mb/s constant write speed, so even 1G may stall now and then, but much much less.

 

At the end the transfer will take the same amount of time, the slow lan will just keep your nerves cooler because there is a constant progress. Thats just a psychological effect.

Edited by MAM59
  • Thanks 1
Link to comment
Posted (edited)

Thanks for the responses guys, I've looked into things and did another full review... but I still think SOMETHING was up.

 

7 hours ago, MAM59 said:

So: BE PATIENT! thats the only solution.

 

Otherwise lower the LAN speed down to 1 or 2Gb/s, this will slow down the transfer likely allow the disks to keep up. But beware, an array with parity hardly can keep over 70-80Mb/s constant write speed, so even 1G may stall now and then, but much much less.

The reason I say this is that I know that my array is capable of writes more than 70MBs, and very low write speeds is what alarmed me.

Its visible on the graphs that there is a lot of reading at 200MBs and little writing at a low ~10-15MBs.

 

I also came to the same educated guess conclusion that it's caching, but the numbers still didn't sit right.

I left it running all night, and I think the same happened, there was a relatively small amount transferred.

 

Today, I saw my ISP router had an issue overnight (Coincidence, no idea), and decided I would do a full restart of the entire network and UNRAIDs (Have not done that in years at this point).

So I cold started both the UNRAID servers, and cold started my ISP router, pfsense router, all wifi APs, all switches.

After that I waited for everything to settle and started the transfer again.

I also turned off Docker and VMs on both servers to make sure I'm seeing relatively real numbers.

 

This is now what I'm seeing on UNRAID2:

image.thumb.png.c444a4dcc7b1a934a1942e89b9a790e4.png

 

I think this is more inline with what I was expecting to see, the write speed of the disks is much higher at 200MBs, in fairly consistent bursts.

Maybe its that the spinning disk caches will fill up and the writes will drop down again, but it seems pretty consistent so far.

 

7 hours ago, MAM59 said:

At the end the transfer will take the same amount of time, the slow lan will just keep your nerves cooler because there is a constant progress. Thats just a psychological effect.

I woud still expect something above 15MBs, but maybe its a psychological thing like you said.

 

Am I wrong in this estimation?

 

I also did a test with trasferring 55GB with the cache turned on, to compare again, and its again what I would expect with writing to the cache at 2GBs, bursting to the NVME drive:

Screenshot2024-06-13131731.thumb.png.a10f4accdcda3367113f67970f6849a9.png

 

And the 55GB finished, looked along what I expect, it's consistent even though its in bursts (Where before it seemed to stall for VERY long time):

Screenshot2024-06-13131911.thumb.png.d205a7cdf79f6e4fd40860f9be72cf60.png

 

And with that done, I then waited for the cache to empty to disk and ran the mover:

image.thumb.png.01a70dbacb41d4e197c035bb1661d9cd.png

 

The mover is apparently great at consistent writes above 300MBs though.

If the mover is writing at 300MBs to disk, and the remote transfers in from spinning disks anyway, is it better to just leave the cache on?

In the end, would you advise migrating large data with the NVME cache on or off, I mean overall? I still have another ~6.5TB to migrate.

 

3 hours ago, JorgeB said:

If you are transferring directly to the array enable turbo write.

I made sure that turbo writes/reconstruct write is turned on before this migration, it's actually been on a week now in preparation for this.

Thanks for the reminder, I've been wondering, is it advisable to have this turned on long term as I have read that it is harder on the array disks?

Edited by KptnKMan
Link to comment
11 minutes ago, KptnKMan said:

is it advisable to have this turned on long term as I have read that it is harder on the array disks?

 

The difference is that all of the array disks will be spun up during write operations rather than just the data disk and the parity disk(s).  All of the disks will be spun down after the "Default spin down delay:" parameter has timed out using either mode.   (Mine is set for 30 minutes...)  When you are migrating your 8 TB data this will result in (perhaps) sixty-four hours of extra hard disk  'spin-up' time for some of the drives.  I seriously doubt that will significantly reduce the life expectancy of those disks which would not be spun up using "read/write/modify" mode.  (Remember that server farms figure they are are going to get at least two-to-four years of service and those drives are always spinning and read/write operations may be continuous!)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...