Big file transfers bring Unraid server to crash


Recommended Posts

TL;DR:

If file transfers of big files are to fast, the receiving Unraid server caches way to much, brings smbd to crash and the sending and receiving machines to stall. Memory is eaten up completely, CPU at 100%.

 

I can reproduce that on two bare metals, connected via GB Ethernet, on really large files (beyond 100GB). On a bare metal server hosting an Unraid VM this happens way earlier at around 15GB. This depends on the amount of RAM the receiving Unraid server has.

 

Release 6.6.6

Diagnostics from two servers attached.

Screenshot attached.

Environment attached.

Mount options attached.

 

 

Long:

Since some weeks I'm on a mission to find out why big transfers bring my Unraid servers to crash.

 

My first candidate was Unassigned Devices. That's because whenever the receiving server crashed, Unassigned Devices on the sending server stalled (mount point not reachable, Main page hangs). When this happens, nginx crashes (don't ask me why) and there's no way to come around this. Even console does not work because the mount points hang. So the only solution is IPMI.

 

I removed Unassigned Devices and did it myself with some User Scripts. What can I say. Same result.

 

Then I used the documentation of Samba, mount and mount.cifs to get some insights. I have to say I'm a Linux NOOB so this is hard stuff for me. What I found is:

 

something caches way to much. The only things regarding caching I found are the SMB options "aio write size" and "write cache size". "aio write size" is set to 4096 in smb.conf. As far as I do understand, Samba will start asynchron writes if the request is bigger that that. "write cache size" is not used. It's defaul is 0 - no cache.

 

So, if Samba does not cache, who caches? The only cache I can think of is the filesystem cache. And here my journey ends. I don't have any clue.

 

So my question is: Who the hell is caching and eating up all memory on a receiving Unraid server. 

 

Environment
===========
3x Supermicro SC846E16 chassis (1x bare metal, 2x JBOD expansion)

Bare metal:
1x Supermicro BPN-SAS2-EL1 backplane
1x Supermicro X9DRi-F Mainboard
2x Intel E5-2680 v2 CPU
2x Supermicro SNK-P0050AP4 Cooler
8x Samsung M393B2K70CMB-YF8 16GB RAM --> 128 GB
1x LSI 9300-8i HBA connected to internal backplane (two cables)
2x LSI 9300-8e HBA connected to JBOD expansions (two cables each)
2x Lycom DT-120 PCIe x4 M.2 Adapter
2x Samsung 970 EVO 250GB M.2 (Cache Pool)

Vor each JBOD expansion:
1x Supermicro BPN-SAS2-EL1 backplane
1x Supermicro CSE-PTJBOD-CB2 Powerboard
1x SFF-8088/SFF-8087 Slot Sheet
2x SFF-8644/SFF-8088 cables (to HBA in bare metal)

For each expansion box I've setup an unRAID VM. There's one 9300-8e passed to every VM. Every VM has 16GB RAM and 4x2 CPUs. All disks in expansion boxes are mounted in bare metal server via mount -t cifs.
Example mounts:
===============
mkdir -p /mnt/hawi
mkdir -p /mnt/hawi/192.168.178.101_disk1
...
mkdir -p /mnt/hawi/192.168.178.102_disk1
...

mount -t cifs -o rw,nounix,iocharset=utf8,_netdev,file_mode=0777,dir_mode=0777,vers=3.0,username=hawi,password=******** '//192.168.178.101/disk1' '/mnt/hawi/192.168.178.101_disk1'
...

mount -t cifs -o rw,nounix,iocharset=utf8,_netdev,file_mode=0777,dir_mode=0777,vers=3.0,username=hawi,password=******** '//192.168.178.102/disk1' '/mnt/hawi/192.168.178.102_disk1'
...

 

tower-diagnostics-20190213-1549.zip

towervm01-diagnostics-20190213-1549.zip

Clipboard01.jpg

Edited by hawihoney
Link to comment
15 minutes ago, hawihoney said:

I can reproduce that on two bare metals, connected via GB Ethernet, on really large files (beyond 100GB).

I don't usually transfer files that large, though files of about 50GB I use quite frequently and never had issues, but for sure don't have a problem transferring various TBs over SMB with btrfs send/receive, that's a continuous stream, last one took almost 24 hours, without any issues, memory usage doesn't go above 20% with 16GB installed, if no one else has any ideas I'll do some tests with a 100GB file over the weekend.

 

Edit: I forgot I have screenshot from the last one I took because of an unrelated issue, and it was 17hours, not 24.

 

sr.png.0963122e4515ece3392ff9feafe57179.png

 

 

Edited by johnnie.black
Link to comment

Please consider a very fast connection. Reading from my PCIe x4 M.2 in first server, writing through PCIe x4 LSI-9300-8e directly into second server backplane - using SMB. Here in the forums I could see 1-3 similar threads. If I remember correctly, 10GB LAN was one of the connections.

 

What I can see is: The copy on the writing server is ready within a minute or so. But the OK is not given, it's simply waiting for an answer. The receiving server is writing and writing for 10-20 minutes depending of file size. Memory is eaten up on receiver side. At some point SMBD crashes. So I reduced file size. If it's around 15GB memory is eaten up as well, but the copy comes to an end. Memory is reduced immediately then. 

 

As my servers crash during the tests I'm a little bit nervous to test the following: Increase the available RAM of the receiving server. I'm pretty sure this will help.

 

Current solutions here: Use USB sticks, use rsync via SSH, increase memory of receiving server (tested it), reduce file size on sending server ... all these workarounds work.

 

 

 

Edited by hawihoney
Link to comment
26 minutes ago, hawihoney said:

What I can see is: The copy on the writing server is ready within a minute or so. But the OK is not given, it's simply waiting for an answer. The receiving server is writing and writing for 10-20 minutes depending of file size.

I'm not sure how this can happen, by default Unraid uses 20% free RAM for write cache, after that is filled the transfer will slow down to the speed the server can actually write to the devices, it will (or it should) never use more than 20% of RAM for write cache, at least not without the user changing the defaults, I can't see how the server could still be writing 10 or 20 minutes after the transfer finishes, but maybe someone else will have some ideas.

Link to comment

Pretty sure this is what I was seeing also Johnnie.. there is something weird with SMB file transfers with large files.  I eventually gave up on transfering via SMB and cloned the entire drive to get the data moved.  Wish I could do some more testing but my rig is in pretty constant use.  Behavior that I saw, was that after the transfer started, it would fill up the host memory until there wasn't any left.

 

Link to comment
2 minutes ago, jordanmw said:

Pretty sure this is what I was seeing also Johnnie.. there is something weird with SMB file transfers with large files. 

Maybe, weird that I don't have issues with send/receive sending TBs of data over SMB, but since it works at the block level it might behave differently, at what file size did you notice the issue?

Link to comment

@jordanmw: Can you tell a little bit about your environment?

 

Do you transfer between servers? Bare metal or VM? How is everything connected? 10G LAN? ...

 

I really want to get that fixed and I'm willing to help. I know my environment is no longer average. But with really fast SSDs, 4K or 8K movies, it might hurt more people.

 

 

 

 

Link to comment

Stupid question: Is it possible that tunable parameters do have any impact on SMBD behaviour on a receiving Unraid server?

 

I did remember that I changed the three values "md_num_stripes", "md_sync_window" and "md_sync_thresh" from custom values back to their default values. I did this when I tried to track down these 6.7.0-RCx problems mentioned here:

 

 

Yesterday I changed these values back to the custom values I found here in the forums. After a day without problems it seems to help.

 

Is this possible?

 

Clipboard01.jpg

Edited by hawihoney
Link to comment
  • 2 years later...

I know this is an old thread, but I was having this issue transferring from a VM (on a different unraid server) to the array.  After trying a lot of things, I found changing to Q35 from i1440fx fixed it for me.

Edit: a lot better, but not 100% fix

Edited by Tosh6072
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.