problem with growing files on the disk array


Recommended Posts

I have a problem similar to this old one. I am trying to backup my Windows Computer to the UnRAID Array with a Software called `Macrium`.

The program works fine, no error logs or crashes. Backups are working when sent to Synology shared folders or Qnap shared folders.

 

But on UnRAID:

The backup is starting, creating a small backup file on the server (around 500MB to 1.3GB) and then does not continue. The Windows process does not freeze and the backup process can be paused, stopped and closed. It will just tell me `20 minutes remaining` and stuck at around 0%-2% in the progress bar, but nothing happens from there on.

So no log fails I can share.

 

I am using one cache drive, but not for the shared folder the backup will be put on. The share for the backup is split on multiple drives (all 8TB) with the option `Allocation method` set to `High-water`. All drives are around 45-53% full. So the High-water will kick in for the next files.

One backup file could reach a size of 500GB to a maximum of 1TB.

The extension `Recycle Bin` is installed on the UnRAID System along to the `Fix common plugin`.

 

Could it be the `High water` setting? What exactly is UnRaid doing if a file grows over time. The Macrium `.tmp` image backup file starts with 0KB and then will grow up to the maximum size (500GB or max 1TB). What, if UnRAID after some percentage decides to move the physical file to an other disk?

Does a "normal copy" of a large, already existing file, tells the server what space to reserve on the drive or not?

What can I do with such growing files over time when the final space is not known at start?

 

EDIT

Putting the setting `Allocation method` to `Fill up` helps a littlebit. Now the process comes to 50%. But then the same problem.

At some point for a growing file it could be the decision of UnRaid to put it on another physical disk when the old one does have not enough space left. So there is no Allocation method that protects me from this behavior. But how is this solved in the real world?

e.g. two disks each 1GB big. disk one is 500MB full, disk two 0MB. I start with a new file, it will be put to disk one. The file grows and grows and at some point will be 501MB, not the file must be moved from disk one to disk two ... and while that is done the client crashes? That can not be the wanted behavior. Am I missing a setting?

And what is the best Allocation method for such real life scenario?

And what Allocation method do the most people use and why?

 

Or is this setting not the problem? Does a file get a node on the drive and will not be pushed from UnRAID when it grows?

 

Infos from my setup:

verison 6.8.3

SMB Share over the UnRAID interface (no samba direct config)

Windows 10

Edited by christopher2007
Link to comment

@christopher2007 Just a small question, does Macrium have an option to split the created backup file? I ask because all backup software I know are having this option and for most of them splitting is the default setting. The reason for this option is for most cases users upload their backups to cloud storage or via network to a nas or some sort of a server. Network traffic can be interrupted, packets getting dropped and have to be resend. Smaller files in this case have a huge advantage. Lets say your software splits the backups into 50MB chunks locally, starts to upload the first file. After finished it checks if the hash of the remote file matches the local one. In case it doesn't the software only reuploads that 50MB file. This is way quicker with smaller files than creating a huge 500GB file, waiting for the upload and having that huge file reupload again.

 

Unraid doesn't know how large your file will be in the end and depending how you set the share it will store it on the first drive where space is. I guess reaching the limits for the drive and starting to move it to another disk can cause an interruption which breaks the datastream from Macrium to the share.

Link to comment

@bastlthanks for your fast reply.

Yes, there is a option for splitting the Backup in smaller files. But with this setting turned on, Macrium is no longer able to create incremental backups. So sadly that is no option for me due to the daily backup importance.

Currently there is one full backup every two months, a differencial backup every two weeks and incremental backups each day.

And after migrating from Synology to UnRAID the first new full backup to UnRAID fails with the above explained problem.

Link to comment

@bastlOf course, sry for not have thought on this by myself.

The screenshots are attached to this post.

 

And yes, I have a `minimum free space` set to `100GB` for testing. Until yesterday it was on `0KB` (default).

Also I restarted the system one hour ago in order to see if that helps (try to turn it off and on again?) ... sadly no change. The problem remains ... But the docker container are now showing the `WebUI` menu entry again.

That was a problem the last days because this menu entry was missing but instead the context menu on right click showed `Console`.

Curently only one docker container is installed and running: `binhex-krusader` with additional hooks to the undesigned devices and the root share. But turning off the docker image does not help for the problem discussed in this thread.

 

Disk Overview.png

Share.png

Edited by christopher2007
Link to comment

@christopher2007 From your screens everything looks ok to me. Enough space to store 500GB, default minimum free space of 0Kb as you said tested, not using the cache. Did you looked into your logs from unraid around the time where the data transfer stopped? Maybe have a look into the smart reports for the disks as well if there are any errors logged. Next time the backup "freezes" pull down your diagnostics from unraid before restarting the server and posting them. Also have a look at you disk temps when the backup job is running.

  • Like 1
Link to comment

@bastl thanks for your help so far.

I started a new backup. Directly stuck at 0%.

Disc temps all normal, under 50°C. No errors in the system log. Only output in that time:

`Jul 18 13:25:04 TS-Alt emhttpd: cmd: /usr/local/emhttp/plugins/dynamix/scripts/disk_log sdd`

I already had a look in the diagnostic zip file, nothing found there but I attach it to this post (dump was created one minute after realizing that the backup has again stopped at 0%).

 

 

My next idea maybe is the network card.

I have two 1G rj45 network plugs on the motherboard and linked them with a link aggregation 802.3ad. My netgear switch can handle that just fine and so I had 2G connectifity. After I realized the backup problem I installed a 10G network card (one from ASUS, a ripoff of atlantic, twisted pair, Cat6).

But the problems are even worth now:

Same backup problem. But now new: sometimes breaking up the connectivity so that Windows says `host unreachable`. Waiting for a few seconds and I can use the share again.

 

I am not sure if this problem was also on the network bond. And it is hard to test this because these `connectifity loses` are once an hour or so without any errors logged in the server.

ts-alt-diagnostics-20200718-1326.zip

Edited by christopher2007
Link to comment

For me it somehow looked either to an networking issue from the Windows client or Unraid itself the way how you describe it. I currently have Macrium 7.2 Free installed and testing to backup with defaults a 1TB SSD, 780GB filled to an default unraid share not using the cache. Small difference, I only have the client in a VM on the same Unraid build, but with it's own IP. The disks where I backing up to have 1,5TB free space. As for now everything looks ok to me. 100 gigs already transferred and I'am able to brows other shares without any hickups or slowdowns. Either your switch, your cables or windows itself has some issues I guess.

 

Any extra virus scanners installed?

Do you have another Windows client where you can test how stable the connection during the backup is?

Do you have any IDS system on your network analysing your LAN traffic?

  • Like 1
Link to comment

Unfortunately no IDS on the network. I will search for one maybe as a docker image to install on the server as well.

Also not tested with other computers yet. Currently I am installing a Win10 machine in order to test Macrium backups with this in a few minutes.

Avast is installed on the main system as virus scanner. But no problems when backing up to the old Synology, so I never tried turing it off. But now I did and it did not changed the result of a stuck backup.

 

I will try it with the other windows machine and then come back with the results.

 

Until then one side question:

Better use SMB shares in Windows over IP Address or over computer name?

Link to comment
6 minutes ago, christopher2007 said:

Better use SMB shares in Windows over IP Address or over computer name?

Depends on how you manage your local DNS settings. If you have DNS lookup issues on your network, direct connections via IP should work without DNS. In my current test I'am using Unraids DNS entry as path for Macrium to backup to. Still running and 300 gigs already written.

 

I found a couple things in the smart logs for your disks you might have to look into

 

for sdb for example there are lots of read and seak errors

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   100   065   044    -    188393
  3 Spin_Up_Time            PO----   091   089   000    -    0
  4 Start_Stop_Count        -O--CK   100   100   020    -    50
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  7 Seek_Error_Rate         POSR--   077   060   045    -    48327554
  9 Power_On_Hours          -O--CK   100   100   000    -    404

These errors are shown for all your array disks. One of my 3 spinners only shows one single seak error. My disks are almost 6 years old now. Could be an issue with your controller as well. Maybe @johnnie.black can also have a look into your logs if he has time and might have some hints for you.

Link to comment
40 minutes ago, christopher2007 said:

Until then one side question:

Better use SMB shares in Windows over IP Address or over computer name?

Not sure if it makes a difference as for some things I use a mapped drive, for Macrium I use \\UNRAID\<folder> for my backups, and I'm testing right now \\<IP>\<folder> to see if it looks any slower.  So far it doesn't.  Ended just fine.

 

I'm imaging a 128GB SSD drive just for testing purposes, but I've never had any issue imaging my main Windows drive to Unraid with Macrium.  It's not as large as yours though.  I only do fresh install images around 30GB.

 

I couldn't test out the incremental backup as at the moment I only have the free version installed.  But are you sure that you cannot do incremental with file size splitting? My first test image was 1GB splits, and the differential used the same profile with no problem (was only a single file though).

 

I would say that it's probably something with your network, but can't offer any suggestions.  My only suggestion might be to rethink your backup needs and configuration.. there's probably not a real good reason to ever be creating a 1TB backup file.. that's pretty extreme.   It's only my opinion but it seems like you should work out a better way.  If you have "personal files" to backup, do those separately perhaps.  If your entire image includes game installs of 100GB+ each, exclude them.  If you had to restore an image you can reinstall the game.  Things like that can make the backup easier.

 

Only suggestions as I don't know what your actual use is or why your backups are so large or you feel you need so many.  ;)

 

I'll try another Macrium test on a larger drive with no file splitting and report my results.

 

Edit -- A single backup file, no file splitting, 35GB, completed in 12 minutes.  That was my largest drive with data on it. :P

 

Edited by Energen
  • Like 1
Link to comment

Thanks so much @bastlfor your time.

I am using 4x `Seagate Exos E 7E8 8TB, 512e, SATA 6Gb/s (ST8000NM0055)` brand new. I tested the SMART information before the installation and there were no errors and also no used time (so from what I could extract, the discs were never used before).

So how does it come to this kind of errors? Should I be worried? Can this be fixed?

 

I build the new server by myself with these parts: hardware list

After that I configured everything and copied all old files with `rsync -tr --info=progress2 <source> <destination>` from the old Synology NAS to the new UnRAID server directly in the fresh created SMB folder share.

This command faild two times so I had to Crlt+C in order to cancel it and then remove everything with `rm -rf` in order to start from the beginning.

Could the errors be the result of the rsync canceling the hard way?

 

Now I am very worried. The system is new and there are read/write errors? 🙈

 

@Energenthanks for your distribution as well.

Macrium warns me (in red) if I set up the file split for the backup: `Note: Incremental retention rules will not be run if backup files are split. This can be caused by setting a fixed file size or if the destination file system is FAT32`

My english is not the best, but I guess only full and differential will work with file size split activated.

Unfortunately there are no games I am trying to backup. It is the Windows OS with my work and the renders I am working with are sometimes large.

 

But I will investigate my network in order to search for an error there.

And I am now starting a Macrium backup with an other computer (also Windows 10) in order to have more information for the problem.

Link to comment
4 minutes ago, johnnie.black said:

Those values are normal for Seagate drives, more info here.

WTF Seagate... So all his reported Seak or Read errors are 8 digits and it's just fine and normal? 😂

Why is something even possibel that a drive reports errors and there are actually no errors. Isn't this something that maybe Limetech can patch or is it a "some Seagate drive thing" only?

Link to comment

Is there a way to get the driver information from the installed 10G network card?

Currently I am testing with another Windows 10 PC and until now Macrium runs just fine (currently 43% with finished 105GB).

 

My changes for this test:

The secondary Windows Computer has an Intel LAN OnBoard network card.

The UnRAID server now uses only one OnBoard network plug and not the 10G PCI card from ASUS.

The main Windows Computer normally has the exact same 10G PCI network card.

 

UnRAID and main Windows use the `ASUS XG-C100C, RJ-45, PCIe 3.0 x4 (90IG0440-MO0R00)`.

 

So my current guess: This card causes the trouble. Not 100% sure at this point, but the best guess in the last three days.

 

For the main Windows Computer there are the old ASUS drivers and newer drivers from the chip `Aquantia AQC107` that is build inside the ASUS ripoff.

But for UnRAID I am not sure. What version is UnRAID using? Is there a way to install a newer version?

 

And @bastlcreepy, but I guess the geizhals link has betrayed me, hasn't it? 😜

Link to comment

Quick search on the forum for the Asus card shows that this card should work since Unraid version 6.7

 

 

Not sure if you have to increase the MTU size server and client side from I think default 1500 to something around 9000 to gain better performance, but without using a highspeed storage like an ssd you won't see that much of an icrease compared to a 1gig nic. Remember writting directly to the array only saturates one disc and the drive is the the limiting factor. On my last Unraid build I had a Aquantia 10G nic on my board, but never used it because never had an second client providing that speed nor did I had a 10gig switch.

 

20 minutes ago, christopher2007 said:

Is there a way to install a newer version?

Usually the drivers come with the kernel itself or been added by Limetech in a newer Unraid build. As people reported Aquantia nics working I guess the current 6.8.3 should already come with it and from your devices list the atlantic driver is loaded.

01:00.0 Ethernet controller [0200]: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] [1d6a:07b1] (rev 02)
	Subsystem: ASUSTeK Computer Inc. Device [1043:8741]
	Kernel driver in use: atlantic
	Kernel modules: atlantic

Some parts of your logs for the nic

Jul 18 13:09:54 TS-Alt kernel: atlantic: link change old 1000 new 0
Jul 18 13:09:54 TS-Alt kernel: br0: port 1(eth0) entered disabled state
Jul 18 13:09:55 TS-Alt dhcpcd[1751]: br0: carrier lost
Jul 18 13:09:55 TS-Alt dhcpcd[1751]: br0: deleting route to 192.168.0.0/24
Jul 18 13:09:55 TS-Alt dhcpcd[1751]: br0: deleting default route via 192.168.0.1

Not sure if it depends on you changing some network configs earlier or not, but I guess it's not an error dropping the connection. You might watch this in your logs if you test with the ASUS nic later if it somehow reports that carrier lost when transfering your backup.

31 minutes ago, christopher2007 said:

And @bastlcreepy, but I guess the geizhals link has betrayed me, hasn't it?

I first saw your timezone is the same as mine and than your geizhals link. Just adding 1+1 together 😁

  • Like 1
Link to comment

@christopher2007 I forgot the following. You can check with "ethtool eth0" if the speeds for the card are reported correctly and you can check for the used driver version with "ethtool -i eth0"

 

Btw my Macrium backup job finished without an issue. A 736GB file is created on one of my array disks and verification of the backup looks like it runs through.

  • Like 1
Link to comment

Thanks @bastl for all your time.

 

So currently everything points to the 10G network card.

Test setup:

  • The second Windows 10 computer (test bench) could back up with Macrium. 100% and around 400GB saved. OnBoard intel network card from Windows to OnBoard network card in UnRAID. -> pass
  • Then a second backup with the second Windows 10 computer (test bench) with Macrium. 1% and then stuck like descriped in my first post. OnBoard intel network card from Windows to 10G ASUS network card in UnRAID. -> no pass

Next I will test OnBoard intel from main Windows 10 computer to UnRAID with both intel OnBoard and 10G ASUS.

Unfortunately this all is only practical testing ... no theoretical approach.

 

But I am a littlebit confuded because others seem to use the ASUS network card without any problem. And 10G network is important for me (2G minimum, but the main machine does not have the possibility for a link Aggregation, so I have to rely on a PCI network extension card).

 

`ethtool -i <10G network card>` on the UnRAID server:

driver: atlantic
version: 2.0.3.0-kern
firmware-version: 3.1.58
expansion-rom-version: 
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

 

`ethtool <10G network card>` on the UnRAID server:

Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   100baseT/Full 
                                1000baseT/Full 
                                10000baseT/Full 
                                2500baseT/Full 
                                5000baseT/Full 
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  100baseT/Full 
                                1000baseT/Full 
                                10000baseT/Full 
                                2500baseT/Full 
                                5000baseT/Full 
        Advertised pause frame use: Symmetric Receive-only
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: Unknown
        Link detected: yes

 

I changed the switch in order to hopefully find the problem in the network like suggested. So the speed is only 1G.

 

@bastlnice one, I had overlooked the timezones ... to frustrated over the last three days because of breaking network SMB drives and no backups. Nice one :)

Link to comment

@christopher2007 Also try to set the MTU size (Jumbo Frames) on all devices (unraid, client, switch) to 9000 and try if that helps. On Unraid you can find it on the settings page for the specific interface and on Windows itself you should be able to find the settings in the device manager for the card. Either there is a "Jumbo Frames" entry you have to enable or MTUSize you can define.

  • Like 1
Link to comment

Thanks so much @bastl.

Currently, I have so many things to test on my stack that it will cost some time until I have new results and information.

But one thing seems to be quite sure: It must be the 10G card. The question now is, whether it can be fixed or has to be swapped out.

Jumbo frames are the next thing I will try and after that mounting additional fans only to the card for better cooling.

Next post will be, hopefully, results and clarity.

  • Haha 1
Link to comment

Ok, I have new test results.

 

Like @bastlsuggested, I tried to max out the possible supported MTU.

For that I set up a MTU of 9216 in my Netgear switch (maximum possible) and started pinging the switch with

  • the windows machine: ping <SwitchIP> -f -l <MTU>
  • the UnRAID server: ping -M do -s <MTU> <SwitchIP>

The MTU given here is without header, so the actual MTU has 28 Bytes more.

The results:

  • OnBoard 1G network card in UnRAID: maximum possible MTU in ping command was 1472, so 1472+28=1500
  • ASUS 10G network card in UnRAID: maximum possible MTU in ping command was 8972, so 8972+28=9000
  • OnBoard 1G network card in Windows: maximum possible MTU in ping command was 1472, so 1472+28=1500
  • ASUS 10G network card in Windows: maximum possible MTU in ping command was 8972, so 8972+28=9000

So I set up the MTU accordingly.

 

Also like @bastlsuggested, I tried to exclude the heat problem. Because yes, the card gets really really hot but I am not able to get temperature data out of this pci card (only motherboard, cpu and gpu seems to have sensors).

So I 3D printed a holder for a 40mm fan and installed it on the UnRAID server and also on the Windows client.

The cards are now much colder.

 

But now it gets really weired:

UnRaid looses the network connection on a Windows Backup. That is a behavior I already described in this thread earlier, but was not able to tell more precicly because of so many overlapping problems.

Easy scenario: I want to create a Backup from either the main Windows machine or a second Windows machine over the network. The backup file starts with 0KB and will grow until the maximum backup size is reached (around 300GB on the second machine and 700GB on the main machine).

On both it fails after a random period of time. And the UnRaid server is then no longer reachable in the network. I have to pull out the rj45 cat6 cable from the network card, wait a second and put it back in. Then the server can be reached again in the network.

 

 

I attached a diagnostic download from after such a network lose and re-plugging the cable.

The replug of the cable was at 18:25. Five to ten minutes bevore that the server was unreachable in the howl network (tested it also with my phone).

 

This problem is independently of the network card (problem occurs in all combinations between UnRaid with ASUS, UnRais with OnBoard, Windows main with ASUS, Windows main with OnBoard, second Windows with OnBoard).

 

Also the same happens when just copying a very very large single file to the server. I tried it with a 250GB file (single file generated only for that test) and the same weird thing happend: The server could not be reached in the network.

 

At this point I am speechless. I have no more ideas than to blame it on UnRaid. And that is sad, I really got in love with that OS. It is so easy and clean in usage and yet so powerful.

Does anyone have another idea? What am I doing wrong?

 

(other tests I run: replacing the switch, creating a VLAN, replacing the cat6 copper cables and even turning the server up side down ... mainly for finding that one screw that rolled under the mainboard, but hey, at least it could have been a valid test.)

ts-alt-diagnostics-20200720-1844.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.