Jump to content

Compression of folders with many small files


tillkrueger

Recommended Posts

After almost a decade of using unRAID to keep my most important production files in one place, I have found that copying/moving parts or all of the content of one drive to another can take anywhere from a few hours to a few days, depending on its content. The slowest are my drives that contain whole computer backups (in .img or .sparsebundle formats) or those of my email accounts (with 100K+ messages).

While I know that compressing a 1.5TB .sparsebundle might also take a significant amount of time, I do suspect that over the long run, copying/moving a compressed archive from one location to another will be significantly faster, or am I wrong in that assumption (or does it depend on the *type* of compressed archive)?

If my assumption is correct, what would be the fastest compression algorithm to start turning folders with 100's of thousands of files into single archives?
Or is there another method by which the same could be accomplished, or maybe even an App or a Docker that can deal with this more efficiently/quickly?

Link to comment

Just so I understand this better, how exactly are you moving/copying these files? I'm thinking you ssh into the box, screen and then cp/rsync/mv the files? 

In regards to compressing you can definitely use: tar -zcf mynewarchivename.tar.gz <directory>

This will tar and gzip all the files in <directory> into a file called mynewarchivename.tar.gz at your working directory (pwd).

If there's a lot of files, you'll probably want to use screen for that as well.

I can certainly explain/assist more but first we need to determine how you're attempting this.

D

 

Link to comment

I have been copying/moving files in three ways, so far:

> initially I ssh'd into the box, then used cp/rsync/mv, and later figured out the screen command (but I always feel uncomfortable using terminal alone)
> ssh'ing into the box and using mc (midnight commander), later with first running screen
> for months now I have been using Krusader, which I have settled on...I really like Krusader

In regards to compressing certain folders, a few days ago I googled what was needed and used zip -r march.zip mydir/* (I thought that zip is better for my Mac/Windows workflow than tar)...I found that zipping a folder with 3 mail inboxes, containing at least 200k messages in them took less than 10 minutes, creating an 11GB file, which will now copy/move *vastly* faster than doing this with the uncompressed folders.

Zipping a folder containing 11GB in small files is a very different thing than attempting to zip a 1.5TB ./img/.sparseimage, though, which is why I asked about a proper method to deal with files that contain hundreds of thousands of files...can they be archived first, and if so, how, without taking many many hours to do so (even though many hours *once* would probably be preferable to a day or two *every* time I have to copy/move my computer backups or other such files/folders).

Link to comment

As a perfect example: I am now moving the data of a disabled 4TB drive that shows 9498 errors, which had about 800GB on it...the first 740GB I moved over night...the last 60GB have taken me the past 2 days, with 35GB remaining and moving at snail's pace...looks like it'll be another day or two, just for those.

Btw, I do *not* have a cache drive atm...maybe that would have helped significantly?

Link to comment

First off - when using tar or other programs that can read/write from stdin/stdout, it's possible to stream-copy and stream-compress the data.

So one machine can make a tar archive and send to stdout, where it gets compressed and sent over some transfer tunnel (might be ssh) to another machine that on-the-fly decompresses and untar the data.

 

Or if you don't want the target machine to decompress/untar, then you can have that machine store the datastream to a tar.gz file.

 

One advantage with this is that tar can make sure ownership and file mode will also be moved and/or archived.

 

I'm not sure your transfer times for the 800GB of data is related to small files - it might also be the location of all your drive errors. Each sector that can't be read will be very, very much slower because a normal (not RAID-optimized) drive will spend some time trying to recover the sector. And depending on transfer program, the program itself might then also retry each broken sector multiple times. So one broken 512-byte or 4kB sector can sometimes take way more time than the transfer can handle several GB of good data. Especially since a HDD has a spinning media, so after every read failure, the drive needs to wait one revolution of the platter before the next attempt. And some drives can sometimes throw in some advanced head moving acrobatics under the assumption that it's aligning of the head or some dirt that might be taken care of by moving the head over the full actuation range. And then the disk can miss even more revolutions before it's ready to try the next read.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...