Incomplete file transfers, and how to detect and fix them


Recommended Posts

I'm in the process of copying files to the array for the first time. Comparing the original vs. what's on the array, I noticed a worrying discrepancy: file count was the same on both, but total directory sizes weren't in some directories.

 

I checked on one such conflict, and indeed all the files were there. But one file was 732MB on the array compared to 1.0GB originally. I assume this is because the copy is incomplete. I have stopped a copy command "mid-stream" occasionally due to various reasons, but thought nothing of it. I assumed it would stop gracefully and delete any incomplete files. This apparently is not the case.

 

Right now I can't trust any of the data on the array. How do I ensure that all the files I have copied are complete? For example, is there an easy way to cross-check checksums? I'm running unRAID 5.0.4 and Linux Mint as my originating desktop.

Link to comment

I found an older thread here with a link to an unRAID-ready file, but it's marked "beta" and is over two years old, so not sure I can trust it.

 

But it gave me an idea to play around with md5deep, which is a Linux utility that does matching of hashes out of the box, ie. it can be used to determine if two sets of files are identical in content. I did a test run with two 200MB files, and it was very quick on my desktop HDD, but matching the hashes against the files on the array took 10+ seconds. Therefore it might not be a feasible solution since I have over a terabyte of critical data I need to check. I might be able to do directory-to-directory or file-to-file size matching first to limit the # of files to be hashed.

 

Need to do more research.

 

edit: I tested on a directory with seven files totaling 2.2GB. Hash on desktop HDD took 3 minutes, and matching those hashes against the files on the array took three minutes and fifteen seconds. So that's roughly 24 hours per terabyte at both ends. Not as bad as I expected, and perfectly doable.

 

Wish there was a built-in way to do a checksum match, like secure copy. Waiting for someone to show me it's built-in and I've just missed it.

 

Haven't used it but this looks interesting.

 

It does, just can't get it to install since I'm a newb with Linux.

Link to comment

Haven't used it but this looks interesting.

 

It does, just can't get it to install since I'm a newb with Linux.

?? There is no install, just download and execute. It will need to be marked as executable, then you can run it.

 

Yeah, tried the chmod thingie and enabled in GUI under properties, but nothing happens when I double-click on it. Yes, I'm a newb :)

Link to comment

File size discrepancies between different file systems is normal and due to the difference in allocation methods. Rieserfs is more efficient than NTFS; although, a single file should not that much smaller. Use teracopy or corz to verify.

 

Yeah, I'm pretty sure the size mismatch is because Linux Mint and/or unRAID doesn't delete partially copied files, which is truly incredible and inexcusable behavior from a modern OS, if that's the case. Good thing I was paranoid and did some checking. I'll do some further testing over the weekend to pinpoint the culprit.

 

Both utilities you suggest seem to be Windows only. Ultracopier is multi-platform, and claims to do error and "collision" (filename?) detection, but the documentation is rudimentary and written in poor English, so it's hard to determine what those actually mean and how it is implemented. I'll do some further research.

 

In the meanwhile, I've been running a test run of MD5 hashes with md5deep, and it has found several files which don't match. 24 hours and counting on 800 gigs. Quite cumbersome since I have to run the hashes and checks separately, and apart from copying files. But it should be damn secure and trustworthy.

Link to comment

I'm pretty sure it's the responsibility of the originator of the copy to handle cleanup correctly. Perhaps you are using a file manager that can cleanly continue a partial transfer if you resume it, so it assumes you may want to continue the copy later? Try using a different file manager, or use the command line and see if the behaviour changes.

Link to comment

Testing it out, stopping a file copy mid-stream from a Linux Mint desktop to the unRAID box will leave an incomplete file on the array, with no warning or error message. Incredible.

 

Any idea if this is a bug or a "feature" of either Mint or unRAID?

Exactly the same will happen if the target is a Windows system!

 

As has been mentioned it is the responsibility of the client system doing the copy to tidy up if the copy is interrupted.  The target system has no reliable way to distinguish between a copy finishing and it being interrupted.

Link to comment

After doing some testing and research, I've decided on a two-tier approach as a compromise between time and accuracy:

 

TIER 1

Compare folders using UltraCompare. It has a nice GUI and shows which files are present, and which have different sizes. This is used on all data, and is a good way to pinpoint files which have not been completely copied over for re-copying. It does have some kind of byte-to-byte comparison, but I haven't tried it.

 

UltraCompare is not the only option, there is Beyond Compare, and you could run some commands to compare outputs of directory structures, but that's beyond this post.

 

TIER 2

Run md5deep on all critical data to ensure they are of similar size, and also equal byte-to-byte. md5deep calculates a hash for each file. A hash is a random-looking string of characters, which is unique to each file. There is a theoretical chance that two files may generate the same hash, but it's such an infinitesimally tiny chance it is of no concern.

 

 

Here my newbie-friendly guide for running md5deep to compare files between the source and unRAID array for Linux Mint. Probably works in Ubuntu and other distros.

 

Step 1 - Mount the array

You need to mount your unRAID array to be able to compare files on it. Replace items in brackets with those appropriate to your system.

 

You will need to create the folder under /home/[your desktop username] before mounting. This is the location on your home folder which will point to a location on your array. You can name [folder to compare] anything you want, but I recommend making it clear so you don't mix things up. For example, I name mine "m-photographs" to show it is on the array, and to distinguish it from "photographs" on my local drive. Path does that as well, but I'm paranoid.

mkdir /home/[your desktop username]/[folder to compare]

 

Mount the array:

sudo mount -t cifs //[array name].local/[folder to compare]/ /home/[your desktop username]/[folder to compare] -o rw,username=[your desktop username]

 

Here example for my array for reference. My array is named TheMonolith.

sudo mount -t cifs //themonolith.local/photographs/ /home/ulvan/m-photographs -o rw,username=ulvan

 

Step 2 - Run the hashes

Run hashes on original files with recursive directories, ie. all files in folders underneath this will be also hashed. I recommend a filename for the hashes which describes the location, for example "photographs". You can create a hash file for an entire drive or one for each main folder, or one for each file. I do by main folder.

 

Create a folder for all the hashes to keep things organized:

mkdir /[full path to original files/ >/home/[your desktop username]/Hashes/

 

Run the hashes - will take several hours if you run it on a full drive. The part after > is the filename of the output, which you will use in Step 3 for comparison.

md5deep -rl /[full path to original files/ >/home/[your desktop username]/Hashes/[filename].md5

 

Step 3 - Compare hashes

Compare hashes of original files from previous step to those of files on the array.

 

This shows all files that DO NOT MATCH the originals - does NOT show files which match. This is used to check which files on the array differ from the source. No output means no mismatches. Multiple source hash files can be used by using the -x flag multiple times. [folder to compare] should point to the folder on the array, same as the one you mounted in Step 1. This will take a lot longer than Step 2, as it reads from the array, and compares the hashes generated on the fly to those on the .md5 file.

md5deep -rwx /home/[your desktop username]/Hashes/[filename].md5  /home/[your desktop username]/[folder to compare]

 

This shows all files that MATCH the originals - does NOT show files which do not match.

md5deep -rwm /home/[your desktop username]/Hashes/[filename].md5 /home/[your desktop username]/[folder to compare]

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.