2 identitical disks mirrored by rsync have diff free space - HOW can this be??


Recommended Posts

I have 2 different v5 servers where the 2TB Rfs disks are mirrored by rsync on a nightly basis.  (about 225,000 files)

 

Problem, the source disk is more full than the designation disk by about 2 GB.  This shouldn't happen.

 

unMenu reports 103.10 GB free on the source 2TB drive but reports 104.95 GB free on the destination disk.

 

Because of the way rsync works, I could see the destination disk showing less free space.  However I could never see a case where the destination disk shows more free space than the source disk.  How can this be??  Could 2 Rfs disk show different free space when they have the same contents??

 

(Dest would be more full than the source in the case of a file getting moved from one location to another on the source, the destination disk would show that file in 2 places as rsync is not replicating deletes)

 

Rsync in running on the source server as follows sending all new files to the destination server tower3:

 

- rsync  -av --stats --progress /mnt/disk1/  /mnt/t3disk1

 

and t3disk1 is being mounted from tower3 via

 

- mount -t nfs tower3:/mnt/disk1  /mnt/t3disk1

Link to comment
  • Replies 50
  • Created
  • Last Reply

Top Posters In This Topic

Install the demo version of FolderMatch on a PC  [ http://www.foldermatch.com/ ]

 

... and then do a comparison of the two disks via your network.

 

Be sure to set the Comparison Method (on the Options menu) to Size and Data/Time or even just File Name Only ... otherwise it will read the entire contents of every file and take a LONG time for that much data.

 

That will show you what's different, and will likely give a lot better idea of just what's happening.

 

 

Link to comment

The source disk likely has a metric ton more of directory and file manipulations that required more filesystem overhead than on the destination disk. Remember, things like directory entries and file entries, even if deleted still take up room BUT the destination disk will not have the overhead of these intermediate directory or file entries since it seems to be used more as a snapshot and not live manipulations.

 

Link to comment

The source disk likely has a metric ton more of directory and file manipulations that required more filesystem overhead than on the destination disk. Remember, things like directory entries and file entries, even if deleted still take up room BUT the destination disk will not have the overhead of these intermediate directory or file entries since it seems to be used more as a snapshot and not live manipulations.

 

So... fragmentation?

Link to comment

As previously posted, directories are growing and shrinking files. A directory that grows to a huge number of files may leave some space.

 

However, I have not seen that to always be the case with reiserfs. It depends on how large the directory has grown.

 

Another issue could be that a file size is reported differently due to some kind of interrupted superblock operation.

Without a full fsck, there could be files with extra space allocated.

 

If you want to be sure all files are copied over intact, you can do the rsync again with the -c option.

This copies by checksum instead of mtime. It will take a very long time, but it will compare all the files and copy over differences. Or use the -n option

 

I believe you have to use  -rcc instead of -av.

Another point worth considering is to create an md5sums file for the source disk, then copy that over to the destination disk and do it in comparison mode.

 

I helped another user with some command examples for migrating to XFS.

the post is here. http://lime-technology.com/forum/index.php?topic=38507.msg360594#msg360594

Maybe that will give you some hints for double checking.

 

There are also the bitrot and bunker shells for attaching the hash values to extended attributes.

If you do this on the source, you'll need to rsync the files again with -av and the -X option to rsync over the attributes.

Link to comment

OK so either directory diff or bitrot. I'll post back once I know.

 

In the future I plan on having bunker or bitrot verification too. But a disk with constantly changing files will be a challenge with bitrot, unless we can figure out a way to update the extended attributes hash on file save. Constantly changing files will all fail the bitrot verification, and you won't know why. Was the file changed and just the hash needs recalculated, or is the hash right and the file corrupted?

Link to comment

OK so either directory diff or bitrot. I'll post back once I know.

 

In the future I plan on having bunker or bitrot verification too. But a disk with constantly changing files will be a challenge with bitrot, unless we can figure out a way to update the extended attributes hash on file save. Constantly changing files will all fail the bitrot verification, and you won't know why. Was the file changed and just the hash needs recalculated, or is the hash right and the file corrupted?

 

I had thought there was a method in either tool to re-calculate the hash if the file changed.

It shouldn't be that hard to add if it's not there. You may want to ask questions in the related threads.

 

If the mtime is > then the hash calculate time, then you can probably be sure that the file has changed.

At least that was the logic I used in my own .c version of the tool.

Link to comment

Would there be any way to automatically update the hash after a file save?  That would be the cleanest way to eliminate false positives from the failure list.

 

But your idea should work too. Especially if the script could figure it out and just fix the hash.

Link to comment

Would there be any way to automatically update the hash after a file save?  That would be the cleanest way to eliminate false positives from the failure list.

 

But your idea should work too. Especially if the script could figure it out and just fix the hash.

 

It's possible using inotify and watch lists.

With an inotify / watch list you can run the tools per file as they change.

 

I worked on a different tool that did it another way. It was sort of a dir_cache and hashdb using gdbm files

 

I used a ftw64 to scan the tree (like dir_cache)

Then I stored the filename as key with the stat block and space for a 132 byte hash.

 

When a file changed, it would update the stat block and truncate the 132 byte hash and print the name of the file that had changed.

If that were stored in a file it could be used by another tool to recalculate the hash.

It could also be done inside the the tool and update the gdbm.

The issue becomes concurrency.

 

If the tool did the hashing, when a large amount of files changed, the tool would be busy hashing each file.

Thus, loosing the benefit of directory caching as dentries get flushed by reading large files.

 

If the tool forked a co-process, then the same would occur at a diminished level, however, now you have gdbm concurrency and file locking issues.  Only 1 process can be a writer to a gdbm file.  I solved that by doing my own locking, but I never went further with it.

 

The benefit of using .gdbm files in this method is speed. Once the whole filesystem is scanned, iterating through the file is very fast.

 

However it takes a bit of space to store all that information. It escapes me how much, I think it was about 90MB for 300,000 files.

If that file is on disk, that slows it down. In addition there are the spin up issues. If it's on a ram disk the file takes up valuable ram space.

If it's on SSD then the individual blocks are updated every time a file gets updated.  So now we're talking about wear.

 

I've also explored this with SQLite. With SQLite you have the concurrency handled by multiple processes, however the size is greatly increased.  Speed on lookup is not as fast as gdbm files. However you have much better access by the shell.

 

I'm really leaning towards that for cataloging and similarities with the locate tool.

The issue here is it takes up even more room in ram. However having locate ability for me is a big plus.

 

When I really thought about it, using a daily job in cron to scan the filesystem for files that are missing hashes or the mtime > hashtime was the most efficient.

 

So it would be a scan and update, then another job could be used to scan hashtimes that were > some hash verification age.

 

What's missing in each of the  shell based tools is an export to a standard hash file and an import from standard hash file.

Otherwise they both use a good mechanism for detection and concurrency.

Link to comment

@ysss. UnRaid works fast enough for what I'm doing. I don't even use a cache drive.  My documents drive does see some regular updates.

 

The drive in question is old jpg image files from the days when digital was starting and file sizes small. So it's a media drive.

 

Link to comment

 

 

 

When I really thought about it, using a daily job in cron to scan the filesystem for files that are missing hashes or the mtime > hashtime was the most efficient.

 

Yes, and creating the hash in real time could bring the server to it's knees during massive copy operations.  Updating the hash can happen later.

 

Are bitrot and bunker efficient in file system scanning doing the mtime>hash time compare?  A drive with 300,000 files takes minutes or hours?

 

Link to comment

 

When I really thought about it, using a daily job in cron to scan the filesystem for files that are missing hashes or the mtime > hashtime was the most efficient.

 

Yes, and creating the hash in real time could bring the server to it's knees during massive copy operations.  Updating the hash can happen later.

 

Are bitrot and bunker efficient in file system scanning doing the mtime>hash time compare?  A drive with 300,000 files takes minutes or hours?

 

To walk through 300,000 files takes me approx 30-40 minutes doing only a find from a freshly booted system.

If nothing changed, I would not expect the scan and XATTR check to take longer then 1 hour on a freshly booted system.

 

Once this scan was done, a subsequent one shouldn't take longer then 10-20 minutes. 'Shouldn't.'

 

It's all relative to the number of files, speed of the disk subsystem and processor.

In addition any cache misses or flushing of the dentry/inode cache comes into play.

 

Now if those files did not have hashes, the initial hash creation can take over 24 hours.

 

This is where a fast processor matters. unRAID doesn't need it unless you are virtualizing or doing high volume hashing and want speed.

 

If the processor is fast, the disk subsystem can read fast enough without interruption and the disk is very full

I would guestimate that a full scan would take 2-3x a full smart long test scan.

 

To estimate that do a smartctl -a  and look at the Extended self-test routine.

recommended polling time:        ( 543) minutes.

 

In my case this disk, took a little over 24 hours for the initial scan and hash using md5deep.

 

 

It's all relative to the speed of the disk, SATA subsystem, how fast the processor is and how full the disk is.

Link to comment

Did you try FolderMatch to see what the differences are ?

 

One other thought r.e. your comment:

 

The drive in question is old jpg image files from the days when digital was starting and file sizes small. So it's a media drive.

 

You may have a few hidden Thumbs.db file on the source drive that's not being copied.    With 225,000 files I wouldn't expect these to be as large as the difference you're seeing, but if there are a lot of them that could explain your difference.  But I'd run FolderMatch so you know what the actual differences are for sure.

 

 

 

 

Link to comment

Yes, it should work okay => did you change the comparison method (on the Options menu) to either "File Name Only" or "Size and Date/Time" ??    If not, it's actually reading all of the data and comparing the files -- this will take a VERY long time for that many files with that much data.

 

 

 

Link to comment

Weebotech, can I do a compare as follows?  Not sure if foldermatch may succeed.  I could do something via the terminal as follows to do an exhaustive file compare of the contents.  I want to output a list of files that don't match.

 

mount -t nfs tower3:/mnt/disk5/ /mnt/t3disk5

rsync -avrc --dry-run --stats --progress /mnt/disk1 /mnt/t3disk1/  >> /boot/logs/cronlogs/t3disk1chksum.log

 

Or is the some way to use diff between 2 servers??

Link to comment

Yes, it should work okay => did you change the comparison method (on the Options menu) to either "File Name Only" or "Size and Date/Time" ??    If not, it's actually reading all of the data and comparing the files -- this will take a VERY long time for that many files with that much data.

 

I tried several different ways, but ultimately asked it to do it based on file contents.  I know it will take a very long time.  The windows laptop doing the compare has a gigabit wired connection, it has an i7 chip and 16 gb ram.  It isn't working very hard at all.  W7 task manager shows 1-5% cpu load.  I expected it to be working harder... 

 

(time to break out the whip.... faster, faster, faster)

Link to comment

Weebotech, can I do a compare as follows?  Not sure if foldermatch may succeed.  I could do something via the terminal as follows to do an exhaustive file compare of the contents.  I want to output a list of files that don't match.

 

mount -t nfs tower3:/mnt/disk5/ /mnt/t3disk5

rsync -avrc --dry-run --stats --progress /mnt/disk1 /mnt/t3disk1/  >> /boot/logs/cronlogs/t3disk1chksum.log

 

Or is the some way to use diff between 2 servers??

 

You can do it that way.

While I do it that way, I also have an md5sums file for the source.

doing the -rc is sort of like doing the md5sum on the file.

 

If you create an md5sums file on the source you will gain in a few ways.

 

1.  You will now have hash values for all your files.

2.  You can bring this file over to the destination and check each file for integrity.  You will see misses if there are any.

3.  Now you have hash values and history.

 

I bet there's a way to import these hash values into the extended attributes for bunker and/or bitrot.

 

 

Going forward you can use one of those tools.

 

The other choice is to use bunker or bitrot to put the extended attributes on the source.

rsync the tree again, use the same tool on the destination.

Downside of this is, if a file is missed you will not know it.

 

With md5sums file, when you check the integrity, you'll know if a file was missed.

 

Quick method is to do a find down each of the trees

 

find /mnt/disk? -print  | sort > filelist on one system

do the same on the other system

 

then do wc -l on each system's filelist to check the count.

 

If there are misses, you can do a diff -u to see where they are.

Link to comment

One thing I definitely do NOT like about FolderMatch is it's almost non-existent status info.  It can definitely appear "hung" when in fact it's still working away.  If you don't see the little window that shows the "current" (in quotes, as it often doesn't update it very often) files being compared, it's likely hidden behind the main FolderMatch window.    As long as Windows doesn't show a "not responding" note in the title bar, it's probably working just fine -- just much slower than you'd like.

 

I still use it as it's a very reliable way to identify file differences.

 

Link to comment

Yes, it should work okay => did you change the comparison method (on the Options menu) to either "File Name Only" or "Size and Date/Time" ??    If not, it's actually reading all of the data and comparing the files -- this will take a VERY long time for that many files with that much data.

 

I tried several different ways, but ultimately asked it to do it based on file contents.  I know it will take a very long time.  The windows laptop doing the compare has a gigabit wired connection, it has an i7 chip and 16 gb ram.  It isn't working very hard at all.  W7 task manager shows 1-5% cpu load.  I expected it to be working harder... 

 

(time to break out the whip.... faster, faster, faster)

 

If you do the md5sums  on the source, then do the md5sum -c on the destination, it's almost the same thing.

You gain by having hash values for checking at another time. (or importing them to the extended attributes).

Link to comment

If you do the md5sums  on the source, then do the md5sum -c on the destination, it's almost the same thing.

You gain by having hash values for checking at another time. (or importing them to the extended attributes).

 

Agree.  I've done this quite often ... computed the MD5's on one system; then checked them on another [or on one drive and then checked them after a copy].

 

Link to comment

----  As long as Windows doesn't show a "not responding" note in the title bar, it's probably working just fine -- just much slower than you'd like.

 

I still use it as it's a very reliable way to identify file differences.

 

---- Windows is showing "Not responding" note in the title bar....

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.