2 identitical disks mirrored by rsync have diff free space - HOW can this be??

tr0910 · May 22, 2015

I have 2 different v5 servers where the 2TB Rfs disks are mirrored by rsync on a nightly basis. (about 225,000 files)

Problem, the source disk is more full than the designation disk by about 2 GB. This shouldn't happen.

unMenu reports 103.10 GB free on the source 2TB drive but reports 104.95 GB free on the destination disk.

Because of the way rsync works, I could see the destination disk showing less free space. However I could never see a case where the destination disk shows more free space than the source disk. How can this be?? Could 2 Rfs disk show different free space when they have the same contents??

(Dest would be more full than the source in the case of a file getting moved from one location to another on the source, the destination disk would show that file in 2 places as rsync is not replicating deletes)

Rsync in running on the source server as follows sending all new files to the destination server tower3:

- rsync -av --stats --progress /mnt/disk1/ /mnt/t3disk1

and t3disk1 is being mounted from tower3 via

- mount -t nfs tower3:/mnt/disk1 /mnt/t3disk1

garycase · May 22, 2015

Install the demo version of FolderMatch on a PC [ http://www.foldermatch.com/ ]

... and then do a comparison of the two disks via your network.

Be sure to set the Comparison Method (on the Options menu) to Size and Data/Time or even just File Name Only ... otherwise it will read the entire contents of every file and take a LONG time for that much data.

That will show you what's different, and will likely give a lot better idea of just what's happening.

tr0910 · May 22, 2015

So you agree with me. This shouldn't happen.

BRiT · May 22, 2015

The source disk likely has a metric ton more of directory and file manipulations that required more filesystem overhead than on the destination disk. Remember, things like directory entries and file entries, even if deleted still take up room BUT the destination disk will not have the overhead of these intermediate directory or file entries since it seems to be used more as a snapshot and not live manipulations.

ysss · May 22, 2015

The source disk likely has a metric ton more of directory and file manipulations that required more filesystem overhead than on the destination disk. Remember, things like directory entries and file entries, even if deleted still take up room BUT the destination disk will not have the overhead of these intermediate directory or file entries since it seems to be used more as a snapshot and not live manipulations.

So... fragmentation?

garycase · May 22, 2015

So you agree with me. This shouldn't happen.

Agree. And this is NOT fragmentation => disk fragmentation doesn't take up ANY extra space ... it simply makes file access less efficient because of the multiple fragments that have to be accessed.

WeeboTech · May 22, 2015

As previously posted, directories are growing and shrinking files. A directory that grows to a huge number of files may leave some space.

However, I have not seen that to always be the case with reiserfs. It depends on how large the directory has grown.

Another issue could be that a file size is reported differently due to some kind of interrupted superblock operation.

Without a full fsck, there could be files with extra space allocated.

If you want to be sure all files are copied over intact, you can do the rsync again with the -c option.

This copies by checksum instead of mtime. It will take a very long time, but it will compare all the files and copy over differences. Or use the -n option

I believe you have to use -rcc instead of -av.

Another point worth considering is to create an md5sums file for the source disk, then copy that over to the destination disk and do it in comparison mode.

I helped another user with some command examples for migrating to XFS.

the post is here. http://lime-technology.com/forum/index.php?topic=38507.msg360594#msg360594

Maybe that will give you some hints for double checking.

There are also the bitrot and bunker shells for attaching the hash values to extended attributes.

If you do this on the source, you'll need to rsync the files again with -av and the -X option to rsync over the attributes.

tr0910 · May 22, 2015

OK so either directory diff or bitrot. I'll post back once I know.

In the future I plan on having bunker or bitrot verification too. But a disk with constantly changing files will be a challenge with bitrot, unless we can figure out a way to update the extended attributes hash on file save. Constantly changing files will all fail the bitrot verification, and you won't know why. Was the file changed and just the hash needs recalculated, or is the hash right and the file corrupted?

WeeboTech · May 22, 2015

OK so either directory diff or bitrot. I'll post back once I know.

In the future I plan on having bunker or bitrot verification too. But a disk with constantly changing files will be a challenge with bitrot, unless we can figure out a way to update the extended attributes hash on file save. Constantly changing files will all fail the bitrot verification, and you won't know why. Was the file changed and just the hash needs recalculated, or is the hash right and the file corrupted?

I had thought there was a method in either tool to re-calculate the hash if the file changed.

It shouldn't be that hard to add if it's not there. You may want to ask questions in the related threads.

If the mtime is > then the hash calculate time, then you can probably be sure that the file has changed.

At least that was the logic I used in my own .c version of the tool.

tr0910 · May 22, 2015

Would there be any way to automatically update the hash after a file save? That would be the cleanest way to eliminate false positives from the failure list.

But your idea should work too. Especially if the script could figure it out and just fix the hash.

WeeboTech · May 22, 2015

Would there be any way to automatically update the hash after a file save? That would be the cleanest way to eliminate false positives from the failure list.

But your idea should work too. Especially if the script could figure it out and just fix the hash.

It's possible using inotify and watch lists.

With an inotify / watch list you can run the tools per file as they change.

I worked on a different tool that did it another way. It was sort of a dir_cache and hashdb using gdbm files

I used a ftw64 to scan the tree (like dir_cache)

Then I stored the filename as key with the stat block and space for a 132 byte hash.

When a file changed, it would update the stat block and truncate the 132 byte hash and print the name of the file that had changed.

If that were stored in a file it could be used by another tool to recalculate the hash.

It could also be done inside the the tool and update the gdbm.

The issue becomes concurrency.

If the tool did the hashing, when a large amount of files changed, the tool would be busy hashing each file.

Thus, loosing the benefit of directory caching as dentries get flushed by reading large files.

If the tool forked a co-process, then the same would occur at a diminished level, however, now you have gdbm concurrency and file locking issues. Only 1 process can be a writer to a gdbm file. I solved that by doing my own locking, but I never went further with it.

The benefit of using .gdbm files in this method is speed. Once the whole filesystem is scanned, iterating through the file is very fast.

However it takes a bit of space to store all that information. It escapes me how much, I think it was about 90MB for 300,000 files.

If that file is on disk, that slows it down. In addition there are the spin up issues. If it's on a ram disk the file takes up valuable ram space.

If it's on SSD then the individual blocks are updated every time a file gets updated. So now we're talking about wear.

I've also explored this with SQLite. With SQLite you have the concurrency handled by multiple processes, however the size is greatly increased. Speed on lookup is not as fast as gdbm files. However you have much better access by the shell.

I'm really leaning towards that for cataloging and similarities with the locate tool.

The issue here is it takes up even more room in ram. However having locate ability for me is a big plus.

When I really thought about it, using a daily job in cron to scan the filesystem for files that are missing hashes or the mtime > hashtime was the most efficient.

So it would be a scan and update, then another job could be used to scan hashtimes that were > some hash verification age.

What's missing in each of the shell based tools is an export to a standard hash file and an import from standard hash file.

Otherwise they both use a good mechanism for detection and concurrency.

ysss · May 22, 2015

@tr0910: why are you using unraid + reiserfs for something like this?

Sounds like something that would be a breeze on zfs (or maybe btrfs).

tr0910 · May 22, 2015

@ysss. UnRaid works fast enough for what I'm doing. I don't even use a cache drive. My documents drive does see some regular updates.

The drive in question is old jpg image files from the days when digital was starting and file sizes small. So it's a media drive.

tr0910 · May 22, 2015

When I really thought about it, using a daily job in cron to scan the filesystem for files that are missing hashes or the mtime > hashtime was the most efficient.

Yes, and creating the hash in real time could bring the server to it's knees during massive copy operations. Updating the hash can happen later.

Are bitrot and bunker efficient in file system scanning doing the mtime>hash time compare? A drive with 300,000 files takes minutes or hours?

WeeboTech · May 22, 2015

When I really thought about it, using a daily job in cron to scan the filesystem for files that are missing hashes or the mtime > hashtime was the most efficient.

Yes, and creating the hash in real time could bring the server to it's knees during massive copy operations. Updating the hash can happen later.

Are bitrot and bunker efficient in file system scanning doing the mtime>hash time compare? A drive with 300,000 files takes minutes or hours?

To walk through 300,000 files takes me approx 30-40 minutes doing only a find from a freshly booted system.

If nothing changed, I would not expect the scan and XATTR check to take longer then 1 hour on a freshly booted system.

Once this scan was done, a subsequent one shouldn't take longer then 10-20 minutes. 'Shouldn't.'

It's all relative to the number of files, speed of the disk subsystem and processor.

In addition any cache misses or flushing of the dentry/inode cache comes into play.

Now if those files did not have hashes, the initial hash creation can take over 24 hours.

This is where a fast processor matters. unRAID doesn't need it unless you are virtualizing or doing high volume hashing and want speed.

If the processor is fast, the disk subsystem can read fast enough without interruption and the disk is very full

I would guestimate that a full scan would take 2-3x a full smart long test scan.

To estimate that do a smartctl -a and look at the Extended self-test routine.

recommended polling time: ( 543) minutes.

In my case this disk, took a little over 24 hours for the initial scan and hash using md5deep.

It's all relative to the speed of the disk, SATA subsystem, how fast the processor is and how full the disk is.

garycase · May 22, 2015

Did you try FolderMatch to see what the differences are ?

One other thought r.e. your comment:

The drive in question is old jpg image files from the days when digital was starting and file sizes small. So it's a media drive.

You may have a few hidden Thumbs.db file on the source drive that's not being copied. With 225,000 files I wouldn't expect these to be as large as the difference you're seeing, but if there are a lot of them that could explain your difference. But I'd run FolderMatch so you know what the actual differences are for sure.

tr0910 · May 23, 2015

Gary, I have had foldermatch running all weekend, and it seems to be locked up. I'll let it go and see on Monday if its still unresponsive.

Should it be able to do 300,000 files in one go?

garycase · May 23, 2015

Yes, it should work okay => did you change the comparison method (on the Options menu) to either "File Name Only" or "Size and Date/Time" ?? If not, it's actually reading all of the data and comparing the files -- this will take a VERY long time for that many files with that much data.

tr0910 · May 23, 2015

Weebotech, can I do a compare as follows? Not sure if foldermatch may succeed. I could do something via the terminal as follows to do an exhaustive file compare of the contents. I want to output a list of files that don't match.

mount -t nfs tower3:/mnt/disk5/ /mnt/t3disk5

rsync -avrc --dry-run --stats --progress /mnt/disk1 /mnt/t3disk1/ >> /boot/logs/cronlogs/t3disk1chksum.log

Or is the some way to use diff between 2 servers??

tr0910 · May 23, 2015

Yes, it should work okay => did you change the comparison method (on the Options menu) to either "File Name Only" or "Size and Date/Time" ?? If not, it's actually reading all of the data and comparing the files -- this will take a VERY long time for that many files with that much data.

I tried several different ways, but ultimately asked it to do it based on file contents. I know it will take a very long time. The windows laptop doing the compare has a gigabit wired connection, it has an i7 chip and 16 gb ram. It isn't working very hard at all. W7 task manager shows 1-5% cpu load. I expected it to be working harder...

(time to break out the whip.... faster, faster, faster)

WeeboTech · May 23, 2015

Weebotech, can I do a compare as follows? Not sure if foldermatch may succeed. I could do something via the terminal as follows to do an exhaustive file compare of the contents. I want to output a list of files that don't match.

mount -t nfs tower3:/mnt/disk5/ /mnt/t3disk5

rsync -avrc --dry-run --stats --progress /mnt/disk1 /mnt/t3disk1/ >> /boot/logs/cronlogs/t3disk1chksum.log

Or is the some way to use diff between 2 servers??

You can do it that way.

While I do it that way, I also have an md5sums file for the source.

doing the -rc is sort of like doing the md5sum on the file.

If you create an md5sums file on the source you will gain in a few ways.

1. You will now have hash values for all your files.

2. You can bring this file over to the destination and check each file for integrity. You will see misses if there are any.

3. Now you have hash values and history.

I bet there's a way to import these hash values into the extended attributes for bunker and/or bitrot.

Going forward you can use one of those tools.

The other choice is to use bunker or bitrot to put the extended attributes on the source.

rsync the tree again, use the same tool on the destination.

Downside of this is, if a file is missed you will not know it.

With md5sums file, when you check the integrity, you'll know if a file was missed.

Quick method is to do a find down each of the trees

find /mnt/disk? -print | sort > filelist on one system

do the same on the other system

then do wc -l on each system's filelist to check the count.

If there are misses, you can do a diff -u to see where they are.

garycase · May 23, 2015

One thing I definitely do NOT like about FolderMatch is it's almost non-existent status info. It can definitely appear "hung" when in fact it's still working away. If you don't see the little window that shows the "current" (in quotes, as it often doesn't update it very often) files being compared, it's likely hidden behind the main FolderMatch window. As long as Windows doesn't show a "not responding" note in the title bar, it's probably working just fine -- just much slower than you'd like.

I still use it as it's a very reliable way to identify file differences.

WeeboTech · May 23, 2015

Yes, it should work okay => did you change the comparison method (on the Options menu) to either "File Name Only" or "Size and Date/Time" ?? If not, it's actually reading all of the data and comparing the files -- this will take a VERY long time for that many files with that much data.

I tried several different ways, but ultimately asked it to do it based on file contents. I know it will take a very long time. The windows laptop doing the compare has a gigabit wired connection, it has an i7 chip and 16 gb ram. It isn't working very hard at all. W7 task manager shows 1-5% cpu load. I expected it to be working harder...

(time to break out the whip.... faster, faster, faster)

If you do the md5sums on the source, then do the md5sum -c on the destination, it's almost the same thing.

You gain by having hash values for checking at another time. (or importing them to the extended attributes).

garycase · May 23, 2015

If you do the md5sums on the source, then do the md5sum -c on the destination, it's almost the same thing.

You gain by having hash values for checking at another time. (or importing them to the extended attributes).

Agree. I've done this quite often ... computed the MD5's on one system; then checked them on another [or on one drive and then checked them after a copy].

tr0910 · May 23, 2015

---- As long as Windows doesn't show a "not responding" note in the title bar, it's probably working just fine -- just much slower than you'd like.

I still use it as it's a very reliable way to identify file differences.

---- Windows is showing "Not responding" note in the title bar....

2 identitical disks mirrored by rsync have diff free space - HOW can this be??

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation