2 identitical disks mirrored by rsync have diff free space - HOW can this be??


Recommended Posts

  • Replies 50
  • Created
  • Last Reply

Top Posters In This Topic

Weebotech, I connected to that location via VPN and ran that rsync script.  It is claiming that about most of the files are different.  Am I reading this right? 

 

Pix2015/2015-05 May/[2015-05-21] outdoor backdrop ideas/IMG_5462.JPG
Pix2015/2015-05 May/[2015-05-21] outdoor backdrop ideas/IMG_5463.CR2
Pix2015/2015-05 May/[2015-05-21] outdoor backdrop ideas/IMG_5463.JPG
Pix2015/2015-05 May/[2015-05-21] outdoor backdrop ideas/M2100.CTG

Number of files: 6636
Number of files transferred: 6424
Total file size: 149609156912 bytes
Total transferred file size: 149609156912 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 205176
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 225309
Total bytes received: 20132

sent 225309 bytes  received 20132 bytes  147.46 bytes/sec
total size is 149609156912  speedup is 609552.43 (DRY RUN)

 

Gary, I'll not be near that computer until Tuesday.  We'll know more then.

Link to comment

use another -v it will say if it matched or not.

 

I still think capturing md5sums on the source is worth the time and cycles.

 

as mentioned, doing the two finds | sort and wc and/or diffs on filenames/directories can be done in less then an hour.

That will tell you if files are missing. not if they match and are intact.

the md5sums will tell you if files are intact.

 

I do expect them to be unless the prior rsync's were interrupted.

However rsync uses hidden tmp files when rsyncing files to another system, so you shouldn't have any partial files unless --partial was used.

Link to comment

vpn into the server and started it generating md5's based on your scripts.  Thanks weebo

 

Might as well keep the server busy, and have that done before next week.  Who knows, it might come in handy.  I really want to know if something went wrong.

Link to comment

I have completed generating the md5 and done a verify on them, and it looks like the files all compare correctly.  So the reason the disks show different free space must be just directory space taken.

 

I haven't been able to get back to the computer running foldermatch, to see if it is crashed, but the batch scripts from weebo were awesome.  Thanks weebo

Link to comment

Using @weebotech tools, I compared over 2 million files between all the servers and found 3 instances of bitrot.  3 files failed and were definitely corrupted.  But out of 2 million, I guess we can live with that.  This is just using

 

rsync -av

 

to do the backup between unRaid servers.  There was no verification taking place, and no md5 generated until now.

Link to comment

No the files had not changed. Sizes and time matched exactly.  These files were backed up daily and it was always one of the backup server that had the corrupted file.  So we can blame the rsync process for creating the 3 corrupt files.

 

I'm afraid it's not that simple.  The rsync process is only doing what we tell it, and it does it very reliably.  If rsync was the problem, then we would all be having this problem, but we aren't.  If there is corruption, then there is almost certainly a hardware malfunction somewhere, that you really need to identify, because there should absolutely be ZERO corruption, no matter how many files you transfer.

 

You have taken the first step by identifying the one server with corrupted files.  Now you need to identify what hardware component cannot be trusted, and the sooner the better, because that server cannot be trusted to protect your files!  The possible components are the memory, the motherboard chips, the NIC, the network cable, a disk drive, possibly others.  The one other possibility is some sort of memory corrupting bug on that server, a bad plugin or driver, but you should most likely have seen problems - crashes, panics, call traces in the syslog...

 

The easiest one to test is the memory, so as soon as possible, run a LONG Memtest from the boot menu, for at least 24 hours (stop if any failure occurs).  While it's running, you can begin to analyze what components and what paths are common to ALL of the affected files.  For example, if corrupted files are now stored on different drives, then no drive is common to all of the files, and you can eliminate the drives from the suspect list.

 

For completeness, you should probably run a Memtest on the machine the files came from too, just in case.  And verify that the 3 files on it and elsewhere are fine, and only the copies on that one server are corrupted.  You want to make absolutely sure that each file was clean before the copy, but not after, to make sure of which machines you can trust or not, and which network paths you can trust, or not.

 

I've been through this twice before myself, and there's so much more could be said, but this is enough to raise the alarm and get you started toward isolating the problem.

Link to comment

Yes Robj, I am sure you are right.

 

This process started in 2010 when the backup servers were Semphron 145 based and they were hand carried off site. Then they were replaced with 24 bay Tam's data center pulls once the disk count went beyond 6. I guess what I'm saying is that much water has gone under the bridge in the last 5 years and it will be hard to identify the guilty party. (Bad network switch, multi rail power supply that was used for 6 months before it caused problems, no UPS for the first 2 years,  Many things have changed, and I've committed all the sins possible. To have only 3 files bad is a welcome confirmation.

 

I want to be able to spot bit rot more quickly in the future, so am glad for the work done by weebo, jbartlett and bonienl with bunker and bitrot. I just wish they were all working together as a tight little suite.

 

Weebo uses a file in the root of each drive to hold the hash while extended attributes are used by the other ones. Still, we can just be glad we have anything at all.

Link to comment

Wow!  Sorry.  You're right, too much is past.

 

What may still be helpful though, is to compare the 3 files, the bad ones with the good originals, just so you can identify what is characteristic about the corruption.  There is almost always a pattern.  For example, it might always be the nth byte in a certain buffer size or memory unit, and/or it's always the nth bit of that byte, and it's a bit that always turns off, not on, or the reverse.

 

If the 3 files aren't too big, and aren't confidential, and you could zip both the good and the bad, I'd be happy to analyze them.

 

The reason it might be helpful is that if it happens again, you want to be able to say it's the same problem, and therefore conclude that the problem component is still there, and active.

Link to comment

The 3 files are gigabyte sized and non-compressible Photoshop PSD files.  But thanks anyway.

 

This has got me interested in the Windows side of this issue.  My files are created on Windows first, then cwrsync'ed to unRaid, then rsync'ed to the backup unRaid server.  I would like to calculate the hash on Windows put it into the extended attributes on the NTFS disk (as per bunker and bitrot) and have it follow the files as they are copied.  Might as well start at the source.... 

 

Guess its time to PM the authors of bunker and bitrot..

Link to comment

The 3 files are gigabyte sized and non-compressible Photoshop PSD files.  But thanks anyway.

 

This has got me interested in the Windows side of this issue.  My files are created on Windows first, then cwrsync'ed to unRaid, then rsync'ed to the backup unRaid server.  I would like to calculate the hash on Windows put it into the extended attributes on the NTFS disk (as per bunker and bitrot) and have it follow the files as they are copied.  Might as well start at the source.... 

 

Guess its time to PM the authors of bunker and bitrot..

 

I computer checksums for all my files on Windows using Corz Checksum utility and copy the checksums along with the files when I copy it to UnRAID.    It's then very simple to test from Windows with a simple right-click, "Verify checksums" selection.  HOWEVER, there's no convenient way to check them from Linux, so if your goal is to do this all within UnRAID the utilities WeeboTech uses are a better choice -- I'm just not sure how you can initiate the process from within Windows.

 

Did you ever get a chance to try FolderMatch again?    I'm very surprised it died on you -- I've used it for years and have sets of files with far more than 225,000 files in them.    I'd at least try it with the "Size only" comparison method, so you know for sure where the actual size differences are.

 

Link to comment
  • 2 weeks later...

I bet there's a way to import these hash values into the extended attributes for bunker and/or bitrot.

 

Going forward you can use one of those tools.

 

I now have finished getting all files hashed, and using bunker, added to the extended attributes, but I find one fatal flaw in creating an ongoing working system.  I can't find an efficient way to update the hashes once a file changes.  I need to compare the hash date with the file date, and this should be easy with bunker as the hash date is stored with the file, however bonienl hasn't implemented it yet.  With bunker I can add hashes for a dozen new files to a 3tb drive in a few minutes, but it takes 24 hours to find a dozen changed files and update their hashes.  How do you update the hashes in close to real time?

 

See here:

 

http://lime-technology.com/forum/index.php?topic=37290.msg381357#msg381357

Link to comment

I bet there's a way to import these hash values into the extended attributes for bunker and/or bitrot.

 

Going forward you can use one of those tools.

 

I now have finished getting all files hashed, and using bunker, added to the extended attributes, but I find one fatal flaw in creating an ongoing working system.  I can't find an efficient way to update the hashes once a file changes.  I need to compare the hash date with the file date, and this should be easy with bunker as the hash date is stored with the file, however bonienl hasn't implemented it yet.  With bunker I can add hashes for a dozen new files to a 3tb drive in a few minutes, but it takes 24 hours to find a dozen changed files and update their hashes.  How do you update the hashes in close to real time?

 

See here:

 

http://lime-technology.com/forum/index.php?topic=37290.msg381357#msg381357

 

Can you drop the cache (or reboot), then do a find down the whole tree just to see how long it takes to scan the whole filesystem?

 

something like

http://linux-mm.org/Drop_Caches

 

echo 3 > /proc/sys/vm/drop_caches

time find /mnt/disk? -type f > filelist

 

That will show you how long it takes to scan the filesystem.

It will be the minimum amount of time to do the check.

 

At that point, it's fairly easy to find what files do not match the scandate in the XATTR.

 

For me it takes 18-32 minutes to scan 350,000 files on a freshly booted or dropped cache system.

 

This does not take into account, the matching of hash sums, just the scanning of the filesystem itself.

There will also be time involved in reading the XATTR to acquire the scandate.

Link to comment

Yes, I think my times are similar. I could generate the file list from your tools in about 10-20 min. There was no drop cache implicit there. I think that anything less than an hour is good.

 

I also know that an rsync could traverse quite fast. I'll add that here.

 

Log for "rsync -av" 2 servers over gbit no files sent with 2 & 3 tb disks about 100K files per disk
====== moving to Disk2 ===========
Mon Jun 8 01:03:32 CDT 2015
====== moving to Disk8 ===========
Mon Jun 8 01:15:47 CDT 2015
====== moving to Disk7 ===========
Mon Jun 8 01:28:34 CDT 2015
====== moving to Disk6 ===========
Mon Jun 8 01:42:31 CDT 2015
====== moving to Disk5 ===========
Mon Jun 8 02:02:21 CDT 2015
====== moving to Disk4 ===========
Mon Jun 8 02:16:53 CDT 2015
====== moving to Disk3 ===========
Mon Jun 8 02:33:15 CDT 2015
====== moving to Disk1 ===========
Mon Jun 8 02:43:08 CDT 2015
====== moving to end ===========
Mon Jun 8 03:03:56 CDT 2015

Link to comment

The rsync traversal on a cached system is going to be much faster then if the caches were dropped.

 

I don't know if bunker or bitrot can accept a list of files to operate on.

If so, a .c program can do what find does and compare the the XATTR scandate vs mtime as fast as an uncached(drop cached) system.

That's your benchmark of minimum speed.

 

The trick is where to put the list of files in what format and how to tell bunker or bitrot how to operate on them only.

 

In this case, it would be something as fast as cache_dirs and pretty much do the same thing that cache_dirs is doing.

scanning the filesystem and now doing something during the scan i.e. checking xattrs.

 

FWIW, I had a cache_dirs alternative that I've worked on in the past. It would put the stat blocks in a .gdbm file.

It operated pretty fast. It was a method to detect file system/mtime changes.

 

However, I didn't think anyone needed to know when a file changed in that short of a time span so I stopped working on it.

 

If limetech hadn't dropped perl from the distro, It could be a quick script to do this.

 

Other then that it's a find, reading each file, doing a pipe to getxattr and comparing mtimes.

It's feasible in bash with the stat and getattr command, but the continual pipe & forks for each tool will slow it down.

 

It still boils down to having bunker or bitrot do this natively and/or allowing import of a filelist.

Link to comment

How do you update the hashes in batch, but close to real time?  Is my use case unique?

 

I have one 3tb drive that is a documents drive.  All the other drives hardly every change, other than new added files, but that one is constantly getting written to.  Once a file is changed and saved back to disk, the hash in the extended attributes is wrong.

 

I need to update the hashes in the extended attributes, or else I mistakenly blame bitrot, when really all that happened is the file was edited and written back to disk. 

 

I don't see a problem adding perl as a prerequisite to this tool.  I have it already added to make rsnapshot work.  Likely bitrot and bunker could use it too.

 

 

Link to comment

I've considered updating them in real time. I changed my mind on it feelign it just wasn't needed to be in real time with normal usage patterns.

 

My thoughts were a daily scan at some interval should catch changed files and recalculate the hash sums.

 

Using my .gdbm method, I could to it in near real time, but it comes at a price.

The file system has to constantly be scanned like cache_dirs, data is stored about the file somewhere and as soon as the file is changed it has to be read and hashed.  Depending on the size of the file, this could cause the kernel caches to get flushed.

 

At that point you could loose all the directory and inode information stored thus causing a delay on the next scan.

I could be something that prevents the drives from sleeping.

 

There's the possibility of having a tool like this only operate on a specific directory try, but then you can use the inotify tools and call bunker/bitrot directly.

 

That's all dependent on these tools being able to operate on an individual file or external list of files known to have changed.

I don't want to reinvent their wheels.  That would be the way to integrate one tool with another and not having to do a whole file system scan and hash check.

 

I thought the bitrot tool had a mode to check for changed files.

Link to comment

My thoughts were a daily scan at some interval should catch changed files and recalculate the hash sums.

I agree daily would be considered real time for me.

I thought the bitrot tool had a mode to check for changed files.

I thought so too.  But bunker -u takes 23 hours to compute a 3 tb drive with only 10 files changed from the last run...

 

http://lime-technology.com/forum/index.php?topic=37290.msg381351#msg381351

 

I don't want to revinvent the already awesome work you guys have already done.  I don't have the time or the knowledge.  I would like to try and use what has already been created, but I have this snag with a documents drive.

 

Proposed workflow assuming that there was an efficient update process for hashes.  Each day update hashes on all newly added files on all drives and update hashes on all changed files on documents drive.

 

Each drive would be verified once per month, per your plan as follows.

 

Day of month

1 Parity Check

2 Disk2

3 Disk3

4 Disk4

5 Disk5

6 Disk6

7 Disk7

8 Disk8

9 Disk9

.... and so on ....

(disks above 20 would not be included in this as they are used for other purposes)

 

Link to comment

I would suggest reconsider the scheduling of parity vs disk? for simplicity.

 

If you schedule the parity check for day 27, it will still occur correctly in Feb and could run for 2 days.

Then by scheduling the disk or hash verification per disk according to day of month you can use a simple date command to figure out which disk to operate on.

 

i.e. here are some examples to do it in bash and with pipes via date.

We use %e so that we can strip of the space padding.

 

BASH

printf -v DOM '%(%e)T' -1

printf -v DOM '%d' ${DOM}

 

verify

 

echo "'${DOM}'"

'8'

 

and in piped format with backticks.

 

DOM=`date +%e`

echo "'${DOM}'"

' 8'

 

Here we strip off space padding with sed.

 

DOM=`date +%e|sed 's# ##g'`

echo "'${DOM}'"

'8'

 

Here we strip off space padding with sed.

Now you can set a cron job to run every day and do a disk mount test with

if [ ! -d /mnt/disk{$DOM} ]

  then exit

fi

 

# otherwise operate on /mnt/disk${DOM}

 

I do it this way ti simplify things.

One job that runs every day of the month.

If the disk is mounted and today is the matching day of the month, do some work.

 

FWIW, I have other code that goes takes the output of /proc/mdcmd status and puts it into a bash array so I can operate on the matching /dev/sd? device per /dev/md${DOW} device.

 

This lets me operate on each disk once a month.

Parity has to be handled differently, yet for the most part, I can trigger smart tests once a month on the appropriate array disk.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.