Dynamix File Integrity plugin


bonienl

Recommended Posts

I just recently found this tool, so thank you bonienl for sharing it with the community.  So far it is working pretty awesome.

 

I do have a question:  I have been playing with the `Find` feature and it isn't behaving as I am expecting.  Just a quick manual glance at my exported hash files, I am finding duplicates that the tool isn't detecting.  Here are some quick examples:

 

659ea86a44709dee1386da8fb4291cc9c123310b67cc62da3a2debf2f3e9593ad8f2bfb20ab1eedc7b357c20dd6bdcdd4709450b4e7cbfb46cde19773a78612e */mnt/disk1/Documents/Finances/Beneficiary Forms/SF1152.pdf
873588b4cc6056b6aba1b435124f59b25f68e7403d5c361efe1c24d22d623f12fa0c097f97a86f534878caa4c9e5b5d51e0ac21b5d6d628abbcf8f8c6d371c55 */mnt/disk1/Documents/Finances/Beneficiary Forms/SF2823.pdf
62bebcf21626d8062d0881c8f7a3587d0070af77704935d7ad09558f758475a7fd32ee6e923b7cc52c822fe8066b1fecc23f76f140463edd5119318be74ece60 */mnt/disk1/Documents/Finances/Beneficiary Forms/SF3102.pdf
762a1e0ed8d97e46a9a4db1fb0560bd3f5487ffff5e4b925a3cbf14ebd36e510382487625a825dd3805e2ca510105c11770d154e6e3e97655ea509055cfde19f */mnt/disk1/Documents/Finances/Beneficiary Forms/TSP3.pdf
659ea86a44709dee1386da8fb4291cc9c123310b67cc62da3a2debf2f3e9593ad8f2bfb20ab1eedc7b357c20dd6bdcdd4709450b4e7cbfb46cde19773a78612e */mnt/disk1/Documents/Finances/Beneficiary/SF1152.pdf
873588b4cc6056b6aba1b435124f59b25f68e7403d5c361efe1c24d22d623f12fa0c097f97a86f534878caa4c9e5b5d51e0ac21b5d6d628abbcf8f8c6d371c55 */mnt/disk1/Documents/Finances/Beneficiary/SF2823.pdf
62bebcf21626d8062d0881c8f7a3587d0070af77704935d7ad09558f758475a7fd32ee6e923b7cc52c822fe8066b1fecc23f76f140463edd5119318be74ece60 */mnt/disk1/Documents/Finances/Beneficiary/SF3102.pdf
762a1e0ed8d97e46a9a4db1fb0560bd3f5487ffff5e4b925a3cbf14ebd36e510382487625a825dd3805e2ca510105c11770d154e6e3e97655ea509055cfde19f */mnt/disk1/Documents/Finances/Beneficiary/TSP3.pdf

 

`find` reports no duplicates:

---------------------------------------

Reading and sorting hash files

Including... disk1.export.hash

Including... disk2.export.hash

Finding duplicate file names

No duplicate file names found

 

Maybe there is an explanation I am overlooking?

Link to comment

I just recently found this tool, so thank you bonienl for sharing it with the community.  So far it is working pretty awesome.

 

I do have a question:  I have been playing with the `Find` feature and it isn't behaving as I am expecting.  Just a quick manual glance at my exported hash files, I am finding duplicates that the tool isn't detecting.  Here are some quick examples:

 

659ea86a44709dee1386da8fb4291cc9c123310b67cc62da3a2debf2f3e9593ad8f2bfb20ab1eedc7b357c20dd6bdcdd4709450b4e7cbfb46cde19773a78612e */mnt/disk1/Documents/Finances/Beneficiary Forms/SF1152.pdf
873588b4cc6056b6aba1b435124f59b25f68e7403d5c361efe1c24d22d623f12fa0c097f97a86f534878caa4c9e5b5d51e0ac21b5d6d628abbcf8f8c6d371c55 */mnt/disk1/Documents/Finances/Beneficiary Forms/SF2823.pdf
62bebcf21626d8062d0881c8f7a3587d0070af77704935d7ad09558f758475a7fd32ee6e923b7cc52c822fe8066b1fecc23f76f140463edd5119318be74ece60 */mnt/disk1/Documents/Finances/Beneficiary Forms/SF3102.pdf
762a1e0ed8d97e46a9a4db1fb0560bd3f5487ffff5e4b925a3cbf14ebd36e510382487625a825dd3805e2ca510105c11770d154e6e3e97655ea509055cfde19f */mnt/disk1/Documents/Finances/Beneficiary Forms/TSP3.pdf
659ea86a44709dee1386da8fb4291cc9c123310b67cc62da3a2debf2f3e9593ad8f2bfb20ab1eedc7b357c20dd6bdcdd4709450b4e7cbfb46cde19773a78612e */mnt/disk1/Documents/Finances/Beneficiary/SF1152.pdf
873588b4cc6056b6aba1b435124f59b25f68e7403d5c361efe1c24d22d623f12fa0c097f97a86f534878caa4c9e5b5d51e0ac21b5d6d628abbcf8f8c6d371c55 */mnt/disk1/Documents/Finances/Beneficiary/SF2823.pdf
62bebcf21626d8062d0881c8f7a3587d0070af77704935d7ad09558f758475a7fd32ee6e923b7cc52c822fe8066b1fecc23f76f140463edd5119318be74ece60 */mnt/disk1/Documents/Finances/Beneficiary/SF3102.pdf
762a1e0ed8d97e46a9a4db1fb0560bd3f5487ffff5e4b925a3cbf14ebd36e510382487625a825dd3805e2ca510105c11770d154e6e3e97655ea509055cfde19f */mnt/disk1/Documents/Finances/Beneficiary/TSP3.pdf

 

`find` reports no duplicates:

---------------------------------------

Reading and sorting hash files

Including... disk1.export.hash

Including... disk2.export.hash

Finding duplicate file names

No duplicate file names found

 

Maybe there is an explanation I am overlooking?

 

A file is considered a duplicate when:

 

1. The same path+name appears on different disks, e.g.

   /mnt/disk1/myfolder/filename.txt
   /mnt/disk2/myfolder/filename.txt

 

2. Hash results of files are the same, regardless of the path+name of the file

 

Link to comment

I just recently found this tool, so thank you bonienl for sharing it with the community.  So far it is working pretty awesome.

 

I do have a question:  I have been playing with the `Find` feature and it isn't behaving as I am expecting.  Just a quick manual glance at my exported hash files, I am finding duplicates that the tool isn't detecting.  Here are some quick examples:

 

659ea86a44709dee1386da8fb4291cc9c123310b67cc62da3a2debf2f3e9593ad8f2bfb20ab1eedc7b357c20dd6bdcdd4709450b4e7cbfb46cde19773a78612e */mnt/disk1/Documents/Finances/Beneficiary Forms/SF1152.pdf
873588b4cc6056b6aba1b435124f59b25f68e7403d5c361efe1c24d22d623f12fa0c097f97a86f534878caa4c9e5b5d51e0ac21b5d6d628abbcf8f8c6d371c55 */mnt/disk1/Documents/Finances/Beneficiary Forms/SF2823.pdf
62bebcf21626d8062d0881c8f7a3587d0070af77704935d7ad09558f758475a7fd32ee6e923b7cc52c822fe8066b1fecc23f76f140463edd5119318be74ece60 */mnt/disk1/Documents/Finances/Beneficiary Forms/SF3102.pdf
762a1e0ed8d97e46a9a4db1fb0560bd3f5487ffff5e4b925a3cbf14ebd36e510382487625a825dd3805e2ca510105c11770d154e6e3e97655ea509055cfde19f */mnt/disk1/Documents/Finances/Beneficiary Forms/TSP3.pdf
659ea86a44709dee1386da8fb4291cc9c123310b67cc62da3a2debf2f3e9593ad8f2bfb20ab1eedc7b357c20dd6bdcdd4709450b4e7cbfb46cde19773a78612e */mnt/disk1/Documents/Finances/Beneficiary/SF1152.pdf
873588b4cc6056b6aba1b435124f59b25f68e7403d5c361efe1c24d22d623f12fa0c097f97a86f534878caa4c9e5b5d51e0ac21b5d6d628abbcf8f8c6d371c55 */mnt/disk1/Documents/Finances/Beneficiary/SF2823.pdf
62bebcf21626d8062d0881c8f7a3587d0070af77704935d7ad09558f758475a7fd32ee6e923b7cc52c822fe8066b1fecc23f76f140463edd5119318be74ece60 */mnt/disk1/Documents/Finances/Beneficiary/SF3102.pdf
762a1e0ed8d97e46a9a4db1fb0560bd3f5487ffff5e4b925a3cbf14ebd36e510382487625a825dd3805e2ca510105c11770d154e6e3e97655ea509055cfde19f */mnt/disk1/Documents/Finances/Beneficiary/TSP3.pdf

 

`find` reports no duplicates:

---------------------------------------

Reading and sorting hash files

Including... disk1.export.hash

Including... disk2.export.hash

Finding duplicate file names

No duplicate file names found

 

Maybe there is an explanation I am overlooking?

 

A file is considered a duplicate when:

 

1. The same path+name appears on different disks, e.g.

   /mnt/disk1/myfolder/filename.txt
   /mnt/disk2/myfolder/filename.txt

 

2. Hash results of files are the same, regardless of the path+name of the file

Since his example shows 4 files with matching hashes to 4 other files, why are there no duplicates found?
Link to comment

Because they're not in the same folder on multiple disks.  Therefore its not a duplicate file that's going to mess up smb/unraid as which to utilize.

Yes, but case 2 specifically states they will be listed as dupes if the hash matches, regardless of path.

 

I am very opposed to calling user file system name collisions dupes, as they very well could be different content. Consider that since only file on the lowest number disk will be exposed for editing, other files with the same name and path on other disks will likely be outdated.

 

Naming collision is a much more accurate description of case 1, duplicate files accurately describes case 2.

Link to comment
Well, I disagree with case 2.  Simply because 2 files have the exact same hash does not mean that they are dupes.  Will you ever see that? Who knows but it is entirely possible
True. If the hash matches, then they should be further processed with a binary compare  before being presented as dupes. That would remove all doubt.
Link to comment

The second case is only executed when the option Include duplicate file hashes in Find command is checked.

 

In that case the utility reports any files which have the same hash value, the result is displayed on a separate page which needs to be opened, see attached picture.

dupes.png.bb5653c515d84a27ccaf57b8ea7a10b6.png

Link to comment
  • 4 weeks later...
  • 2 weeks later...

Well, I disagree with case 2.  Simply because 2 files have the exact same hash does not mean that they are dupes.  Will you ever see that? Who knows but it is entirely possible

 

Sent from my LG-D852 using Tapatalk

 

That's pretty paranoid. The likelihood of of two of your files having a hash collision with blake2 is astronomical.

Link to comment

Well, I disagree with case 2.  Simply because 2 files have the exact same hash does not mean that they are dupes.  Will you ever see that? Who knows but it is entirely possible

 

Sent from my LG-D852 using Tapatalk

 

That's pretty paranoid. The likelihood of of two of your files having a hash collision with blake2 is astronomical.

Why? Regardless of whether you see it or not, it is entirely possible (likely even depending upon the number of files) of 2 completely different files having identical hashes.  Using a 32bit hash (admittedly Blake2 isn't), the odds of a collision is 1 in 2 for 77000 files.

 

Hash values are by definition not unique -> if they were, your bluray rip would take up a whopping 64 bytes on your hard drive.

 

When 2 file's hash values (regardless of length) differ, its guaranteed that the files are different.  When 2 file's hash values are identical, the only way to determine that the files are different is to perform a byte-by-byte comparison.  Hashing is used to identify is a transfer of a file (or the integrity of a file) was performed successfully, because small changes in the file will produce large changes in the hash.  They are not intended to compare the contents of two different files.

 

The odds of a collision with Blake2 on a single file that has suffered silent corruption (ie: the original file and the corrupted file both having the same hash) however is indeed astronomical.  And that's what makes Blake a great hashing algorithm.

Link to comment
  • 1 month later...

This looks like it'll be a great plugin and I've installed it but not set it up yet (as my server is currently doing some fairly intensive stuff and I don't want to complicate matters). As this plugin has been out a little while now, I was just wondering if anyone had any experience with which hashing algorithm to use on an Atom processor with the least performance impact?

 

I have an Intel C2750D4I - it's not the most powerful processor ever but it's perfect for my usage scenario. The last thing I want, however, is for something to start writing to the array while I'm watching or transcoding something in plex and for it to interfere with that so it's pretty important to me that I use the hashing algorithm with the least performance impact.

 

I see that blake2 is supposed to be the fastest of the bunch, but I also saw somewhere that you have to make sure your processor is compatible (plus sometimes things are fast on some processors but not on others).

 

If anyone has any input on this, especially if they've used it with an Intel Atom C2750, I'd really appreciate it as obviously hashing all of my storage 3 times to find the fastest one would take a long time when someone's probably already got some information available!

Link to comment

To anyone in the future interested in this, I did some very basic testing of this by creating a user share with a few files in and excluding every other share, then doing a build in order to find out which was fastest on my processor.

 

Interestingly it doesn't seem like any of them are multi-core optimised (and I guess the build only does one file at a time, at least if they're all on one disk). I got 100% CPU load on one core of my processor whichever algorithm I used. At the end, the build gives you an average speed. I ran all the tests a couple of times with all different files and this is what I got:

 

SHA1: Was around the 90 MB/s mark (unfortunately can't remember the exact results)

BLAKE2: 93 MB/s

MD5: 323 MB/s

 

So if anyone wants to install this plugin and their primary concern is speed, at least on the C2750 8 core atom, MD5 is by far the fastest to use. I find it crazy that BLAKE2, which is supposed to be the fastest, is less than a third of the speed of MD5, but this may well just be a quirk of the C2750 processor.

Link to comment

I did a measurement on my Xeon based system (see my sig). These are the results of hashing a 8.5GB file

sha256 = 36.7s = 232 MB/s
md5    = 12.8s = 664 MB/s
blake2 = 10.2s = 833 MB/s

 

Interesting, probably just a quirk of my processor then. I did a quick Google search for "C2750" and "BLAKE2" and came up with this

 

https://github.com/minio/blake2b-simd/issues/11

 

Not sure if that's the blake2 algorithm this is using (or if the problems posted there are related to my slow speeds) but I would guess that it's something to do with the "Seems to indicate that there's some kind of performance penalty on Atom when executing SSE with 64-bit operands" comment.

 

Thanks for the benchmarks, always useful to have. So I guess for most people, BLAKE2 is probably the best option but for us Atom users, probably best stick with MD5.

Link to comment

Right from the webpage https://blake2.net/

 

Q: BLAKE2bp gives different results from BLAKE2b. Is that normal?

 

A: Yes. BLAKE2bp is a different algorithm from BLAKE2b and BLAKE2sp is a different algorithm from BLAKE2s. Each algorithm produces a different hash value.

 

BLAKE2b and BLAKE2s are designed to be efficient on a single CPU core (BLAKE2b is more efficient on 64-bit CPUs and BLAKE2s is more efficient on 8-bit, 16-bit, or 32-bit CPUs). BLAKE2bp and BLAKE2sp are designed to be efficient on multicore or SIMD chips, by processing the input in parallel. This parallel approach results in different secure hash values from the single-core variants.

 

More generally, two instances of BLAKE2b or BLAKE2s with two distinct sets of parameters will produce different results. For example, BLAKE2b in some tree mode (say, with fanout 2) will produce different results than BLAKE2b in a modified tree mode (say, with fanout 3).

 

There is also this document https://131002.net/data/papers/NA12a.pdf which describes the benefits of the AVX2 instructions which ATOMS don't appear to have

 

So it might be worth it to test the others versions and use the fastest ... maybe dynamically at install

Link to comment

Some quick testing on my windows machine (because I am lazy) showed the following timed results for a 1.32GB file on an SSD with an AMD PhenomII X4-975

 

-h blake2b:  4.39s (I can't tell what this is optimized for other than 64-bit)

-h blake2s:  6.79s (optimized for 8- to 32-bit)

-h blake2bp: 2.04s (optimized for 4-way parallel)

-h blake2sp: 2.83s (optimized for 8-way parallel)

 

I don't know which parameter is being used by this plugin but clearly even on a "modern" CPU it can make a large difference. If I had to guess, my 8-core unRaid cpu probably would do better on the 8-way optimized and would probably get even more benefit from the fact that it implements a more modern instruction set to include AVX, FMA4, and XOP which my desktop does not.

 

Makes me wonder if your C2750, being an 8-core will benefit from using the either the blake2bp or even blake2sp hash function?

Link to comment

Some quick testing on my windows machine (because I am lazy) showed the following timed results for a 1.32GB file on an SSD with an AMD PhenomII X4-975

 

-h blake2b:  4.39s (I can't tell what this is optimized for other than 64-bit)

-h blake2s:  6.79s (optimized for 8- to 32-bit)

-h blake2bp: 2.04s (optimized for 4-way parallel)

-h blake2sp: 2.83s (optimized for 8-way parallel)

 

I don't know which parameter is being used by this plugin but clearly even on a "modern" CPU it can make a large difference. If I had to guess, my 8-core unRaid cpu probably would do better on the 8-way optimized and would probably get even more benefit from the fact that it implements a more modern instruction set to include AVX, FMA4, and XOP which my desktop does not.

 

Makes me wonder if your C2750, being an 8-core will benefit from using the either the blake2bp or even blake2sp hash function?

 

That's a good idea. Looked around and found the command (b2sum), you can specify the hash algorithm (running b2sum with no args even gives you something to copy paste into a shell script for loop). So I gave it a go with an 11GB file and these are the results I got:

 

blake2b  - 1 core:  93MB/s

blake2s  - 1 core:  167MB/s

blake2bp - 4 cores: 315MB/s

blake2sp - 8 cores: 620MB/s

 

Judging by those stats the plugin is using standard blake2b. All of them maxed out whichever cores they were using (except for blake2sp which seemed to use around 85% of each).

 

It would be great to have the option to use a different blake2 algorithm as it clearly makes quite a large difference, at least on my system (which I believe is fairly popular). Of course I'd rather not make writing files to the array quite that intensive so I'd probably use the blake2bp just so there's a little CPU wiggle-room left for other tasks, but it's always nice to have options!

Link to comment

I have to be honest and say this really seems like a significant defect with blake2, when each variant produces a different hash.  (Put another way, I think they made a 'hash' of it!)  This makes portability very problematic.  The only time you are sure it's producing the same hashes is if you use the same tool configured the same way and only on the same machine.  But a common usage is to transfer large sets of files elsewhere, and check for corruption at the destination.  You won't be assured of the same hashes until you test the other tool, and attempt to configure it the same.  Since they seem to be so flippant with the hash produced on the *same* machine, how can you know that a different machine and OS and different CPU maker (Intel vs AMD) will produce the same hash, even if the same blake2 variant is used?  It would have to be tested, thoroughly.

 

A little more speed means nothing if guaranteed portability is needed and your chosen blake2 variant can't be trusted.

  • Like 1
Link to comment

I have two reasons why Dynamix File Integrity uses a fixed algorithm for BLAKE2:

 

1. Different algorithms produce different results. Changing the algorithm would invalidate all hashes already calculated and effectively you need to start all over again. Big PITA

 

2. The verification schedule allows multiple disks to be checked simultaneously. These are different processes running concurrently, and the OS would spread them across the available cores. While it sounds interesting to use the multi-core version of BLAKE, its advantage diminishes when several processes are running at the same time.

 

The hashing process is a background activity and reliability/consistency is of higher importance than absolute speed.

 

Link to comment

meh, I think if different machines produced different results we'd have heard about it by now. Blake and Blake2 are hardly untested. https://tools.ietf.org/html/rfc7693 You might as well judge SHA harshly because SHA-256 makes different hashes than SHA-512. I admit it would have been nicer if single vs parallel made the same hash but from what I've read of the technical articles, it is a different way of constructing the hash and a different result is impossible to avoid or it wouldn't be as much of a speedup. Blake2s vs Blake2b not making the same has also makes sense since, well they aren't even the same length.

 

So, I'm not sure what would lead you to make the leap from different algorithms creating different hashes to the same algorithm creating different hashes on different machines?

 

Still, I don't see portability as a concern. The user can choose which Blake2 algorithm to use just like they choose to use Blake2 over SHA or MD5. It is no more or less portable than any of the other hash functions that produce different hashes. Also, should a verification check generate a metric sh!t ton of failures, that might indicate the wrong algorithm is being used so maybe double check against the other options. It should be fairly obvious and just needs to be an user option when verifying. If nothing else the hash check code can cut the options in half just by looking at the length of the hash, b is longer than s.

 

So really what should probably happen is one algorithm has to be chosen on install (and that might be overly restrictive), by the user, and then a warning issued if the user changes it. Practically speaking that is only going to happen for anyone moving from something without AVX to something with it. After that they'd literally have no reason to change ever again until another faster better algorithm is created. Just like when everyone moved from MD5 to SHA-256 to SHA-512.

 

And they don't even HAVE to change (ie start over) because all that will happen is they will not be running the fastest most secure hash for their system. A fact already true right now.

 

As for saturating the system, I sure hope the OS is smart enough to load balance, but even if it isn't, that it why you run it at night when nothing else is going on.

 

Other than the implied ask of you doing more work I don't see the problem with offering user choice.

Link to comment

I have two reasons why Dynamix File Integrity uses a fixed algorithm for BLAKE2:

 

1. Different algorithms produce different results. Changing the algorithm would invalidate all hashes already calculated and effectively you need to start all over again. Big PITA

 

2. The verification schedule allows multiple disks to be checked simultaneously. These are different processes running concurrently, and the OS would spread them across the available cores. While it sounds interesting to use the multi-core version of BLAKE, its advantage diminishes when several processes are running at the same time.

 

The hashing process is a background activity and reliability/consistency is of higher importance than absolute speed.

 

As to point numner 2, that depends on how many disks and how many cores are available. I have far more cores available than disks, its a factor of 8 to 1. A choice of using multithreaded algorothm could drastically help in my situation. My system has 32 cores and 4 data disks.

Link to comment

Have a look at this (no option = blake2b)

# b2sum test
fe62b863ad3acac1f1caf4fa305630b4ca2e99b6e70da5a47ea297d4083dbc2b103db9074334ab2a34412444c0bbfbb02736cb8a15c7aa589efa138360184722 test

# b2sum -a blake2b test
fe62b863ad3acac1f1caf4fa305630b4ca2e99b6e70da5a47ea297d4083dbc2b103db9074334ab2a34412444c0bbfbb02736cb8a15c7aa589efa138360184722 test

# b2sum -a blake2s test
249537724d4257ef35cda010761bbd87f5151d54ed82f761f190a73c409d6858 test

# b2sum -a blake2bp test
9c1ae086deb4de564d0152138abcd2381737da3a96622a1b9af13a4b0f8ec683565c3a8d2aefbfb2bafbda825bf2f42b8159d9e6df64b74a2f91077e18328757 test

# b2sum -a blake2sp test
212d5849b4abc1b91cc776d669d9cbab8dfcf991f1eb095c1eb546bfff2d2f91 test

 

Say you want to swap between single core (2b) to multi core (2bp) it will produce different results.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.