How does one check huge files identity if hashing is CPU bound? The Next CEO of Stack OverflowDebian - Port 80 is blocked, but I don't know by whatSHA Hash of large files in WindowsVirtualized CPU cores vs. threadsSQL server availability issue: large query stops other connections from connectingrsync with multi-core server should go faster than it goes. Am I wrong?Performance implications of CPU/cores rationumber of cores available on a serverDell 2970 Server dual Opteron 2435's_Ubuntu OS incorrectly reporting CPU'sHow does CPU affinity interact with cgroups in Linux?RHEL: 4 Cores available but only 1 usedMPI job hangs on one specific node onlyScaling Apache/PHP is as simple as maximizing the number of CPU cores?

Opposite of a diet

What is the point of a new vote on May's deal when the indicative votes suggest she will not win?

Why were Madagascar and New Zealand discovered so late?

Example of a Mathematician/Physicist whose Other Publications during their PhD eclipsed their PhD Thesis

How to use tikz in fbox?

Was a professor correct to chastise me for writing "Prof. X" rather than "Professor X"?

Can a single photon have an energy density?

Why is there a PLL in CPU?

When Does an Atlas Uniquely Define a Manifold?

Only print output after finding pattern

If the heap is initialized for security, then why is the stack uninitialized?

Text adventure game code

Need some help with wall behind rangetop

MAZDA 3 2006 (UK) - poor acceleration then takes off at 3250 revs

How do scammers retract money, while you can’t?

What does this shorthand mean?

Trouble understanding the speech of overseas colleagues

What makes a siege story/plot interesting?

Science fiction (dystopian) short story set after WWIII

WOW air has ceased operation, can I get my tickets refunded?

Whats the best way to handle refactoring a big file?

I believe this to be a fraud - hired, then asked to cash check and send cash as Bitcoin

How to get regions to plot as graphics

Is it safe to use c_str() on a temporary string?



How does one check huge files identity if hashing is CPU bound?



The Next CEO of Stack OverflowDebian - Port 80 is blocked, but I don't know by whatSHA Hash of large files in WindowsVirtualized CPU cores vs. threadsSQL server availability issue: large query stops other connections from connectingrsync with multi-core server should go faster than it goes. Am I wrong?Performance implications of CPU/cores rationumber of cores available on a serverDell 2970 Server dual Opteron 2435's_Ubuntu OS incorrectly reporting CPU'sHow does CPU affinity interact with cgroups in Linux?RHEL: 4 Cores available but only 1 usedMPI job hangs on one specific node onlyScaling Apache/PHP is as simple as maximizing the number of CPU cores?










5















For small files hashing is just ok, but with huge ones you can easily find md5sum is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)










share|improve this question



















  • 1





    Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

    – Brian
    Jun 26 '16 at 16:27











  • In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

    – mmv-ru
    Jun 29 '16 at 0:08












  • did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

    – yagmoth555
    Jun 29 '16 at 12:54











  • @yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

    – poige
    Jun 29 '16 at 13:37















5















For small files hashing is just ok, but with huge ones you can easily find md5sum is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)










share|improve this question



















  • 1





    Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

    – Brian
    Jun 26 '16 at 16:27











  • In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

    – mmv-ru
    Jun 29 '16 at 0:08












  • did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

    – yagmoth555
    Jun 29 '16 at 12:54











  • @yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

    – poige
    Jun 29 '16 at 13:37













5












5








5


4






For small files hashing is just ok, but with huge ones you can easily find md5sum is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)










share|improve this question
















For small files hashing is just ok, but with huge ones you can easily find md5sum is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)







multi-core hash big-data






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 19 hours ago







poige

















asked Jun 26 '16 at 10:59









poigepoige

7,09211437




7,09211437







  • 1





    Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

    – Brian
    Jun 26 '16 at 16:27











  • In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

    – mmv-ru
    Jun 29 '16 at 0:08












  • did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

    – yagmoth555
    Jun 29 '16 at 12:54











  • @yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

    – poige
    Jun 29 '16 at 13:37












  • 1





    Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

    – Brian
    Jun 26 '16 at 16:27











  • In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

    – mmv-ru
    Jun 29 '16 at 0:08












  • did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

    – yagmoth555
    Jun 29 '16 at 12:54











  • @yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

    – poige
    Jun 29 '16 at 13:37







1




1





Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

– Brian
Jun 26 '16 at 16:27





Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

– Brian
Jun 26 '16 at 16:27













In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

– mmv-ru
Jun 29 '16 at 0:08






In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

– mmv-ru
Jun 29 '16 at 0:08














did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

– yagmoth555
Jun 29 '16 at 12:54





did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

– yagmoth555
Jun 29 '16 at 12:54













@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

– poige
Jun 29 '16 at 13:37





@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

– poige
Jun 29 '16 at 13:37










7 Answers
7






active

oldest

votes


















11














Mine own best at the moment solution is:



parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend ''
-k -j …NUMofProcessesSay4… md5sum | md5sum



— It should be noted that:



  1. Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin

  2. It also doesn't perform very good, specially when you use pipe and not file as input


  3. parallel's --pipepart as I found out doesn't support disk partitions

So I'd love to hear other ways around as well.






share|improve this answer




















  • 3





    Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

    – Ole Tange
    Jul 4 '16 at 2:40






  • 1





    THUMb up! Thanks for maintaining the utility. ;)

    – poige
    Jul 4 '16 at 16:45











  • This should have been put as part of the question rather than as an answer.

    – mc0e
    Jul 5 '16 at 21:19


















3














Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.



What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.



If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash



Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.






share|improve this answer

























  • The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

    – poige
    Jun 29 '16 at 13:39











  • BTW: stackoverflow.com/questions/10726325/…

    – poige
    Jun 29 '16 at 14:07











  • @poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

    – shodanshok
    Jun 29 '16 at 14:55











  • I won't consider this as correct answer. ;-P

    – poige
    Jul 1 '16 at 7:02











  • @shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

    – Jedi
    Jul 2 '16 at 16:54


















2














There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).



Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.



If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.



This Gist could be a good start for something in Python.






share|improve this answer























  • split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

    – poige
    Jun 26 '16 at 13:24











  • @poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

    – Gary
    Jun 26 '16 at 13:27











  • Yeah, but that's the theory which is quite obvious; anything practical?

    – poige
    Jun 26 '16 at 14:29











  • @poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

    – Gary
    Jun 26 '16 at 14:33






  • 2





    I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

    – poige
    Jun 26 '16 at 14:46



















0














Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.



Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.



If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.






share|improve this answer






























    0














    I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao






    share|improve this answer






























      -1














      You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.



      Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep



      The package name in ubuntu and debian is md5deep and includes hashdeep.






      share|improve this answer























      • From the man page I'd expect -j to support thread per each file given, not its parts.

        – poige
        Jul 3 '16 at 21:28


















      -1














      It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.



      Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.



      As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).



      From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.



      It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.



      If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.



      A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.






      share|improve this answer

























      • I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

        – mc0e
        Jul 4 '16 at 7:59











      Your Answer








      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "2"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f786338%2fhow-does-one-check-huge-files-identity-if-hashing-is-cpu-bound%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      7 Answers
      7






      active

      oldest

      votes








      7 Answers
      7






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      11














      Mine own best at the moment solution is:



      parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend ''
      -k -j …NUMofProcessesSay4… md5sum | md5sum



      — It should be noted that:



      1. Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin

      2. It also doesn't perform very good, specially when you use pipe and not file as input


      3. parallel's --pipepart as I found out doesn't support disk partitions

      So I'd love to hear other ways around as well.






      share|improve this answer




















      • 3





        Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

        – Ole Tange
        Jul 4 '16 at 2:40






      • 1





        THUMb up! Thanks for maintaining the utility. ;)

        – poige
        Jul 4 '16 at 16:45











      • This should have been put as part of the question rather than as an answer.

        – mc0e
        Jul 5 '16 at 21:19















      11














      Mine own best at the moment solution is:



      parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend ''
      -k -j …NUMofProcessesSay4… md5sum | md5sum



      — It should be noted that:



      1. Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin

      2. It also doesn't perform very good, specially when you use pipe and not file as input


      3. parallel's --pipepart as I found out doesn't support disk partitions

      So I'd love to hear other ways around as well.






      share|improve this answer




















      • 3





        Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

        – Ole Tange
        Jul 4 '16 at 2:40






      • 1





        THUMb up! Thanks for maintaining the utility. ;)

        – poige
        Jul 4 '16 at 16:45











      • This should have been put as part of the question rather than as an answer.

        – mc0e
        Jul 5 '16 at 21:19













      11












      11








      11







      Mine own best at the moment solution is:



      parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend ''
      -k -j …NUMofProcessesSay4… md5sum | md5sum



      — It should be noted that:



      1. Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin

      2. It also doesn't perform very good, specially when you use pipe and not file as input


      3. parallel's --pipepart as I found out doesn't support disk partitions

      So I'd love to hear other ways around as well.






      share|improve this answer















      Mine own best at the moment solution is:



      parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend ''
      -k -j …NUMofProcessesSay4… md5sum | md5sum



      — It should be noted that:



      1. Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin

      2. It also doesn't perform very good, specially when you use pipe and not file as input


      3. parallel's --pipepart as I found out doesn't support disk partitions

      So I'd love to hear other ways around as well.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jun 26 '16 at 15:50

























      answered Jun 26 '16 at 14:49









      poigepoige

      7,09211437




      7,09211437







      • 3





        Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

        – Ole Tange
        Jul 4 '16 at 2:40






      • 1





        THUMb up! Thanks for maintaining the utility. ;)

        – poige
        Jul 4 '16 at 16:45











      • This should have been put as part of the question rather than as an answer.

        – mc0e
        Jul 5 '16 at 21:19












      • 3





        Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

        – Ole Tange
        Jul 4 '16 at 2:40






      • 1





        THUMb up! Thanks for maintaining the utility. ;)

        – poige
        Jul 4 '16 at 16:45











      • This should have been put as part of the question rather than as an answer.

        – mc0e
        Jul 5 '16 at 21:19







      3




      3





      Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

      – Ole Tange
      Jul 4 '16 at 2:40





      Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

      – Ole Tange
      Jul 4 '16 at 2:40




      1




      1





      THUMb up! Thanks for maintaining the utility. ;)

      – poige
      Jul 4 '16 at 16:45





      THUMb up! Thanks for maintaining the utility. ;)

      – poige
      Jul 4 '16 at 16:45













      This should have been put as part of the question rather than as an answer.

      – mc0e
      Jul 5 '16 at 21:19





      This should have been put as part of the question rather than as an answer.

      – mc0e
      Jul 5 '16 at 21:19













      3














      Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.



      What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.



      If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash



      Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.






      share|improve this answer

























      • The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

        – poige
        Jun 29 '16 at 13:39











      • BTW: stackoverflow.com/questions/10726325/…

        – poige
        Jun 29 '16 at 14:07











      • @poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

        – shodanshok
        Jun 29 '16 at 14:55











      • I won't consider this as correct answer. ;-P

        – poige
        Jul 1 '16 at 7:02











      • @shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

        – Jedi
        Jul 2 '16 at 16:54















      3














      Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.



      What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.



      If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash



      Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.






      share|improve this answer

























      • The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

        – poige
        Jun 29 '16 at 13:39











      • BTW: stackoverflow.com/questions/10726325/…

        – poige
        Jun 29 '16 at 14:07











      • @poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

        – shodanshok
        Jun 29 '16 at 14:55











      • I won't consider this as correct answer. ;-P

        – poige
        Jul 1 '16 at 7:02











      • @shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

        – Jedi
        Jul 2 '16 at 16:54













      3












      3








      3







      Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.



      What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.



      If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash



      Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.






      share|improve this answer















      Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.



      What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.



      If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash



      Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jun 29 '16 at 12:40

























      answered Jun 29 '16 at 12:02









      shodanshokshodanshok

      26.6k34787




      26.6k34787












      • The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

        – poige
        Jun 29 '16 at 13:39











      • BTW: stackoverflow.com/questions/10726325/…

        – poige
        Jun 29 '16 at 14:07











      • @poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

        – shodanshok
        Jun 29 '16 at 14:55











      • I won't consider this as correct answer. ;-P

        – poige
        Jul 1 '16 at 7:02











      • @shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

        – Jedi
        Jul 2 '16 at 16:54

















      • The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

        – poige
        Jun 29 '16 at 13:39











      • BTW: stackoverflow.com/questions/10726325/…

        – poige
        Jun 29 '16 at 14:07











      • @poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

        – shodanshok
        Jun 29 '16 at 14:55











      • I won't consider this as correct answer. ;-P

        – poige
        Jul 1 '16 at 7:02











      • @shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

        – Jedi
        Jul 2 '16 at 16:54
















      The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

      – poige
      Jun 29 '16 at 13:39





      The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

      – poige
      Jun 29 '16 at 13:39













      BTW: stackoverflow.com/questions/10726325/…

      – poige
      Jun 29 '16 at 14:07





      BTW: stackoverflow.com/questions/10726325/…

      – poige
      Jun 29 '16 at 14:07













      @poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

      – shodanshok
      Jun 29 '16 at 14:55





      @poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

      – shodanshok
      Jun 29 '16 at 14:55













      I won't consider this as correct answer. ;-P

      – poige
      Jul 1 '16 at 7:02





      I won't consider this as correct answer. ;-P

      – poige
      Jul 1 '16 at 7:02













      @shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

      – Jedi
      Jul 2 '16 at 16:54





      @shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

      – Jedi
      Jul 2 '16 at 16:54











      2














      There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).



      Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.



      If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.



      This Gist could be a good start for something in Python.






      share|improve this answer























      • split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

        – poige
        Jun 26 '16 at 13:24











      • @poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

        – Gary
        Jun 26 '16 at 13:27











      • Yeah, but that's the theory which is quite obvious; anything practical?

        – poige
        Jun 26 '16 at 14:29











      • @poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

        – Gary
        Jun 26 '16 at 14:33






      • 2





        I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

        – poige
        Jun 26 '16 at 14:46
















      2














      There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).



      Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.



      If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.



      This Gist could be a good start for something in Python.






      share|improve this answer























      • split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

        – poige
        Jun 26 '16 at 13:24











      • @poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

        – Gary
        Jun 26 '16 at 13:27











      • Yeah, but that's the theory which is quite obvious; anything practical?

        – poige
        Jun 26 '16 at 14:29











      • @poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

        – Gary
        Jun 26 '16 at 14:33






      • 2





        I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

        – poige
        Jun 26 '16 at 14:46














      2












      2








      2







      There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).



      Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.



      If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.



      This Gist could be a good start for something in Python.






      share|improve this answer













      There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).



      Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.



      If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.



      This Gist could be a good start for something in Python.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jun 26 '16 at 13:22









      GaryGary

      1044




      1044












      • split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

        – poige
        Jun 26 '16 at 13:24











      • @poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

        – Gary
        Jun 26 '16 at 13:27











      • Yeah, but that's the theory which is quite obvious; anything practical?

        – poige
        Jun 26 '16 at 14:29











      • @poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

        – Gary
        Jun 26 '16 at 14:33






      • 2





        I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

        – poige
        Jun 26 '16 at 14:46


















      • split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

        – poige
        Jun 26 '16 at 13:24











      • @poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

        – Gary
        Jun 26 '16 at 13:27











      • Yeah, but that's the theory which is quite obvious; anything practical?

        – poige
        Jun 26 '16 at 14:29











      • @poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

        – Gary
        Jun 26 '16 at 14:33






      • 2





        I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

        – poige
        Jun 26 '16 at 14:46

















      split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

      – poige
      Jun 26 '16 at 13:24





      split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

      – poige
      Jun 26 '16 at 13:24













      @poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

      – Gary
      Jun 26 '16 at 13:27





      @poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

      – Gary
      Jun 26 '16 at 13:27













      Yeah, but that's the theory which is quite obvious; anything practical?

      – poige
      Jun 26 '16 at 14:29





      Yeah, but that's the theory which is quite obvious; anything practical?

      – poige
      Jun 26 '16 at 14:29













      @poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

      – Gary
      Jun 26 '16 at 14:33





      @poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

      – Gary
      Jun 26 '16 at 14:33




      2




      2





      I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

      – poige
      Jun 26 '16 at 14:46






      I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

      – poige
      Jun 26 '16 at 14:46












      0














      Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.



      Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.



      If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.






      share|improve this answer



























        0














        Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.



        Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.



        If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.






        share|improve this answer

























          0












          0








          0







          Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.



          Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.



          If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.






          share|improve this answer













          Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.



          Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.



          If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jul 3 '16 at 19:22









          tfrederick74656tfrederick74656

          1,27211027




          1,27211027





















              0














              I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao






              share|improve this answer



























                0














                I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao






                share|improve this answer

























                  0












                  0








                  0







                  I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao






                  share|improve this answer













                  I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Oct 1 '18 at 17:14









                  Jack O'ConnorJack O'Connor

                  173128




                  173128





















                      -1














                      You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.



                      Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep



                      The package name in ubuntu and debian is md5deep and includes hashdeep.






                      share|improve this answer























                      • From the man page I'd expect -j to support thread per each file given, not its parts.

                        – poige
                        Jul 3 '16 at 21:28















                      -1














                      You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.



                      Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep



                      The package name in ubuntu and debian is md5deep and includes hashdeep.






                      share|improve this answer























                      • From the man page I'd expect -j to support thread per each file given, not its parts.

                        – poige
                        Jul 3 '16 at 21:28













                      -1












                      -1








                      -1







                      You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.



                      Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep



                      The package name in ubuntu and debian is md5deep and includes hashdeep.






                      share|improve this answer













                      You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.



                      Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep



                      The package name in ubuntu and debian is md5deep and includes hashdeep.







                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Jul 3 '16 at 18:47









                      JasonJason

                      214




                      214












                      • From the man page I'd expect -j to support thread per each file given, not its parts.

                        – poige
                        Jul 3 '16 at 21:28

















                      • From the man page I'd expect -j to support thread per each file given, not its parts.

                        – poige
                        Jul 3 '16 at 21:28
















                      From the man page I'd expect -j to support thread per each file given, not its parts.

                      – poige
                      Jul 3 '16 at 21:28





                      From the man page I'd expect -j to support thread per each file given, not its parts.

                      – poige
                      Jul 3 '16 at 21:28











                      -1














                      It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.



                      Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.



                      As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).



                      From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.



                      It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.



                      If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.



                      A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.






                      share|improve this answer

























                      • I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

                        – mc0e
                        Jul 4 '16 at 7:59















                      -1














                      It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.



                      Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.



                      As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).



                      From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.



                      It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.



                      If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.



                      A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.






                      share|improve this answer

























                      • I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

                        – mc0e
                        Jul 4 '16 at 7:59













                      -1












                      -1








                      -1







                      It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.



                      Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.



                      As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).



                      From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.



                      It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.



                      If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.



                      A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.






                      share|improve this answer















                      It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.



                      Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.



                      As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).



                      From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.



                      It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.



                      If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.



                      A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Jul 4 '16 at 7:31

























                      answered Jul 4 '16 at 7:01









                      mc0emc0e

                      5,3541127




                      5,3541127












                      • I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

                        – mc0e
                        Jul 4 '16 at 7:59

















                      • I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

                        – mc0e
                        Jul 4 '16 at 7:59
















                      I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

                      – mc0e
                      Jul 4 '16 at 7:59





                      I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

                      – mc0e
                      Jul 4 '16 at 7:59

















                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Server Fault!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f786338%2fhow-does-one-check-huge-files-identity-if-hashing-is-cpu-bound%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Wikipedia:Vital articles Мазмуну Biography - Өмүр баян Philosophy and psychology - Философия жана психология Religion - Дин Social sciences - Коомдук илимдер Language and literature - Тил жана адабият Science - Илим Technology - Технология Arts and recreation - Искусство жана эс алуу History and geography - Тарых жана география Навигация менюсу

                      Bruxelas-Capital Índice Historia | Composición | Situación lingüística | Clima | Cidades irmandadas | Notas | Véxase tamén | Menú de navegacióneO uso das linguas en Bruxelas e a situación do neerlandés"Rexión de Bruxelas Capital"o orixinalSitio da rexiónPáxina de Bruselas no sitio da Oficina de Promoción Turística de Valonia e BruxelasMapa Interactivo da Rexión de Bruxelas-CapitaleeWorldCat332144929079854441105155190212ID28008674080552-90000 0001 0666 3698n94104302ID540940339365017018237

                      What should I write in an apology letter, since I have decided not to join a company after accepting an offer letterShould I keep looking after accepting a job offer?What should I do when I've been verbally told I would get an offer letter, but still haven't gotten one after 4 weeks?Do I accept an offer from a company that I am not likely to join?New job hasn't confirmed starting date and I want to give current employer as much notice as possibleHow should I address my manager in my resignation letter?HR delayed background verification, now jobless as resignedNo email communication after accepting a formal written offer. How should I phrase the call?What should I do if after receiving a verbal offer letter I am informed that my written job offer is put on hold due to some internal issues?Should I inform the current employer that I am about to resign within 1-2 weeks since I have signed the offer letter and waiting for visa?What company will do, if I send their offer letter to another company