How does one check huge files identity if hashing is CPU bound? The Next CEO of Stack OverflowDebian - Port 80 is blocked, but I don't know by whatSHA Hash of large files in WindowsVirtualized CPU cores vs. threadsSQL server availability issue: large query stops other connections from connectingrsync with multi-core server should go faster than it goes. Am I wrong?Performance implications of CPU/cores rationumber of cores available on a serverDell 2970 Server dual Opteron 2435's_Ubuntu OS incorrectly reporting CPU'sHow does CPU affinity interact with cgroups in Linux?RHEL: 4 Cores available but only 1 usedMPI job hangs on one specific node onlyScaling Apache/PHP is as simple as maximizing the number of CPU cores?

Opposite of a diet

What is the point of a new vote on May's deal when the indicative votes suggest she will not win?

Why were Madagascar and New Zealand discovered so late?

Example of a Mathematician/Physicist whose Other Publications during their PhD eclipsed their PhD Thesis

How to use tikz in fbox?

Was a professor correct to chastise me for writing "Prof. X" rather than "Professor X"?

Can a single photon have an energy density?

Why is there a PLL in CPU?

When Does an Atlas Uniquely Define a Manifold?

Only print output after finding pattern

If the heap is initialized for security, then why is the stack uninitialized?

Text adventure game code

Need some help with wall behind rangetop

MAZDA 3 2006 (UK) - poor acceleration then takes off at 3250 revs

How do scammers retract money, while you can’t?

What does this shorthand mean?

Trouble understanding the speech of overseas colleagues

What makes a siege story/plot interesting?

Science fiction (dystopian) short story set after WWIII

WOW air has ceased operation, can I get my tickets refunded?

Whats the best way to handle refactoring a big file?

I believe this to be a fraud - hired, then asked to cash check and send cash as Bitcoin

How to get regions to plot as graphics

Is it safe to use c_str() on a temporary string?

How does one check huge files identity if hashing is CPU bound?

The Next CEO of Stack OverflowDebian - Port 80 is blocked, but I don't know by whatSHA Hash of large files in WindowsVirtualized CPU cores vs. threadsSQL server availability issue: large query stops other connections from connectingrsync with multi-core server should go faster than it goes. Am I wrong?Performance implications of CPU/cores rationumber of cores available on a serverDell 2970 Server dual Opteron 2435's_Ubuntu OS incorrectly reporting CPU'sHow does CPU affinity interact with cgroups in Linux?RHEL: 4 Cores available but only 1 usedMPI job hangs on one specific node onlyScaling Apache/PHP is as simple as maximizing the number of CPU cores?

For small files hashing is just ok, but with huge ones you can easily find md5sum is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)

edited 19 hours ago

asked Jun 26 '16 at 10:59

poige

7,09211437

1

Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

– Brian
Jun 26 '16 at 16:27

In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

– mmv-ru
Jun 29 '16 at 0:08

did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

– yagmoth555♦
Jun 29 '16 at 12:54

@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

– poige
Jun 29 '16 at 13:37

add a comment |

edited 19 hours ago

asked Jun 26 '16 at 10:59

poige

7,09211437

1

Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

– Brian
Jun 26 '16 at 16:27

In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

– mmv-ru
Jun 29 '16 at 0:08

did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

– yagmoth555♦
Jun 29 '16 at 12:54

@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

– poige
Jun 29 '16 at 13:37

add a comment |

edited 19 hours ago

asked Jun 26 '16 at 10:59

poige

7,09211437

multi-core hash big-data

edited 19 hours ago

asked Jun 26 '16 at 10:59

poige

7,09211437

edited 19 hours ago

asked Jun 26 '16 at 10:59

poige

7,09211437

edited 19 hours ago

asked Jun 26 '16 at 10:59

poige

7,09211437

asked Jun 26 '16 at 10:59

poige

7,09211437

asked Jun 26 '16 at 10:59

poige

7,09211437

1

Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

– Brian
Jun 26 '16 at 16:27

In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

– mmv-ru
Jun 29 '16 at 0:08

did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

– yagmoth555♦
Jun 29 '16 at 12:54

@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

– poige
Jun 29 '16 at 13:37

add a comment |

1

Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

– Brian
Jun 26 '16 at 16:27

In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

– mmv-ru
Jun 29 '16 at 0:08

did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

– yagmoth555♦
Jun 29 '16 at 12:54

@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

– poige
Jun 29 '16 at 13:37

Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel

– Brian
Jun 26 '16 at 16:27

In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.

– mmv-ru
Jun 29 '16 at 0:08

did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)

– yagmoth555♦
Jun 29 '16 at 12:54

@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.

– poige
Jun 29 '16 at 13:37

add a comment |

7 Answers
7

active

oldest

votes

Mine own best at the moment solution is:

parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend '' -k -j …NUMofProcessesSay4… md5sum | md5sum

— It should be noted that:

Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin

It also doesn't perform very good, specially when you use pipe and not file as input

parallel's --pipepart as I found out doesn't support disk partitions

So I'd love to hear other ways around as well.

edited Jun 26 '16 at 15:50

answered Jun 26 '16 at 14:49

poige

7,09211437

3

Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

– Ole Tange
Jul 4 '16 at 2:40

1

THUMb up! Thanks for maintaining the utility. ;)

– poige
Jul 4 '16 at 16:45

This should have been put as part of the question rather than as an answer.

– mc0e
Jul 5 '16 at 21:19

add a comment |

Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.

What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.

If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash

Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.

edited Jun 29 '16 at 12:40

answered Jun 29 '16 at 12:02

shodanshok

26.6k34787

The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

– poige
Jun 29 '16 at 13:39

BTW: stackoverflow.com/questions/10726325/…

– poige
Jun 29 '16 at 14:07

@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

– shodanshok
Jun 29 '16 at 14:55

I won't consider this as correct answer. ;-P

– poige
Jul 1 '16 at 7:02

@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

– Jedi
Jul 2 '16 at 16:54

|
show 1 more comment

There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).

Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.

If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.

This Gist could be a good start for something in Python.

answered Jun 26 '16 at 13:22

Gary

1044

split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

– poige
Jun 26 '16 at 13:24

@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

– Gary
Jun 26 '16 at 13:27

Yeah, but that's the theory which is quite obvious; anything practical?

– poige
Jun 26 '16 at 14:29

@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

– Gary
Jun 26 '16 at 14:33

2

I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

– poige
Jun 26 '16 at 14:46

|
show 1 more comment

Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.

Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.

If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.

answered Jul 3 '16 at 19:22

tfrederick74656

1,27211027

add a comment |

I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao

answered Oct 1 '18 at 17:14

Jack O'Connor

173128

add a comment |

-1

You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.

Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep

The package name in ubuntu and debian is md5deep and includes hashdeep.

answered Jul 3 '16 at 18:47

Jason

214

From the man page I'd expect -j to support thread per each file given, not its parts.

– poige
Jul 3 '16 at 21:28

add a comment |

-1

It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.

Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.

As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).

From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.

It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.

If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.

A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.

edited Jul 4 '16 at 7:31

answered Jul 4 '16 at 7:01

mc0e

5,3541127

I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

– mc0e
Jul 4 '16 at 7:59

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f786338%2fhow-does-one-check-huge-files-identity-if-hashing-is-cpu-bound%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

7 Answers
7

active

oldest

votes

7 Answers
7

active

oldest

votes

Mine own best at the moment solution is:

parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend '' -k -j …NUMofProcessesSay4… md5sum | md5sum

— It should be noted that:

Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin

It also doesn't perform very good, specially when you use pipe and not file as input

parallel's --pipepart as I found out doesn't support disk partitions

So I'd love to hear other ways around as well.

edited Jun 26 '16 at 15:50

answered Jun 26 '16 at 14:49

poige

7,09211437

3

Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

– Ole Tange
Jul 4 '16 at 2:40

1

THUMb up! Thanks for maintaining the utility. ;)

– poige
Jul 4 '16 at 16:45

This should have been put as part of the question rather than as an answer.

– mc0e
Jul 5 '16 at 21:19

add a comment |

Mine own best at the moment solution is:

parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend '' -k -j …NUMofProcessesSay4… md5sum | md5sum

— It should be noted that:

Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin

It also doesn't perform very good, specially when you use pipe and not file as input

parallel's --pipepart as I found out doesn't support disk partitions

So I'd love to hear other ways around as well.

edited Jun 26 '16 at 15:50

answered Jun 26 '16 at 14:49

poige

7,09211437

3

Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

– Ole Tange
Jul 4 '16 at 2:40

1

THUMb up! Thanks for maintaining the utility. ;)

– poige
Jul 4 '16 at 16:45

This should have been put as part of the question rather than as an answer.

– mc0e
Jul 5 '16 at 21:19

add a comment |

Mine own best at the moment solution is:

parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend '' -k -j …NUMofProcessesSay4… md5sum | md5sum

— It should be noted that:

Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin

It also doesn't perform very good, specially when you use pipe and not file as input

parallel's --pipepart as I found out doesn't support disk partitions

So I'd love to hear other ways around as well.

edited Jun 26 '16 at 15:50

answered Jun 26 '16 at 14:49

poige

7,09211437

Mine own best at the moment solution is:

parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend '' -k -j …NUMofProcessesSay4… md5sum | md5sum

— It should be noted that:

Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin

It also doesn't perform very good, specially when you use pipe and not file as input

parallel's --pipepart as I found out doesn't support disk partitions

So I'd love to hear other ways around as well.

edited Jun 26 '16 at 15:50

answered Jun 26 '16 at 14:49

poige

7,09211437

edited Jun 26 '16 at 15:50

answered Jun 26 '16 at 14:49

poige

7,09211437

answered Jun 26 '16 at 14:49

poige

7,09211437

answered Jun 26 '16 at 14:49

poige

7,09211437

3

Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

– Ole Tange
Jul 4 '16 at 2:40

1

THUMb up! Thanks for maintaining the utility. ;)

– poige
Jul 4 '16 at 16:45

This should have been put as part of the question rather than as an answer.

– mc0e
Jul 5 '16 at 21:19

add a comment |

3

Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

– Ole Tange
Jul 4 '16 at 2:40

1

THUMb up! Thanks for maintaining the utility. ;)

– poige
Jul 4 '16 at 16:45

This should have been put as part of the question rather than as an answer.

– mc0e
Jul 5 '16 at 21:19

Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.

– Ole Tange
Jul 4 '16 at 2:40

THUMb up! Thanks for maintaining the utility. ;)

– poige
Jul 4 '16 at 16:45

This should have been put as part of the question rather than as an answer.

– mc0e
Jul 5 '16 at 21:19

add a comment |

What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.

If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash

edited Jun 29 '16 at 12:40

answered Jun 29 '16 at 12:02

shodanshok

26.6k34787

The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

– poige
Jun 29 '16 at 13:39

BTW: stackoverflow.com/questions/10726325/…

– poige
Jun 29 '16 at 14:07

@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

– shodanshok
Jun 29 '16 at 14:55

I won't consider this as correct answer. ;-P

– poige
Jul 1 '16 at 7:02

@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

– Jedi
Jul 2 '16 at 16:54

|
show 1 more comment

What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.

If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash

edited Jun 29 '16 at 12:40

answered Jun 29 '16 at 12:02

shodanshok

26.6k34787

The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

– poige
Jun 29 '16 at 13:39

BTW: stackoverflow.com/questions/10726325/…

– poige
Jun 29 '16 at 14:07

@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

– shodanshok
Jun 29 '16 at 14:55

I won't consider this as correct answer. ;-P

– poige
Jul 1 '16 at 7:02

@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

– Jedi
Jul 2 '16 at 16:54

|
show 1 more comment

What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.

If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash

edited Jun 29 '16 at 12:40

answered Jun 29 '16 at 12:02

shodanshok

26.6k34787

What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.

If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash

edited Jun 29 '16 at 12:40

answered Jun 29 '16 at 12:02

shodanshok

26.6k34787

edited Jun 29 '16 at 12:40

answered Jun 29 '16 at 12:02

shodanshok

26.6k34787

answered Jun 29 '16 at 12:02

shodanshok

26.6k34787

answered Jun 29 '16 at 12:02

shodanshok

26.6k34787

The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

– poige
Jun 29 '16 at 13:39

BTW: stackoverflow.com/questions/10726325/…

– poige
Jun 29 '16 at 14:07

@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

– shodanshok
Jun 29 '16 at 14:55

I won't consider this as correct answer. ;-P

– poige
Jul 1 '16 at 7:02

@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

– Jedi
Jul 2 '16 at 16:54

|
show 1 more comment

The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

– poige
Jun 29 '16 at 13:39

BTW: stackoverflow.com/questions/10726325/…

– poige
Jun 29 '16 at 14:07

@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

– shodanshok
Jun 29 '16 at 14:55

I won't consider this as correct answer. ;-P

– poige
Jul 1 '16 at 7:02

@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

– Jedi
Jul 2 '16 at 16:54

The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.

– poige
Jun 29 '16 at 13:39

BTW: stackoverflow.com/questions/10726325/…

– poige
Jun 29 '16 at 14:07

@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).

– shodanshok
Jun 29 '16 at 14:55

I won't consider this as correct answer. ;-P

– poige
Jul 1 '16 at 7:02

@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.

– Jedi
Jul 2 '16 at 16:54

|
show 1 more comment

Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.

This Gist could be a good start for something in Python.

answered Jun 26 '16 at 13:22

Gary

1044

split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

– poige
Jun 26 '16 at 13:24

@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

– Gary
Jun 26 '16 at 13:27

Yeah, but that's the theory which is quite obvious; anything practical?

– poige
Jun 26 '16 at 14:29

@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

– Gary
Jun 26 '16 at 14:33

2

I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

– poige
Jun 26 '16 at 14:46

|
show 1 more comment

Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.

This Gist could be a good start for something in Python.

answered Jun 26 '16 at 13:22

Gary

1044

split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

– poige
Jun 26 '16 at 13:24

@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

– Gary
Jun 26 '16 at 13:27

Yeah, but that's the theory which is quite obvious; anything practical?

– poige
Jun 26 '16 at 14:29

@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

– Gary
Jun 26 '16 at 14:33

2

I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

– poige
Jun 26 '16 at 14:46

|
show 1 more comment

Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.

This Gist could be a good start for something in Python.

answered Jun 26 '16 at 13:22

Gary

1044

Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.

This Gist could be a good start for something in Python.

answered Jun 26 '16 at 13:22

Gary

1044

answered Jun 26 '16 at 13:22

Gary

1044

answered Jun 26 '16 at 13:22

Gary

1044

answered Jun 26 '16 at 13:22

Gary

1044

split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

– poige
Jun 26 '16 at 13:24

@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

– Gary
Jun 26 '16 at 13:27

Yeah, but that's the theory which is quite obvious; anything practical?

– poige
Jun 26 '16 at 14:29

@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

– Gary
Jun 26 '16 at 14:33

2

I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

– poige
Jun 26 '16 at 14:46

|
show 1 more comment

split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

– poige
Jun 26 '16 at 13:24

@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

– Gary
Jun 26 '16 at 13:27

Yeah, but that's the theory which is quite obvious; anything practical?

– poige
Jun 26 '16 at 14:29

@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

– Gary
Jun 26 '16 at 14:33

2

I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

– poige
Jun 26 '16 at 14:46

split does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.

– poige
Jun 26 '16 at 13:24

@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.

– Gary
Jun 26 '16 at 13:27

Yeah, but that's the theory which is quite obvious; anything practical?

– poige
Jun 26 '16 at 14:29

@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?

– Gary
Jun 26 '16 at 14:33

I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…

– poige
Jun 26 '16 at 14:46

|
show 1 more comment

answered Jul 3 '16 at 19:22

tfrederick74656

1,27211027

add a comment |

answered Jul 3 '16 at 19:22

tfrederick74656

1,27211027

add a comment |

answered Jul 3 '16 at 19:22

tfrederick74656

1,27211027

answered Jul 3 '16 at 19:22

tfrederick74656

1,27211027

answered Jul 3 '16 at 19:22

tfrederick74656

1,27211027

answered Jul 3 '16 at 19:22

tfrederick74656

1,27211027

answered Jul 3 '16 at 19:22

tfrederick74656

1,27211027

add a comment |

answered Oct 1 '18 at 17:14

Jack O'Connor

173128

add a comment |

answered Oct 1 '18 at 17:14

Jack O'Connor

173128

add a comment |

answered Oct 1 '18 at 17:14

Jack O'Connor

173128

answered Oct 1 '18 at 17:14

Jack O'Connor

173128

answered Oct 1 '18 at 17:14

Jack O'Connor

173128

answered Oct 1 '18 at 17:14

Jack O'Connor

173128

answered Oct 1 '18 at 17:14

Jack O'Connor

173128

add a comment |

-1

Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep

The package name in ubuntu and debian is md5deep and includes hashdeep.

answered Jul 3 '16 at 18:47

Jason

214

From the man page I'd expect -j to support thread per each file given, not its parts.

– poige
Jul 3 '16 at 21:28

add a comment |

-1

Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep

The package name in ubuntu and debian is md5deep and includes hashdeep.

answered Jul 3 '16 at 18:47

Jason

214

From the man page I'd expect -j to support thread per each file given, not its parts.

– poige
Jul 3 '16 at 21:28

add a comment |

-1

Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep

The package name in ubuntu and debian is md5deep and includes hashdeep.

answered Jul 3 '16 at 18:47

Jason

214

Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep

The package name in ubuntu and debian is md5deep and includes hashdeep.

answered Jul 3 '16 at 18:47

Jason

214

answered Jul 3 '16 at 18:47

Jason

214

answered Jul 3 '16 at 18:47

Jason

214

answered Jul 3 '16 at 18:47

Jason

214

From the man page I'd expect -j to support thread per each file given, not its parts.

– poige
Jul 3 '16 at 21:28

add a comment |

From the man page I'd expect -j to support thread per each file given, not its parts.

– poige
Jul 3 '16 at 21:28

From the man page I'd expect -j to support thread per each file given, not its parts.

– poige
Jul 3 '16 at 21:28

add a comment |

-1

It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.

A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.

edited Jul 4 '16 at 7:31

answered Jul 4 '16 at 7:01

mc0e

5,3541127

I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

– mc0e
Jul 4 '16 at 7:59

add a comment |

-1

It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.

A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.

edited Jul 4 '16 at 7:31

answered Jul 4 '16 at 7:01

mc0e

5,3541127

I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

– mc0e
Jul 4 '16 at 7:59

add a comment |

-1

It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.

A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.

edited Jul 4 '16 at 7:31

answered Jul 4 '16 at 7:01

mc0e

5,3541127

It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.

A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.

edited Jul 4 '16 at 7:31

answered Jul 4 '16 at 7:01

mc0e

5,3541127

edited Jul 4 '16 at 7:31

answered Jul 4 '16 at 7:01

mc0e

5,3541127

answered Jul 4 '16 at 7:01

mc0e

5,3541127

answered Jul 4 '16 at 7:01

mc0e

5,3541127

I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

– mc0e
Jul 4 '16 at 7:59

add a comment |

I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

– mc0e
Jul 4 '16 at 7:59

I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.

– mc0e
Jul 4 '16 at 7:59

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Server Fault!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Otdfbt

7 Answers
7

Your Answer

Post as a guest

7 Answers
7

7 Answers
7

Post as a guest

Popular posts from this blog

Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070

7 Answers 7

Your Answer

Sign up or log in

Post as a guest

Post as a guest

7 Answers 7

7 Answers 7

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O﻿ / ﻿43.24775, -8.60070

7 Answers
7

7 Answers
7

7 Answers
7

Vilaño, A Laracha Índice Patrimonio | Lugares e parroquias | Véxase tamén | Menú de navegación43°14′52″N 8°36′03″O / 43.24775, -8.60070