How does one check huge files identity if hashing is CPU bound? The Next CEO of Stack OverflowDebian - Port 80 is blocked, but I don't know by whatSHA Hash of large files in WindowsVirtualized CPU cores vs. threadsSQL server availability issue: large query stops other connections from connectingrsync with multi-core server should go faster than it goes. Am I wrong?Performance implications of CPU/cores rationumber of cores available on a serverDell 2970 Server dual Opteron 2435's_Ubuntu OS incorrectly reporting CPU'sHow does CPU affinity interact with cgroups in Linux?RHEL: 4 Cores available but only 1 usedMPI job hangs on one specific node onlyScaling Apache/PHP is as simple as maximizing the number of CPU cores?
Opposite of a diet
What is the point of a new vote on May's deal when the indicative votes suggest she will not win?
Why were Madagascar and New Zealand discovered so late?
Example of a Mathematician/Physicist whose Other Publications during their PhD eclipsed their PhD Thesis
How to use tikz in fbox?
Was a professor correct to chastise me for writing "Prof. X" rather than "Professor X"?
Can a single photon have an energy density?
Why is there a PLL in CPU?
When Does an Atlas Uniquely Define a Manifold?
Only print output after finding pattern
If the heap is initialized for security, then why is the stack uninitialized?
Text adventure game code
Need some help with wall behind rangetop
MAZDA 3 2006 (UK) - poor acceleration then takes off at 3250 revs
How do scammers retract money, while you can’t?
What does this shorthand mean?
Trouble understanding the speech of overseas colleagues
What makes a siege story/plot interesting?
Science fiction (dystopian) short story set after WWIII
WOW air has ceased operation, can I get my tickets refunded?
Whats the best way to handle refactoring a big file?
I believe this to be a fraud - hired, then asked to cash check and send cash as Bitcoin
How to get regions to plot as graphics
Is it safe to use c_str() on a temporary string?
How does one check huge files identity if hashing is CPU bound?
The Next CEO of Stack OverflowDebian - Port 80 is blocked, but I don't know by whatSHA Hash of large files in WindowsVirtualized CPU cores vs. threadsSQL server availability issue: large query stops other connections from connectingrsync with multi-core server should go faster than it goes. Am I wrong?Performance implications of CPU/cores rationumber of cores available on a serverDell 2970 Server dual Opteron 2435's_Ubuntu OS incorrectly reporting CPU'sHow does CPU affinity interact with cgroups in Linux?RHEL: 4 Cores available but only 1 usedMPI job hangs on one specific node onlyScaling Apache/PHP is as simple as maximizing the number of CPU cores?
For small files hashing is just ok, but with huge ones you can easily find md5sum
is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)
multi-core hash big-data
add a comment |
For small files hashing is just ok, but with huge ones you can easily find md5sum
is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)
multi-core hash big-data
1
Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel
– Brian
Jun 26 '16 at 16:27
In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.
– mmv-ru
Jun 29 '16 at 0:08
did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)
– yagmoth555♦
Jun 29 '16 at 12:54
@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.
– poige
Jun 29 '16 at 13:37
add a comment |
For small files hashing is just ok, but with huge ones you can easily find md5sum
is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)
multi-core hash big-data
For small files hashing is just ok, but with huge ones you can easily find md5sum
is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)
multi-core hash big-data
multi-core hash big-data
edited 19 hours ago
poige
asked Jun 26 '16 at 10:59
poigepoige
7,09211437
7,09211437
1
Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel
– Brian
Jun 26 '16 at 16:27
In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.
– mmv-ru
Jun 29 '16 at 0:08
did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)
– yagmoth555♦
Jun 29 '16 at 12:54
@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.
– poige
Jun 29 '16 at 13:37
add a comment |
1
Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel
– Brian
Jun 26 '16 at 16:27
In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.
– mmv-ru
Jun 29 '16 at 0:08
did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)
– yagmoth555♦
Jun 29 '16 at 12:54
@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.
– poige
Jun 29 '16 at 13:37
1
1
Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel
– Brian
Jun 26 '16 at 16:27
Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel
– Brian
Jun 26 '16 at 16:27
In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.
– mmv-ru
Jun 29 '16 at 0:08
In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.
– mmv-ru
Jun 29 '16 at 0:08
did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)
– yagmoth555♦
Jun 29 '16 at 12:54
did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)
– yagmoth555♦
Jun 29 '16 at 12:54
@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.
– poige
Jun 29 '16 at 13:37
@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.
– poige
Jun 29 '16 at 13:37
add a comment |
7 Answers
7
active
oldest
votes
Mine own best at the moment solution is:
parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend ''
-k -j …NUMofProcessesSay4… md5sum | md5sum
— It should be noted that:
- Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin
- It also doesn't perform very good, specially when you use
pipe
and not file as input parallel
's--pipepart
as I found out doesn't support disk partitions
So I'd love to hear other ways around as well.
3
Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.
– Ole Tange
Jul 4 '16 at 2:40
1
THUMb up! Thanks for maintaining the utility. ;)
– poige
Jul 4 '16 at 16:45
This should have been put as part of the question rather than as an answer.
– mc0e
Jul 5 '16 at 21:19
add a comment |
Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.
What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.
If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash
Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.
The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.
– poige
Jun 29 '16 at 13:39
BTW: stackoverflow.com/questions/10726325/…
– poige
Jun 29 '16 at 14:07
@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).
– shodanshok
Jun 29 '16 at 14:55
I won't consider this as correct answer. ;-P
– poige
Jul 1 '16 at 7:02
@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.
– Jedi
Jul 2 '16 at 16:54
|
show 1 more comment
There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).
Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.
If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.
This Gist could be a good start for something in Python.
split
does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.
– poige
Jun 26 '16 at 13:24
@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.
– Gary
Jun 26 '16 at 13:27
Yeah, but that's the theory which is quite obvious; anything practical?
– poige
Jun 26 '16 at 14:29
@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?
– Gary
Jun 26 '16 at 14:33
2
I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…
– poige
Jun 26 '16 at 14:46
|
show 1 more comment
Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.
Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.
If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.
add a comment |
I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao
add a comment |
You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j
flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.
Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep
The package name in ubuntu and debian is md5deep and includes hashdeep.
From the man page I'd expect-j
to support thread per each file given, not its parts.
– poige
Jul 3 '16 at 21:28
add a comment |
It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.
Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.
As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).
From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.
It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.
If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.
A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.
I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.
– mc0e
Jul 4 '16 at 7:59
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f786338%2fhow-does-one-check-huge-files-identity-if-hashing-is-cpu-bound%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
7 Answers
7
active
oldest
votes
7 Answers
7
active
oldest
votes
active
oldest
votes
active
oldest
votes
Mine own best at the moment solution is:
parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend ''
-k -j …NUMofProcessesSay4… md5sum | md5sum
— It should be noted that:
- Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin
- It also doesn't perform very good, specially when you use
pipe
and not file as input parallel
's--pipepart
as I found out doesn't support disk partitions
So I'd love to hear other ways around as well.
3
Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.
– Ole Tange
Jul 4 '16 at 2:40
1
THUMb up! Thanks for maintaining the utility. ;)
– poige
Jul 4 '16 at 16:45
This should have been put as part of the question rather than as an answer.
– mc0e
Jul 5 '16 at 21:19
add a comment |
Mine own best at the moment solution is:
parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend ''
-k -j …NUMofProcessesSay4… md5sum | md5sum
— It should be noted that:
- Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin
- It also doesn't perform very good, specially when you use
pipe
and not file as input parallel
's--pipepart
as I found out doesn't support disk partitions
So I'd love to hear other ways around as well.
3
Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.
– Ole Tange
Jul 4 '16 at 2:40
1
THUMb up! Thanks for maintaining the utility. ;)
– poige
Jul 4 '16 at 16:45
This should have been put as part of the question rather than as an answer.
– mc0e
Jul 5 '16 at 21:19
add a comment |
Mine own best at the moment solution is:
parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend ''
-k -j …NUMofProcessesSay4… md5sum | md5sum
— It should be noted that:
- Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin
- It also doesn't perform very good, specially when you use
pipe
and not file as input parallel
's--pipepart
as I found out doesn't support disk partitions
So I'd love to hear other ways around as well.
Mine own best at the moment solution is:
parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend ''
-k -j …NUMofProcessesSay4… md5sum | md5sum
— It should be noted that:
- Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin
- It also doesn't perform very good, specially when you use
pipe
and not file as input parallel
's--pipepart
as I found out doesn't support disk partitions
So I'd love to hear other ways around as well.
edited Jun 26 '16 at 15:50
answered Jun 26 '16 at 14:49
poigepoige
7,09211437
7,09211437
3
Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.
– Ole Tange
Jul 4 '16 at 2:40
1
THUMb up! Thanks for maintaining the utility. ;)
– poige
Jul 4 '16 at 16:45
This should have been put as part of the question rather than as an answer.
– mc0e
Jul 5 '16 at 21:19
add a comment |
3
Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.
– Ole Tange
Jul 4 '16 at 2:40
1
THUMb up! Thanks for maintaining the utility. ;)
– poige
Jul 4 '16 at 16:45
This should have been put as part of the question rather than as an answer.
– mc0e
Jul 5 '16 at 21:19
3
3
Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.
– Ole Tange
Jul 4 '16 at 2:40
Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea.
– Ole Tange
Jul 4 '16 at 2:40
1
1
THUMb up! Thanks for maintaining the utility. ;)
– poige
Jul 4 '16 at 16:45
THUMb up! Thanks for maintaining the utility. ;)
– poige
Jul 4 '16 at 16:45
This should have been put as part of the question rather than as an answer.
– mc0e
Jul 5 '16 at 21:19
This should have been put as part of the question rather than as an answer.
– mc0e
Jul 5 '16 at 21:19
add a comment |
Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.
What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.
If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash
Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.
The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.
– poige
Jun 29 '16 at 13:39
BTW: stackoverflow.com/questions/10726325/…
– poige
Jun 29 '16 at 14:07
@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).
– shodanshok
Jun 29 '16 at 14:55
I won't consider this as correct answer. ;-P
– poige
Jul 1 '16 at 7:02
@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.
– Jedi
Jul 2 '16 at 16:54
|
show 1 more comment
Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.
What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.
If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash
Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.
The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.
– poige
Jun 29 '16 at 13:39
BTW: stackoverflow.com/questions/10726325/…
– poige
Jun 29 '16 at 14:07
@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).
– shodanshok
Jun 29 '16 at 14:55
I won't consider this as correct answer. ;-P
– poige
Jul 1 '16 at 7:02
@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.
– Jedi
Jul 2 '16 at 16:54
|
show 1 more comment
Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.
What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.
If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash
Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.
Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.
What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.
If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash
Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.
edited Jun 29 '16 at 12:40
answered Jun 29 '16 at 12:02
shodanshokshodanshok
26.6k34787
26.6k34787
The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.
– poige
Jun 29 '16 at 13:39
BTW: stackoverflow.com/questions/10726325/…
– poige
Jun 29 '16 at 14:07
@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).
– shodanshok
Jun 29 '16 at 14:55
I won't consider this as correct answer. ;-P
– poige
Jul 1 '16 at 7:02
@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.
– Jedi
Jul 2 '16 at 16:54
|
show 1 more comment
The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.
– poige
Jun 29 '16 at 13:39
BTW: stackoverflow.com/questions/10726325/…
– poige
Jun 29 '16 at 14:07
@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).
– shodanshok
Jun 29 '16 at 14:55
I won't consider this as correct answer. ;-P
– poige
Jul 1 '16 at 7:02
@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.
– Jedi
Jul 2 '16 at 16:54
The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.
– poige
Jun 29 '16 at 13:39
The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways.
– poige
Jun 29 '16 at 13:39
BTW: stackoverflow.com/questions/10726325/…
– poige
Jun 29 '16 at 14:07
BTW: stackoverflow.com/questions/10726325/…
– poige
Jun 29 '16 at 14:07
@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).
– shodanshok
Jun 29 '16 at 14:55
@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs).
– shodanshok
Jun 29 '16 at 14:55
I won't consider this as correct answer. ;-P
– poige
Jul 1 '16 at 7:02
I won't consider this as correct answer. ;-P
– poige
Jul 1 '16 at 7:02
@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.
– Jedi
Jul 2 '16 at 16:54
@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file.
– Jedi
Jul 2 '16 at 16:54
|
show 1 more comment
There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).
Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.
If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.
This Gist could be a good start for something in Python.
split
does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.
– poige
Jun 26 '16 at 13:24
@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.
– Gary
Jun 26 '16 at 13:27
Yeah, but that's the theory which is quite obvious; anything practical?
– poige
Jun 26 '16 at 14:29
@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?
– Gary
Jun 26 '16 at 14:33
2
I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…
– poige
Jun 26 '16 at 14:46
|
show 1 more comment
There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).
Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.
If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.
This Gist could be a good start for something in Python.
split
does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.
– poige
Jun 26 '16 at 13:24
@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.
– Gary
Jun 26 '16 at 13:27
Yeah, but that's the theory which is quite obvious; anything practical?
– poige
Jun 26 '16 at 14:29
@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?
– Gary
Jun 26 '16 at 14:33
2
I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…
– poige
Jun 26 '16 at 14:46
|
show 1 more comment
There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).
Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.
If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.
This Gist could be a good start for something in Python.
There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).
Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.
If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.
This Gist could be a good start for something in Python.
answered Jun 26 '16 at 13:22
GaryGary
1044
1044
split
does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.
– poige
Jun 26 '16 at 13:24
@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.
– Gary
Jun 26 '16 at 13:27
Yeah, but that's the theory which is quite obvious; anything practical?
– poige
Jun 26 '16 at 14:29
@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?
– Gary
Jun 26 '16 at 14:33
2
I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…
– poige
Jun 26 '16 at 14:46
|
show 1 more comment
split
does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.
– poige
Jun 26 '16 at 13:24
@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.
– Gary
Jun 26 '16 at 13:27
Yeah, but that's the theory which is quite obvious; anything practical?
– poige
Jun 26 '16 at 14:29
@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?
– Gary
Jun 26 '16 at 14:33
2
I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…
– poige
Jun 26 '16 at 14:46
split
does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.– poige
Jun 26 '16 at 13:24
split
does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option.– poige
Jun 26 '16 at 13:24
@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.
– Gary
Jun 26 '16 at 13:27
@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk.
– Gary
Jun 26 '16 at 13:27
Yeah, but that's the theory which is quite obvious; anything practical?
– poige
Jun 26 '16 at 14:29
Yeah, but that's the theory which is quite obvious; anything practical?
– poige
Jun 26 '16 at 14:29
@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?
– Gary
Jun 26 '16 at 14:33
@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include?
– Gary
Jun 26 '16 at 14:33
2
2
I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…
– poige
Jun 26 '16 at 14:46
I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: serverfault.com/questions/488486/…
– poige
Jun 26 '16 at 14:46
|
show 1 more comment
Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.
Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.
If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.
add a comment |
Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.
Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.
If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.
add a comment |
Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.
Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.
If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.
Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.
Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.
If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.
answered Jul 3 '16 at 19:22
tfrederick74656tfrederick74656
1,27211027
1,27211027
add a comment |
add a comment |
I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao
add a comment |
I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao
add a comment |
I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao
I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao
answered Oct 1 '18 at 17:14
Jack O'ConnorJack O'Connor
173128
173128
add a comment |
add a comment |
You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j
flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.
Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep
The package name in ubuntu and debian is md5deep and includes hashdeep.
From the man page I'd expect-j
to support thread per each file given, not its parts.
– poige
Jul 3 '16 at 21:28
add a comment |
You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j
flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.
Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep
The package name in ubuntu and debian is md5deep and includes hashdeep.
From the man page I'd expect-j
to support thread per each file given, not its parts.
– poige
Jul 3 '16 at 21:28
add a comment |
You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j
flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.
Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep
The package name in ubuntu and debian is md5deep and includes hashdeep.
You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j
flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.
Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep
The package name in ubuntu and debian is md5deep and includes hashdeep.
answered Jul 3 '16 at 18:47
JasonJason
214
214
From the man page I'd expect-j
to support thread per each file given, not its parts.
– poige
Jul 3 '16 at 21:28
add a comment |
From the man page I'd expect-j
to support thread per each file given, not its parts.
– poige
Jul 3 '16 at 21:28
From the man page I'd expect
-j
to support thread per each file given, not its parts.– poige
Jul 3 '16 at 21:28
From the man page I'd expect
-j
to support thread per each file given, not its parts.– poige
Jul 3 '16 at 21:28
add a comment |
It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.
Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.
As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).
From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.
It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.
If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.
A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.
I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.
– mc0e
Jul 4 '16 at 7:59
add a comment |
It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.
Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.
As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).
From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.
It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.
If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.
A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.
I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.
– mc0e
Jul 4 '16 at 7:59
add a comment |
It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.
Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.
As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).
From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.
It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.
If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.
A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.
It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.
Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.
As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).
From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.
It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.
If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.
A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.
edited Jul 4 '16 at 7:31
answered Jul 4 '16 at 7:01
mc0emc0e
5,3541127
5,3541127
I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.
– mc0e
Jul 4 '16 at 7:59
add a comment |
I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.
– mc0e
Jul 4 '16 at 7:59
I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.
– mc0e
Jul 4 '16 at 7:59
I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia.
– mc0e
Jul 4 '16 at 7:59
add a comment |
Thanks for contributing an answer to Server Fault!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f786338%2fhow-does-one-check-huge-files-identity-if-hashing-is-cpu-bound%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Huges ones is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using GNU Parallel
– Brian
Jun 26 '16 at 16:27
In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel.
– mmv-ru
Jun 29 '16 at 0:08
did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (support.microsoft.com/en-ca/kb/841290)
– yagmoth555♦
Jun 29 '16 at 12:54
@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP.
– poige
Jun 29 '16 at 13:37