15 January 2024
I have needs. Very specific needs. Namely, I want:
- To hash a corpus of English words, individually
- …using a cryptographic hash
- …in Python.
Putting aside that this is a strange thing to want to do, I’ve compiled a few benchmarks to help me chose the best function for my purposes. I’ve found plenty of benchmarks for all these hash functions, but they’re almost always for very large inputs.
These benchmarks might be useful to you if you are in the same (unlikely) situation as I, but also, if you’re interested in knowing how hash functions perform on data smaller than a kilobyte✪On my particular PC..
These are the cryptographic hash functions we’ll be benchmarking:
|32 bytes, extensible
|HAIFA, Merkle tree
|arbitrary (here, 32 bytes)
|arbitrary (here, 32 bytes)
|Series of Unique Block Iterations using the Threefish cipher.
I also had a passing curiosity on how non-cryptographic hashes might compare. These are the ones I compared:
|8 or 16 bytes
|Algorithmically generated. Vaguely explained.
|8 or 16 bytes (here, 16)
|8, 16, or 32 bytes (here, 16 and 32)
As much as possible, I tried to use the built-in
hashlib library. The
blake3 library is a thin wrapper over the official Rust implementation of Blake3, and beside being the only Python implementation I can find, much of
PySkein is implemented in C and is optimized for performance.
The Word Benchmark
Our first benchmark is the “Word” benchmark, wherein individual words from three different corpora are hashed using individual hash functions.
The different corpora are:
|Homer’s Odyssey as found on Project Gutenberg.
|A dataset of 365 English WikiNews articles from 2004 to 2014. The dataset is truncated to 1Mb.
|All messages from the
sci.crypt newsgroup in the 20 Newsgroups dataset. The dataset is truncated to 1Mb.
I chose the different datasets because they were created very differently, at different times, for different purposes. And because they were easy to find. You can see below that they differ quite a bit with respect to the length and number of words in each file. The Odyssey is a shorter text with smaller words.
sci.crypt has an unsual amount of 1 and 2 letter words, which is probably emoticons and symbols in the headers of the messages. In all cases, ostensibly all words are under 20 characters long. That means that 80 bytes is as about as large as the input to the hash function will be✪All input is UTF-8 encoded..
What we see in the results below is that, as expected, the Odyssey corpus have consistantly smaller hash times than the other two. This is followed by WikiNews and
Among the cryptographic hash functions, the Blake2 family takes the cake, with Blake2s out-performing all but one of the non-cryptographic functions.
Among the non-cryptographic hash functions, ho-boy is MurMurHash3 fast. That said, all the hash functions here are actually very fast in absolute terms (as far as I’m concerned).
Blake3 and SHA-3’s dead-last performance is a bit surprising to me. Blake3 especially has a reputation for being the fastest gun in the west, and by quite some margin in the benchmarks I’ve seen. What’s more is that its little brothers do so well, which is a bit weird considering they share so much DNA. But I guess I shouldn’t be too surprised, it’s well known that hashes perform differently on short inputs (hence this benchmark).
The Random Benchmark
I wanted to see how hash performance changed with input size, so I began by hashing random input generated by Python’s built-in
random.randbytes function✪This uses the Mersenne Twister, a non-cryptographic pseudo-random number generator. in the interval of zero bytes to one gigabyte:
Well… uh…, we can’t really make out any input-size-dependent differences in relative hash performance. It’s still useful however. Blake3 lives up to its reputation, being nearly five-fold faster. I was a bit surprised to see SHA-1 and SHA-256 performing so closely, because I feel like I’ve often heard people belly-ache about SHA-256 performance when having to migrate from SHA-1 for security reasons. Also, SHA-1 is a lot faster than these benchmarks from the Blake2 team. Then I realized that my CPU supports the SHA instruction set, which accelerates both SHA-1 and SHA-256 in hardware.
SHA-3 has caught flack for being very slow in software, and we definitely see that here. I can’t wait for the apparently “blazing fast” hardware implementations, because we see here what wonders they did for SHA-256. For my purposes, however, it’s not useful.
So anywho, can we see those input-size-dependent performance changes between 0 and 1 kilobytes?
Aha! Another story emerges at this resolution. We can see that Blake2s loses its lead when inputs start getting larger than 100 bytes. SHA-1 and Blake2b are pretty similar up until around 500 bytes when Blake2b begins getting slower.
Both accelerated SHA-1 and SHA-256 are pretty consistant, independent of input size.
SHA-3 has this weird step-function thing going on, where steps are about… hmm… that looks susipiciously like it could be 136 bytes, which is the block size of the 256-bit variant of SHA-3 that we’re using.
Blake3 is still slow at even 1kB, which is enough to encode at least one UTF-8-encoded tweet.
All the benchmarks were run on my PC with the following specs:
|AMD Ryzen Threadripper 1950X
For my weird “hashing English words” use case, Blake2s is the hash function to use based on these numbers. If I relax my needs to include non-cryptographic functions, then MurMurHash3 is probably what I’d go for✪Un-keyed hashes like MurMurHash3 are susceptible to DoS attacks, so you’d want to use something like SipHash or a fast cryptographic hash depending on your use case..
If I had more data (megabytes or gigabytes) to hash, I’d likely use Blake3. But Blake3’s performance on inputs as big as 1kB is beat out by half of the cryptographic hash functions tested.
- These functions are all very fast in absolute terms.
- These benchmarks might look very different if you run them.
The code I used to run the benchmarks is here, but be forewarned, it was written in a fugue state.
There are a lot of fine hash functions, particularly non-cryptographic hash functions, that were not included in this analysis for various reasons:
KangarooTwelve is basically Keccak (which is basically SHA-3), but with many fewer rounds. It’s meant to be fast, but when I tested it using the
pycrpto library, it was orders of magnitude slower than SHA-3. I’m going to chalk this up to an implementation error on my or
pycrpto’s part here, since the results really made no sense.
AquaHash is a non-cryptographic hash by the author of Metrohash. It is specifically designed deliver “state-of-the-art performance across all key sizes” by employing two algorithms, one that kicks in for smaller inputs, and the other for larger ones. This would be great to see in this analysis but I couldn’t find any Python bindings for the reference C code.
Haraka v2 is something that comes up a lot when I search for “short-input hash”. It’s cryptographic and uses AES instructions to accelerate its compuation. There looks to be a reference C and Python implementation online, but I didn’t realize that until after I ran the analysis. Would be very neat to check out.
Meowhash is the Hacker News darling by the Handmade Hero himself and his company Molly Rocket. It’s a non-cryptographic hash function which is accelerated by the AES CPU instruction. It’s meant to be really fast, but I couldn’t find a Python implementation/binding, so it’s not included.