QUESTION: A four-fold improvement? Yes, I would be very happy with that :) A four-fold improvement over the RFC implementation or OpenSSL or what? I am already using optimized scalar routines, so I was only expecting a 2x increase, at most, but I would certainly accept a 4x increase :)
Alright. I got this code to compile alright under Microsoft Visual Studio 2008 Express Edition, but I am at a loss to implement it. First off all allow me to clarify: this code is SSE2-accelerated MD5, capable of computing 4(?) hashes simultaneously? Is this understanding correct?
Could you show me how to implement this? I would like the flexibility to encrypt 4 or more arrays or binary buffers simultaneously. This routine does not seem to have the padding/append length functions necessary for preparing input for the actual MD5 compress function? Correct? If I could pad each arbitrary-length array individually, I could store the results in a buffer like this 'unsigned char' since after padding, all values would be 32 bytes, anyway (correct?). I could then encrypt that buffer and store the results (4 MD5 hashes) back in the same buffer so that I can compare each value to the hash I am trying to recover to see if they match.
Does this sound feasible? Could you show me how I could pad arbitrary-length inputs one at a time, store them in a 2D buffer, encrypt all 4 values simultaneously using this SSE2-accelerated MD5 compress function, and store the results back in the same buffer, where I could check them one-by-one, by array element?
If I am totally off-base or am misunderstanding the function of this code or something regarding SSE2 or MD5, please correct me.
again, thank you for your time.
ANSWER: Let me clarify.
MD4 and its successors have been designed with 32bit efficiency in mind. The MD5 reference digest algorithm processes the message in successive 512-bit chunks - divided into sixteen 32-bit elements.
On modern processors, raw cryptographic algorithms like DES and MD5 can be made faster by implementing their bit-sliced, vectorized, or parallelized versions.
The bit-sliced version modifies the algorithm to use 128-bit SSE2 registers. It treats a 512-bit chunk as four 128-bit elements (instead of sixteen 32-bit elements), thereby cutting down the number of instructions by a factor of four. Hence, the four-fold improvement in speed over the non sse version. The algorithm used is the RSA security implementation of MD5; the four fold increase is over the same algorithm (same source code), first without and then with sse2 instructions enabled. Padding was not incorporated as this was just a 'proof of concept' implementation.
The vectorized version of the algorithm can be used on CPUs that support 32-bit vector elements (and a reasonable set of parallel operations on such elements, including 32-bit addition) as is the case with SSE2. This approach is the one that computes four digests in parallel - each 128-bit register is used as a vector of four different 32-bit elements, one from each message. On such processors, this certainly is more efficient than the bit-sliced implementation - not surprising as MD5 had originally been designed with 32bit efficiency in mind.
The parallelized version of the algorithm uses multiple threads to compute the digest; this attemts to exploit multiple cores and/or processors available on modern machines. For example, on a quad processor machine, four different MD5 hashes could be computed simultaneously in four different threads.
What you seem to be looking for is the vectorized version of the algorithm (and not the bit-sliced version).
---------- FOLLOW-UP ----------
QUESTION: I see. So both the bitslice version and the vectorized version of MD5 exploit SSE2 to accelerate the code. But the vectorized is faster. It is also (probably) significantly more complicated to implement, if bitslice computes one hash at a time while vector computes 4+ hashes.
You said you were able to benchmark this? What were the benchmarks on your machine, and what type/speed of processor do you have? I know that different processors will perform in different ways, but I thought if you had a core2 to run it on, I might be able to estimate a benchmark on my machine and see if the bitslice is worth pursuing.
I originally was researching implementations which used SSE2 to compute what fraction of MD5 could be computed in parallel. They computed one hash at a time, and as such would probably be easier to implement in place of my current routines. Their performance in comparison to the vectorized versions made the vectorized versions worth pursuing, however.
But if you are getting a four-fold increase over scaler using bitslice, it may be worth the tradeoff of speed for simplicity. So could you send me some benchmarks?
ANSWER: > What were the benchmarks on your machine? So could you send me some benchmarks?
I had given the source code that was benchmarked, the compiler switches used for both the non-sse and the sse versions and the benchmark results in my earlier answer:
> what type/speed of processor do you have?
I had bechmarked it on my Thinkpad T61 laptop; processor, compiler and os details are
vijayan@~:>dmesg | grep CPU:
CPU: Intel(R) Core(TM)2 Duo CPU T7100 @ 1.80GHz (1795.51-MHz 686-class CPU)
vijayan@~:>c++ --version | grep GCC
g++44 (GCC) 4.4.1 20090714 (prerelease)
FreeBSD 7.1-PRERELEASE i386
note that the os (and compiler) are for i386 ie. 32-bit versions
---------- FOLLOW-UP ----------
QUESTION: I was looking for MD5 computations per second, just just benchmark in comparison to scaler version. I'm interpreting your benchmark as it took 1.67 seconds to compute whatever it is computing in scaler, and .41 seconds to compute it in vector? True? The only problem is I don't know how many total keys it is computing.
The program computes the MD5 for 64 million bytes of message data for both the non-sse scalar (32-bit elements) and the sse bit-sliced (128-bit element) versions.
For the non sse version, it computes the MD5 31250 times, each time for a message consisting of 2048 bytes of data - this took 1.661 seconds cpu time and 1.67 seconds elapsed time.
For the sse version, it computes the MD5 7812 times, each time for for a message consisting of 8192 bytes of data - this took 0.407 seconds cpu time and 0.41 seconds elapsed time.