C++/SSE2-accelerated MD4
Expert: vijayan - 8/29/2009
QuestionQUESTION: I need help implementing an SSE2-accelerated MD4 implementation. I found one that I think may work at
http://www.freerainbowtables.com/phpBB3/5/viewtopic3.php.htm#p8248
but need help implementing it. It is a little over my head. All I want to do is conduct 4 MD4 encryptions simultaneously using SSE2.
I have this, for example:
char candidate0[]="password0";
char candidate1[]="password1";
char candidate2[]="password2";
char candidate3[]="password3";
I believe I need to pad them and append the length, or whatever the MD4 algorithm requires, before feeding them into the compressSse() function which will do the rest of the work to turn them into 4 MD4 hashes.
Maybe you know of a better SSE2-accelerated MD4 routine? Easier/ faster? I also plan to use this for NTLM and MSCACHE, so I would like to retain the flexibility to use or not to use UNICODE conversion. I think the implementation above might include UNICODE conversion. I would like to keep that separate, so I can use it for NTLM and MD4.
here is pseudocode for my planned future implementation in a password brute forcer (sorry if it's a mess :$):
int cached_hashes=0;//counts how many plaintexts are ready, when 4, compute
char candidate[];//the candidate password
char plaintexts[4][];//stores all 4 plaintexts
unsigned char ready_candidate[16];//plaintexts prepared for the compressSse() function
unsigned char encrypted[16];//the encrypted password we are trying to recover
while(bruteforcing)
{
candidate++;
memcpy(plaintexts[cached_hashes],candidate, strlen(candidate));
//possibly UNICODE conversion here, depending on hash type
prepare candidate- padding, whatever- store in ready_candidate
memcpy(input[cached_hashes],ready_candidate,16);
cached_hashes++;
if(cached_hashes==4)//if 4 candidates ready and stored...
{
//MD4 encrypt them all simultaneously
compressSse();
cached_hashes=0;
//I'm assuming that compressSse() writes the MD4 hashes
//back into input[] elements, respectively
for(int i=0;i<4;i++)
if(!memcmp(input[i],encrypted,16)
cout<<"plaintext is "<<plaintext[i]<<"!\n";
}
}
I generate one candidate at a time, and then wait until I have collected 4 candidates and then encrypt them all simultaneously. Then I check the results one-by-one, again.
I think once I can get an SSE2-accelerated MD4 implementation up and running, I should be able to modify it to turn it into an SSE2-accelerated MD5 implementation.
ANSWER: The link that you posted gives a 404 NOT FOUND
Here are two C++ crypto libraries that I've used. They use SSE2, but perhaps not for obsolescent algorithms like MD4 or MD5. Check them out and see if they meet your requirements:
Crypto++ -
http://www.cryptopp.com/
Botan -
http://botan.randombit.net/about.html
and one which is very easy to use, but definitely does not use SSE2
hashlib++ -
http://hashlib2plus.sourceforge.net/
---------- FOLLOW-UP ----------
QUESTION: I checked out the implementations you provided. I don't think either of them use SSE2 for MD4/MD5 computation. MD4 and MD5 can only be partially conducted in SIMD-parallel for a single hash, and at very little speed advantage. These implementations only take a single hash in the function call, so I doubt they compute multiple hashes in parallel. Correct me if any of this information is wrong.
Here is the link to the implementation I was looking at:
http://www.freerainbowtables.com/phpBB3/viewtopic.php?f=6&t=904&sid=4e3e3efc6f55
hopefully that link works. I just am having a little trouble understanding how to use this implementation. I get the idea that I need to do some padding, but loading the input[][][] array is still a little over my head.
ANSWER: I had a look at the link.
It is not very clear to me from what I saw there how the array input[][][] is used. I could guess, but that would be risky.
---------- FOLLOW-UP ----------
QUESTION: Corni gave me an implementation of it in an mscache cracker. I still couldn't quite make sense of it. Code is a little convoluted. Perhaps it may help?
here is the interchange:
"Hi,
first, i've not really looked at this code for a long time, but let's see what i get together...
first, for your compiling errors, you need to #include the right header. google is your friend here ;)
second, for gcc as compiler (which I assume Dev-C++ uses) you've to add the -msse2 command line switch to compile.
(That was the easy part :D)
For the rest, i've written a brute-forcer for mscache. It isn't particularly nice written or feature rich, but it provides examples for padding and loading the input, and even has a dumb plaintext-generator for a specified string length. Beware, the code is ugly (and sometimes just with german comments)...
I've just removed the char target[], but you can use the char targetTest[] as target...
md4.h just contains the sse macros you already have."
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <sys/time.h>
#include <xmmintrin.h>
#include "md4.h"
//#define test
#pragma pack(1)
//all messages are in windows unicode!
unsigned short mixedalphanumeric[63];
unsigned long long count=0;
timeval start;
__m128i a,b,c,d;
unsigned characters[4][STRING_MAX];
unsigned short input[16][4][2] __attribute__ ((aligned (16)));
int part=0;
unsigned short string[STRING_MAX]={0};//for looking up the used hash
unsigned char target[]=
{//857B55E2B52DB477EC3820A26BD6B1CE - administrator of schule
0x85,0x7B,0x55,0xE2,0xB5,0x2D,0xB4,0x77,0xEC,0x38,0x20,0xA2,0x6B,0xD6,0xB1,0xCE
};
unsigned char targetTest[]=//aaaaaf
{//1419ebe422ff17029c8e82914fdf08e4
0x14,0x19,0xeb,0xe4,0x22,0xff,0x17,0x02,0x9c,0x8e,0x82,0x91,0x4f,0xdf,0x08,0xe4
};
void printInput()
{
for(int j=0; j<4; j++)
{
for(int i=0; i<STRING_MAX; i++)
{
//print_hex((unsigned char*)&input[i][j][0],4);
printUnicode(&input[i][j][0],2,0 );
}
printf("\n");
}
}
void initSse()
{
memset(input,0,sizeof(input));
characters[0][STRING_MAX-1]=0;
characters[1][STRING_MAX-1]=1;
characters[2][STRING_MAX-1]=2;
characters[3][STRING_MAX-1]=3;
}
inline void compressSse()
{
//set the initial values
a=_mm_set_epi32(0x67452301,0x67452301,0x67452301,0x67452301);
b=_mm_set_epi32(0xefcdab89,0xefcdab89,0xefcdab89,0xefcdab89);
c=_mm_set_epi32(0x98badcfe,0x98badcfe,0x98badcfe,0x98badcfe);
d=_mm_set_epi32(0x10325476,0x10325476,0x10325476,0x10325476);
FFSSE(a, b, c, d, SSE_LOAD(0), S11); /* 1 */
FFSSE(d, a, b, c, SSE_LOAD(1), S12); /* 2 */
FFSSE(c, d, a, b, SSE_LOAD(2), S13); /* 3 */
FFSSE(b, c, d, a, SSE_LOAD(3), S14); /* 4 */
FFSSE(a, b, c, d, SSE_LOAD(4), S11); /* 5 */
FFSSE(d, a, b, c, SSE_LOAD(5), S12); /* 6 */
FFSSE(c, d, a, b, SSE_LOAD(6), S13); /* 7 */
FFSSE(b, c, d, a, SSE_LOAD(7), S14); /* 8 */
FFSSE(a, b, c, d, SSE_LOAD(8), S11); /* 9 */
FFSSE(d, a, b, c, SSE_LOAD(9), S12); /* 10 */
FFSSE(c, d, a, b, SSE_LOAD(10), S13); /* 11 */
FFSSE(b, c, d, a, SSE_LOAD(11), S14); /* 12 */
FFSSE(a, b, c, d, SSE_LOAD(12), S11); /* 13 */
FFSSE(d, a, b, c, SSE_LOAD(13), S12); /* 14 */
FFSSE(c, d, a, b, SSE_LOAD(14), S13); /* 15 */
FFSSE(b, c, d, a, SSE_LOAD(15), S14); /* 16 */
/* Round 2 */
GGSSE(a, b, c, d, SSE_LOAD(0), S21); /* 17 */
GGSSE(d, a, b, c, SSE_LOAD(4), S22); /* 18 */
GGSSE(c, d, a, b, SSE_LOAD(8), S23); /* 19 */
GGSSE(b, c, d, a, SSE_LOAD(12), S24); /* 20 */
GGSSE(a, b, c, d, SSE_LOAD(1), S21); /* 21 */
GGSSE(d, a, b, c, SSE_LOAD(5), S22); /* 22 */
GGSSE(c, d, a, b, SSE_LOAD(9), S23); /* 23 */
GGSSE(b, c, d, a, SSE_LOAD(13), S24); /* 24 */
GGSSE(a, b, c, d, SSE_LOAD(2), S21); /* 25 */
GGSSE(d, a, b, c, SSE_LOAD(6), S22); /* 26 */
GGSSE(c, d, a, b, SSE_LOAD(10), S23); /* 27 */
GGSSE(b, c, d, a, SSE_LOAD(14), S24); /* 28 */
GGSSE(a, b, c, d, SSE_LOAD(3), S21); /* 29 */
GGSSE(d, a, b, c, SSE_LOAD(7), S22); /* 30 */
GGSSE(c, d, a, b, SSE_LOAD(11), S23); /* 31 */
GGSSE(b, c, d, a, SSE_LOAD(15), S24); /* 32 */
/* Round 3 */
HHSSE(a, b, c, d, SSE_LOAD(0), S31); /* 33 */
HHSSE(d, a, b, c, SSE_LOAD(8), S32); /* 34 */
HHSSE(c, d, a, b, SSE_LOAD(4), S33); /* 35 */
HHSSE(b, c, d, a, SSE_LOAD(12), S34); /* 36 */
HHSSE(a, b, c, d, SSE_LOAD(2), S31); /* 37 */
HHSSE(d, a, b, c, SSE_LOAD(10), S32); /* 38 */
HHSSE(c, d, a, b, SSE_LOAD(6), S33); /* 39 */
HHSSE(b, c, d, a, SSE_LOAD(14), S34); /* 40 */
HHSSE(a, b, c, d, SSE_LOAD(1), S31); /* 41 */
HHSSE(d, a, b, c, SSE_LOAD(9), S32); /* 42 */
HHSSE(c, d, a, b, SSE_LOAD(5), S33); /* 43 */
HHSSE(b, c, d, a, SSE_LOAD(13), S34); /* 44 */
HHSSE(a, b, c, d, SSE_LOAD(3), S31); /* 45 */
HHSSE(d, a, b, c, SSE_LOAD(11), S32); /* 46 */
HHSSE(c, d, a, b, SSE_LOAD(7), S33); /* 47 */
HHSSE(b, c, d, a, SSE_LOAD(15), S34); /* 48 */
// Finally, add initial values, as this is the only pass we make.
a=_mm_add_epi32(a,_mm_set_epi32(0x67452301,0x67452301,0x67452301,0x67452301));
b=_mm_add_epi32(b,_mm_set_epi32(0xefcdab89,0xefcdab89,0xefcdab89,0xefcdab89));
c=_mm_add_epi32(c,_mm_set_epi32(0x98badcfe,0x98badcfe,0x98badcfe,0x98badcfe));
d=_mm_add_epi32(d,_mm_set_epi32(0x10325476,0x10325476,0x10325476,0x10325476));
}
volatile bool stop=false;
inline void setInput2(unsigned short* charset, int no, int recursion=0);
inline void setInput(int recursion=0)
{
static bool restart=false;
//static bool stop=false;
int i=0;
if(recursion==STRING_MAX)
{
//printf("max\n");
// stop=true;
return;
}
//check the for characters arrays for too high values
while(i<4)
{
if(characters[i][recursion]>=62)
{
characters[i][recursion]-=62;
restart=true;
if(recursion>1)
characters[i][recursion-1]++;
else if(recursion==0||recursion==1)
{
printf("reached end of part %d, stopping\n",part);
exit(0);
}
//("wrap for %d\n",i);
}
i++;
}
if(restart)
{
if(recursion)
return;
restart=false;
setInput();
return;
}
/*while(charset[characters[0][recursion]]!=0)
{
string[recursion]=charset[characters[0][recursion]];
input[recursion/2][no][recursion%2]=mixedalphanumeric[characters[0][recursion]];
setInput2(charset, no, recursion+1);
#if STRING_MAX>6
if(recursion==0)
{ compressSse();
printf("finished part: %d\n",part);
exit(0);
}
#endif
characters[0][recursion]++;
}*/
//while((mixedalphanumeric[characters[recursion]]!=0)&&(!stop))
{
//i ist die hash-nummer 0...3
//[x] ist jeweils die string-nummer, recursion der character-index
string[recursion]=mixedalphanumeric[characters[0][recursion]];
input[recursion/2][0][recursion%2]=mixedalphanumeric[characters[0][recursion]];
input[recursion/2][1][recursion%2]=mixedalphanumeric[characters[1][recursion]];
input[recursion/2][2][recursion%2]=mixedalphanumeric[characters[2][recursion]];
input[recursion/2][3][recursion%2]=mixedalphanumeric[characters[3][recursion]];
setInput(recursion+1);
if(recursion==STRING_MAX-1)
{
characters[0][recursion]+=4;
characters[1][recursion]+=4;
characters[2][recursion]+=4;
characters[3][recursion]+=4;
}
#if 0
for(i=0;(i<4)&&(mixedalphanumeric[characters[recursion]]!=0); i++)
{
//printf("i: %d\n",i);
input[recursion/2][i][recursion%2]=mixedalphanumeric[characters[recursion]];
if(STRING_MAX-1==recursion)
characters[recursion]++;
}
if(i<3&&recursion)
{
characters[recursion-1]++;
characters[recursion]=0;
}
//printf("step deeper\n");
if(!stop)
setInput(recursion+1);
characters[recursion]=0;//reset this
#endif
}
//characters[recursion]=0;//reset this
}
void saveCrackerState()
{
FILE* fpRest=fopen("restore","wb");
if(fwrite(&characters,1,sizeof(characters),fpRest)!=sizeof(characters))
{
printf("failed to save state of the cracker, exiting with failure!\n");
exit(-1);
}
fclose(fpRest);
}
unsigned char salt[]="administrator";
/*
{
//'a',0,'d',0,'m',0,'i',0,'n',0,'i',0,'s',0,'t',0,'r',0,'a',0,'t',0,'o',0,'r',0
};*/
inline void brute()
{
unsigned int __attribute__ ((aligned (16))) data[4][4];
memset(input,0,sizeof(input));
stop=false;
setInput();
//printInput();
//NTLM
//padding byte - strings are all in words, MS UNICODE
#if STRING_MAX%2
{
input[STRING_MAX/2][0][1]=0x80;
input[STRING_MAX/2][1][1]=0x80;
input[STRING_MAX/2][2][1]=0x80;
input[STRING_MAX/2][3][1]=0x80;
}
#else
{
input[STRING_MAX/2][0][0]=0x80;
input[STRING_MAX/2][1][0]=0x80;
input[STRING_MAX/2][2][0]=0x80;
input[STRING_MAX/2][3][0]=0x80;
}
#endif
//length
input[14][0][0]=STRING_MAX*2*8;
input[14][1][0]=STRING_MAX*2*8;
input[14][2][0]=STRING_MAX*2*8;
input[14][3][0]=STRING_MAX*2*8;
#ifndef test
compressSse();
#endif
//memset(input,0,sizeof(input));
_mm_store_si128((__m128i*)&input[0][0][0],a);
_mm_store_si128((__m128i*)&input[1][0][0],b);
_mm_store_si128((__m128i*)&input[2][0][0],c);
_mm_store_si128((__m128i*)&input[3][0][0],d);
for(unsigned int i=0; i<sizeof(salt)-1; i++)
{
input[4+i/2][0][i%2]=salt[i];
input[4+i/2][1][i%2]=salt[i];
input[4+i/2][2][i%2]=salt[i];
input[4+i/2][3][i%2]=salt[i];
}
//padding byte
input[10][0][1]=0x80;
input[10][1][1]=0x80;
input[10][2][1]=0x80;
input[10][3][1]=0x80;
//length
input[14][0][0]=128+(sizeof(salt)-1)*8*2;
input[14][1][0]=128+(sizeof(salt)-1)*8*2;
input[14][2][0]=128+(sizeof(salt)-1)*8*2;
input[14][3][0]=128+(sizeof(salt)-1)*8*2;
#ifndef test
compressSse();
#endif
//check
_mm_store_si128((__m128i*)data[0],a);
if((data[0][0]==*(unsigned int*)&target[0])||(data[0][1]==*(unsigned int*)&target[0])||(data[0][2]==*(unsigned int*)&target[0])||(data[0][3]==*(unsigned int*)&target[0]))
{
printf("potential match at input0=");
printUnicode(string,STRING_MAX,0);
printf("\n");
_mm_store_si128((__m128i *)data[1],b);
_mm_store_si128((__m128i *)data[2],c);
_mm_store_si128((__m128i *)data[3],d);
for(int i=0; i<4; i++)
{
if(*(unsigned int*)&target[0]==data[0][i]&&*(unsigned int*)&target[4]==data[1][i]&&*(unsigned int*)&target[8]==data[2][i]&&*(unsigned int*)&target[12]==data[3][i])
{
printf("match with input %d in part %d with input0=",i,part);
printUnicode(string,STRING_MAX,1);
printf("\n");
exit(0);
}
}
}
count++;
if(count>100000000)
{
//save state of characters array
int old=start.tv_sec;
gettimeofday(&start,NULL);
printf("100000000*4 hashes done in %d seconds, save our state, atm string: ", (int)(start.tv_sec-old));
printUnicode(string,STRING_MAX, 0);
printf("\n");
saveCrackerState();
count=0;
}
}
void doSse()
{
initSse();
while(1)
brute();
}
#pragma pack()
void print_hex(void* bs, unsigned int n)
{
unsigned int i;
for (i=0;i<n;i++)
printf("%02x",((unsigned char*)bs)[i]);
}
void printUnicode(unsigned short* ptS, int length, int ready)
{
FILE* fp=fopen("result", "ab");
if(ready)
fprintf(fp,"result!!!!: ");
for(int i=0; i<length; i++)
{
printf("%c",(char)*ptS);
fprintf(fp,"%c",(char)*ptS);
ptS++;
}
fprintf(fp,"\n");
fclose(fp);
}
void initArray()
{
//memset(characters,0,sizeof(characters));
memset(characters,0,sizeof(characters));
memset(mixedalphanumeric,0,sizeof(mixedalphanumeric));
memset(&start,0,sizeof(start));
gettimeofday(&start,NULL);
//characters[0]=part;
char* ar=(char*)mixedalphanumeric;
int array=0;
for(char i='a'; i<='z'; i++)
{
ar[array]=i;
array++;
ar[array]=0;
array++;
}
for(char i='A'; i<='Z'; i++)
{
ar[array]=i;
array++;
ar[array]=0;
array++;
}
for(char i='0'; i<='9'; i++)
{
ar[array]=i;
array++;
ar[array]=0;
array++;
}
}
int main(int argc, char* argv[])
{
FILE* fpRest=NULL;
initArray();
if(argc<=1)
{
printf("usage: either specify a part number from 0...61 or --restore!\n");
exit(0);
}
if(strncasecmp(argv[1],"--restore",strlen("--restore"))==0)
{
fpRest=fopen("restore", "rb");
if(fread(&characters, 1, sizeof(characters),fpRest)!=sizeof(characters))
{
printf("failed to read the restore file, exiting!\n");
exit(-1);
}
fclose(fpRest);
part=characters[0][0];
printf("continue the cracking\n");
}
else
{
part=atoi(argv[1]);
characters[0][0]=part;
characters[1][0]=part;
characters[2][0]=part;
characters[3][0]=part;
saveCrackerState();
printf("Please start again with ./hash --restore to process part %d\n",part);
exit(0);
}
printf("processing part: %d\n",part);
//make a backup in case I forgot a --restore
system("cp restore restorebackup");
system("cp restorebackup restorebackup2");
doSse();
return 0;
}
Answer> Code is a little convoluted
Yes. And I got differences in the MD5 hash (from the results from the reference implementation), so i suspect it is not completely bug-free.
The way I tend to use SSE3 is to let the compiler generate SSE3 instructions as part of optimization. The algorithm implementation has to be a bit-sliced one for this to work, I picked up a known, public domain, bit-slice implementation of the MD5 compression function by by Solar Designer <solar at openwall.com> (as part of his 'john the riper' implementation from here:
http://www.masm32.com/board/index.php?topic=8154.0
I got a four-fold improvement in performance with SSE2. Is this kind of improvement what you are looking for?
>alias c++
g++44 -Wall -std=c++98 -pedantic -Werror -I /usr/local/include -L /usr/local/lib
>c++ -s -O3 -fomit-frame-pointer -funroll-loops bitslicemd5.c
>time ./a.out
vector size = 32 bits
c09c4c1f 21876746 18aed2 70b452f0
1.661u 0.000s 0:01.67 99.4% 15+1134k 0+0io 0pf+0w
>c++ -s -O3 -fomit-frame-pointer -funroll-loops -msse3 -march=core2 -DVECTOR bitslicemd5.c
>time ./a.out
vector size = 128 bits
c09c4c1f 21876746 18aed2 70b452f0
0.407u 0.000s 0:00.41 97.5% 21+1144k 0+0io 0pf+0w
The code, completely unchanged, just sanitized for C++:
/*
* Proof-of-concept bitslice implementation of the MD5 compression function.
*
* Written by Solar Designer <solar at openwall.com> in 1997-1998, revised in
* 2007, and placed in the public domain. There's absolutely no warranty.
* Although this is not required formally, due credit will be appreciated if
* you re-use this code or concepts.
*
* Possible compiler invocations (tested on Linux/x86-64):
*
* # No vectors (e.g., native 64-bit registers):
* gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops
*
* # 128-bit vectors with SSE2:
* PATH=~/gcc-4.1.0/bin:$PATH gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops -msse2 -DVECTOR
*
* The -msse2 option is not required on x86-64 (it is implied), and it needs
* to be replaced with -maltivec for AltiVec. You may also add -march=... to
* match your CPU model.
*
* Systems with smaller L1 instruction caches may need -O3 replaced with -O2,
* or -fno-inline-functions may be passed explicitly.
*/
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
typedef unsigned int scalar;
typedef signed int element;
#ifdef VECTOR
typedef element vector __attribute__ ((vector_size (16)));
#else
typedef unsigned long vector;
#endif
#define S11 7
#define S12 12
#define S13 17
#define S14 22
#define S21 5
#define S22 9
#define S23 14
#define S24 20
#define S31 4
#define S32 11
#define S33 16
#define S34 23
#define S41 6
#define S42 10
#define S43 15
#define S44 21
static void add32( vector* x, vector* y, vector* z)
{
int i;
vector a, b, c, p;
i = 32;
c ^= c; // c = 0;
do
{
a = *x++;
b = *y++;
*z++ = (p = a ^ b) ^ c;
c = (p & c) | (a & b);
} while (--i);
}
static void add32c( scalar c, vector* x, vector* y, vector* z )
{
int i;
vector a, b, c1, c2, p, q;
i = 32;
c1 ^= c1; c2 ^= c2; // c1 = c2 = 0;
do {
a = *x++;
b = *y++;
if (c & 1)
{
*z++ = ~(a ^ b) ^ c1 ^ c2;
c2 = (a & b & (p = c1 | c2)) | (c1 & c2 & (q = a | b));
c1 = p | q;
}
else
{
*z++ = (q = (p = a ^ b) ^ c1) ^ c2;
c1 = (p & c1) | (a & b);
c2 &= q;
}
c >>= 1;
} while (--i);
}
static void F( vector* x, vector* y, vector* z, vector* f )
{
int i;
vector a;
i = 32;
do
{
a = *z++;
*f++ = a ^ (*x++ & (*y++ ^ a));
} while (--i);
}
static void G( vector*x, vector*y, vector*z, vector*f )
{
F(z, x, y, f);
}
static void H( vector* x, vector* y, vector* z, vector* f )
{
int i;
i = 32;
do
{
*f++ = *x++ ^ *y++ ^ *z++;
} while (--i);
}
static void I( vector* x, vector* y, vector* z, vector* f )
{
int i;
i = 32;
do
{
*f++ = *y++ ^ (*x++ | ~*z++);
} while (--i);
}
static void rol32( int n, vector* x, vector* y )
{
int i, j;
j = 32 - (i = n);
do
{
*y++ = x[j++];
} while (--i);
i = 32 - n;
do
{
*y++ = *x++;
} while (--i);
}
static void const32( scalar x, vector* y )
{
int i;
vector zero, ones;
zero ^= zero;
ones = ~zero;
i = 32;
do
{
if (x & 1)
*y++ = ones;
else
*y++ = zero;
x >>= 1;
} while (--i);
}
static void XX( void (*f) ( vector* x, vector* y, vector* z, vector* f ),
vector* a, vector* b, vector* c, vector* d, vector* x,
int s, scalar ac )
{
vector tmp[32];
f(b, c, d, tmp);
add32c(ac, x, tmp, tmp);
add32(a, tmp, tmp);
rol32(s, tmp, a);
add32(b, a, a);
}
#define FF(a, b, c, d, x, s, ac) XX(F, a, b, c, d, x, s, ac)
#define GG(a, b, c, d, x, s, ac) XX(G, a, b, c, d, x, s, ac)
#define HH(a, b, c, d, x, s, ac) XX(H, a, b, c, d, x, s, ac)
#define II(a, b, c, d, x, s, ac) XX(I, a, b, c, d, x, s, ac)
static union
{
vector v[4][32];
element e[4][32][sizeof(vector) / sizeof(element)];
} state;
static vector init[4][32];
static void md5_init(void)
{
const32(0x67452301, init[0]);
const32(0xefcdab89, init[1]);
const32(0x98badcfe, init[2]);
const32(0x10325476, init[3]);
}
static void md5_xform( vector x[16][32] )
{
vector a[32], b[32], c[32], d[32];
memcpy(a, init[0], sizeof(a));
memcpy(b, init[1], sizeof(b));
memcpy(c, init[2], sizeof(c));
memcpy(d, init[3], sizeof(d));
/* Round 1 */
FF(a, b, c, d, x[0], S11, 0xd76aa478); /* 1 */
FF(d, a, b, c, x[1], S12, 0xe8c7b756); /* 2 */
FF(c, d, a, b, x[2], S13, 0x242070db); /* 3 */
FF(b, c, d, a, x[3], S14, 0xc1bdceee); /* 4 */
FF(a, b, c, d, x[4], S11, 0xf57c0faf); /* 5 */
FF(d, a, b, c, x[5], S12, 0x4787c62a); /* 6 */
FF(c, d, a, b, x[6], S13, 0xa8304613); /* 7 */
FF(b, c, d, a, x[7], S14, 0xfd469501); /* 8 */
FF(a, b, c, d, x[8], S11, 0x698098d8); /* 9 */
FF(d, a, b, c, x[9], S12, 0x8b44f7af); /* 10 */
FF(c, d, a, b, x[10], S13, 0xffff5bb1); /* 11 */
FF(b, c, d, a, x[11], S14, 0x895cd7be); /* 12 */
FF(a, b, c, d, x[12], S11, 0x6b901122); /* 13 */
FF(d, a, b, c, x[13], S12, 0xfd987193); /* 14 */
FF(c, d, a, b, x[14], S13, 0xa679438e); /* 15 */
FF(b, c, d, a, x[15], S14, 0x49b40821); /* 16 */
/* Round 2 */
GG(a, b, c, d, x[1], S21, 0xf61e2562); /* 17 */
GG(d, a, b, c, x[6], S22, 0xc040b340); /* 18 */
GG(c, d, a, b, x[11], S23, 0x265e5a51); /* 19 */
GG(b, c, d, a, x[0], S24, 0xe9b6c7aa); /* 20 */
GG(a, b, c, d, x[5], S21, 0xd62f105d); /* 21 */
GG(d, a, b, c, x[10], S22, 0x2441453); /* 22 */
GG(c, d, a, b, x[15], S23, 0xd8a1e681); /* 23 */
GG(b, c, d, a, x[4], S24, 0xe7d3fbc8); /* 24 */
GG(a, b, c, d, x[9], S21, 0x21e1cde6); /* 25 */
GG(d, a, b, c, x[14], S22, 0xc33707d6); /* 26 */
GG(c, d, a, b, x[3], S23, 0xf4d50d87); /* 27 */
GG(b, c, d, a, x[8], S24, 0x455a14ed); /* 28 */
GG(a, b, c, d, x[13], S21, 0xa9e3e905); /* 29 */
GG(d, a, b, c, x[2], S22, 0xfcefa3f8); /* 30 */
GG(c, d, a, b, x[7], S23, 0x676f02d9); /* 31 */
GG(b, c, d, a, x[12], S24, 0x8d2a4c8a); /* 32 */
/* Round 3 */
HH(a, b, c, d, x[5], S31, 0xfffa3942); /* 33 */
HH(d, a, b, c, x[8], S32, 0x8771f681); /* 34 */
HH(c, d, a, b, x[11], S33, 0x6d9d6122); /* 35 */
HH(b, c, d, a, x[14], S34, 0xfde5380c); /* 36 */
HH(a, b, c, d, x[1], S31, 0xa4beea44); /* 37 */
HH(d, a, b, c, x[4], S32, 0x4bdecfa9); /* 38 */
HH(c, d, a, b, x[7], S33, 0xf6bb4b60); /* 39 */
HH(b, c, d, a, x[10], S34, 0xbebfbc70); /* 40 */
HH(a, b, c, d, x[13], S31, 0x289b7ec6); /* 41 */
HH(d, a, b, c, x[0], S32, 0xeaa127fa); /* 42 */
HH(c, d, a, b, x[3], S33, 0xd4ef3085); /* 43 */
HH(b, c, d, a, x[6], S34, 0x4881d05); /* 44 */
HH(a, b, c, d, x[9], S31, 0xd9d4d039); /* 45 */
HH(d, a, b, c, x[12], S32, 0xe6db99e5); /* 46 */
HH(c, d, a, b, x[15], S33, 0x1fa27cf8); /* 47 */
HH(b, c, d, a, x[2], S34, 0xc4ac5665); /* 48 */
/* Round 4 */
II(a, b, c, d, x[0], S41, 0xf4292244); /* 49 */
II(d, a, b, c, x[7], S42, 0x432aff97); /* 50 */
II(c, d, a, b, x[14], S43, 0xab9423a7); /* 51 */
II(b, c, d, a, x[5], S44, 0xfc93a039); /* 52 */
II(a, b, c, d, x[12], S41, 0x655b59c3); /* 53 */
II(d, a, b, c, x[3], S42, 0x8f0ccc92); /* 54 */
II(c, d, a, b, x[10], S43, 0xffeff47d); /* 55 */
II(b, c, d, a, x[1], S44, 0x85845dd1); /* 56 */
II(a, b, c, d, x[8], S41, 0x6fa87e4f); /* 57 */
II(d, a, b, c, x[15], S42, 0xfe2ce6e0); /* 58 */
II(c, d, a, b, x[6], S43, 0xa3014314); /* 59 */
II(b, c, d, a, x[13], S44, 0x4e0811a1); /* 60 */
II(a, b, c, d, x[4], S41, 0xf7537e82); /* 61 */
II(d, a, b, c, x[11], S42, 0xbd3af235); /* 62 */
II(c, d, a, b, x[2], S43, 0x2ad7d2bb); /* 63 */
II(b, c, d, a, x[9], S44, 0xeb86d391); /* 64 */
add32(a, init[0], state.v[0]);
add32(b, init[1], state.v[1]);
add32(c, init[2], state.v[2]);
add32(d, init[3], state.v[3]);
}
int main(void)
{
vector x[16][32];
scalar y[4];
int i, j;
printf("vector size = %u bits\n", (unsigned int)sizeof(vector) * 8);
md5_init();
for (i = 0; i < 16; i++)
const32(0xf4f4f4f4, x[i]);
md5_xform(x);
memset(y, 0, sizeof(y));
for (i = 0; i < 4; i++)
for (j = 0; j < 32; j++)
y[i] |= (state.e[i][j][0] & 1) << j;
printf("%x %x %x %x\n", y[0], y[1], y[2], y[3]);
i = 1000000 / (sizeof(vector) * 8);
do
md5_xform(x);
while (--i);
}
I checked out the assembly code generated with g++, it does generate parallelized SSE2 instructions. The assembly listing:
>alias c++2asm
g++44 -c -g -Wa,-a,-ad -Wall -std=c++98 -pedantic -Werror
>c++2asm -s -O3 -fomit-frame-pointer -funroll-loops -msse3 -march=core2 -DVECTOR bitslicemd5.c > bitslicemd5.asm
I can't include the full asembly listing (exceeds All Experts 65000 character limit), so here is an extract
99:bitslicemd5.c **** static void F( vector* x, vector* y, vector* z, vector* f )
100:bitslicemd5.c **** {
15 .loc 1 100 0
16 .LVL0:
17 0000 56 pushl %esi
18 .LCFI0:
19 0001 31C0 xorl %eax, %eax
20 0003 53 pushl %ebx
21 .LCFI1:
22 .loc 1 100 0
GAS LISTING /var/tmp//ccQZ7caK.s page 3
23 0004 8B74240C movl 12(%esp), %esi
24 0008 8B5C2410 movl 16(%esp), %ebx
25 000c 8B4C2414 movl 20(%esp), %ecx
26 0010 8B542418 movl 24(%esp), %edx
27 .LVL1:
28 .L2:
29 .LBB57:
101:bitslicemd5.c **** int i;
102:bitslicemd5.c **** vector a;
103:bitslicemd5.c ****
104:bitslicemd5.c **** i = 32;
105:bitslicemd5.c ****
106:bitslicemd5.c **** do
107:bitslicemd5.c **** {
108:bitslicemd5.c **** a = *z++;
30 .loc 1 108 0
31 0014 660F6F3C movdqa (%ecx,%eax), %xmm7
31 01
32 .LVL2:
109:bitslicemd5.c **** *f++ = a ^ (*x++ & (*y++ ^ a));
33 .loc 1 109 0c
34 0019 660F6F34 movdqa (%ebx,%eax), %xmm6
34 03
35 001e 660FEFF7 pxor %xmm7, %xmm6
36 0022 660FDB34 pand (%esi,%eax), %xmm6
36 06
37 0027 660FEFF7 pxor %xmm7, %xmm6
38 002b 660F7F34 movdqa %xmm6, (%edx,%eax)
38 02
39 .loc 1 108 0
40 0030 660F6F6C movdqa 16(%ecx,%eax), %xmm5
40 0110
41 .LVL3:
42 .loc 1 109 0
43 0036 660F6F64 movdqa 16(%ebx,%eax), %xmm4
43 0310
44 003c 660FEFE5 pxor %xmm5, %xmm4
45 0040 660FDB64 pand 16(%esi,%eax), %xmm4
45 0610
46 0046 660FEFE5 pxor %xmm5, %xmm4
47 004a 660F7F64 movdqa %xmm4, 16(%edx,%eax)
47 0210
48 .loc 1 108 0
49 0050 660F6F5C movdqa 32(%ecx,%eax), %xmm3
49 0120
50 .LVL4:
51 .loc 1 109 0
52 0056 660F6F54 movdqa 32(%ebx,%eax), %xmm2
52 0320
53 005c 660FEFD3 pxor %xmm3, %xmm2
54 0060 660FDB54 pand 32(%esi,%eax), %xmm2
54 0620
55 0066 660FEFD3 pxor %xmm3, %xmm2
56 006a 660F7F54 movdqa %xmm2, 32(%edx,%eax)
56 0220
57 .loc 1 108 0
58 0070 660F6F4C movdqa 48(%ecx,%eax), %xmm1
GAS LISTING /var/tmp//ccQZ7caK.s page 4
58 0130
59 .LVL5:
60 .loc 1 109 0
61 0076 660F6F44 movdqa 48(%ebx,%eax), %xmm0
61 0330
62 007c 660FEFC1 pxor %xmm1, %xmm0
63 0080 660FDB44 pand 48(%esi,%eax), %xmm0
63 0630
64 0086 660FEFC1 pxor %xmm1, %xmm0
65 008a 660F7F44 movdqa %xmm0, 48(%edx,%eax)
65 0230
66 .loc 1 108 0
67 0090 660F6F7C movdqa 64(%ecx,%eax), %xmm7
67 0140
68 .LVL6:
69 .loc 1 109 0
70 0096 660F6F74 movdqa 64(%ebx,%eax), %xmm6
70 0340
71 009c 660FEFF7 pxor %xmm7, %xmm6
72 00a0 660FDB74 pand 64(%esi,%eax), %xmm6
72 0640
73 00a6 660FEFF7 pxor %xmm7, %xmm6
74 00aa 660F7F74 movdqa %xmm6, 64(%edx,%eax)
74 0240
75 .loc 1 108 0
76 00b0 660F6F6C movdqa 80(%ecx,%eax), %xmm5
76 0150
77 .LVL7:
78 .loc 1 109 0
79 00b6 660F6F64 movdqa 80(%ebx,%eax), %xmm4
79 0350
80 00bc 660FEFE5 pxor %xmm5, %xmm4
81 00c0 660FDB64 pand 80(%esi,%eax), %xmm4
81 0650
82 00c6 660FEFE5 pxor %xmm5, %xmm4
83 00ca 660F7F64 movdqa %xmm4, 80(%edx,%eax)
83 0250
84 .loc 1 108 0
85 00d0 660F6F5C movdqa 96(%ecx,%eax), %xmm3
85 0160
86 .LVL8:
87 .loc 1 109 0
88 00d6 660F6F54 movdqa 96(%ebx,%eax), %xmm2
88 0360
89 00dc 660FEFD3 pxor %xmm3, %xmm2
90 00e0 660FDB54 pand 96(%esi,%eax), %xmm2
90 0660
91 00e6 660FEFD3 pxor %xmm3, %xmm2
92 00ea 660F7F54 movdqa %xmm2, 96(%edx,%eax)
92 0260
93 .loc 1 108 0
94 00f0 660F6F4C movdqa 112(%ecx,%eax), %xmm1
94 0170
95 .LVL9:
96 .loc 1 109 0
97 00f6 660F6F44 movdqa 112(%ebx,%eax), %xmm0
97 0370
GAS LISTING /var/tmp//ccQZ7caK.s page 5
98 00fc 660FEFC1 pxor %xmm1, %xmm0
99 0100 660FDB44 pand 112(%esi,%eax), %xmm0
99 0670
100 0106 660FEFC1 pxor %xmm1, %xmm0
101 010a 660F7F44 movdqa %xmm0, 112(%edx,%eax)
101 0270
102 0110 83E880 subl $-128, %eax
103 .loc 1 106 0
104 0113 3D000200 cmpl $512, %eax
104 00
105 0118 0F85F6FE jne .L2
105 FFFF
106 .LBE57:
110:bitslicemd5.c **** } while (--i);
111:bitslicemd5.c **** }
107 .loc 1 111 0
108 011e 5B popl %ebx
109 .LVL10:
110 011f 5E popl %esi
111 .LVL11:
112 0120 C3 ret
113 .LFE3:
115 0121 EB0D9090 .p2align 4,,15
115 90909090
115 90909090
115 909090
117 _ZL1GPU8__vectoriS0_S0_S0_:
118 .LFB4:
112:bitslicemd5.c ****
**** note **** I'm on vacation (at a place without internet) till 17th of September. Will be back after that.