You are here:

C++/SSE2-accelerated MD4

Advertisement


Question
QUESTION: I need help implementing an SSE2-accelerated MD4 implementation. I found one that I think may work at http://www.freerainbowtables.com/phpBB3/5/viewtopic3.php.htm#p8248

but need help implementing it. It is a little over my head. All I want to do is conduct 4 MD4 encryptions simultaneously using SSE2.

I have this, for example:

char candidate0[]="password0";
char candidate1[]="password1";
char candidate2[]="password2";
char candidate3[]="password3";

I believe I need to pad them and append the length, or whatever the MD4 algorithm requires, before feeding them into the compressSse() function which will do the rest of the work to turn them into 4 MD4 hashes.

Maybe you know of a better SSE2-accelerated MD4 routine? Easier/ faster? I also plan to use this for NTLM and MSCACHE, so I would like to retain the flexibility to use or not to use UNICODE conversion. I think the implementation above might include UNICODE conversion. I would like to keep that separate, so I can use it for NTLM and MD4.

here is pseudocode for my planned future implementation in a password brute forcer (sorry if it's a mess :$):

int cached_hashes=0;//counts how many plaintexts are ready, when 4, compute

char candidate[];//the candidate password

char plaintexts[4][];//stores all 4 plaintexts

unsigned char ready_candidate[16];//plaintexts prepared for the compressSse() function

unsigned char encrypted[16];//the encrypted password we are trying to recover



while(bruteforcing)

{

  candidate++;

  memcpy(plaintexts[cached_hashes],candidate, strlen(candidate));

  //possibly UNICODE conversion here, depending on hash type

  prepare candidate- padding, whatever- store in ready_candidate

  memcpy(input[cached_hashes],ready_candidate,16);

  cached_hashes++;

  if(cached_hashes==4)//if 4 candidates ready and stored...

  {

     //MD4 encrypt them all simultaneously

     compressSse();

     cached_hashes=0;

     //I'm assuming that compressSse() writes the MD4 hashes

     //back into input[] elements, respectively

     for(int i=0;i<4;i++)

        if(!memcmp(input[i],encrypted,16)

         cout<<"plaintext is "<<plaintext[i]<<"!\n";

  }

}



I generate one candidate at a time, and then wait until I have collected 4 candidates and then encrypt them all simultaneously. Then I check the results one-by-one, again.

I think once I can get an SSE2-accelerated MD4 implementation up and running, I should be able to modify it to turn it into an SSE2-accelerated MD5 implementation.

ANSWER: The link that you posted gives a 404 NOT FOUND

Here are two C++ crypto libraries that I've used. They use SSE2, but perhaps not for obsolescent algorithms like MD4 or MD5. Check them out and see if they meet your requirements:

Crypto++ - http://www.cryptopp.com/

Botan - http://botan.randombit.net/about.html

and one which is very easy to use, but definitely does not use SSE2
hashlib++ - http://hashlib2plus.sourceforge.net/




---------- FOLLOW-UP ----------

QUESTION: I checked out the implementations you provided. I don't think either of them use SSE2 for MD4/MD5 computation. MD4 and MD5 can only be partially conducted in SIMD-parallel for a single hash, and at very little speed advantage. These implementations only take a single hash in the function call, so I doubt they compute multiple hashes in parallel. Correct me if any of this information is wrong.

Here is the link to the implementation I was looking at:
http://www.freerainbowtables.com/phpBB3/viewtopic.php?f=6&t=904&sid=4e3e3efc6f55
hopefully that link works. I just am having a little trouble understanding how to use this implementation. I get the idea that I need to do some padding, but loading the input[][][] array is still a little over my head.


ANSWER: I had a look at the link.

It is not very clear to me from what I saw there how the array input[][][] is used. I could guess, but that would be risky.

---------- FOLLOW-UP ----------

QUESTION: Corni gave me an implementation of it in an mscache cracker. I still couldn't quite make sense of it. Code is a little convoluted. Perhaps it may help?

here is the interchange:

"Hi,
first, i've not really looked at this code for a long time, but let's see what i get together...
first, for your compiling errors, you need to #include the right header. google is your friend here ;)
second, for gcc as compiler (which I assume Dev-C++ uses) you've to add the -msse2 command line switch to compile.
(That was the easy part :D)

For the rest, i've written a brute-forcer for mscache. It isn't particularly nice written or feature rich, but it provides examples for padding and loading the input, and even has a dumb plaintext-generator for a specified string length. Beware, the code is ugly (and sometimes just with german comments)...
I've just removed the char target[], but you can use the char targetTest[] as target...
md4.h just contains the sse macros you already have."

#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <sys/time.h>
#include <xmmintrin.h>
#include "md4.h"

//#define test
#pragma pack(1)
//all messages are in windows unicode!
unsigned short mixedalphanumeric[63];
unsigned long long count=0;
timeval start;
__m128i a,b,c,d;
unsigned characters[4][STRING_MAX];
unsigned short input[16][4][2] __attribute__ ((aligned (16)));
int part=0;
unsigned short string[STRING_MAX]={0};//for looking up the used hash
unsigned char target[]=
{//857B55E2B52DB477EC3820A26BD6B1CE - administrator of schule
     0x85,0x7B,0x55,0xE2,0xB5,0x2D,0xB4,0x77,0xEC,0x38,0x20,0xA2,0x6B,0xD6,0xB1,0xCE
};

unsigned char targetTest[]=//aaaaaf
{//1419ebe422ff17029c8e82914fdf08e4
  0x14,0x19,0xeb,0xe4,0x22,0xff,0x17,0x02,0x9c,0x8e,0x82,0x91,0x4f,0xdf,0x08,0xe4
};

void printInput()
{
  for(int j=0; j<4; j++)
  {
     for(int i=0; i<STRING_MAX; i++)
     {
        //print_hex((unsigned char*)&input[i][j][0],4);
        printUnicode(&input[i][j][0],2,0 );
     }
     printf("\n");
  }
}

void initSse()
{
  memset(input,0,sizeof(input));
  characters[0][STRING_MAX-1]=0;
  characters[1][STRING_MAX-1]=1;
  characters[2][STRING_MAX-1]=2;
  characters[3][STRING_MAX-1]=3;
}
inline void compressSse()
{
  //set the initial values
  a=_mm_set_epi32(0x67452301,0x67452301,0x67452301,0x67452301);
  b=_mm_set_epi32(0xefcdab89,0xefcdab89,0xefcdab89,0xefcdab89);
  c=_mm_set_epi32(0x98badcfe,0x98badcfe,0x98badcfe,0x98badcfe);
  d=_mm_set_epi32(0x10325476,0x10325476,0x10325476,0x10325476);
  FFSSE(a, b, c, d, SSE_LOAD(0), S11); /* 1 */
  FFSSE(d, a, b, c, SSE_LOAD(1), S12); /* 2 */
  FFSSE(c, d, a, b, SSE_LOAD(2), S13); /* 3 */
  FFSSE(b, c, d, a, SSE_LOAD(3), S14); /* 4 */
  FFSSE(a, b, c, d, SSE_LOAD(4), S11); /* 5 */
  FFSSE(d, a, b, c, SSE_LOAD(5), S12); /* 6 */
  FFSSE(c, d, a, b, SSE_LOAD(6), S13); /* 7 */
  FFSSE(b, c, d, a, SSE_LOAD(7), S14); /* 8 */
  FFSSE(a, b, c, d, SSE_LOAD(8), S11); /* 9 */
  FFSSE(d, a, b, c, SSE_LOAD(9), S12); /* 10 */
  FFSSE(c, d, a, b, SSE_LOAD(10), S13); /* 11 */
  FFSSE(b, c, d, a, SSE_LOAD(11), S14); /* 12 */
  FFSSE(a, b, c, d, SSE_LOAD(12), S11); /* 13 */
  FFSSE(d, a, b, c, SSE_LOAD(13), S12); /* 14 */
  FFSSE(c, d, a, b, SSE_LOAD(14), S13); /* 15 */
  FFSSE(b, c, d, a, SSE_LOAD(15), S14); /* 16 */
  /* Round 2 */
  GGSSE(a, b, c, d, SSE_LOAD(0), S21); /* 17 */
  GGSSE(d, a, b, c, SSE_LOAD(4), S22); /* 18 */
  GGSSE(c, d, a, b, SSE_LOAD(8), S23); /* 19 */
  GGSSE(b, c, d, a, SSE_LOAD(12), S24); /* 20 */
  GGSSE(a, b, c, d, SSE_LOAD(1), S21); /* 21 */
  GGSSE(d, a, b, c, SSE_LOAD(5), S22); /* 22 */
  GGSSE(c, d, a, b, SSE_LOAD(9), S23); /* 23 */
  GGSSE(b, c, d, a, SSE_LOAD(13), S24); /* 24 */
  GGSSE(a, b, c, d, SSE_LOAD(2), S21); /* 25 */
  GGSSE(d, a, b, c, SSE_LOAD(6), S22); /* 26 */
  GGSSE(c, d, a, b, SSE_LOAD(10), S23); /* 27 */
  GGSSE(b, c, d, a, SSE_LOAD(14), S24); /* 28 */
  GGSSE(a, b, c, d, SSE_LOAD(3), S21); /* 29 */
  GGSSE(d, a, b, c, SSE_LOAD(7), S22); /* 30 */
  GGSSE(c, d, a, b, SSE_LOAD(11), S23); /* 31 */
  GGSSE(b, c, d, a, SSE_LOAD(15), S24); /* 32 */
  /* Round 3 */
  HHSSE(a, b, c, d, SSE_LOAD(0), S31); /* 33 */
  HHSSE(d, a, b, c, SSE_LOAD(8), S32); /* 34 */
  HHSSE(c, d, a, b, SSE_LOAD(4), S33); /* 35 */
  HHSSE(b, c, d, a, SSE_LOAD(12), S34); /* 36 */
  HHSSE(a, b, c, d, SSE_LOAD(2), S31); /* 37 */
  HHSSE(d, a, b, c, SSE_LOAD(10), S32); /* 38 */
  HHSSE(c, d, a, b, SSE_LOAD(6), S33); /* 39 */
  HHSSE(b, c, d, a, SSE_LOAD(14), S34); /* 40 */
  HHSSE(a, b, c, d, SSE_LOAD(1), S31); /* 41 */
  HHSSE(d, a, b, c, SSE_LOAD(9), S32); /* 42 */
  HHSSE(c, d, a, b, SSE_LOAD(5), S33); /* 43 */
  HHSSE(b, c, d, a, SSE_LOAD(13), S34); /* 44 */
  HHSSE(a, b, c, d, SSE_LOAD(3), S31); /* 45 */
  HHSSE(d, a, b, c, SSE_LOAD(11), S32); /* 46 */
  HHSSE(c, d, a, b, SSE_LOAD(7), S33); /* 47 */
  HHSSE(b, c, d, a, SSE_LOAD(15), S34); /* 48 */
  // Finally, add initial values, as this is the only pass we make.
  a=_mm_add_epi32(a,_mm_set_epi32(0x67452301,0x67452301,0x67452301,0x67452301));
  b=_mm_add_epi32(b,_mm_set_epi32(0xefcdab89,0xefcdab89,0xefcdab89,0xefcdab89));
  c=_mm_add_epi32(c,_mm_set_epi32(0x98badcfe,0x98badcfe,0x98badcfe,0x98badcfe));
  d=_mm_add_epi32(d,_mm_set_epi32(0x10325476,0x10325476,0x10325476,0x10325476));
}
volatile bool stop=false;
inline void setInput2(unsigned short* charset, int no, int recursion=0);
inline void setInput(int recursion=0)
{
  static bool restart=false;
  //static bool stop=false;
  int i=0;
  if(recursion==STRING_MAX)
  {
     //printf("max\n");
  //   stop=true;
     return;
  }

  //check the for characters arrays for too high values
  while(i<4)
  {
     if(characters[i][recursion]>=62)
     {
        characters[i][recursion]-=62;
        restart=true;
        if(recursion>1)
         characters[i][recursion-1]++;
        else if(recursion==0||recursion==1)
        {
         printf("reached end of part %d, stopping\n",part);
         exit(0);
        }
        //("wrap for %d\n",i);
     }
     i++;
  }
  if(restart)
  {
     if(recursion)
        return;
     restart=false;
     setInput();
     return;
  }

  /*while(charset[characters[0][recursion]]!=0)
  {
     string[recursion]=charset[characters[0][recursion]];
     input[recursion/2][no][recursion%2]=mixedalphanumeric[characters[0][recursion]];
     setInput2(charset, no, recursion+1);
#if STRING_MAX>6
     if(recursion==0)
     {   compressSse();
        printf("finished part: %d\n",part);
        exit(0);
     }
#endif
     characters[0][recursion]++;
  }*/

  //while((mixedalphanumeric[characters[recursion]]!=0)&&(!stop))
  {
     //i ist die hash-nummer 0...3
     //[x] ist jeweils die string-nummer, recursion der character-index
     string[recursion]=mixedalphanumeric[characters[0][recursion]];
     input[recursion/2][0][recursion%2]=mixedalphanumeric[characters[0][recursion]];
     input[recursion/2][1][recursion%2]=mixedalphanumeric[characters[1][recursion]];
     input[recursion/2][2][recursion%2]=mixedalphanumeric[characters[2][recursion]];
     input[recursion/2][3][recursion%2]=mixedalphanumeric[characters[3][recursion]];
     setInput(recursion+1);
     if(recursion==STRING_MAX-1)
     {
        characters[0][recursion]+=4;
        characters[1][recursion]+=4;
        characters[2][recursion]+=4;
        characters[3][recursion]+=4;
     }
#if 0
     for(i=0;(i<4)&&(mixedalphanumeric[characters[recursion]]!=0); i++)
     {
        //printf("i: %d\n",i);
        input[recursion/2][i][recursion%2]=mixedalphanumeric[characters[recursion]];
        if(STRING_MAX-1==recursion)
         characters[recursion]++;
     }
     if(i<3&&recursion)
     {
        characters[recursion-1]++;
        characters[recursion]=0;
     }
     //printf("step deeper\n");
     if(!stop)
        setInput(recursion+1);
     characters[recursion]=0;//reset this
#endif
  }

  //characters[recursion]=0;//reset this
}


void saveCrackerState()
{
  FILE* fpRest=fopen("restore","wb");
  if(fwrite(&characters,1,sizeof(characters),fpRest)!=sizeof(characters))
  {
     printf("failed to save state of the cracker, exiting with failure!\n");
     exit(-1);
  }
  fclose(fpRest);
}
unsigned char salt[]="administrator";
  /*
{
  //'a',0,'d',0,'m',0,'i',0,'n',0,'i',0,'s',0,'t',0,'r',0,'a',0,'t',0,'o',0,'r',0
};*/

inline void brute()
{
  unsigned int  __attribute__ ((aligned (16))) data[4][4];
  memset(input,0,sizeof(input));
  stop=false;
  setInput();
  //printInput();
  //NTLM
  //padding byte - strings are all in words, MS UNICODE
#if STRING_MAX%2
  {
     input[STRING_MAX/2][0][1]=0x80;
     input[STRING_MAX/2][1][1]=0x80;
     input[STRING_MAX/2][2][1]=0x80;
     input[STRING_MAX/2][3][1]=0x80;
  }
#else
  {
     input[STRING_MAX/2][0][0]=0x80;
     input[STRING_MAX/2][1][0]=0x80;
     input[STRING_MAX/2][2][0]=0x80;
     input[STRING_MAX/2][3][0]=0x80;
  }
#endif
  //length
  input[14][0][0]=STRING_MAX*2*8;
  input[14][1][0]=STRING_MAX*2*8;
  input[14][2][0]=STRING_MAX*2*8;
  input[14][3][0]=STRING_MAX*2*8;
#ifndef test
  compressSse();
#endif
  //memset(input,0,sizeof(input));
  _mm_store_si128((__m128i*)&input[0][0][0],a);
  _mm_store_si128((__m128i*)&input[1][0][0],b);
  _mm_store_si128((__m128i*)&input[2][0][0],c);
  _mm_store_si128((__m128i*)&input[3][0][0],d);
  for(unsigned int i=0; i<sizeof(salt)-1; i++)
  {
     input[4+i/2][0][i%2]=salt[i];
     input[4+i/2][1][i%2]=salt[i];
     input[4+i/2][2][i%2]=salt[i];
     input[4+i/2][3][i%2]=salt[i];
  }
  //padding byte
  input[10][0][1]=0x80;
  input[10][1][1]=0x80;
  input[10][2][1]=0x80;
  input[10][3][1]=0x80;
  //length
  input[14][0][0]=128+(sizeof(salt)-1)*8*2;
  input[14][1][0]=128+(sizeof(salt)-1)*8*2;
  input[14][2][0]=128+(sizeof(salt)-1)*8*2;
  input[14][3][0]=128+(sizeof(salt)-1)*8*2;
#ifndef test
  compressSse();
#endif
  //check
  _mm_store_si128((__m128i*)data[0],a);
  if((data[0][0]==*(unsigned int*)&target[0])||(data[0][1]==*(unsigned int*)&target[0])||(data[0][2]==*(unsigned int*)&target[0])||(data[0][3]==*(unsigned int*)&target[0]))
  {
     printf("potential match at input0=");
     printUnicode(string,STRING_MAX,0);
     printf("\n");
     _mm_store_si128((__m128i *)data[1],b);
     _mm_store_si128((__m128i *)data[2],c);
     _mm_store_si128((__m128i *)data[3],d);
     for(int i=0; i<4; i++)
     {
        if(*(unsigned int*)&target[0]==data[0][i]&&*(unsigned int*)&target[4]==data[1][i]&&*(unsigned int*)&target[8]==data[2][i]&&*(unsigned int*)&target[12]==data[3][i])
        {
         printf("match with input %d in part %d with input0=",i,part);
         printUnicode(string,STRING_MAX,1);
         printf("\n");
         exit(0);
        }
     }
  }
  count++;
  if(count>100000000)
  {
     //save state of characters array
     int old=start.tv_sec;
     gettimeofday(&start,NULL);
     printf("100000000*4 hashes done in %d seconds, save our state, atm string: ", (int)(start.tv_sec-old));
     printUnicode(string,STRING_MAX, 0);
     printf("\n");
     saveCrackerState();
     count=0;
  }
}

void doSse()
{
  initSse();
  while(1)
     brute();
}
#pragma pack()
void print_hex(void* bs, unsigned int n)
{
  unsigned int i;
  for (i=0;i<n;i++)
     printf("%02x",((unsigned char*)bs)[i]);
}

void printUnicode(unsigned short* ptS, int length, int ready)
{
  FILE* fp=fopen("result", "ab");
     if(ready)
  fprintf(fp,"result!!!!: ");
  for(int i=0; i<length; i++)
  {
     printf("%c",(char)*ptS);
     fprintf(fp,"%c",(char)*ptS);
     ptS++;
  }
  fprintf(fp,"\n");
  fclose(fp);
}

void initArray()
{
  //memset(characters,0,sizeof(characters));
  memset(characters,0,sizeof(characters));
  memset(mixedalphanumeric,0,sizeof(mixedalphanumeric));
  memset(&start,0,sizeof(start));
  gettimeofday(&start,NULL);
  //characters[0]=part;
  char* ar=(char*)mixedalphanumeric;
  int array=0;
  for(char i='a'; i<='z'; i++)
  {
     ar[array]=i;
     array++;
     ar[array]=0;
     array++;
  }
  for(char i='A'; i<='Z'; i++)
  {
     ar[array]=i;
     array++;
     ar[array]=0;
     array++;
  }
  for(char i='0'; i<='9'; i++)
  {
     ar[array]=i;
     array++;
     ar[array]=0;
     array++;
  }
}

int main(int argc, char* argv[])
{
  FILE* fpRest=NULL;
  initArray();
  if(argc<=1)
  {
     printf("usage: either specify a part number from 0...61 or --restore!\n");
     exit(0);
  }
  if(strncasecmp(argv[1],"--restore",strlen("--restore"))==0)
  {
     fpRest=fopen("restore", "rb");
     if(fread(&characters, 1, sizeof(characters),fpRest)!=sizeof(characters))
     {
        printf("failed to read the restore file, exiting!\n");
        exit(-1);
     }
     fclose(fpRest);
     part=characters[0][0];
     printf("continue the cracking\n");
  }
  else
  {
     part=atoi(argv[1]);
     characters[0][0]=part;
     characters[1][0]=part;
     characters[2][0]=part;
     characters[3][0]=part;
     saveCrackerState();
     printf("Please start again with ./hash --restore to process part %d\n",part);
     exit(0);
  }
  printf("processing part: %d\n",part);
  //make a backup in case I forgot a --restore
  system("cp restore restorebackup");
  system("cp restorebackup restorebackup2");
  doSse();
  return 0;
}


Answer
> Code is a little convoluted

Yes. And I got differences in the MD5 hash (from the results from the reference implementation), so i suspect it is not completely bug-free.

The way I tend to use SSE3 is to let the compiler generate SSE3 instructions as part of optimization. The algorithm implementation has to be a bit-sliced one for this to work, I picked up a known, public domain,  bit-slice implementation of the MD5 compression function by by Solar Designer <solar at openwall.com> (as part of his 'john the riper' implementation from here: http://www.masm32.com/board/index.php?topic=8154.0

I got a four-fold improvement in performance with SSE2. Is this kind of improvement what you are looking for?

>alias c++
g++44 -Wall -std=c++98 -pedantic -Werror -I /usr/local/include -L /usr/local/lib

>c++ -s -O3 -fomit-frame-pointer -funroll-loops bitslicemd5.c
>time ./a.out
vector size = 32 bits
c09c4c1f 21876746 18aed2 70b452f0
1.661u 0.000s 0:01.67 99.4%     15+1134k 0+0io 0pf+0w

>c++ -s -O3 -fomit-frame-pointer -funroll-loops -msse3 -march=core2 -DVECTOR bitslicemd5.c
>time ./a.out
vector size = 128 bits
c09c4c1f 21876746 18aed2 70b452f0
0.407u 0.000s 0:00.41 97.5%     21+1144k 0+0io 0pf+0w


The code, completely unchanged, just sanitized for C++:

/*
* Proof-of-concept bitslice implementation of the MD5 compression function.
*
* Written by Solar Designer <solar at openwall.com> in 1997-1998, revised in
* 2007, and placed in the public domain.  There's absolutely no warranty.
* Although this is not required formally, due credit will be appreciated if
* you re-use this code or concepts.
*
* Possible compiler invocations (tested on Linux/x86-64):
*
* # No vectors (e.g., native 64-bit registers):
* gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops
*
* # 128-bit vectors with SSE2:
* PATH=~/gcc-4.1.0/bin:$PATH gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops -msse2 -DVECTOR
*
* The -msse2 option is not required on x86-64 (it is implied), and it needs
* to be replaced with -maltivec for AltiVec.  You may also add -march=... to
* match your CPU model.
*
* Systems with smaller L1 instruction caches may need -O3 replaced with -O2,
* or -fno-inline-functions may be passed explicitly.
*/

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

typedef unsigned int scalar;
typedef signed int element;
#ifdef VECTOR
typedef element vector __attribute__ ((vector_size (16)));
#else
typedef unsigned long vector;
#endif

#define S11 7
#define S12 12
#define S13 17
#define S14 22
#define S21 5
#define S22 9
#define S23 14
#define S24 20
#define S31 4
#define S32 11
#define S33 16
#define S34 23
#define S41 6
#define S42 10
#define S43 15
#define S44 21

static void add32( vector* x, vector* y, vector* z)
{
   int i;
   vector a, b, c, p;

   i = 32;
   c ^= c; // c = 0;

   do
   {
       a = *x++;
       b = *y++;
       *z++ = (p = a ^ b) ^ c;
       c = (p & c) | (a & b);
   } while (--i);
}

static void add32c( scalar c, vector* x, vector* y, vector* z )
{
   int i;
   vector a, b, c1, c2, p, q;

   i = 32;
   c1 ^= c1; c2 ^= c2; // c1 = c2 = 0;

   do {
       a = *x++;
       b = *y++;

       if (c & 1)
       {
         *z++ = ~(a ^ b) ^ c1 ^ c2;
         c2 = (a & b & (p = c1 | c2)) | (c1 & c2 & (q = a | b));
         c1 = p | q;
       }
       else
       {
         *z++ = (q = (p = a ^ b) ^ c1) ^ c2;
         c1 = (p & c1) | (a & b);
         c2 &= q;
       }
       c >>= 1;
   } while (--i);
}

static void F( vector* x, vector* y, vector* z, vector* f )
{
   int i;
   vector a;

   i = 32;

   do
   {
       a = *z++;
       *f++ = a ^ (*x++ & (*y++ ^ a));
   } while (--i);
}

static void G( vector*x, vector*y, vector*z, vector*f )
{
   F(z, x, y, f);
}

static void H( vector* x, vector* y, vector* z, vector* f )
{
   int i;

   i = 32;
   do
   {
       *f++ = *x++ ^ *y++ ^ *z++;
   } while (--i);
}

static void I( vector* x, vector* y, vector* z, vector* f )
{
   int i;

   i = 32;
   do
   {
       *f++ = *y++ ^ (*x++ | ~*z++);
   } while (--i);
}

static void rol32( int n, vector* x, vector* y )
{
   int i, j;

   j = 32 - (i = n);

   do
   {
       *y++ = x[j++];
   } while (--i);

   i = 32 - n;

   do
   {
       *y++ = *x++;
   } while (--i);
}

static void const32( scalar x, vector* y )
{
   int i;
   vector zero, ones;

   zero ^= zero;
   ones = ~zero;

   i = 32;

   do
   {
       if (x & 1)
       *y++ = ones;
       else
       *y++ = zero;
       x >>= 1;

   } while (--i);
}

static void XX( void (*f) ( vector* x, vector* y, vector* z, vector* f ),
         vector* a, vector* b, vector* c, vector* d, vector* x,
         int s, scalar ac )
{
   vector tmp[32];

   f(b, c, d, tmp);
   add32c(ac, x, tmp, tmp);
   add32(a, tmp, tmp);
   rol32(s, tmp, a);
   add32(b, a, a);
}

#define FF(a, b, c, d, x, s, ac) XX(F, a, b, c, d, x, s, ac)
#define GG(a, b, c, d, x, s, ac) XX(G, a, b, c, d, x, s, ac)
#define HH(a, b, c, d, x, s, ac) XX(H, a, b, c, d, x, s, ac)
#define II(a, b, c, d, x, s, ac) XX(I, a, b, c, d, x, s, ac)

static union
{
   vector v[4][32];
   element e[4][32][sizeof(vector) / sizeof(element)];
} state;
static vector init[4][32];

static void md5_init(void)
{
   const32(0x67452301, init[0]);
   const32(0xefcdab89, init[1]);
   const32(0x98badcfe, init[2]);
   const32(0x10325476, init[3]);
}

static void md5_xform( vector x[16][32] )
{
   vector a[32], b[32], c[32], d[32];

   memcpy(a, init[0], sizeof(a));
   memcpy(b, init[1], sizeof(b));
   memcpy(c, init[2], sizeof(c));
   memcpy(d, init[3], sizeof(d));

   /* Round 1 */
   FF(a, b, c, d, x[0], S11, 0xd76aa478); /* 1 */
   FF(d, a, b, c, x[1], S12, 0xe8c7b756); /* 2 */
   FF(c, d, a, b, x[2], S13, 0x242070db); /* 3 */
   FF(b, c, d, a, x[3], S14, 0xc1bdceee); /* 4 */
   FF(a, b, c, d, x[4], S11, 0xf57c0faf); /* 5 */
   FF(d, a, b, c, x[5], S12, 0x4787c62a); /* 6 */
   FF(c, d, a, b, x[6], S13, 0xa8304613); /* 7 */
   FF(b, c, d, a, x[7], S14, 0xfd469501); /* 8 */
   FF(a, b, c, d, x[8], S11, 0x698098d8); /* 9 */
   FF(d, a, b, c, x[9], S12, 0x8b44f7af); /* 10 */
   FF(c, d, a, b, x[10], S13, 0xffff5bb1); /* 11 */
   FF(b, c, d, a, x[11], S14, 0x895cd7be); /* 12 */
   FF(a, b, c, d, x[12], S11, 0x6b901122); /* 13 */
   FF(d, a, b, c, x[13], S12, 0xfd987193); /* 14 */
   FF(c, d, a, b, x[14], S13, 0xa679438e); /* 15 */
   FF(b, c, d, a, x[15], S14, 0x49b40821); /* 16 */

   /* Round 2 */
   GG(a, b, c, d, x[1], S21, 0xf61e2562); /* 17 */
   GG(d, a, b, c, x[6], S22, 0xc040b340); /* 18 */
   GG(c, d, a, b, x[11], S23, 0x265e5a51); /* 19 */
   GG(b, c, d, a, x[0], S24, 0xe9b6c7aa); /* 20 */
   GG(a, b, c, d, x[5], S21, 0xd62f105d); /* 21 */
   GG(d, a, b, c, x[10], S22, 0x2441453); /* 22 */
   GG(c, d, a, b, x[15], S23, 0xd8a1e681); /* 23 */
   GG(b, c, d, a, x[4], S24, 0xe7d3fbc8); /* 24 */
   GG(a, b, c, d, x[9], S21, 0x21e1cde6); /* 25 */
   GG(d, a, b, c, x[14], S22, 0xc33707d6); /* 26 */
   GG(c, d, a, b, x[3], S23, 0xf4d50d87); /* 27 */
   GG(b, c, d, a, x[8], S24, 0x455a14ed); /* 28 */
   GG(a, b, c, d, x[13], S21, 0xa9e3e905); /* 29 */
   GG(d, a, b, c, x[2], S22, 0xfcefa3f8); /* 30 */
   GG(c, d, a, b, x[7], S23, 0x676f02d9); /* 31 */
   GG(b, c, d, a, x[12], S24, 0x8d2a4c8a); /* 32 */

   /* Round 3 */
   HH(a, b, c, d, x[5], S31, 0xfffa3942); /* 33 */
   HH(d, a, b, c, x[8], S32, 0x8771f681); /* 34 */
   HH(c, d, a, b, x[11], S33, 0x6d9d6122); /* 35 */
   HH(b, c, d, a, x[14], S34, 0xfde5380c); /* 36 */
   HH(a, b, c, d, x[1], S31, 0xa4beea44); /* 37 */
   HH(d, a, b, c, x[4], S32, 0x4bdecfa9); /* 38 */
   HH(c, d, a, b, x[7], S33, 0xf6bb4b60); /* 39 */
   HH(b, c, d, a, x[10], S34, 0xbebfbc70); /* 40 */
   HH(a, b, c, d, x[13], S31, 0x289b7ec6); /* 41 */
   HH(d, a, b, c, x[0], S32, 0xeaa127fa); /* 42 */
   HH(c, d, a, b, x[3], S33, 0xd4ef3085); /* 43 */
   HH(b, c, d, a, x[6], S34, 0x4881d05); /* 44 */
   HH(a, b, c, d, x[9], S31, 0xd9d4d039); /* 45 */
   HH(d, a, b, c, x[12], S32, 0xe6db99e5); /* 46 */
   HH(c, d, a, b, x[15], S33, 0x1fa27cf8); /* 47 */
   HH(b, c, d, a, x[2], S34, 0xc4ac5665); /* 48 */

   /* Round 4 */
   II(a, b, c, d, x[0], S41, 0xf4292244); /* 49 */
   II(d, a, b, c, x[7], S42, 0x432aff97); /* 50 */
   II(c, d, a, b, x[14], S43, 0xab9423a7); /* 51 */
   II(b, c, d, a, x[5], S44, 0xfc93a039); /* 52 */
   II(a, b, c, d, x[12], S41, 0x655b59c3); /* 53 */
   II(d, a, b, c, x[3], S42, 0x8f0ccc92); /* 54 */
   II(c, d, a, b, x[10], S43, 0xffeff47d); /* 55 */
   II(b, c, d, a, x[1], S44, 0x85845dd1); /* 56 */
   II(a, b, c, d, x[8], S41, 0x6fa87e4f); /* 57 */
   II(d, a, b, c, x[15], S42, 0xfe2ce6e0); /* 58 */
   II(c, d, a, b, x[6], S43, 0xa3014314); /* 59 */
   II(b, c, d, a, x[13], S44, 0x4e0811a1); /* 60 */
   II(a, b, c, d, x[4], S41, 0xf7537e82); /* 61 */
   II(d, a, b, c, x[11], S42, 0xbd3af235); /* 62 */
   II(c, d, a, b, x[2], S43, 0x2ad7d2bb); /* 63 */
   II(b, c, d, a, x[9], S44, 0xeb86d391); /* 64 */

   add32(a, init[0], state.v[0]);
   add32(b, init[1], state.v[1]);
   add32(c, init[2], state.v[2]);
   add32(d, init[3], state.v[3]);
}

int main(void)
{
   vector x[16][32];
   scalar y[4];
   int i, j;

   printf("vector size = %u bits\n", (unsigned int)sizeof(vector) * 8);

   md5_init();

   for (i = 0; i < 16; i++)
       const32(0xf4f4f4f4, x[i]);

   md5_xform(x);

   memset(y, 0, sizeof(y));

   for (i = 0; i < 4; i++)
       for (j = 0; j < 32; j++)
         y[i] |= (state.e[i][j][0] & 1) << j;

   printf("%x %x %x %x\n", y[0], y[1], y[2], y[3]);

   i = 1000000 / (sizeof(vector) * 8);

   do
       md5_xform(x);
   while (--i);

}

I checked out the assembly code generated with g++, it does generate parallelized SSE2 instructions. The assembly listing:

>alias c++2asm
g++44 -c -g -Wa,-a,-ad -Wall -std=c++98 -pedantic -Werror

>c++2asm -s -O3 -fomit-frame-pointer -funroll-loops -msse3 -march=core2 -DVECTOR bitslicemd5.c > bitslicemd5.asm

I can't include the full asembly listing (exceeds All Experts 65000 character limit), so here is an extract

 99:bitslicemd5.c **** static void F( vector* x, vector* y, vector* z, vector* f )
100:bitslicemd5.c **** {
 15          .loc 1 100 0
 16          .LVL0:
 17 0000 56          pushl %esi
 18          .LCFI0:
 19 0001 31C0        xorl  %eax, %eax
 20 0003 53          pushl %ebx
 21          .LCFI1:
 22          .loc 1 100 0
GAS LISTING /var/tmp//ccQZ7caK.s       page 3


 23 0004 8B74240C    movl  12(%esp), %esi
 24 0008 8B5C2410    movl  16(%esp), %ebx
 25 000c 8B4C2414    movl  20(%esp), %ecx
 26 0010 8B542418    movl  24(%esp), %edx
 27          .LVL1:
 28          .L2:
 29          .LBB57:
101:bitslicemd5.c ****     int i;
102:bitslicemd5.c ****     vector a;
103:bitslicemd5.c ****
104:bitslicemd5.c ****     i = 32;
105:bitslicemd5.c ****
106:bitslicemd5.c ****     do
107:bitslicemd5.c ****     {
108:bitslicemd5.c ****         a = *z++;
 30          .loc 1 108 0
 31 0014 660F6F3C    movdqa  (%ecx,%eax), %xmm7
 31      01
 32          .LVL2:
109:bitslicemd5.c ****         *f++ = a ^ (*x++ & (*y++ ^ a));
 33          .loc 1 109 0c
 34 0019 660F6F34    movdqa  (%ebx,%eax), %xmm6
 34      03
 35 001e 660FEFF7    pxor  %xmm7, %xmm6
 36 0022 660FDB34    pand  (%esi,%eax), %xmm6
 36      06
 37 0027 660FEFF7    pxor  %xmm7, %xmm6
 38 002b 660F7F34    movdqa  %xmm6, (%edx,%eax)
 38      02
 39          .loc 1 108 0
 40 0030 660F6F6C    movdqa  16(%ecx,%eax), %xmm5
 40      0110
 41          .LVL3:
 42          .loc 1 109 0
 43 0036 660F6F64    movdqa  16(%ebx,%eax), %xmm4
 43      0310
 44 003c 660FEFE5    pxor  %xmm5, %xmm4
 45 0040 660FDB64    pand  16(%esi,%eax), %xmm4
 45      0610
 46 0046 660FEFE5    pxor  %xmm5, %xmm4
 47 004a 660F7F64    movdqa  %xmm4, 16(%edx,%eax)
 47      0210
 48          .loc 1 108 0
 49 0050 660F6F5C    movdqa  32(%ecx,%eax), %xmm3
 49      0120
 50          .LVL4:
 51          .loc 1 109 0
 52 0056 660F6F54    movdqa  32(%ebx,%eax), %xmm2
 52      0320
 53 005c 660FEFD3    pxor  %xmm3, %xmm2
 54 0060 660FDB54    pand  32(%esi,%eax), %xmm2
 54      0620
 55 0066 660FEFD3    pxor  %xmm3, %xmm2
 56 006a 660F7F54    movdqa  %xmm2, 32(%edx,%eax)
 56      0220
 57          .loc 1 108 0
 58 0070 660F6F4C    movdqa  48(%ecx,%eax), %xmm1
GAS LISTING /var/tmp//ccQZ7caK.s       page 4


 58      0130
 59          .LVL5:
 60          .loc 1 109 0
 61 0076 660F6F44    movdqa  48(%ebx,%eax), %xmm0
 61      0330
 62 007c 660FEFC1    pxor  %xmm1, %xmm0
 63 0080 660FDB44    pand  48(%esi,%eax), %xmm0
 63      0630
 64 0086 660FEFC1    pxor  %xmm1, %xmm0
 65 008a 660F7F44    movdqa  %xmm0, 48(%edx,%eax)
 65      0230
 66          .loc 1 108 0
 67 0090 660F6F7C    movdqa  64(%ecx,%eax), %xmm7
 67      0140
 68          .LVL6:
 69          .loc 1 109 0
 70 0096 660F6F74    movdqa  64(%ebx,%eax), %xmm6
 70      0340
 71 009c 660FEFF7    pxor  %xmm7, %xmm6
 72 00a0 660FDB74    pand  64(%esi,%eax), %xmm6
 72      0640
 73 00a6 660FEFF7    pxor  %xmm7, %xmm6
 74 00aa 660F7F74    movdqa  %xmm6, 64(%edx,%eax)
 74      0240
 75          .loc 1 108 0
 76 00b0 660F6F6C    movdqa  80(%ecx,%eax), %xmm5
 76      0150
 77          .LVL7:
 78          .loc 1 109 0
 79 00b6 660F6F64    movdqa  80(%ebx,%eax), %xmm4
 79      0350
 80 00bc 660FEFE5    pxor  %xmm5, %xmm4
 81 00c0 660FDB64    pand  80(%esi,%eax), %xmm4
 81      0650
 82 00c6 660FEFE5    pxor  %xmm5, %xmm4
 83 00ca 660F7F64    movdqa  %xmm4, 80(%edx,%eax)
 83      0250
 84          .loc 1 108 0
 85 00d0 660F6F5C    movdqa  96(%ecx,%eax), %xmm3
 85      0160
 86          .LVL8:
 87          .loc 1 109 0
 88 00d6 660F6F54    movdqa  96(%ebx,%eax), %xmm2
 88      0360
 89 00dc 660FEFD3    pxor  %xmm3, %xmm2
 90 00e0 660FDB54    pand  96(%esi,%eax), %xmm2
 90      0660
 91 00e6 660FEFD3    pxor  %xmm3, %xmm2
 92 00ea 660F7F54    movdqa  %xmm2, 96(%edx,%eax)
 92      0260
 93          .loc 1 108 0
 94 00f0 660F6F4C    movdqa  112(%ecx,%eax), %xmm1
 94      0170
 95          .LVL9:
 96          .loc 1 109 0
 97 00f6 660F6F44    movdqa  112(%ebx,%eax), %xmm0
 97      0370
GAS LISTING /var/tmp//ccQZ7caK.s       page 5


 98 00fc 660FEFC1    pxor  %xmm1, %xmm0
 99 0100 660FDB44    pand  112(%esi,%eax), %xmm0
 99      0670
100 0106 660FEFC1    pxor  %xmm1, %xmm0
101 010a 660F7F44    movdqa  %xmm0, 112(%edx,%eax)
101      0270
102 0110 83E880      subl  $-128, %eax
103          .loc 1 106 0
104 0113 3D000200    cmpl  $512, %eax
104      00
105 0118 0F85F6FE    jne .L2
105      FFFF
106          .LBE57:
110:bitslicemd5.c ****     } while (--i);
111:bitslicemd5.c **** }
107          .loc 1 111 0
108 011e 5B          popl  %ebx
109          .LVL10:
110 011f 5E          popl  %esi
111          .LVL11:
112 0120 C3          ret
113          .LFE3:
115 0121 EB0D9090    .p2align 4,,15
115      90909090
115      90909090
115      909090
117          _ZL1GPU8__vectoriS0_S0_S0_:
118          .LFB4:
112:bitslicemd5.c ****


**** note **** I'm on vacation (at a place without internet) till 17th of September. Will be back after that.  

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


vijayan

Expertise

my primary areas of interest are generic and template metaprogramming, STL, algorithms, design patterns and c++11. i would not answer questions about gui and web programming.

Experience

about 15 years or so

Education/Credentials
post graduate engineer

©2016 About.com. All rights reserved.