You are here:

C++/reading wordlist words into memory

Advertisement


Question
QUESTION: I am writing a program which will access a wordlist file and check every word in that file to see if a given array matches that word. Here is an example:

char candidate[] = "testword"; //word to test to see if in       //wordlist
int truthcounter;
truthcounter = 0;
char word[1000]; //array to store words from wordlist in
ifstream wordlist(argv[1]); //open the wordlist file //supplied as argument to main
while(!wordlist.eof()) //for every word in the wordlist...
{
  wordlist.getline(word,999); //get the word...
  for(int x = 0;x<strlen(candidate);x++) //and go //through this complicated
  {          //process //of checking every
     if(candidate[x]==word[x])   //character //of the word to see
     {          //if the //current candidate matches
        truthcounter++;   //the current word //from the wordlist
     }
  }
  if(truthcounter==strlen(candidate)) //if every //character in the candidate
  {          //is the //same as the current wordlist word, output confirmation
     std::cout << "word is in wordlist!\n";
  }
}

I hope this doesn't seem to cryptic. The problem with this code is that it is in a loop, and the candidate changes each time through the loop. So in other words, I have to close the wordlist file and re-open it every time through the loop. This makes the program incredibly slow (~180 candidates tried per second, when I need several million per second).

The question is: is there any way that I can read every word in a wordlist file from the disk into memory, without making an unique array for each word? The wordlist file has millions of words, and I cannot be sure of the length of them (thus char word[1000], to be on the safe side).

I was thinking something like reading through the wordlist and saving each word to an element of an array. Perhaps a two-dimensional array? Example:

char wordlistarray; //an array of arrays
ifstream wordlist(argv[1]);
for(int i = 0; !wordlist.eof();i++)
{
  wordlist.getline(wordlistarray[i],999)
}
std::cout << wordlistarray[0];

so for example, if the first word in the wordlist was "Hindenburg", the code would store that array in the first element of the wordlistarray array. So wordlistarray[0] would be "Hindenburg". The program above would output "Hindenburg" to the screen. And I could test the wordlist against the candidate like this:

char candidate[] = "Hindenburg";
for(int c = 0;c<strlen(wordlistarray);c++)
{
  if(candidate==wordlistarray[c]) //this is not //possible. The idea is if the entire
  {          //candidate array //matches the entire array stored
     std::cout << "word is in wordlist!\n"; //in //element zero of the wordlistarray array
  }          //then print confirmation
}

I have tried and failed to do this with multi-dimensional arrays. These are char arrays, which may be part of the problem. I have read about linked lists, but they seem too complicated for what I am trying to do.

Any suggestions? Thank you for your time.


ANSWER: > The wordlist file has millions of words, and I cannot be sure
> of the length of them (thus char word[1000], to be on the safe side).

use a std::string instead of a char array.
http://www.cprogramming.com/tutorial/string.html


> read every word in a wordlist file from the disk into memory

also to keep the words that you have read, prefer using a std::vector<> in place of a c-style array.
http://www.cprogramming.com/tutorial/stl/vector.html

> go through this complicated
> process of checking every
> character of the word to see
> current candidate matches

to compare two c-style strings, use std::strcmp() http://www.cppreference.com/wiki/c/string/strcmp

here, we would be using the C++ std::string instead, compare them using the overloaded == operator.

thus a simple version of your program would be:

#include <string>
#include <vector>
#include <iostream>
#include <fstream>

int main( int argc, char** argv )
{
   std::ifstream file( argv[1] ) ;
   std::vector< std::string > word_list ;

   // read words (lines) from file into the vector
   std::string word ;
   while( std::getline( file, word ) ) word_list.push_back(word) ;
   std::cout << word_list.size() << " words\n" ;

   std::string candidate ;
   while( ( std::cout << "candidate word? " ) && ( std::cin >> candidate ) )
   {
       // iterate through the vector looking for the word
       for( std::size_t i = 0 ; i < word_list.size() ; ++i )
       {
         if( word_list[i] == candidate )
         {
         std::cout << "word is in wordlist!\n" ;
         break ;
         }
       }
   }
}





---------- FOLLOW-UP ----------

QUESTION: Thank you for your response. I have implemented your code and it works. There is still one problem, however. Using this code, I can only test about 416,666 candidates per second when testing a wordlist file containing only 6,134 words! The code slowed my program down from almost 5,000,000 trials per second to 769 trials per second. This is a major problem for me, as speed is essential to my program. Am I correct in understanding that the entire wordlist has been loaded into memory? If so, why would it take so long to check the vector to see if the candidate is in the wordlist? Surely the processor can access the RAM faster than that! I have eight gigs of RAM in my computer, so I know that is not a problem of running out of room.



I modified the code you gave me slightly to do a benchmark:



#include <string>

#include <vector>

#include <iostream>

#include <fstream>

#include <ctime>



int main( int argc, char** argv )

{

     unsigned int startTime = time(0);

  unsigned int endTime;

  unsigned long int processtime;

  unsigned int keyrate;

  

   std::ifstream file( argv[1] ) ;

   std::vector< std::string > word_list ;



   // read words (lines) from file into the vector

   std::string word ;

   while( std::getline( file, word ) ) word_list.push_back(word) ;

   std::cout << word_list.size() << " words\n" ;



   char candidate[] = "$uper";



   for(int c = 0; c<50000000 ; c++)//does 50 000 000 loops

   {

      // iterate through the vector looking for the word

      for( std::size_t i = 0 ; i < word_list.size() ; ++i )

      {

         if( word_list[i] == candidate )

         {

         //std::cout << "word is in wordlist!\n" ;

         break ;

         }

      }

   }

  //benchmark code

   endTime = time(0);

   processtime = endTime - startTime;

   keyrate = 50000000/processtime;

   std::cout << "keyrate = " << keyrate << "\n";

}



this program yields about 10,000,000 trials per second for a wordlist with a single word in it, 416,666 trials per second for a wordlist with 6,134 words in it. Am I doing something wrong?



this is the actual test function used in my code:



for( std::size_t i = 0 ; i < word_list.size() ; ++i ) //run through...

{

  if( word_list[i] == candidate ) //if candidate is in wordlist...

  {

     recoveredwords++; //count the word as recovered          if(recoveredwords==word_list.size())//if every word recovered...

     {

        goto end; //quit program

     }

  }

}//end of test function



the idea is that when every word in the wordlist has matched the one of the candidates, the program exits. So when I have generated a candidate for each word in the wordlist file, the program is done.



Perhaps if I were able to remove the word from the vector after it had tested out against the candidate. Then I would have to test that many less words each time the candidate matches. And when there are no words left in the vector, the program exits. That should speed thing up a little. Here is an example:



for( std::size_t i = 0 ; i < word_list.size() ; ++i ) //run through...

{

  if( word_list[i] == candidate ) //if candidate is in wordlist...

  {

     delete word_list[i]; //delete that word from the vector

     //that way I don't have to test it again the next time around

     //and if there are multiple words in the vector that match the candidate,          //delete them all

     if(number of words left in vector = 0)

     {

        std::cout << "every word in wordlist recovered!\n";

        goto end; //quit program

     }

  }

}//end of test function



Any idea how I would go about doing this? Or some other optimizations I might use? Am I doing something wrong that is slowing down my code? Any help would be greatly appreciated. Thank you for your time.

Answer
> Am I correct in understanding that the entire wordlist has been loaded into memory?

Yes.

> The code slowed my program... why would it take so long...

Perhaps, you have compiled your code with optimizations (inlining of function calls) disabled. Member functions of std::vector<>, std::string etc. need to be inlined to get performance equivalent to hand-written code using C style arrays.

> Perhaps if I were able to remove the word from the vector after it had tested out against the candidate.
> Then I would have to test that many less words each time the candidate matches...
>  delete word_list[i]; //delete that word from the vector

You cannot erase an element from the vector that way. Instead, you have to use std::vector<>::erase. http://www.cppreference.com/wiki/stl/vector/erase
This is not a viable option; erasing an element from a vector is an expensive operation - it takes O(N) time.  

> Any idea how I would go about doing this? Or some other optimizations I might use?

Right now, the search in the word list is a *linear* search, this takes time proportional to the number of words in the list or O(N) time. Since the list needs to be initialized only once and then repeatedly searched, we can speed up the search to take only logarithmic time by first sorting the list and then doing a *binary* search.

here is how you could do that:

#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <string>
#include <ctime>

typedef std::vector< std::string > word_list_type ;


void initialize_data_structures( const char* file_name,
         word_list_type& word_list )
{
  std::ifstream file( file_name ) ;

  // read words (lines) from file into the vector
  std::string word ;
  while( std::getline( file, word ) ) word_list.push_back(word) ;
  std::cout << word_list.size() << " words\n" ;

  // sort word_list
  std::sort( word_list.begin(), word_list.end() ) ;
}

inline bool look_up_word( const std::string& word,
         const word_list_type& word_list )
{
   return std::binary_search( word_list.begin(),
         word_list.end(),
         word ) ;
}

int main()
{
 word_list_type word_list ;
 const char* word_file = "/usr/share/dict/words" ;

 initialize_data_structures( word_file, word_list ) ;

 std::size_t n = 0 ;
 std::clock_t begin = std::clock() ;

 for( std::size_t i = 0 ; i<word_list.size() ; ++i )
     if( look_up_word( word_list[i], word_list ) ) ++n ;

 std::clock_t end = std::clock() ;

 std::cout << n << " words. in " << double(end-begin) / CLOCKS_PER_SEC
         << " seconds\n" ;
}

on FreeBSD 6.3, GCC 4.3, Pentium III 700 Mhz, this is what I get:

vijayan@~/code:>uname -spr
FreeBSD 6.3-PRERELEASE i386

vijayan@~/code:>dmesg | grep CPU:
CPU: Intel(R) Pentium(R) III Mobile CPU 700MHz (695.96-MHz 686-class CPU)

vijayan@~/code:>c++ --version | grep GCC
g++43 (GCC) 4.3.4 20090312 (prerelease)

vijayan@~/code:>c++ word_list_slow.cc && ./a.out
235882 words
235882 words. in 1.0625 seconds

with inlining enabled:
vijayan@~/code:>c++ -O3 word_list_slow.cc && ./a.out
235882 words
235882 words. in 0.4375 seconds

Your machine is several times faster than this one, this is all the performance that you may need.

If you need to make it faster, you could exploit this: we only need to search those words in the word list whose length matches the length of the word we are looking for. So if we order the word list by
a. first, length of the word
b. for words of the same length, lexicographically (alphabetically)
we need to look into only into the right portion of the word list.

here is how you could do that.

#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <string>
#include <map>
#include <ctime>

typedef std::vector< std::string > word_list_type ;
typedef std::pair< std::size_t, std::size_t > word_list_pos ;
typedef std::map< std::size_t, word_list_pos > map_type ;

struct sort_on_length
{
 bool operator() ( const std::string& a, const std::string& b ) const
 {
   std::size_t len_a = a.size(), len_b = b.size() ;
   if( len_a < len_b ) return true ;
   else if( len_a > len_b ) return false ;
   else return a < b ; // lexicographic order if lentghts are same
 }
};

void initialize_data_structures( const char* file_name,
         word_list_type& word_list,
         map_type& look_up )
{
  std::ifstream file( file_name ) ;

  // read words (lines) from file into the vector
  std::string word ;
  while( std::getline( file, word ) ) word_list.push_back(word) ;
  std::cout << word_list.size() << " words\n" ;

  // sort word_list on length as primary key,
  // lexicographic order as secondary key
  std::sort( word_list.begin(), word_list.end(), sort_on_length()  ) ;

  // initialize lookup for fast lookup on word size
  std::size_t curr_length = word_list[0].size() ;
  std::size_t curr_pos = 0U ;
  const std::size_t NUM_WORDS = word_list.size() ;

  for( std::size_t pos = 1 ; pos < NUM_WORDS ; ++pos )
  {
      std::size_t len = word_list[pos].size() ;
      if( len > curr_length )
      {
        look_up[ curr_length ] = std::make_pair( curr_pos, pos ) ;
        curr_length = len ;
        curr_pos = pos ;
      }
  }
  look_up[ curr_length ] = std::make_pair( curr_pos, NUM_WORDS ) ;
}

inline bool look_up_word( const std::string& word,
         const word_list_type& word_list,
         const map_type& look_up )
{
 std::size_t len = word.size() ;
 map_type::const_iterator iter = look_up.find(len) ;
 if( iter == look_up.end() )
   return false ;
 else
 {
   std::size_t start_pos = iter->second.first ;
   std::size_t end_pos = iter->second.second ;

   return std::binary_search( word_list.begin() + start_pos,
         word_list.begin() + end_pos,
         word ) ;
 }
}

int main()
{
 enum { EXPECTED_MAX_WORDS = 300000U } ; // estimate
 word_list_type word_list ;
 word_list.reserve( EXPECTED_MAX_WORDS ) ;
 map_type look_up ;
 const char* word_file = "/usr/share/dict/words" ;

 initialize_data_structures( word_file, word_list, look_up ) ;

 std::size_t n = 0 ;
 std::clock_t begin = std::clock() ;

 for( std::size_t i = 0 ; i<word_list.size() ; ++i )
     if( look_up_word( word_list[i], word_list, look_up ) ) ++n ;

 std::clock_t end = std::clock() ;

 std::cout << n << " words. in " << double(end-begin) / CLOCKS_PER_SEC
         << " seconds\n" ;
}

vijayan@~/code:>c++ word_list.cc && ./a.out
235882 words
235882 words. in 1.01562 seconds

vijayan@~/code:>c++ -O3 word_list.cc && ./a.out
235882 words
235882 words. in 0.375 seconds


this uses a lot of C++ features (iterators,function objects, maps and so on) that you may be unfamiliar with right now. But the larger the number of words in the list, the more would be the improvement in performance.

We could make it even faster by using a hash_map and a little fancier algorithm, but this would require using libraries which are not there in the current C++ standard (C++98) and would be available only in C++09 or a TR1 implementation. I would suggest that you look at that option only if these fail to meet your performance requirements.  

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


vijayan

Expertise

my primary areas of interest are generic and template metaprogramming, STL, algorithms, design patterns and c++11. i would not answer questions about gui and web programming.

Experience

about 15 years or so

Education/Credentials
post graduate engineer

©2016 About.com. All rights reserved.