You are here:

C++/count unique words in a text file

Advertisement


Question
QUESTION: I am trying to obtain a list and count of all the unique words in a text file with a structure and linked list.  Unfortunately I am having a rough start and would appreciate any suggestions.  So far my program counts the number of words in the file and I have the structure defined.  I am lost as to where to go next.  I have included my code below. Thank you for your time.

//----------------------------------------------------------
#include <vcl\condefs.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <conio.h>

#pragma hdrstop
//----------------------------------------------------------
USERES("unique_words.res");
//----------------------------------------------------------
int main()
{
   FILE * stream;
   char ch;
   char prevChar;
   char word[];
   int length=0;

   //------------------------------------------------------
   // structure
   struct word{
      char * word;   // pointer
       int count;      // counter
   };

   //-----------------------------------------------------
   // hold count information
   int numWords = 0;

   //------------------------------------------------------
   // open file
   stream = fopen("c:\\Paragraph.txt", "r");
   if (stream == NULL)
   {
      printf("Error opening file...\n");
   }
   else {printf("File opened.\n");}

   //------------------------------------------------------
   // obtain file size
   fseek (stream, 0, SEEK_END);
   length = ftell (stream);
   fseek (stream, 0L, SEEK_SET);
   printf ("Length: %d\n\n", length);

   //------------------------------------------------------
   // determine number of characters and words
   for (int i = 0; i < length; i++)
   {
  ch = fgetc(stream);
      printf("%c",ch);

       // keep count of words
     if ((isalnum(prevChar)) && (isspace(ch) ||  ispunct(ch) && ch != '\''))
     {
        numWords++;
       }
       prevChar = ch;
   }

   //----------------------------------------------------
   // display values on screen
   printf("\n\nTotal Number of Words: " "%d\n", numWords);
   getch();
    
   //-----------------------------------------------------
   // close file
   fclose(stream);
  return 0;
}

ANSWER: First, be sure to initialize prevChar.  The first time(s) through the loop, it's not initialized.

Where you count the words, you need to skip all non-word delimiters (unless you know more about the data than you've told me).  At each delimiter, where you have a word, you need to look up the word in the linked list (not a bad choice for this exercise) and count it if it's there, and if not, add to the list.

Your struct word needs a 'struct word *next' member to make a linked list.  You probably want to use malloc to set next and set the count to 1 and the word to the pointer to the word.  And that brings up another issue - what is *word going to point to?  I would also malloc the pointer for *word using the strlen of the word (+1).  Like

void AddWord( const char *word )
{
struct word *ptrWord;

 ptrWord = malloc( sizof( struct word ) );
 ptrWord->next = NULL;
 ptrWord->count = 1;
 ptrWord->word = malloc( strlen( word ) + 1 );
 strcpy( ptrWord->word, word );
 if( listHead == NULL )
   wordList =
   listHead = ptrWord;
 else
   {
   wordList->next = ptrWord;
   wordList = ptrWord;
   }
}

The code above should have error checks on malloc.

You want also a pointer to the head of the list which is set above when it's NULL.  wordList points to the last item and is used for adding.  You would loop through the head pointer to the end doing a strcmp (or stricmp) on the word to see if it's in the list:

int  FindWord( const char *word )
{
struct word *wPtr;

 for( wPtr = listHead; wPtr != NULL; wPtr = wPtr->next )
   if( ! strcmp( word, wPtr->word ) )
     {
     ++wPtr->count;
     return 1;   // found
     }
 AddWord( word );
 return 0;
}

This isn't complete but shows how to store and compare the words.

I think you need to flesh out the testing for words more.  Do a lot of testing - you don't know what you'll be sent (multiple spaces, punctuation, numbers, etc).  Again, unless you know the input.

Bill

---------- FOLLOW-UP ----------

QUESTION: Unfortunately I am not allowed to use classes yet and have to keep it pretty basic.  I have the list working and all words are being read into it.  My problem now is that I am getting an Access Violation when trying to search the list and increase the count.  The way I have it set up, I am obtaining the word and searching the list for the word.  If the word is in the list, a count variable in the structure is supposed to increment; however, that is not happening.  Then if the word is not in the list, it is added.  I have stepped through the program and the access violation appears after the second words is detected.  I don't know why it is having a problem and am a beginner.  Therefore I am having trouble finding the problem.  Any help would be appreciated.  I have included my code.  Thank you.

//---------------------------------------------------------
#include <vcl\condefs.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <conio.h>

#pragma hdrstop
//---------------------------------------------------------
USERES("unique_words.res");
//----------------------------------------------------------
// structure
typedef struct TWord{
  char * m_word;
   int count;
   TWord * next;
}TWord;

//---------------------------------------------------------
int main()
{
   FILE * stream;
   char ch = 0;
   char prevChar = 0;
   int length=0;
   char newWord[1024];
   int newWord_len = 0;

   //-----------------------------------------------------
   // initialize beginning pointer to NULL
  TWord * start_ptr = NULL;
   TWord * end_ptr;
   TWord * temp;

   //-----------------------------------------------------
   // hold count information
   int numWords = 0;

   //------------------------------------------------------
   // open file
   stream = fopen("c:\\Paragraph.txt", "r");
   if (stream == NULL)
   {
      printf("Error opening file...\n");
   }
   else {printf("File opened.\n");}

   //-----------------------------------------------------
   // obtain file size
   fseek (stream, 0, SEEK_END);
   length = ftell (stream);
   fseek (stream, 0L, SEEK_SET);
   printf ("Length: %d\n\n", length);

   //------------------------------------------------------
   // determine number of characters and words
   temp->count = 0;

   for (int i = 0; i < length; i++)
   {
     ch = fgetc(stream);
      printf("%c",ch);

       // append characters to string
       newWord[newWord_len++] = ch;

       // keep count of words
       if ((isalnum(prevChar)) && (isspace(ch) || ispunct(ch) && ch != '\'' && ch != '-'))
       {
         printf("\t");
         printf(temp->m_word);
         printf("\n");
         for (temp = start_ptr; temp != NULL; temp = temp->next)
         {
         if(strcmp(newWord, temp->m_word)==0)
         {
         temp->count++;
         break;
         }
         }
         // reserve block of memory and copy the word to the structure instance
         temp = new TWord;
         temp->m_word = new char[newWord_len + 1];
         strncpy(temp->m_word, newWord, newWord_len);


         // head has no value - assign word to head
         if (start_ptr == NULL)
         {
         start_ptr = temp;
         end_ptr = temp;
         }
         // head has value - assign word to next node
         else
         {
         end_ptr->next = temp;
         temp->next = NULL;
         end_ptr = temp;
         }    

         // increase word count
         numWords++;

         // reset word length to zero
         newWord_len = 0;
       }
       prevChar = ch;
   }

   //------------------------------------------------------
   // display values on screen
   printf("\nTotal Number of Words: " "%d\n", numWords);
   printf("\nList: \n");
   for(temp = start_ptr; temp != NULL; temp = temp->next)
   {
      printf(temp->m_word);
       printf("\t");
   }
   getch();

   // cleanup
   delete temp;

   //-----------------------------------------------------
   // close file
   fclose(stream);
  return 0;
}


Answer
Corrections marked with **> (add) or **< (delete):


//---------------------------------------------------------
#include <vcl\condefs.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <conio.h>

#pragma hdrstop
//---------------------------------------------------------
USERES("unique_words.res");
//----------------------------------------------------------
// structure
typedef struct TWord{
char * m_word;
  int count;
  TWord * next;
}TWord;

//---------------------------------------------------------
int main()
{
  FILE * stream;
  char ch = 0;
  char prevChar = 0;
  int length=0;
  char newWord[1024];
  int newWord_len = 0;

  //-----------------------------------------------------
  // initialize beginning pointer to NULL
TWord * start_ptr = NULL;
  TWord * end_ptr;
  TWord * temp;

  //-----------------------------------------------------
  // hold count information
  int numWords = 0;

  //------------------------------------------------------
  // open file
  stream = fopen("c:\\Paragraph.txt", "r");
  if (stream == NULL)
  {
   printf("Error opening file...\n");
  }
  else {printf("File opened.\n");}

  //-----------------------------------------------------
  // obtain file size
  fseek (stream, 0, SEEK_END);
  length = ftell (stream);
  fseek (stream, 0L, SEEK_SET);
  printf ("Length: %d\n\n", length);

  //------------------------------------------------------
  // determine number of characters and words
**<   temp->count = 0;

  for (int i = 0; i < length; i++)
  {
ch = fgetc(stream);
   printf("%c",ch);

      // append characters to string
      newWord[newWord_len++] = ch;

      // keep count of words
      if ((isalnum(prevChar)) && (isspace(ch) || ispunct(ch) && ch != '\'' && ch != '-'))
      {
         printf("\t");
**<        printf(temp->m_word);
**>        printf(newWord);
         printf("\n");
       for (temp = start_ptr; temp != NULL; temp = temp->next)
         {
         if(strcmp(newWord, temp->m_word)==0)
         {
         temp->count++;
         break;
         }
         }
         // reserve block of memory and copy the word to the structure instance
**>      if(temp == NULL)
**>        {
         temp = new TWord;
**>        temp->next = NULL;
         temp->m_word = new char[newWord_len + 1];
         strncpy(temp->m_word, newWord, newWord_len);


         // head has no value - assign word to head
         if (start_ptr == NULL)
         {
         start_ptr = temp;
         end_ptr = temp;
         }
         // head has value - assign word to next node
         else
         {
         end_ptr->next = temp;
**<          temp->next = NULL;
         end_ptr = temp;
         }    
**>      }
         // increase word count
       numWords++;

         // reset word length to zero
         newWord_len = 0;
      }
      prevChar = ch;
  }

  //------------------------------------------------------
  // display values on screen
  printf("\nTotal Number of Words: " "%d\n", numWords);
  printf("\nList: \n");
  for(temp = start_ptr; temp != NULL; temp = temp->next)
  {
   printf(temp->m_word);
      printf("\t");
  }
  getch();

  // cleanup
  delete temp;

  //-----------------------------------------------------
  // close file
  fclose(stream);
return 0;
}

Comments: If you find the word, you don't want to add it to the list.  Temp wasn't initialized, so you can't reference it in the printf.  Always set temp->next to NULL or you're referencing an undefined next in the first element.

Bill

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


Bill A

Expertise

I can answer questions about C++, programming algorithms, Windows programming in MFC (which is C++). I cannot answer questions about STL (templates) and I have no experience with Linux. I do enjoy reviewing code and critiquing it or finding problems in it. I will also gladly show better algorithms or methods if you want to take advantage of that.

Experience

I've developed a commercial embedded C compiler/assembler and IDE with debugger toolset, of which the IDE and debugger are written in C++. I work in the industry writing high tech embedded programs and Windows programs to communicate with the embedded devices.

Publications
Book: Embedded Systems Design using the Rabbit 3000 Microprocessor Authored Chapter 10 in its entirety.

Education/Credentials
BS Computer Engineering

©2016 About.com. All rights reserved.