You are here:

C++/How to differentiate integers and alphabets when reading .txt file using C

Advertisement


Question
Dear Sir,
I have only learnt C and not introduced to c++ codes at all. Can you help in advising how to differentiate integers and alphabets using C codes? My .txt file contains few thousand rows of integers, and 5 coloums of them. But I only wish to read and process a few coloums of them. And there are alphabets mixed in between.

I have tried to open the file using FILE *infile, then open and read, and used "fscanf" to scan for "char". I was able to differentiate integer and alphabets using "isdigit". However, it only managed to recognize individual digits, i.e -20001 is read and processed as 2, 0, 0, 0, 1. It failed to recognize it as -twenty thousand and one. Even the negative sign is ignored. Pls advise on what i can improve on my codes to successfully recognize words and numbers.. Thanks!

Answer
First because you are asking a C++ expert I will make it clear that I have not written pure C code for many years. Hence I may introduce C++ only idioms, due to forgetting all the things you cannot do in C but can in C++. If this is not acceptable then please ask a C expert, not a C++ expert!

Second, your description of your file does not give enough information to formulate a complete answer. For example is your data structured thus:

12345 asd 3456 hhh 6666 jgjgj 789 lkjh 5678

Or

12345,asd,3456,hhh,6666,jgjgj,789,lkjh,5678

Or

12345asd3456hhh6666jgjgj789lkjh5678

Or some other way.

Are the fields of a fixed width or are they of variable size? That is do you have something like:

12345 abcde
1234  abcd
 123   abc
  12    ab
   1     a

Or like this:

12345 abcde
1234 abcd
123 abc
12 ab
1 a

All these things affect possible ways to process your data.

The fscanf family of functions can do a lot more than read single characters, if you wish to spend the time concocting the correct format specification string. Whether you can use fscanf effectively and how depends on the way the data is structured:

- are fields separated by white-space (space, tab or newline)
- are fields fixed width
- are the alphabetic characters the same (e.g. indicating a field name)

If all you wish to do is read single characters then use fgetc.

fscanf (and its variations) is a very powerful function, but not always a safe function to use (e.g. when storing C-strings - if the input sequence is too long for the provided buffer then invalid memory is overwritten, or if the arguments provided to store input data do not match the required types as per the format specifications). Additionally it not always used as safely as possible because people tend not check of errors returned from scanf.

The signature of fscanf is as follows:

   int fscanf( FILE * stream, char const * format, ... );

The returned value from fscanf is the number of input items assigned to the provided arguments (or the value of the EOF macro if an error occurs before any conversion). You should always check that this value matches the number of arguments you passed to scanf after the format argument (i.e. third and subsequent arguments).

The format and ... arguments are where things get interesting.

In short the format string describes the sort of data expected and what to do with it. The simplest and commonest usage is to just specify data types expected for each field and then supply pointers to variables of those types for the remaining (variable) argument list (the ... ellipsis in the parameter list). In general each field is assumed to be separated by white-space. The most usual usages therefore are a set of these data field format specification like so:

   int n = 0;
   int field0;
   char field1[64];
   n = fscanf( inputFile, "%d%s", &field0, field1 );

   if ( n != 2 )
   {
   /* Oops! Something went wrong ... handle it here */
   }

We pass the address of the (start) of the memory in which to store the resultant data. In the case of an int value (and other scalar field types) this is a pointer to the variable (e.g. &field0). In the case of a character array (for character string data) it will be the character array (e.g. field1), as this is converted to a pointer to the first (zeroth) element.

The addresses of the memory to store each converted field are in the same order as they appear in the format specification string. In the example two fields are specified: a signed decimal integer field (the %d in the format string) and a white-space terminated sequence of characters i.e. a string of characters (the %s in the format string).

The example above when applied to the following:

   12345 abcdefgh

Will read and store:

12345 in field0 as an int
abcdefgh in field1, as a zero terminated character array string.

Note that the maximum safe length for the character string field is 63 characters as we specified 64 as the size of the field1 character array, and one additional space is required for the terminating zero character. The program behaviour is undefined if a string field longer than 63 characters is read as fscanf will then start overwriting memory for field1 that does not belong to field1 (ouch!): you may get a crash, it may not effect anything, it may corrupt some other data item(s), it may go unnoticed for some time making tracking the cause of the problem difficult. To help get around this problem we can specify a field width thus:

   n = fscanf( inputFile, "%d%63s", &field0, field1 );

Now field1 will be read up to the first white space character (the usual field terminator) or when 63 characters have been read (or if the end of file is reached). Of course if there were more fields following field1, as would probably be the case for your requirement, then it is likely that the next field read and convert operation would fail as the rest of the (overly long) string field still had some un-read characters to read. For example if the next field were another integer field, then when integer forming characters were expected, remaining string characters were read. If this were the case then the returned value stored in n would not match the expected number of fields read and stored (now at least three, not two as in the original example), which we can check for and handle as an error condition.

If we wish to read but not store some data in the input stream we can place a * in the format specification like so:

   n = fscanf( inputFile, "%d%*s", &field0 );

In the above we read but do not store the string field, and n should have a value of 1 if the fscanf call was successful. I am not sure but you suggest in your question that this might be useful, as you seem more concerned with the integer columns and not the alpha characters between them.

If there are more characters on the line following the formatted input operations of the fscanf call that are of no interest (which you also suggest is the case) then just read them and discard them. In this case fgets is useful (file get string):

   #define MYPROGRAM_BUFFERSIZE 256

   /* ... */

   char buffer[MYPROGRAM_BUFFERSIZE];
   char * returned_value;

   /* ... */

   returned_value = fgets( buffer, MYPROGRAM_BUFFERSIZE, inputFile );

The above will read up to the first new line, to the end of file or to one less than the value given for the second parameter (MYPROGRAM_BUFFERSIZE-1, or 255, in this case), whichever comes first. Note that the maximum number of character read is one less than the value given by the second argument. Again this is to try to ensure there is space for the terminating zero character.

The new line character (if read) is retained so check that the last character is a new line ('\n') to ensure you have read up to the end of the line, and if not read some more (i.e. use a loop!).

Also check that you have not reached the end of file or that an error did not occur. This can be done by checking the returned value, which will be either a pointer to the read string (i.e. buffer in this case) or NULL if end of file or an error occurred:

   do
   {
       returned_value
         = fgets(buffer, MYPROGRAM_BUFFERSIZE, inputFile);

   }
   while (  returned_value != NULL
         && buffer[ strlen(buffer)-1 ] != '\n'
         );

The above keeps reading while no error occurred and end of file is not reached (i.e. returned_value != NULL) and while there is no new line in the last character position of the read string (i.e. buffer[ strlen(buffer)-1 ] != '\n'; note that the last character in the string is one less than the length of the string, as the first character is at position zero).

I have only given some of the basic usages of fscanf here. You should read the full description of fscanf (and fgetc and fets) in your development environment documentation. For UNIX, Linux (and similar) systems this can generally be found in the system man pages, e.g. At a terminal / console enter the command:

   man fscanf

To access the manual pages on fscanf. This information is also available online at various places, for example try:
http://www.opengroup.org/onlinepubs/007908775/xsh/stdio.h.html

If you are using the Microsoft compiler then you should have the MSDN library documentation available locally and can find these functions decribed in the

Development Tools and Languages > Visual Studio > Visual C++ > Reference > Library Reference > Run-Time Library

documentation (note: the exact location varies from time to time, this topic path is good for the version I have installed locally). You can use the search function to help locate what you want.

The MSDN library can also be found online at http://msdn2.microsoft.com/en-gb/library/default.aspx.

If after reading all this information you cannot see a way to make use of fscanf to do most of your formatted input for and have to fall back on reading one character at a time, then use fgetc to read individual characters in a loop, as it seems you are doing with fscanf at the moment. However, you can build up individual field strings by appending characters to a field string buffer. When you detect a change of field type you can convert the buffer to the required type.

For example you could store all the digits for an integer field and when your code detects the end of the field (column) you could use atoi to convert the string to an integer:

   #define MYPROGRAM_MAX_FIELDWIDTH 255;
   #define MYPROGRAM_MAX_INT_FIELDS 10;

   /* ... */

   char field_buffer[MYPROGRAM_MAX_FIELDWIDTH+1];
   int int_fields[MYPROGRAM_MAX_INT_FIELDS];
   int int_field_number = 0;

   /*
      ...

    Code to read in characters and build field string in field_buffer
    and detect end of integer field, so string in field_buffer is
    ready for conversion to an integer

      ...
   */

   int_fields[int_field_number] = atoi(field_buffer);

You should include stdlib.h to use atoi.

Another possibility would be to read each line into a string buffer (char array) and then split it onto tokens using strtok, which I do not have time to go into here, but you can look up the documentation for strtok (and atoi for that matter) as for other C library functions.

Hope this gives you some pointers. If you have further questions then please ask (one at a time though please!)  

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


Ralph McArdell

Expertise

I am a software developer with more than 15 years C++ experience and over 25 years experience developing a wide variety of applications for Windows NT/2000/XP, UNIX, Linux and other platforms. I can help with basic to advanced C++, C (although I do not write just-C much if at all these days so maybe ask in the C section about purely C matters), software development and many platform specific and system development problems.

Experience

My career started in the mid 1980s working as a batch process operator for the now defunct Inner London Education Authority, working on Prime mini computers. I then moved into the role of Programmer / Analyst, also on the Primes, then into technical support and finally into the micro computing section, using a variety of 16 and 8 bit machines. Following the demise of the ILEA I worked for a small company, now gone, called Hodos. I worked on a part task train simulator using C and the Intel DVI (Digital Video Interactive) - the hardware based predecessor to Indeo. Other projects included a CGI based train simulator (different goals to the first), and various other projects in C and Visual Basic (er, version 1 that is). When Hodos went into receivership I went freelance and finally managed to start working in C++. I initially had contracts working on train simulators (surprise) and multimedia - I worked on many of the Dorling Kindersley CD-ROM titles and wrote the screensaver games for the Wallace and Gromit Cracking Animator CD. My more recent contracts have been more traditionally IT based, working predominately in C++ on MS Windows NT, 2000. XP, Linux and UN*X. These projects have had wide ranging additional skill sets including system analysis and design, databases and SQL in various guises, C#, client server and remoting, cross porting applications between platforms and various client development processes. I have an interest in the development of the C++ core language and libraries and try to keep up with at least some of the papers on the ISO C++ Standard Committee site at http://www.open-std.org/jtc1/sc22/wg21/.

Education/Credentials

©2016 About.com. All rights reserved.