C++/C++ vectors (incrementing an element)
Expert: Ralph McArdell - 3/13/2008
Questionafter i read each character from an input stream (a text file) one by one how do I count the frequency of each character? I assume it is by incrementing the frequency for the appropriate element in a vector, like if there is the letter 'a' in the file then Vector[97] = incremented by one?
Vector[97] would be the ascii value for a...
Vector[98] would be the ascii value for b.. for example
i am stumped on this one..thanks
AnswerI am not sure why you say you are stumped as you have the basic idea totally correct!
The only thing you seem not to have not realised is that you can use the character value read as a number.
When you say:
Vector[97] would be the ascii value for a...
You can then take it one stage further and say:
Vector['a']
Character literals such as 'a' translate to the number representing that character in the character set in use for the compiler/platform in question. Hence we can use them in places an integer literal or constant can be used, as in the index usage above. Other obvious examples are:
For case values in switch statements:
switch (command_char)
{
case 'h': // Help requested
case 'H':
case '?':
DoHelp();
break;
case 'q': // Quit requested
case 'Q':
return;
// ...
}
And as conditions in if and loops:
if ( 'q' == the_char || 'Q' == the_char )
{
// ...
}
do
{
// ...
}
while ( 'q' != the_char && 'Q' != the_char )
The char type is nothing more than a small integer type, typically of 8-bit byte size. Unlike other C and C++ integer types it has three variations: char, signed char and unsigned char. This is because, unlike say int which is shorthand for signed int, char may be signed or unsigned, depending on the compiler and quite possibly the compiler switches in effect.
Thus as char is a small integer type variables of char type can be used where integer variables are allowed, for example:
Vector[read_character]
Where read_character is a variable of char type that contains the character read from the input stream.
[Note: The code fragments I show are for exposition only. The code has not been tested or even compiled so may contains typos or other errors. If this is the case then I apologise!].
Oh, strictly speaking assuming that the character set in use is ASCII or based on ASCII may be a dangerous assumption. Not all systems use such a character set, IBM systems traditionally used EBCDIC for example (see
http://en.wikipedia.org/wiki/EBCDIC - and note the Criticism and Humour section of the article).
I assume that now you have the method to access the frequency element for each character you are able to increment that value using the C/C++ increment operator ++.
Remember that for C++ pre-increment (++value) is preferred over post-increment (value++) in situations where either could be used. Which form is used is not a concern for built in types as used in this case. However C++ allows increment operators to be defined for user defined classes. In such cases post-increment operations are typically defined in terms of pre-increment operations and are therefore less efficient and so should only be used where really necessary. Hence it is a good idea to get into the habit of using pre-increment where either pre- or post-increment could be used. The same goes for decrement operators by the way.
Finally, although not really part of your question here are some things to consider:
While using a vector is OK for counting the frequency of letters in English/ASCII or similar western scripts, where a single-byte character code is used and thus there are only 256 possibilities (128 if we stick to ASCII, which is a 7-bit encoding, and even less if we exclude control codes) such a method may not be so feasible for larger character sets. The UNICODE character encoding for example (see
http://unicode.org/) currently allows the encoding of characters in around 21 bits (so on typical operating system platforms only 32-bit or wider data types can represent any UNICODE character in a single value – the UNICODE UTF-32 encoding). However not all this possible space is used (apparently only around 99100 characters are currently assigned codes). To count the frequency of text that could contain any current UNICODE character would thus require a larger vector, much of which would probably not be used in many cases. If we wished to cater for any future possible UNICODE character we would require an even larger vector.
Alternative UNICODE encodings using 16-bit (UTF-16) or 8-bit (UTF-8) values require multiple values for a single character for some characters. Such character encoding schemes using 8-bit bytes (UNICODE’s UTF-8 byte encoding is by no way the only such encoding) are called multibyte character sets. Handling such multibyte (or multi word) character data is obviously more complex than those that always encode one character in one value, and thus would require a modified scheme to the nice simple one we have here!
C++ has support for wide characters (i.e. more bits than a char) in the wchar_t type. Unlike other integer types there is only one wchar_t type and it is commonly a 16 or 32 bit integer type and may be signed or unsigned. For example Microsoft Windows uses 16-bit unsigned values, thus the Microsoft Visual C++ compiler has an unsigned 16-bit wchar_t type. On the other hand the GNU g++ compiler on my 64-bit Ubuntu Linux system has a signed 32-bit wchar_t type. In addition to the wchar_t type C++ also has wide strings (std::wstring) and wide streams (std::wistream, std::wostream etc.) that read from / write to narrow (char) character data externally and convert the values to/from wide (wchar_t) character data internally. The standard C++ library locales have facet types to convert between narrow and wide character encodings. C++ inherits wide versions of various C-string and character functions from C such as wcslen and wcscpy (include <cwchar>). Note that in C wchar_t is a type alias (typedef) to one of the other integer types e.g.
typedef usigned short wchar_t;
However in C++ wchar_t is a proper built in type distinct from other integer types.
Hope this helps and gives you some food for thought.