C++/A question on strcmp's implementation
Below is the code snippet of Microsoft's C runtime strcmp implementation:
int __cdecl strcmp (
const char * src,
const char * dst
int ret = 0 ;
while( ! (ret = *(unsigned char *)src - *(unsigned char *)dst) && *dst)
if ( ret < 0 )
ret = -1 ;
else if ( ret > 0 )
ret = 1 ;
return( ret );
My question is, why cast both src and dest from char* to unsigned char*?
One thing I should caution you about is posting verbatim copies of _copyrighted_ code and other materials. Effectively you are copying it for all the world to see on the AllExperts site (it is published at http://en.allexperts.com/q/C-1040/2008/1/question-strcmp-implementation.htm
) and the copyright owners - Microsoft - may have something to say about that.
Does this organisation have a lot of money to protect what is theirs?
Do they have an awesome legal department?
Have they been known to go after people and organisations that infringe on their rights?
My take is yes to all three questions.
Remember Microsoft code is not usually open source check the license (EULA end user license agreement) terms for your copy of MS Visual C++ and see if you are allowed to re-distribute or publish the source code of the CRT library provided to you in the product (in short, as far as I can tell, no).
Well I cannot comment that authoritatively on the whys and wherefores and thought processes behind code I did not write (<g>).
However I do note that character sets and code pages do not (usually) use signed values. However a plain char in C and C++ may be signed or unsigned, depending on the compiler and often the compiler flags used. In fact C and C++ have three char types: char, signed char and unsigned char. For the Microsoft compiler a plain char is signed by default but this can be changed by specifying /J on the command line (for Visual C++ 2005).
Thus in order to compare (unsigned) characters pointed to in a string of (probably signed) char and calculate a result consistent with the specification of strcmp, the character values pointed to by the src and dst pointers need to be compared as _unsigned_ character values. This is achieved in the code by casting each of the pointers to an unsigned char * and then de-referencing what is pointed at, which will be interpreted as an unsigned char value.
Thus the value of ret each time is the result of the difference between the src and dst characters, as unsigned character values.
If you do not see that using signed characters can be a problem consider comparing a string which uses non-ASCII characters above 127, such as those in the 850 (Latin 1) code page (see http://en.wikipedia.org/wiki/Code_page_850
for more details). For example, if one string contained a lowercase a with an acute accent, which appears to be character number 160 in code page 850, with a string using a normal a (value 97) in the same position. The difference between these two values is plus or minus 63 (160 - 97 or vice versa).
However if we use a (signed) char type to store these characters then the range of an 8-bit byte, using 2 complement representation, is -128 to +127, and thus the unsigned value of 160 will become a negative value when viewed as a signed 2s complement value. In fact it will become the value -96. The plain a character will be unaffected as it falls into the range of positive 8-bit 2s complement values. Thus when using signed values the difference between the two characters is minus or plus 193 (-96 - 97 or vice versa). You will notice that not only has the difference value changed but so has the sign when performing the calculation the same way as for the unsigned case (notice I changed the order of plus and minus in the text to suit). The effect of course is that if signed character values are used then the higher valued characters above 127 are compared incorrectly as lower valued than characters below 128. Worse, higher character values above 127 compare less than lower character values above 127 when interpreted as signed values, for example a character having value 255 (-1) will be compared as being lower valued to a character having the value 254 (-2) when the character values are 8-bits in size and signed.
You might like to try the following on your Visual C++ compiler:
unsigned char aAccute(160);
char aAccuteSigned( aAccute );
unsigned char a('a');
std::cout << "a-aAccute=" << int(a-aAccute) << '\n';
std::cout << "aAccute-a=" << int(aAccute-a) << '\n';
std::cout << "aSigned-aAccuteSigned="
<< int(aSigned-aAccuteSigned) << '\n';
std::cout << "aAccuteSigned-aSigned="
<< int(aAccuteSigned-aSigned) << '\n';
I have used the same types as used in the strcmp example. The results should be:
You might like to try the effect of compiling using /J (Project property C/C++->Language->Default Char Unsigned in Visual Studio 2005) on this code and see what happens to the results.
Whether keeping the character sequence strictly ordered like this is useful for such strings may be debatable: should characters with accents be compared as the same, lower or higher valued than the same character without an accent? What about symbols such as the copyright symbol and pound sign? Where should they come in the collation sequence?
However other code pages do have symbol sets that can benefit from correct handling of value comparisons for character values above 127. Code page 737 (Greek) for example (see http://en.wikipedia.org/wiki/Code_page_737
) has the letters of the Greek alphabet ordered in sequence for both lowercase and uppercase in a similar way to the English alphabet symbol values in the base ASCII character set.
Hope this helps.