You are here:

C++/A question on strcmp's implementation

Advertisement


Question
Below is the code snippet of Microsoft's C runtime strcmp implementation:

int __cdecl strcmp (
       const char * src,
       const char * dst
       )
{
       int ret = 0 ;

       while( ! (ret = *(unsigned char *)src - *(unsigned char *)dst) && *dst)
         ++src, ++dst;

       if ( ret < 0 )
         ret = -1 ;
       else if ( ret > 0 )
         ret = 1 ;

       return( ret );
}

My question is, why cast both src and dest from char* to unsigned char*?

Answer
--------------------------
FOLLOWUP
--------------------------

One thing I should caution you about is posting verbatim copies of _copyrighted_ code and other materials. Effectively you are copying it for all the world to see on the AllExperts site (it is published at http://en.allexperts.com/q/C-1040/2008/1/question-strcmp-implementation.htm) and the copyright owners - Microsoft - may have something to say about that.

Ask yourself:

Does this organisation have a lot of money to protect what is theirs?

Do they have an awesome legal department?

Have they been known to go after people and organisations that infringe on their rights?

My take is ‘yes’ to all three questions.

Remember Microsoft code is not usually open source – check the license (EULA – end user license agreement) terms for your copy of MS Visual C++ and see if you are allowed to re-distribute or publish the source code of the CRT library provided to you in the product (in short, as far as I can tell, no).


--------------------------
ORIGINAL
--------------------------

Well I cannot comment that authoritatively on the whys and wherefores and thought processes behind code I did not write (<g>).

However I do note that character sets and code pages do not (usually) use signed values. However a plain char in C and C++ may be signed or unsigned, depending on the compiler and often the compiler flags used. In fact C and C++ have three char types: char, signed char and unsigned char. For the Microsoft compiler a plain char is signed by default but this can be changed by specifying /J on the command line (for Visual C++ 2005).

Thus in order to compare (unsigned) characters pointed to in a string of (probably signed) char and calculate a result consistent with the specification of strcmp, the character values pointed to by the src and dst pointers need to be compared as _unsigned_ character values. This is achieved in the code by casting each of the pointers to an unsigned char * and then de-referencing what is pointed at, which will be interpreted as an unsigned char value.

Thus the value of ret each time is the result of the difference between the src and dst characters, as unsigned character values.

If you do not see that using signed characters can be a problem consider comparing a string which uses non-ASCII characters above 127, such as those in the 850 (Latin 1) code page (see http://en.wikipedia.org/wiki/Code_page_850 for more details). For example, if one string contained a lowercase a with an acute accent, which appears to be character number 160 in code page 850, with a string using a normal a (value 97) in the same position. The difference between these two values is plus or minus 63 (160 - 97 or vice versa).

However if we use a (signed) char type to store these characters then the range of an 8-bit byte, using 2 complement representation, is -128 to +127, and thus the unsigned value of 160 will become a negative value when viewed as a signed 2s complement value. In fact it will become the value -96. The plain a character will be unaffected as it falls into the range of positive 8-bit 2s complement values. Thus when using signed values the difference between the two characters is minus or plus 193 (-96 - 97 or vice versa). You will notice that not only has the difference value changed but so has the sign when performing the calculation the same way as for the unsigned case (notice I changed the order of plus and minus in the text to suit). The effect of course is that if signed character values are used then the higher valued characters above 127 are compared incorrectly as lower valued than characters below 128. Worse, higher character values above 127 compare less than lower character values above 127 when interpreted as signed values, for example a character having value 255 (-1) will be compared as being lower valued to a character having the value 254 (-2) when the character values are 8-bits in size and signed.

You might like to try the following on your Visual C++ compiler:

   int main()
   {
       unsigned char aAccute(160);
       char aAccuteSigned( aAccute );
       unsigned char a('a');
       char aSigned('a');

       std::cout << "a-aAccute=" << int(a-aAccute) << '\n';
       std::cout << "aAccute-a=" << int(aAccute-a) << '\n';
       std::cout << "aSigned-aAccuteSigned="
         << int(aSigned-aAccuteSigned) << '\n';
       std::cout << "aAccuteSigned-aSigned="
         << int(aAccuteSigned-aSigned) << '\n';
   }

I have used the same types as used in the strcmp example. The results should be:

   a-aAccute=-63
   aAccute-a=63
   aSigned-aAccuteSigned=193
   aAccuteSigned-aSigned=-193

You might like to try the effect of compiling using /J (Project property C/C++->Language->Default Char Unsigned in Visual Studio 2005) on this code and see what happens to the results.

Whether keeping the character sequence strictly ordered like this is useful for such strings may be debatable: should characters with accents be compared as the same, lower or higher valued than the same character without an accent? What about symbols such as the copyright symbol and pound sign? Where should they come in the collation sequence?

However other code pages do have symbol sets that can benefit from correct handling of value comparisons for character values above 127. Code page 737 (Greek) for example (see http://en.wikipedia.org/wiki/Code_page_737) has the letters of the Greek alphabet ordered in sequence for both lowercase and uppercase in a similar way to the English alphabet symbol values in the base ASCII character set.

Hope this helps.  

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


Ralph McArdell

Expertise

I am a software developer with more than 15 years C++ experience and over 25 years experience developing a wide variety of applications for Windows NT/2000/XP, UNIX, Linux and other platforms. I can help with basic to advanced C++, C (although I do not write just-C much if at all these days so maybe ask in the C section about purely C matters), software development and many platform specific and system development problems.

Experience

My career started in the mid 1980s working as a batch process operator for the now defunct Inner London Education Authority, working on Prime mini computers. I then moved into the role of Programmer / Analyst, also on the Primes, then into technical support and finally into the micro computing section, using a variety of 16 and 8 bit machines. Following the demise of the ILEA I worked for a small company, now gone, called Hodos. I worked on a part task train simulator using C and the Intel DVI (Digital Video Interactive) - the hardware based predecessor to Indeo. Other projects included a CGI based train simulator (different goals to the first), and various other projects in C and Visual Basic (er, version 1 that is). When Hodos went into receivership I went freelance and finally managed to start working in C++. I initially had contracts working on train simulators (surprise) and multimedia - I worked on many of the Dorling Kindersley CD-ROM titles and wrote the screensaver games for the Wallace and Gromit Cracking Animator CD. My more recent contracts have been more traditionally IT based, working predominately in C++ on MS Windows NT, 2000. XP, Linux and UN*X. These projects have had wide ranging additional skill sets including system analysis and design, databases and SQL in various guises, C#, client server and remoting, cross porting applications between platforms and various client development processes. I have an interest in the development of the C++ core language and libraries and try to keep up with at least some of the papers on the ISO C++ Standard Committee site at http://www.open-std.org/jtc1/sc22/wg21/.

Education/Credentials

©2016 About.com. All rights reserved.