You are here:

C++/Splitting a char

Advertisement


Question
Hi.
I am a noob in C and can't get a simple thing working.
Background:
I am writing a WinSock client. The client connects to a server and receives a special string. (something like "ABC|123456|654321"). Then my program needs to split the string by "|" character (creating three variables with values: "ABC", "123456","654321").

Thanks!  

Answer
Please note that I am primarily a C++ expert and you are asking your C question in the C++ section. The reason I mention this as that I work mainly with C++ these days. Many years ago I did work in C, and of course most of C (from the original 1989/1990 standard) is compatible with C++. The reverse is not true, C++ allows things that C does not.

I warn you in advance that my pure C is a little rusty. I apologise for any mistakes.

OK, warning over. Second point is that you are _not_ splitting a char. A char is a scalar integer type that has a size of 1. On many popular desktop and server platforms it is an integer that is 1 byte in size. Unlike other integer types (short int, int and long int) the type char is not necessarily signed; some compilers have a command line switch to change the default. So a char can represent small integer values, typically from 0 to 255 or -128 to 127. These are typically also values that can be used to represent characters in some character set (7-bit ASCII or 8-bit EBCDIC for example).

What you wish to split are C style character strings (at least I assume this is the case as you do not say otherwise). These by convention are arrays of char with the last element a zero, which indicates the end of the string. For this reason they are sometimes called zero terminated strings or similar.

The C standard library contains several functions for working with zero terminated strings. To use them include the header <string.h>. The function of interest in this case is strtok. You call strtok repeatedly to split a string comprising of many tokens into those tokens; it has the following declaration:

       char * strtok( char *strToken, const char *strDelimit );

You pass strtok pointers to two zero-terminated strings. Note that the name of an array variable in C and C++ is equivalent to the address of the first (zeroth) element of the array (i.e. a pointer to the first element). The second is specified as const meaning the function does not modify it, and that the pointer can safely refer to a string literal.

The first parameter, strToken, is the string you wish to split and the second, strDelim is a string containing characters that delimit those tokens. The returned value is a pointer to the next token found, or a NULL pointer value if there are no more.

For example if your example string from the server is in the string variable msg, and you wish to split it into tokens delimited by | then you would make an initial call to strtok like so:

         char * token = strtok( msg, "|" );

To extract following tokens you pass a null pointer value to strtok in place of msg:

       token = strtok( NULL,  "|" );

Each time you call strtok the found delimiting character is replaced with a zero character, which terminates the token string. The pointer value returned points to the first character of the token. The final token returned is the one delimited by the end of string zero valued character. Note this point well. It means the token parameter string is modified. In our example this means that the string associated with msg is changed. So you can display the value like so:

#include <string.h>
#include <stdio.h>

int main()
{
   char msg[256] =  "ABC|123456|654321";

   char * token = strtok( msg, "|" );

   while ( token != NULL )
   {
       printf( "%s\n", token );
       token = strtok( NULL,  "|" );
   }
}

Warning: because strtok has to keep the state of the current string it is splitting into tokens between calls, including a reference to the string itself, only one such string can be being split by strtok at any time, and that string MUST exist for all of the token splitting operations. Not heeding this warning will likely cause data corruption and inaccurate results. Also read the small print of your strtok implementation documentation to see if calling it simultaneously from different threads is supported or not, if you are writing multithreaded code.

I have used string literals for the delimiter string parameter in the strtok calls. Of course it is better practice not to embed magic values such as "|" all over the place. First changing the message format to use different delimiter(s) means all uses of "|" have to be found and changed, and second "|" has no useful information to readers of your code (maybe yourself in a couple of years time) to indicate whether each occurrence refers to this type of message's delimiters or something else. Combined this leads to tedium and likelihood of bugs being introduced during such changes. So let's name our delimiters:

       const char * ServerMessageDelimiters = "|";

Remember that C global constants have external linkage, so can only be defined once. They need to be declared extern if used from multiple code modules.

Now you specified that each value needed to be placed into a variable of its own and token is only a pointer into the (modified) message string for the currently extracted token. You could of course have multiple token pointers, each one pointing into the message string and this may be enough for what you wish. If not then a copy will have to be made of each token string. For this we can use the strcpy C library function.

In the following example I place the tokens into an array of strings (effectively a 2 dimensional array) called MsgParts. I have also defined limits on the number and length of tokens and the length of messages using the #define pre-processor macro facility as I cannot recall whether C allows constant integer values to be used for compile time const values such as array bounds (I am testing the code using a C++ compiler!).

I have taken a simplistic approach to error handling. The while loop picking off the tokens terminates when there are no more tokens, no space to store more tokens or a token is too long (which introduces strlen which returns the length of a C-string, less the zero terminator).

Finally, after all tokens have been extracted and copied to the MsgParts array I iterate through this array printing out each token.

Here is the code:

#include <string.h>
#include <stdio.h>

const char * ServerMessageDelimiters = "|";
#define MaxTokens   10
#define MaxTokenLength 255
#define MaxMessageLength (MaxTokens*(MaxTokenLength+1)-1)

int main()
{
   char msg[MaxMessageLength+1] =  "ABC|123456|654321";

   char msgParts[MaxTokens][MaxTokenLength+1];

   int partNumber = 0;

   char * token = strtok( msg, ServerMessageDelimiters );

   while (  token != NULL
         && partNumber<MaxTokens
         && strlen(token) <  MaxTokenLength
         )
   {
       strcpy( msgParts[partNumber], token );
       token = strtok( NULL,  ServerMessageDelimiters );
       ++partNumber;
   }

   int i;
   for ( i = 0; i < partNumber; ++i )
   {
       printf( "%s\n", msgParts[i] );
   }
}

An alternative is that we make a copy of the msg string and pass the copy to strtok (thus preserving the original), and then keep each of the pointers into this copy returned by strtok. The modified example main now looks like:

int main()
{
   char msg[MaxMessageLength+1] =  "ABC|123456|654321";

   char msgCopy[MaxMessageLength+1];

   strcpy( msgCopy, msg );

   char * msgParts[MaxTokens];

   int partNumber = 0;

   char * token = strtok( msgCopy, ServerMessageDelimiters );

   while ( token != NULL && partNumber<MaxTokens )
   {
       msgParts[partNumber] = token;
       token = strtok( NULL,  ServerMessageDelimiters );
       ++partNumber;
   }

   int i;
   for ( i = 0; i < partNumber; ++i )
   {
       printf( "%s\n", msgParts[i] );
   }
}

A caveat is that the string in msgCopy needs to exist for at least as long as the pointers in msgParts refer into it.

Remember the all code here is example code only, so will need addition error checking and handling. You may have different requirements. If you really need all message tokens to be copied and stored somewhere and you do not know how many there are then you may have to look into using dynamically allocated structures using malloc and free and the like, which is beyond the scope of your original question. You might like to absorb more of the basics of C before playing with dynamic memory and data structures too much.

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


Ralph McArdell

Expertise

I am a software developer with more than 15 years C++ experience and over 25 years experience developing a wide variety of applications for Windows NT/2000/XP, UNIX, Linux and other platforms. I can help with basic to advanced C++, C (although I do not write just-C much if at all these days so maybe ask in the C section about purely C matters), software development and many platform specific and system development problems.

Experience

My career started in the mid 1980s working as a batch process operator for the now defunct Inner London Education Authority, working on Prime mini computers. I then moved into the role of Programmer / Analyst, also on the Primes, then into technical support and finally into the micro computing section, using a variety of 16 and 8 bit machines. Following the demise of the ILEA I worked for a small company, now gone, called Hodos. I worked on a part task train simulator using C and the Intel DVI (Digital Video Interactive) - the hardware based predecessor to Indeo. Other projects included a CGI based train simulator (different goals to the first), and various other projects in C and Visual Basic (er, version 1 that is). When Hodos went into receivership I went freelance and finally managed to start working in C++. I initially had contracts working on train simulators (surprise) and multimedia - I worked on many of the Dorling Kindersley CD-ROM titles and wrote the screensaver games for the Wallace and Gromit Cracking Animator CD. My more recent contracts have been more traditionally IT based, working predominately in C++ on MS Windows NT, 2000. XP, Linux and UN*X. These projects have had wide ranging additional skill sets including system analysis and design, databases and SQL in various guises, C#, client server and remoting, cross porting applications between platforms and various client development processes. I have an interest in the development of the C++ core language and libraries and try to keep up with at least some of the papers on the ISO C++ Standard Committee site at http://www.open-std.org/jtc1/sc22/wg21/.

Education/Credentials

©2016 About.com. All rights reserved.