You are here:

C++/Sorting and searching data

Advertisement


Question
How can I sort data in alphebatical order after writng them(some records containing information about customers,say)in to a file? How can I create a program to read and print customer records arranged with their names in alphabatical order?

Aiso I want to how to search for a record by the customer name.I'd be grateful if you can provide with the techniques used for these problems.

Answer
This is a very large topic so I'll only give you a few pointers here.

First, the techniques used depend to some extent on the (expected) size of the data. Sorting a few hundred or even a few thousand small records is not going to tax most programs even if they are written very simply, particularly on a modern computer.

However, the whole thing changes when you have millions, maybe hundreds of millions or more, of records and/or the records are (very) large. In these cases your first problem is that you have to be clever about memory management and input/ output. You probably cannot just load all the data into memory, sort it and write it back out. Even if it fitted in memory reading it and writing it would be very expensive. Here using the correct sorting algorithm(s) can be crucial to the performance of the system.

Exactly where you draw the lines depends on the various performance characteristics of your computer.

If you are interested in the mechanics of sorting and searching algorithms then I suggest you obtain some good reference material. An obvious choice would be Donald Knuth's third volume of his (uncompleted) "The Art of Computer Programming" series "Sorting and Searching". However he uses his own MIX assembler like language for the examples, and the style might be a little heavy going, depending on your mindset (I found the first volume quite heavy going at times, and in fact never finished it!). A more relevant book might be one in which C or C++ is used for the examples such as "Algorithms in C++" by Robert Sedgewick. Make sure any such book you obtain has sorting and searching algorithms covered to the level of detail you require and the text is written in a style you can get along with.

The next problem with what you require is to define "alphabetic order". Such orderings differ depending on where you are in the world.

I shall now assume a simple example and run through putting together such an application as you describe. A lot of the example will be described at a high level as to go into too much code would be too much to cover in such an answer, sorry. Please not that any code shown is for example purposes only and not up to production quality.

First I shall state some assumptions:

The record storage format is as a single flat file of text.

Each record is the same size.

The whole record set will fit easily into memory.

The character set used naturally has an English collation order.

Now there are two ways to sort the data: physically sort it and logically sort it.

Physically sorting requires that the original data file be re-written with the records in the sorted order, and can obviously only apply to one way of sorting at a time - if its written in customer name order then it cannot also be written in date of birth order at the same time.

Logically sorting the data leaves the records in the same state but creates a separate index into the data for each record in the ordering. There can be many indexes into the records data at any time. The simplest implementation is to create a separate index file for each ordering.

The data in an index file is a reference to each record. This can be the absolute position of the records in the record data file. If each record is the same size (as stated in my example assumptions) then the record number is sufficient to determine the record's position and is especially easy to do if the records are indexed from 0. In this case the n-th record is n times the byte size of the records in the file. This is pretty much the same calculation that the compiler does when determining the address of an element in a built in C or C++ single dimension array in memory.

So, to sort the data you read in each record into a collection in memory (remember this is a simple example), sort that data if the collection is not in order by default, and write out to an index file the reference to each record. As I stipulated that each record is the same size then this can be the index of the (sorted) records. Assuming the record data in the file is backed by a C++ user-defined class type in the program and it supports operator<< and operator>> for writing to and reading from C++ IOStreams then we might have something like the following:

// Use vector as data type for record collection
typedef std::vector<CustomerRecord> DataSetType;

// Function to compare two customer names
bool ByName( CustomerRecord const & cr1, CustomerRecord const & cr2 )
{
   return cr1.GetName() < cr2.GetName();
}

// ...

// Read in data into a std::vector of customer records
 CustomerRecord inRec;
 std::ifstream dataIn("Customers.dat");
 DataSetType dataSet;
 unsigned int idx(0);
 while ( dataIn.good() )
 {
     dataIn >> inRec;
     if ( dataIn.good() )
     {
         inRec.SetRecordIndex( idx );
         dataSet.push_back( inRec );
         ++idx;
     }
 }

// Sort the records in memory using the C++ syandard library std::sort
// algorithm together with the ByName comparing function
 std::sort( dataSet.begin(), dataSet.end(), ByName );

At this point we can either re-write the Customers.dat data file so the records are in order by writing out the records from the sorted in-memory vector, or we could write an index file, which is just the original record index we set when reading in the records, for example:

 std::ofstream idxOut("Customers_ByName.idx");

 for ( unsigned int i=0; i < dataSet.size(); ++i )
 {
     idxOut << dataSet[i].GetRecordIndex() << '\n';
 }

We can read the index like so:

typedef std::vector<unsigned int>     CustomerRecordIndexType;

// ...

 std::ifstream idxIn("Customers_ByName.idx");

 CustomerRecordIndexType byNameIndex;

 while ( idxIn.good() )
 {
     unsigned int recIdx;
     idxIn >> recIdx;
     if ( idxIn.good() )
     {
         byNameIndex.push_back( recIdx );
     }
 }

Basically, we open the index file for reading, keep reading record index values until an error is encountered - hopefully this should be the end of file. We store each read index value at the end of a vector of index values. Iterating from beginning to end of this vector will access the record index numbers in sorted name order. We can use this index vector like so:

 CustomerRecord inRec;
 std::ifstream dataIn("Customers.dat");

 for ( unsigned int i=0; i < byNameIndex.size(); ++i )
 {
 // calculate next record's byte offset in the data file:
   unsigned int recordOffset
         ( byNameIndex[i]*CustomerRecord::RecordStreamSize() );

 // position dataIn stream's get position to that position:
   dataIn.seekg(recordOffset);

 // Read in record:
   dataIn >> inRec;

 // display name:
   std::cout << inRec.GetName() << '\n';
 }

The idea here is to use the record index values in the byNameIndex vector to calculate the position in the data file (which we opened for reading) of the record in question. This, as I mentioned above, is done by multiplying the record index (i.e. byNameIndex[i]) by the byte size in the data file of each record ( i.e. CustomerRecord::RecordStreamSize() ). Having obtained this offset we use the seekg file random access stream operation to seek to that offset within the data file (the g in seekg stands for get - i.e. reading - as opposed to the seek position for writing - or put which has the name seekp). After calling seekg the next read operation will be from that point which should be at the position of the start of a record, so we read the record in using the >> operator for the customer record type and then print the name out.

To search for the customer's name and read that record we modify what we did above. It is an algorithm called a binary chop or binary search and only works on sorted data. The idea here is to start with the record in the middle of the index. So in the example code if there are 101 records and therefore the index has 101 record index values then we start by looking at the record having the index given by byNameIndex[50], or more generically byNameIndex[byNameIndex.size()/2]. If the name in this record is not the one we wish it will compare either less or greater than the one we are looking for (note: we will need a different comparator function to the ByName function shown above as that function returns a bool result only). If it is less then it will be in the lower half of the sorted range, if it is greater it will be in the upper half. We halve the size of the starting offset and add or subtract it to its value to give a new index record index. In the example it would be the record given by byNameIndex[25] or byNameIndex[75]. This process is repeated until the size of change in position is 0. At this point if the name is not found then it is not in the data set.

Alternatively you could load the whole of the customer data as we did when indexing and use std::find to locate the required name. If we make the data sorted either by using std::sort as we did when generating the index or by loading the records in order using the index we can then use std::binary_search.

If I thought about it some more I might be able to find a way to use std::binary_search together with the index file and raw data file.

Now I am running out of both time and space (AllExperts have a 10000 character limit on answers) so that will have to do for now. Hopefully it is enough for you to be going on with. Please ask additional questions if you need more specific information or other advise.

Finally, I would also recommend a good book on the standard C++ library - I use "The C++ Standard Library A Tutorial and Reference" by Nicolai M. Josuttis. There is also some information here: http://www.sgi.com/tech/stl/ and here: http://www.digilife.be/quickreferences/PT.htm (this page may try to pop up something).  

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


Ralph McArdell

Expertise

I am a software developer with more than 15 years C++ experience and over 25 years experience developing a wide variety of applications for Windows NT/2000/XP, UNIX, Linux and other platforms. I can help with basic to advanced C++, C (although I do not write just-C much if at all these days so maybe ask in the C section about purely C matters), software development and many platform specific and system development problems.

Experience

My career started in the mid 1980s working as a batch process operator for the now defunct Inner London Education Authority, working on Prime mini computers. I then moved into the role of Programmer / Analyst, also on the Primes, then into technical support and finally into the micro computing section, using a variety of 16 and 8 bit machines. Following the demise of the ILEA I worked for a small company, now gone, called Hodos. I worked on a part task train simulator using C and the Intel DVI (Digital Video Interactive) - the hardware based predecessor to Indeo. Other projects included a CGI based train simulator (different goals to the first), and various other projects in C and Visual Basic (er, version 1 that is). When Hodos went into receivership I went freelance and finally managed to start working in C++. I initially had contracts working on train simulators (surprise) and multimedia - I worked on many of the Dorling Kindersley CD-ROM titles and wrote the screensaver games for the Wallace and Gromit Cracking Animator CD. My more recent contracts have been more traditionally IT based, working predominately in C++ on MS Windows NT, 2000. XP, Linux and UN*X. These projects have had wide ranging additional skill sets including system analysis and design, databases and SQL in various guises, C#, client server and remoting, cross porting applications between platforms and various client development processes. I have an interest in the development of the C++ core language and libraries and try to keep up with at least some of the papers on the ISO C++ Standard Committee site at http://www.open-std.org/jtc1/sc22/wg21/.

Education/Credentials

©2016 About.com. All rights reserved.