C++/Sorting and searching data
How can I sort data in alphebatical order after writng them(some records containing information about customers,say)in to a file? How can I create a program to read and print customer records arranged with their names in alphabatical order?
Aiso I want to how to search for a record by the customer name.I'd be grateful if you can provide with the techniques used for these problems.
This is a very large topic so I'll only give you a few pointers here.
First, the techniques used depend to some extent on the (expected) size of the data. Sorting a few hundred or even a few thousand small records is not going to tax most programs even if they are written very simply, particularly on a modern computer.
However, the whole thing changes when you have millions, maybe hundreds of millions or more, of records and/or the records are (very) large. In these cases your first problem is that you have to be clever about memory management and input/ output. You probably cannot just load all the data into memory, sort it and write it back out. Even if it fitted in memory reading it and writing it would be very expensive. Here using the correct sorting algorithm(s) can be crucial to the performance of the system.
Exactly where you draw the lines depends on the various performance characteristics of your computer.
If you are interested in the mechanics of sorting and searching algorithms then I suggest you obtain some good reference material. An obvious choice would be Donald Knuth's third volume of his (uncompleted) "The Art of Computer Programming" series "Sorting and Searching". However he uses his own MIX assembler like language for the examples, and the style might be a little heavy going, depending on your mindset (I found the first volume quite heavy going at times, and in fact never finished it!). A more relevant book might be one in which C or C++ is used for the examples such as "Algorithms in C++" by Robert Sedgewick. Make sure any such book you obtain has sorting and searching algorithms covered to the level of detail you require and the text is written in a style you can get along with.
The next problem with what you require is to define "alphabetic order". Such orderings differ depending on where you are in the world.
I shall now assume a simple example and run through putting together such an application as you describe. A lot of the example will be described at a high level as to go into too much code would be too much to cover in such an answer, sorry. Please not that any code shown is for example purposes only and not up to production quality.
First I shall state some assumptions:
The record storage format is as a single flat file of text.
Each record is the same size.
The whole record set will fit easily into memory.
The character set used naturally has an English collation order.
Now there are two ways to sort the data: physically sort it and logically sort it.
Physically sorting requires that the original data file be re-written with the records in the sorted order, and can obviously only apply to one way of sorting at a time - if its written in customer name order then it cannot also be written in date of birth order at the same time.
Logically sorting the data leaves the records in the same state but creates a separate index into the data for each record in the ordering. There can be many indexes into the records data at any time. The simplest implementation is to create a separate index file for each ordering.
The data in an index file is a reference to each record. This can be the absolute position of the records in the record data file. If each record is the same size (as stated in my example assumptions) then the record number is sufficient to determine the record's position and is especially easy to do if the records are indexed from 0. In this case the n-th record is n times the byte size of the records in the file. This is pretty much the same calculation that the compiler does when determining the address of an element in a built in C or C++ single dimension array in memory.
So, to sort the data you read in each record into a collection in memory (remember this is a simple example), sort that data if the collection is not in order by default, and write out to an index file the reference to each record. As I stipulated that each record is the same size then this can be the index of the (sorted) records. Assuming the record data in the file is backed by a C++ user-defined class type in the program and it supports operator<< and operator>> for writing to and reading from C++ IOStreams then we might have something like the following:
// Use vector as data type for record collection
typedef std::vector<CustomerRecord> DataSetType;
// Function to compare two customer names
bool ByName( CustomerRecord const & cr1, CustomerRecord const & cr2 )
return cr1.GetName() < cr2.GetName();
// Read in data into a std::vector of customer records
unsigned int idx(0);
while ( dataIn.good() )
dataIn >> inRec;
if ( dataIn.good() )
inRec.SetRecordIndex( idx );
dataSet.push_back( inRec );
// Sort the records in memory using the C++ syandard library std::sort
// algorithm together with the ByName comparing function
std::sort( dataSet.begin(), dataSet.end(), ByName );
At this point we can either re-write the Customers.dat data file so the records are in order by writing out the records from the sorted in-memory vector, or we could write an index file, which is just the original record index we set when reading in the records, for example:
for ( unsigned int i=0; i < dataSet.size(); ++i )
idxOut << dataSet[i].GetRecordIndex() << '\n';
We can read the index like so:
typedef std::vector<unsigned int> CustomerRecordIndexType;
while ( idxIn.good() )
unsigned int recIdx;
idxIn >> recIdx;
if ( idxIn.good() )
byNameIndex.push_back( recIdx );
Basically, we open the index file for reading, keep reading record index values until an error is encountered - hopefully this should be the end of file. We store each read index value at the end of a vector of index values. Iterating from beginning to end of this vector will access the record index numbers in sorted name order. We can use this index vector like so:
for ( unsigned int i=0; i < byNameIndex.size(); ++i )
// calculate next record's byte offset in the data file:
unsigned int recordOffset
( byNameIndex[i]*CustomerRecord::RecordStreamSize() );
// position dataIn stream's get position to that position:
// Read in record:
dataIn >> inRec;
// display name:
std::cout << inRec.GetName() << '\n';
The idea here is to use the record index values in the byNameIndex vector to calculate the position in the data file (which we opened for reading) of the record in question. This, as I mentioned above, is done by multiplying the record index (i.e. byNameIndex[i]) by the byte size in the data file of each record ( i.e. CustomerRecord::RecordStreamSize() ). Having obtained this offset we use the seekg file random access stream operation to seek to that offset within the data file (the g in seekg stands for get - i.e. reading - as opposed to the seek position for writing - or put which has the name seekp). After calling seekg the next read operation will be from that point which should be at the position of the start of a record, so we read the record in using the >> operator for the customer record type and then print the name out.
To search for the customer's name and read that record we modify what we did above. It is an algorithm called a binary chop or binary search and only works on sorted data. The idea here is to start with the record in the middle of the index. So in the example code if there are 101 records and therefore the index has 101 record index values then we start by looking at the record having the index given by byNameIndex, or more generically byNameIndex[byNameIndex.size()/2]. If the name in this record is not the one we wish it will compare either less or greater than the one we are looking for (note: we will need a different comparator function to the ByName function shown above as that function returns a bool result only). If it is less then it will be in the lower half of the sorted range, if it is greater it will be in the upper half. We halve the size of the starting offset and add or subtract it to its value to give a new index record index. In the example it would be the record given by byNameIndex or byNameIndex. This process is repeated until the size of change in position is 0. At this point if the name is not found then it is not in the data set.
Alternatively you could load the whole of the customer data as we did when indexing and use std::find to locate the required name. If we make the data sorted either by using std::sort as we did when generating the index or by loading the records in order using the index we can then use std::binary_search.
If I thought about it some more I might be able to find a way to use std::binary_search together with the index file and raw data file.
Now I am running out of both time and space (AllExperts have a 10000 character limit on answers) so that will have to do for now. Hopefully it is enough for you to be going on with. Please ask additional questions if you need more specific information or other advise.
Finally, I would also recommend a good book on the standard C++ library - I use "The C++ Standard Library A Tutorial and Reference" by Nicolai M. Josuttis. There is also some information here: http://www.sgi.com/tech/stl/
and here: http://www.digilife.be/quickreferences/PT.htm
(this page may try to pop up something).