You are here:

C++/sorting a binary file

Advertisement


Question
I'm trying to get information from a user and then write it to the file in the correct place (for this, alphabetical order).  I don't know how to put it where I want it without overwriting data and without copying everything to RAM.  Ideally, it could also be indexed to make it faster to search, but since I don't know how to write it correctly in the first place, I haven't looked much at this yet.

Answer
You have correctly spotted the two options open to you: either read all existing data into memory, sort the in-memory collection, write the whole lot back out to file, or overwrite what you have already - moving the existing data at and after this position down before you write the data for the new record thus:

   - Move to end of file less byte-size-of-record bytes.
   - Read last byte-size-of-record bytes of data to memory and append (write) them to the end of the file
   - Moving backwards up the file until insert position reached move existing data down the file by
       reading chunks into memory and writing them byte-size-of-record bytes further down the file.
   - The insert position can now have the new record's bytes written to it

This of course would invalidate any indexes you already have for record positions following the inserted one.

The problem is very like inserting into the middle of an array or vector (single dimension array) - it is expensive.

Note that if your records are all the same size then calculating the position of the nth record is given by n * byte-size-of-record.

There are other strategies you can use, here are few possibilities for starters:

The first is not to maintain the data in order but to, as you were thinking of doing, maintain an index to each record's start position - then just maintain the index in order and append new data to the end of the file. This of course just moves the insertion problem to the index. However as index entries are probably going to be a reasonably small fixed size they can be easier to manage if records are large or variable size.

Another option is to also append new data to the end of the file but add a link field to link records in order. This effectively builds a linked list into the file.

Another option might be to pre-allocate a number of 'buckets' - areas in the file to place record data. Each bucket stores records whose keys hash to a certain range of values - say 0 .. 12, each representing a bucket - in this case you are creating something like a hash map in the file. Within each bucket you can sort the records - e.g. by loading into memory, sorting and writing back to the bucket. If a bucket is full then you will  have to box a little clever. You could dump such overflow records into an overflow bucket area, or allocate a new chunk of file and link it to the previous bucket for that hash value.

You can of course mix and match.

Note that you may have to create utilities that run from time to time to physically re-order files to optimise their performance and to check their consistency - a bit like file system de-fragmentation and checking/fixing tools.

Remember: disk storage is allocated in sectors or in some cases clusters of sectors - so if you have to increase your file's size it may be worth doing so in multiples of these block sizes, and have any additional space available for immediate use later (the std::vector class template does something similar when re-allocating memory - it allocates more than it requires so as not to have to do it so frequently).

So unless you have a lot of data to manipulate then loading the data into memory, manipulating it in memory, then writing it back out to file is going to be a lot simpler.

Here are a couple of previous questions and answers that shows simple examples of reading into memory, sorting and writing back to file (although as text rather than binary):

   http://en.allexperts.com/q/C-1040/Sorting-searching-data.htm
   http://en.allexperts.com/q/C-1040/file-handling-C-2.htm

As you are beginning to appreciate I hope this is a non-trivial problem and there is not necessarily one best solution, however I hope I have provided some food for thought at least.

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


Ralph McArdell

Expertise

I am a software developer with more than 15 years C++ experience and over 25 years experience developing a wide variety of applications for Windows NT/2000/XP, UNIX, Linux and other platforms. I can help with basic to advanced C++, C (although I do not write just-C much if at all these days so maybe ask in the C section about purely C matters), software development and many platform specific and system development problems.

Experience

My career started in the mid 1980s working as a batch process operator for the now defunct Inner London Education Authority, working on Prime mini computers. I then moved into the role of Programmer / Analyst, also on the Primes, then into technical support and finally into the micro computing section, using a variety of 16 and 8 bit machines. Following the demise of the ILEA I worked for a small company, now gone, called Hodos. I worked on a part task train simulator using C and the Intel DVI (Digital Video Interactive) - the hardware based predecessor to Indeo. Other projects included a CGI based train simulator (different goals to the first), and various other projects in C and Visual Basic (er, version 1 that is). When Hodos went into receivership I went freelance and finally managed to start working in C++. I initially had contracts working on train simulators (surprise) and multimedia - I worked on many of the Dorling Kindersley CD-ROM titles and wrote the screensaver games for the Wallace and Gromit Cracking Animator CD. My more recent contracts have been more traditionally IT based, working predominately in C++ on MS Windows NT, 2000. XP, Linux and UN*X. These projects have had wide ranging additional skill sets including system analysis and design, databases and SQL in various guises, C#, client server and remoting, cross porting applications between platforms and various client development processes. I have an interest in the development of the C++ core language and libraries and try to keep up with at least some of the papers on the ISO C++ Standard Committee site at http://www.open-std.org/jtc1/sc22/wg21/.

Education/Credentials

©2012 About.com, a part of The New York Times Company. All rights reserved.