You are here:

C++/binary file

Advertisement


Question
QUESTION: Dear Ralph,

How to open and view a binary file using C++?
I'm assuming it is a string of 0'a and 1's. I'd like to use this string. For example I'd like to count all 1's.

Thanks,
Andres

ANSWER: First, all digital data can be said to be representable as a string of 0s and 1s. Files usually group these into sets of 8 binary digits, or bits, called bytes, so we usually show raw values from files as byte values - 0 to 255 or 0 to FF hexadecimal (base 16) or 0 to 377 octal (base 8).

Secondly, a binary file is binary only in relation to text files.

A text file is one whose bytes represent symbols and control codes in some character set in which the concept of a line is important (control code values in the character set represent end of line - which might be more than one value depending on which conventions the text file uses).

Values in character sets may take a a single byte, a fixed number of multiple bytes or a variable number of bytes. Some character sets have more than one encoding - Unicode characters can be encoded using variable multibyte (UFT-8), 1 or 2 16-bit (2-byte) values (UTF-16) or one 32-bit (4-byte) value (UTF-32).

A common single byte character set is the ASCII character set (more properly it takes 7-bits, which fits nicely in 8-bit bytes). In ASCII the letter 'A' is represented by the value 65 or 41 hexadecimal, and the digit 0 by the value 48 or 30 hexadecimal and the value 1 by 49, or 31 hexadecimal.

I mention the fact that the representation of a symbol, such as 0 or 1, in readable text is _not_ a bit, byte or larger word grouping of bits having the value 0 or 1 but some other, character set encoding specific value in case you had not realised this.

In C and C++ a file can be opened in text mode (the default) or binary mode. This _only_ affects the end of line handling and maybe some other platform specific text-file only oddities (e.g. Microsoft operating  systems treat character 26 - ctrl-Z in ASCII - as an end of file marker in text files for historic reasons from the MS-DOS days).

When a file is open in text mode the C and C++ end of line sequence ('\n' - newline) is converted to the operating system end of line sequence (e.g. '\r' '\n' - carriage return linefeed - on Microsoft OSes) when writing data and converts operating system end of line sequences to C/C++ end of line when reading data. This is not good if you are trying to read or write non text data such as the raw bits of a compressed image as the written data in the file would not match that written from the program and that seen by the program when read would not match that of data in the file unless the C/C++ end of line character sequence matches that of the operating system (which is the case for Linux and Unix like operating systems). In such cases we term the file binary (as in we want to treat this files as just a bunch of binary bits in bytes and want no sort of text mode translations and related oddness thank you very much).

So your first action on the file is to open it for reading as a binary file, e.g.:

   #include <fstream>

   ...

   std::fstream binfile("path/to/the/file", std::ios::in|std::ios::binary);

   if ( binfile.is_open() )
   {
   // process file...
   }
   else
   {
   // handle error
   }

You can then read the file and output the values of the bytes as sequences of digits and not the raw character values they represent, for example we can use the input stream get function to read each character, which are returned as an int to allow end of file to be represented. This int value can then be output using formatted output:

   unsigned int const ValuesPerLine(20);  
   unsigned int byte_count(0);
   int value(EOF);
   while ( (value=binfile.get())!=EOF )
   {
       ++byte_count;
       std::cout << value << (byte_count%ValuesPerLine?' ':'\n');
   }

The parts with byte_count and ValuesPerLine are only there to insert newlines every ValuesPerLine values output. Note: the use of std::cout assumes iostream is included appropriately.

The reason the above prints out integer values and not characters is because the get member returns an int and not a char. The small integer char types are handled differently to other integer types by C++ IOStreams formatted I/O in that they are output as is so the character appears on the console and not its character set encoding value, so for ASCII encoded characters a char of value 65 results in A being displayed and not 65. However for other integer types such as int the value is converted to a sequence of characters representing the number, in base 10 by default, but hexadecimal and octal are options, e.g. to specify hexadecimal number values we could do the following:

       std::cout << std::hex << value << (byte_count%ValuesPerLine?' ':'\n');

Unfortunately, output as strings of binary digits is not a standard option so we will need to do some work to output values in binary.

We need to be able to determine whether each bit in our value is a 0 or a 1. In addition it might be nice to state how many of the bits we are interested in - especially as we currently have the file's 0.255 8-bit byte values dumped in a larger, probably 32-bit, int value. Oh, and which stream to write to would be useful as well. This would give us a function like so:

   void  write_binary( std::ostream & out, T value, unsigned int number_of_bits );

The type T is the type of the integer we wish to use to represent our value, We could make the function a function template to allow various scenarios, but for simplicity here I will just make T an int - the type returned by get:

   void write_binary( std::ostream & out, int value, unsigned int number_of_bits );

The implementation of write_binary should probably take the minimum of number_of_bits and the bit length of an int but I shall ignore this again for brevity and just assume number_of_bits is valid - not a good assumption in real life code by the way.

So for each bit of interest if that bit is 1 write '1' else write '0'. We could use a recursive method to do this and repeatedly shift the value down into the 0th bit position and always test that bit. Such recursive approaches are common for outputting digits of a value as it allows us to get the digits displayed in the (Western culturally) correct order - least significant on the right rather than on the left which naturally results otherwise.

However here we can start with a value to test the most significant digit and shift it down testing each significant digit as we go, thus extracting bit value information in the most significant to least significant order required for display.

First we require a mask value initialised to have only the most significant bit of interest set to 1, this is in fact 1 shifted number_of_bits - 1 bits left (consider that 1 shifted 0 bits would allow us to output 1 digit):

   int mask_value(1<<(number_of_bits-1));

Next we use this value to test each bit, starting with the most significant and shifting back down towards the least significant 0th bit. The operation to perform each test is a bitwise AND operation. This performs a Boolean AND operation on each bit of its two operands setting the equivalent bit of the result, result_n, to be operand1_n AND operand2_n. As the AND operation only produces a 1 result if both inputs are also 1 the only bits in the result that will be 1 will be those in which there was a 1 in both of the operands. As our mask operand only has a single bit set to 1 only this bit in the result might be 1, and only if the equivalent bit in the tested value is also 1 - just what we want!

   while (mask_value!=0) // done when the mask 1 bit is shifted off the end!
   {
       if ( value & mask_value )
       {
         out << '1';
       }
       else
       {
         out << '0';
       }
       mask_value >>= 1; // shift mask down one bit
   }

We can of course shorten the above loop by using the conditional ?: operator similar to how I did earlier when deciding whether to output a space or newline:

   while (mask_value!=0) // done when the mask 1 bit is shifted off the end!
   {
      out << (value & mask_value ? '1' : '0');
      mask_value >>= 1; // shift mask down one bit
   }

Putting this together we get:

   #include <fstream>
   #include <iostream>

   void write_binary( std::ostream & out, int value, unsigned int number_of_bits )
   {
       int mask_value(1<<(number_of_bits-1));
       while (mask_value!=0) // done when the mask 1 bit is shifted off the end!
       {
         out << (value & mask_value ? '1' : '0');
         mask_value >>= 1; // shift mask down one bit
       }
   }

   int main()
   {
       std::fstream binfile("path/to/the/file", std::ios::in|std::ios::binary);

       if ( binfile.is_open() )
       {
         unsigned int const ValuesPerLine(8);  
         unsigned int byte_count(0);
         int value(EOF);
         while ( (value=binfile.get())!=EOF )
         {
         ++byte_count;
         write_binary( std::cout, value, 8);
         std::cout << (byte_count%ValuesPerLine?' ':'\n');
         }

       }
       else
       {
         std::cerr << "Oops! Unable to open file.\n";
       }
       
   }

You can of course modify this basic scheme to, for example count the ones in each byte - as you mentioned in your question, by writing a different, but similar function to write_binary, e.g.:

   unsigned int count_1_bits( int value, unsigned int number_of_bits);

Which keeps a count of and returns the number of 1 bits rather than outputting 1 s and 0s:

   unsigned int count_1_bits( int value, unsigned int number_of_bits)
   {
       int mask_value(1<<(number_of_bits-1));
       unsigned int ones_count(0);
       while (mask_value!=0) // done when the mask 1 bit is shifted off the end!
       {
         if ( value & mask_value )
         {
         ++ones_count;
         }
         mask_value >>= 1; // shift mask down one bit
       }
       return ones_count;
   }

You then add all the individual file byte 1 counts together (showing just the modified loop in main):

   unsigned long long one_bits_in_file_count(0);
   int value(EOF);
   while ( (value=binfile.get())!=EOF )
   {
       one_bits_in_file_count += count_1_bits(value, 8);
   }

If your compiler does not support unsigned long long then use unsigned long instead.

Note that the above are example code only and not meant to be the only or best way to achieve this sort of thing, as noted I elided some error checking for brevity. There are clever bit manipulation ways to count the number of 1 bits in a word, see for example this discussion on stackoverflow:

   http://stackoverflow.com/questions/109023/best-algorithm-to-count-the-number-of-

There is also a book called "Hacker's Delight" that cover many such useful techniques (I really must get around to getting hold of a copy and reading it !!!):

   http://www.hackersdelight.org/

Hope this helps and have fun.


---------- FOLLOW-UP ----------

QUESTION: Hi Ralph,

First of all thank you so much for your your most useful answer. You helped me very much, indeed.

I was wondering if I could ask a quick question as a follow-up. What is the best (most elegant way) to automatically tell if a file is a text file or a binary file using C++?

Thank you!
Andres

Answer
Short answer: none.

Longer answer:

As most file systems do not differentiate between files containing just text and those containing non-textual 'binary' data there is no sure way to know if a file could be termed a text file or not, and even if a filesystem had such a facility it would not be accessible as _standard_ from C/C++ - you would presumably have to use vendor-specific API (application programming interface) or API extensions to make use of such a facility. Even then how robust against accidental or malicious misclassification of files would such a system be?

A text file is only so by convention - remember such files still contain binary data, it is just data that can be interpreted as meaningful text in a simple way such a just displaying the contents on a console screen. Consider that many Word processing document formats, especially older formats, are binary formats, even though the documents they represent are obviously text. And even textual formats tend to contain a lot of additional markup text around the core text of the document (XML and HTML based document formats for example).

In fact if you try to display a text file containing text encoded in a way other than that expected by the way you display it then you are likely to see rubbish - at least for some characters.

Then there is nothing to stop you opening a supposedly text file as a binary file - i.e. without any end of line translation etc. - in C and C++ and treat the file as raw bytes.

If a text file is only supposed to contain basic 7-bit ASCII encoded characters and no characters with higher values - that is no character in the range 128 to 255 - then you can check to see if the file contains any such characters I suppose and if it does deem it to be a binary file.

However such a technique is not general - many character encodings use the full 8-bits - even if they still only use one byte per character. One such technique is to use a base of 7-bit ASCII plus a specific 'code page' of characters encoded using the higher value bytes (128 - 255). See:

   http://en.wikipedia.org/wiki/Code_page

for more information on code pages.

Note that trying to display text having characters for one code page in an application or device using another may well end up with some rubbish characters being displayed.

Sorry not to be able to give you a nice neat answer - text, characters and character encodings are not a nice neat area and what might look like text to one system may well look like raw binary data to another, although hopefully the World will tend towards Unicode (which matches ASCII's encoding for its characters in the 0 to 127 range) and its various encodings, which may reduce the number of possibilities somewhat. Note however the MS Windows consoles (DOS boxes) do not understand Unicode (UTF-8) encoding and seem stuck on an ASCII plus code page setup :(

I have to go now to an evening appointment so sorry if this reply is a bit rushed...  

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


Ralph McArdell

Expertise

I am a software developer with more than 15 years C++ experience and over 25 years experience developing a wide variety of applications for Windows NT/2000/XP, UNIX, Linux and other platforms. I can help with basic to advanced C++, C (although I do not write just-C much if at all these days so maybe ask in the C section about purely C matters), software development and many platform specific and system development problems.

Experience

My career started in the mid 1980s working as a batch process operator for the now defunct Inner London Education Authority, working on Prime mini computers. I then moved into the role of Programmer / Analyst, also on the Primes, then into technical support and finally into the micro computing section, using a variety of 16 and 8 bit machines. Following the demise of the ILEA I worked for a small company, now gone, called Hodos. I worked on a part task train simulator using C and the Intel DVI (Digital Video Interactive) - the hardware based predecessor to Indeo. Other projects included a CGI based train simulator (different goals to the first), and various other projects in C and Visual Basic (er, version 1 that is). When Hodos went into receivership I went freelance and finally managed to start working in C++. I initially had contracts working on train simulators (surprise) and multimedia - I worked on many of the Dorling Kindersley CD-ROM titles and wrote the screensaver games for the Wallace and Gromit Cracking Animator CD. My more recent contracts have been more traditionally IT based, working predominately in C++ on MS Windows NT, 2000. XP, Linux and UN*X. These projects have had wide ranging additional skill sets including system analysis and design, databases and SQL in various guises, C#, client server and remoting, cross porting applications between platforms and various client development processes. I have an interest in the development of the C++ core language and libraries and try to keep up with at least some of the papers on the ISO C++ Standard Committee site at http://www.open-std.org/jtc1/sc22/wg21/.

Education/Credentials

©2016 About.com. All rights reserved.