You are here:

C++/reading buffer

Advertisement


Question
Dear Ralph,

Thank you for taking questions!

I need to read a binary file and convert it to text (letters and numbers). I know this file uses 4-byte blocks but they alternate with 2 bite blocks later on. Let's assume that the blocks are 4+4+2+4.

How would I translate this into text? I need to read the file in large chunks (not byte by byte for speed considerations). So far I got it working fine with char:

struct rec
  {
       char x;
  };

long buffersize=1000;
char *buffer; buffer = (char*) malloc (sizeof(char)*buffersize); if (buffer == NULL) {fputs ("Memory error1",stderr); exit (1);}

FILE *gF; gF = fopen("testfile.txt","rb"); if (gF==NULL) {fputs("File error",stderr); exit(1);}
long result;
int i;

while(!feof(gF))
{

result=fread(buffer,sizeof(struct rec),buffersize,gF);

i=0; char value;

while(i < result)
{
 value = 0;
 memcpy(&value, buffer, sizeof(struct rec));
 cout << value;
 buffer = buffer + sizeof(struct rec);
 i = i + sizeof(struct rec);
}
}

Thanks so much!

Regards,
Andres

Answer
Sorry Andres but I am unclear what it is you are trying to do.

What do you mean you need to convert the values in the file to 'text'?
What do you mean by 4 byte and 2 byte blocks? Do you mean they represent 4 byte and 2 byte values such as std::int32_t and std::int16_t[1] (or that you wish to interpret them as such)?

For example do you mean you wish to interpret the 4 and 2 byte blocks as 32-bit and 16-bit integer values respectfully and convert the raw values to textual decimal, octal, or hexadecimal representations of those values?

Or something else - for example the 4 and 2 byte blocks are short byte (character?) sequences.

If the 4 and 2 byte blocks are integer values then the next question is how are you handling endianness[2]? Any multibyte value can be interpreted as different values depending on which order each byte's bits appear in the associated word, thus the 2 byte values 0xab 0xcd could represent the 16 bit value 0xabcd or the 16-bit value 0xcdba - assuming the bits in the bytes in the file and those in memory agree on their ordering as well - which I think is always the case to a high degree of certainty.

Next - buffering. Um, the C file functions using the FILE* values - such a fopen, fclose, fread etc. are already buffered - as are the C++ IOStream file stream buffers. Note also that disk files are block devices so read/write a block (or sector) at a time - values of 512, 1024, 2048 and 4096 bytes are some likely sizes. So you probably are not saving as much time as you think by using your own buffering. In any case do not optimise unless you have to - meaning you have evidence - i.e. you have profile measurements that back up your assertion - that any particular piece of code is a bottleneck. The limiting factor in this case is likely to be that the disk IO - even from a solid state drive - is much, much, slower than memory accesses which is a lot slower than cache memory accesses which is still slower than the CPU cores can keep up with!

I also note that the code you present is nearly 100% C code - with the exception a cout << value; (possibly badly formed unless there is a using namespace std or using std::cout somewhere not shown, along with the headers and function declaration etc.). I do not generally answer questions on C code.

However in this case the raw IO is very similar between C and C++ once you get down to it.

Let's start by creating a suitable looking struct that contains 4 byte, 4 byte, 2 byte and 4 byte members in that order. Note: as I am a C++ expert answering C++ questions in the C++ AllExperts area I am assuming C++ - in fact C++11 or later - and will elide some error handling to try to keep things brief! All code is for exposition only and certainly not intended to be of production quality.

   struct rec
   {
       std::uint32_t block_1;
       std::uint32_t block_2;
       std::uint16_t block_3;
       std::uint32_t block_4;
   };

The problem here is that it is likely that there will be some padding added by the compiler - most likely between members block_3 and block_4 - that is the size is 16 bytes (4+4+4+4) not the 14 bytes (4+4+2+4) expected. This is a problem. You can get around it using compiler specific #pragmas - such as #pragma pack[3,4]. Note there is since C++11 standard support for alignment specifications using alignas[5] - which might allow for a standard way to do the same thing if it weren't for the fact that the alignment is set as requested "except if it would weaken the alignment the type would have had without this alignas" (from [5]).

There are other problems with messing with the padding and alignment of data. On many systems data not properly aligned (unaligned data) takes more effort to manipulate and thus incurs a time penalty. On some systems unaligned data causes a hardware fault (trap, hardware exception etc.). This is probably the reason for the restriction on the behaviour of the standard C++11 and later alignas specifier.

The following program can be used to see the effects of no alignment or packing specification and various specifications:

#include <iostream>
#include <cstdint>

struct rec_padded
{
   std::uint32_t block_1;
   std::uint32_t block_2;
   std::uint16_t block_3;
   std::uint32_t block_4;
};

struct alignas(2) rec_alignas_2
{
   std::uint32_t block_1;
   std::uint32_t block_2;
   std::uint16_t block_3;
   std::uint32_t block_4;
};

struct rec_block_4_alignas_2
{
   std::uint32_t block_1;
   std::uint32_t block_2;
   std::uint16_t block_3;
   alignas(2) std::uint32_t block_4;
};

#pragma pack(2)
struct rec_pack_2
{
   std::uint32_t block_1;
   std::uint32_t block_2;
   std::uint16_t block_3;
   std::uint32_t block_4;
};

int main()
{
   std::cout << "Size of          rec_padded: "
         << sizeof(rec_padded) <<'\n';
   std::cout << "Size of         rec_alignas_2: "
         << sizeof(rec_alignas_2) <<'\n';
   std::cout << "Size of rec_block_4_alignas_2: "
         << sizeof(rec_block_4_alignas_2) <<'\n';
   std::cout << "Size of          rec_pack_2: "
         << sizeof(rec_pack_2) <<'\n';
}

I built the above code with GNU g++ 4.9.2 using:

g++ -Wall -Wextra -pedantic --std=c++11 -o scratch scratch.cpp

Where the name of the source file is scratch.cpp and the resultant executable is scratch. It produces output like so:

Size of          rec_padded: 16
Size of         rec_alignas_2: 16
Size of rec_block_4_alignas_2: 16
Size of          rec_pack_2: 14

(apologies for crappy indentation and column alignment - AllExperts insists on using proportionally spaced fonts - I have never persuaded it to allow fixed-spaced fonts for code etc. :( )

OK so if you can live with using a non-standard feature such as #pragma pack(2) and your platform does not keel over at the first whiff of unaligned data then you can just pass the address of a struct similar to rec_pack_2 to fread (note: I am re-using some of the structs I defined in the previously shown program code):

void display_first_record_from_file(char const * fName)
{
   FILE * fin{std::fopen(fName, "rb")};
   if (fin==nullptr)
   {
       std::cerr << "Unable to open file '" << fName << "' for reading.\n";
       return;
   }
   rec_pack_2 record{0,0,0,0};
   const std::size_t NumberOfRecordsToRead{1};
   std::size_t
       records_read{fread(&record,sizeof record,NumberOfRecordsToRead,fin)};
   if (records_read==NumberOfRecordsToRead)
   {
       std::cout << "{\n  " << std::hex
         << record.block_1 << "\n  "
         << record.block_2 << "\n  "
         << record.block_3 << "\n  "
         << record.block_4 << "\n}\n";
   }
   fclose(fin);
}

Or using C++ IOStreams:

void display_first_record_from_file(char const * fName)
{
   std::ifstream fin{fName};
   if (!fin)
   {
       std::cerr << "Unable to open file '" << fName << "' for reading.\n";
       return;
   }
   rec_pack_2 record{0,0,0,0};
   if (fin.read(reinterpret_cast<char*>(&record),sizeof record))
   {
       std::cout << "{\n  " << std::hex
         << record.block_1 << "\n  "
         << record.block_2 << "\n  "
         << record.block_3 << "\n  "
         << record.block_4 << "\n}\n";
   }
// the fin object closes the file when destroyed on going out of scope
}

If you do not wish to use #pragma pack(2) or similar then you will have to read each block member separately, which might look like so if using C++ IOStreams:

void display_first_record_from_file(char const * fName)
{
   std::ifstream fin{fName};
   if (!fin)
   {
       std::cerr << "Unable to open file '" << fName << "' for reading.\n";
       return;
   }

   rec_padded record{0,0,0,0}; // ### note: record WITH PADDING!
   fin.read(reinterpret_cast<char*>(&record.block_1), sizeof record.block_1);
   fin.read(reinterpret_cast<char*>(&record.block_2), sizeof record.block_2);
   fin.read(reinterpret_cast<char*>(&record.block_3), sizeof record.block_3);
   fin.read(reinterpret_cast<char*>(&record.block_4), sizeof record.block_4);

   if (fin)
   {
       std::cout << "{\n  " << std::hex
         << record.block_1 << "\n  "
         << record.block_2 << "\n  "
         << record.block_3 << "\n  "
         << record.block_4 << "\n}\n";
   }
}

I shall leave it as an exercise to modify the example functions to your requirements. I have not bothered in this question exploring endianness problems as it is quite long enough already.

In any case I hope this give you some pointers.

References:
-----------
[1] See header cstdint, for example, http://en.cppreference.com/w/cpp/header/cstdint
[2] See for example https://en.wikipedia.org/wiki/Endianness
[3] https://msdn.microsoft.com/en-us/library/2e70t5y1.aspx
[4] https://gcc.gnu.org/onlinedocs/gcc/Structure-Packing-Pragmas.html
[5] http://en.cppreference.com/w/cpp/language/alignas

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


Ralph McArdell

Expertise

I am a software developer with more than 15 years C++ experience and over 25 years experience developing a wide variety of applications for Windows NT/2000/XP, UNIX, Linux and other platforms. I can help with basic to advanced C++, C (although I do not write just-C much if at all these days so maybe ask in the C section about purely C matters), software development and many platform specific and system development problems.

Experience

My career started in the mid 1980s working as a batch process operator for the now defunct Inner London Education Authority, working on Prime mini computers. I then moved into the role of Programmer / Analyst, also on the Primes, then into technical support and finally into the micro computing section, using a variety of 16 and 8 bit machines. Following the demise of the ILEA I worked for a small company, now gone, called Hodos. I worked on a part task train simulator using C and the Intel DVI (Digital Video Interactive) - the hardware based predecessor to Indeo. Other projects included a CGI based train simulator (different goals to the first), and various other projects in C and Visual Basic (er, version 1 that is). When Hodos went into receivership I went freelance and finally managed to start working in C++. I initially had contracts working on train simulators (surprise) and multimedia - I worked on many of the Dorling Kindersley CD-ROM titles and wrote the screensaver games for the Wallace and Gromit Cracking Animator CD. My more recent contracts have been more traditionally IT based, working predominately in C++ on MS Windows NT, 2000. XP, Linux and UN*X. These projects have had wide ranging additional skill sets including system analysis and design, databases and SQL in various guises, C#, client server and remoting, cross porting applications between platforms and various client development processes. I have an interest in the development of the C++ core language and libraries and try to keep up with at least some of the papers on the ISO C++ Standard Committee site at http://www.open-std.org/jtc1/sc22/wg21/.

Education/Credentials

©2016 About.com. All rights reserved.