You are here:

C++/forking for loops

Advertisement


Question
Dear Ralph,

Thank you for taking questions.

I have simple for loop:
for (i=0; i<10000; i++)
{
for (j=0; j<10000; j++){doSomething(i,j);}
}

I would like to divide the process into 10 parts and run them in parallel on separate processors (cores).

What would be the simplest way to do that? I am completely new to this. I'm interested in an embarrassingly parallel solution. There needs to be no timing at all, the jobs can finish whenever they want, nothing needs to be coordinated. It could be great to know, however, when the last of the jobs finishes.

I would be thankful for any kind of guidance you can provide.

Sincerely yours,
Andres

Answer
I shall overlook the fact that there would be no point to doing anything with the shown code before fixing it so it builds. Even assuming doSomthing is suitably declared and defined somewhere that the compiled version is linked to the above code it is still bad C++ as i and j do not seem to have been declared/defined. I suggest prefixing their initial use in the for initialisation statements with some integer type or alias name - unsigned or std::size_t for example.

Next I will apologise if the code indentation is messy - the code in my answer started out very nicely indented but posting through AllExperts seems to trash it - leading spaces seem to be totally ignored after a certain amount, even in tags - and no I am not happy about it at all :(

The simplest way would probably be to use a parallel-for construct.

However, Standard C++ has no such construct - either as part of the core language or as part of the supporting C++ standard library. On the other hand 3rd party libraries do - such as Intel's Task Building Blocks (TBB) or Microsoft's Parallel Patterns Library (PPL), as do language extensions (if that is the right term) like OpenMP.

The C++11 standard library has support for asynchronous operations which might also be pressed in to use but would not map quite so directly onto the nested loop structure you show.

Your first task would be to split the current looping construct into the chunking you require. Do not assume that a 1-to-1 mapping of tasks to cores is necessarily the best fit - maybe make this easy to change.

   unsigned const OuterLoopCount(10000);
   unsigned const InnerLoopCount(10000);
   unsigned const NumberOfTasks(10);
   unsigned const OuterTaskCount(OuterLoopCount / NumberOfTasks);

   for (unsigned t=0; t!=NumberOfTasks; ++t)
   {
       unsigned const Begin(t*OuterTaskCount);
       unsigned const End(Begin+OuterTaskCount);
       for (unsigned i = Begin; i!=End; ++i)
       {
         for (unsigned j = 0; j<InnerLoopCount; ++j)          
         {
         doSomething(i,j);
         }
       }
   }
   
Note the above make the assumption that OuterLoopCount % NumberOfTasks is 0 - that is OuterLoopCount is exactly divisible by NumberOfTasks with no remainder. When run the results should be the same as your original code with doSomething called with the same combinations of i and j in the same order and all run sequentially still.

The next thing to appreciate is that you may not have ultimate control over how many actual threads are used when you iterate over a parallel for loop. Often the implementation will make a decision for you based on the capabilities of the system you are using and I can see a case for maybe leaving one or two cores off the thread count for other processes to use on standard PC style systems! In some cases you may be able to specify how may threads should be used.

Here are some example transforms of your loops to using various parallel for or parallel for like structures. First up using the Microsoft Visual C++ specific PPL library, which uses their Concurrency Runtime. Your code parallelised would look something like so:

   using concurrency::parallel_for;
   parallel_for(std::size_t(0), NumberOfTasks, [](size_t t)
   {
     std::size_t const Begin(t*OuterTaskCount);
     std::size_t const End(Begin+OuterTaskCount);
     for (std::size_t i = Begin; i!=End; ++i)
     {
         for (std::size_t j = 0; j<InnerLoopCount; ++j)          
         {
         doSomething(i,j);
         }
     }
   });

Note: Here and elsewhere I assume OuterLoopCount, InnerLoopCount, NumberOfTasks and OuterTaskCount are constants similar to those defined above except that they are std::size_t constants rather than unsigned, mainly because PPL parallel_for likes std::size_t values!

To use PPL parallel_for you have to be using a version of Microsoft Visual C++ that includes it and include the ppl.h header file. I found that the default linked-with libraries worked fine with Visual C++ 2012 Professional.

You will notice that PPL parallel_for  is a function. It takes a start value, an end value (std::size_t(0) and NumberOfTasks in the example) and a function - in this case a C++11 lambda function - that takes as a parameter the iteration count value (to fit in with the task-chunked version of your sequential code I have called this t). The body of the lambda function looks exactly like the body of the task t for loop body in the task-chunked sequential code.

By default Visual C++ PPL and the Concurrency Runtime libraries use a default scheduler policy but you can replace it with your own that specifies various parameters, including how much concurrency (as it were) you want by setting MaxConcurrency and MinConcurrency parameter values. See the MSDN documentation for further details, here are some (hopefully) relevant places in the MSDN documentation:

   http://msdn.microsoft.com/en-us/library/hh875062.aspx
   http://msdn.microsoft.com/en-us/library/dd504870.aspx
   http://msdn.microsoft.com/en-us/library/dd492418.aspx
   http://msdn.microsoft.com/en-us/library/dd728073.aspx
   http://msdn.microsoft.com/en-us/library/ff829269.aspx

Next up there is OpenMP. This is supported by a wider range of compilers (and languages - Fortran has been a long time supported language), but you may have to turn on support for it. For Visual C++ it is a project / C/C++ / language property (the /openmp command line option), for gcc, g++ and GNU Fortran the -fopenmp is used which also enables automatic linking of the OpenMP runtime library. You should include the omp.h header file. The openMP parallel-for version of your code might look like so:

     omp_set_num_threads(NumberOfTasks);
   # pragma omp parallel
     {
   #     pragma omp for
         for (int t=0; t<int(NumberOfTasks); ++t)
         {
         std::size_t const Begin(t*OuterTaskCount);
         std::size_t const End(Begin+OuterTaskCount);
         for (std::size_t i = Begin; i!=End; ++i)
         {
         for (std::size_t j = 0; j<InnerLoopCount; ++j)          
         {
         doSomething(i,j);
         }
         }
         }
     }

You will notice that openMP uses a mix of #pragma pre-processor directives, such as #pragma omp parallel, and supporting library functions such as omp_set_num_threads (which by the way allows setting up exactly the number of threads you wanted!). You will notice that the section to be parallelised is within a scoped block  (that is between { and } ) and introduced using the omp parallel pragma and a standard for loop is specified as being a parallel for using the for omp pragma. However, for some reason OpneMP insists on using signed integral loop variables only and only a subset of loop termination conditions are accepted (!=, common in modern C++ code is not acceptable for example). You can find more on OpenMP in general at the OpenMP org web site:

   http://openmp.org/

Notice the sections down the left hand side. Of particular interest is the Compilers section which lists compilers supporting OpenMP and has links to further information for each compiler's specifics.

The final example I would like to show is using standard C++11 library functionality in the form of std::async. The way std::async functions is to execute a function - including lambda functions - asynchronously. Firing off a function via std::async returns something called a future which represents the result of the call which will quite possibly not be ready until sometime time in the future (hence the name). You can check whether the result is ready or not or just wait for the result to become available. This applies even if there is no result (i.e. functions returning void). To emulate a parallel for the task function is passed to std::async the requisite number of times in a loop passing in the loop variable each time and storing the returned futures in an array. Finally each futures is waited on to be ready, again using a loop. The resulting code looks like so:

   std::array<std::future<void>, NumberOfTasks> task_results;
   for (std::size_t u=0; u!=NumberOfTasks; ++u)
   {
       task_results[u] = std::async(std::launch::async, [](size_t t)
       {
         std::size_t const Begin(t*OuterTaskCount);
         std::size_t const End(Begin+OuterTaskCount);
         for (std::size_t i = Begin; i!=End; ++i)
         {
         for (std::size_t j = 0; j<InnerLoopCount; ++j)          
         {
         doSomething(i,j);
         }
         }
       }, u);
   }
   for (auto & result : task_results)
   {
       result.get();
   }

The future and array headers will need to be included.

You will have to look carefully at the { } and ( ) around the call to std::async. As in the PPL example [](size_t t) introduces a lambda function which is defined in the following lines between { }. The final parameter to std::async follows the closing } of the lambda function definition and is the task-spawning loop's variable u. Note also the use of other C++11 features: a std::array - a more friendly wrapper around a built in array -is used to hold the result futures. A for-each style loop is used to iterate over the array of async function future results waiting for each result to have finished by the result future's get function which returns the future's result (or nothing in cases of void returns, as here) waiting if necessary. Here of course we are only concerned with the wait until ready aspect of the get operation.

One thing about the C++11 concurrency support is the lack of any obvious control over things like numbers of threads and other such scheduling policy type parameters. If I remember correctly the Microsoft Visual C++ 2012 implementations sits on top of their Concurrency Runtime so setting the Concurrency Runtime scheduling policy should also affect the C++11 library concurrency features as well - NOTE: Microsoft Visual C++ specific, and I have not tracked down the documentation to support this claim - so please look up the exact details if it is of interest.

You can find out more about std::async and other C++11 features form any good C++11 reference (I am using a post-C++11 draft of the standard document n3337: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf , but you might prefer C++11 texts such as:

   http://www.stroustrup.com/4th.html
   http://www.cppstdlib.com/
   http://www.cplusplusconcurrencyinaction.com/

There are probably a whole host of C++11 Q&As, examples, articles etc. out there - fire up your search engine of choice and have a look around - which of course applies to the other technologies mentioned and, for that matter, others that did not get a mention.

Hope this helps.

C++

All Answers


Answers by Expert:


Ask Experts

Volunteer


Ralph McArdell

Expertise

I am a software developer with more than 15 years C++ experience and over 25 years experience developing a wide variety of applications for Windows NT/2000/XP, UNIX, Linux and other platforms. I can help with basic to advanced C++, C (although I do not write just-C much if at all these days so maybe ask in the C section about purely C matters), software development and many platform specific and system development problems.

Experience

My career started in the mid 1980s working as a batch process operator for the now defunct Inner London Education Authority, working on Prime mini computers. I then moved into the role of Programmer / Analyst, also on the Primes, then into technical support and finally into the micro computing section, using a variety of 16 and 8 bit machines. Following the demise of the ILEA I worked for a small company, now gone, called Hodos. I worked on a part task train simulator using C and the Intel DVI (Digital Video Interactive) - the hardware based predecessor to Indeo. Other projects included a CGI based train simulator (different goals to the first), and various other projects in C and Visual Basic (er, version 1 that is). When Hodos went into receivership I went freelance and finally managed to start working in C++. I initially had contracts working on train simulators (surprise) and multimedia - I worked on many of the Dorling Kindersley CD-ROM titles and wrote the screensaver games for the Wallace and Gromit Cracking Animator CD. My more recent contracts have been more traditionally IT based, working predominately in C++ on MS Windows NT, 2000. XP, Linux and UN*X. These projects have had wide ranging additional skill sets including system analysis and design, databases and SQL in various guises, C#, client server and remoting, cross porting applications between platforms and various client development processes. I have an interest in the development of the C++ core language and libraries and try to keep up with at least some of the papers on the ISO C++ Standard Committee site at http://www.open-std.org/jtc1/sc22/wg21/.

Education/Credentials

©2016 About.com. All rights reserved.