Slicing and Dicing Your Code

Article Index

Facing Reality

Message passing whether implicit or explicit always consumes time. When messages are frequent or huge they can easily consume more time than can be saved by parallel execution. For this reason, it is important to know how much time is saved by parallel execution and how much time is lost to perform message passing. In the following FORTRAN program

      READ *, X;
      S = 0. 
      DO I = 1, N
        S=S + X(I)
      ENDDO
      PRINT *, S

it is possible to make the concurrent summation execute in parallel for this program. Let's consider the efficiency when executed on a 2-node cluster.

Node 1

      READ *, X
      send X(N/2 + 1: N) to Node 2 
      S1 = 0. 
      DO I = 1, N/2
        S1=S1 + X(I)
      ENDDO	
      receive S2 from Node 2
      S = S1 + S2
      PRINT *, S

Node 2

      receive X(N/2 + 1: N) from Node 1
      S2 = 0.
      DO I = N/2+1, N
      S2=S2 + X(I)
      ENDDO 
      send S2 to Node 1

Note that send and receive are not elements of the language. They just mean that data is sent to (received from) another node. In practice, for most systems the transfer time of one word through the network is larger than the time of an ordinary operation and the transfer of first word will be much longer! With a 1 GHz processor and Gigabit Ethernet (Gigabit Ethernet) connection we can have approximately the following times.

S1=S1 + X(I)                       -   1 ns (nanosecond)
transfer of one element of X(I)  -   5 ns (nanoseconds)

When N = 1,000,000 the time of sequential execution of the DO loop is about 1 ms (millisecond). The time of parallel execution is half of the execution time of the sequential loop or .5 ms. The time to transfer 500 words will be no less than 2.5 ms creating a total execution time of about 3 ms. So, the execution time of the parallel loop is 3 times longer than the sequential one! If we have smaller DO loop, like N=1000 the difference would be even much more dramatic.

All is not lost however, let's consider the following C program example:

s = homer(a) + homer(b);
mp = 4;
for(int m=1; m<=n-1; m++)
{
  s = s + mp*homer(a + h*m);
  if(mp == 4)
    mp = 2;
  else
    mp = 4;
}
simpson = s*h/3;

In this case a 2-node parallel program looks like the following (assume an even number of half iterations).

{mosgoogle right} Node 1

s = homer(a) + homer(b);
mp = 4;
for(int m=1; m<=(n-2)/2; m++)
{
  s1 = s1 + mp*homer(a + h*m);
  if(mp == 4) 
    mp = 2;
  else
    mp = 4;
}
receive s2 from Node 2
s = s1 + s2;
simpson = s*h/3;

Node 2

s2 = 0;
mp = 4;   
for(int m=(n-2)/2+1; m<=n-2; m++)
{
  s2 = s2 + mp*homer(a+h*m);
  if(mp == 4) 
    mp = 2;
   else
    mp = 4;
}
send s2 to Node 1

Here we have only one transfer of one word. Using Gigabit Ethernet this will require 20 to 50 μs (microseconds) depending on the implementation. Let's assume it is 30 us. Furthermore, we will assume the function call homer can be executed in 100 ns, so we can neglect the time for the rest of the loop. In this case, with only 1000 iterations we have a sequential time for the loop equal to 100 us. Using the same process as be did above, the parallel time will be 80 us. This difference is a small improvement, but an improvement none the less. With 1,000,000 iterations, performance of the parallel program is practically doubled, while as we saw in the previous example it will be inefficient with any number of iterations.

These examples show that it is not just sufficient to blindly parallelize programs. It is important to place code on nodes only if it increases performance i.e. concurrent portions of the program need to be scheduled based on the underlying hardware. Consider the above examples if the communication layer were faster. Therefore, to understand how to improve efficiency, it is desirable to know execution times of important program parts, i.e. profiling. In future columns we will discuss ways to determine these numbers, use execution models, develop basic algorithms, learn ways to express parallelism in different languages, examine tools for parallel conversion, and much more.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Pavel Telegin is currently Department Head of Programming Technology and Methods in the Joint Supercomputer Center, Moscow, Russia.

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.