Want to parallelize your code? Before you dive in, you might want to test the waters.
It can be said be said that writing parallel code is easy. It can also be said that getting your code to run fast and produce correct answers is a bit more difficult. With parallel computing, we don't have the luxury of a compiler optimizing for us, so we have to do the work. In this column we are going to look at some of the fundamentals and hopefully get you thinking about some of the issues that are critical for your success.
Clusters are organized in such a way that there is a set of independent computers connected by a communication network. The difference between clusters and multiprocessor computers (SMP systems) is the memory structure. Multiprocessors contain several CPUs which are connected in someway to memory. They can be connected by a bus or crossbar, but the most important thing is that all processors are connected to all the memory. This configuration is called shared memory. With a cluster, each node has its own memory that is local to cluster node. Other nodes can only access this memory through a communication network. This configuration is called distributed memory and is what makes a programming a cluster different from shared memory computer. Accessing the memory of other nodes results in substantial time delay. And because of this delay, a program that runs well on a multiprocessor will not necessarily run well on a cluster. Of course, each node can be and often is a multiprocessor, but the key to cluster programming is programming distributed memory systems because the limiting factor is the communication speed.
Regardless of processor or interconnect, all clusters have the same common programming model based on distributed memory. The goal of the column is to discuss how one writes efficient programs for clusters. We will not cover issues concerning shared memory systems. This information is available from other sources.
After a cluster is assembled and software installed, it is time to run application programs on it. Of course, the cluster was created to make application programs run fast. When you run your existing sequential programs on your cluster you will find that it usually runs as fast as it will run on one cluster node. You can load all cluster nodes with a copy of the program, so you can exploit many nodes solving many similar problems at one time. In some cases your program may take a long time to run, and you would like to use the full power of your cluster to work on this program. In this case, you need your program to be parallel.
Looking at the Code
In an ordinary sequential program, all statements are executed one after another. In a parallel program, there are many concurrent statements executed at the same time. The programmer should be sure of two things with such a program. First, concurrent statements must be independent each of each other. And second, they also should be have the correct data to process. If you already have a parallel version of your program, and it runs much faster when you run it on cluster, you are lucky. Often programs with concurrency do not always run faster on a cluster (see sidebar). Fortunately, this is not the End of the World (or career), but this means that you have to take a closer look at your program.Sidebar: Concurrent and Parallel |
The terms concurrent and parallel are often used interchangeably. In the context of this column, they are not the same thing. Concurrency is a property of a program or algorithm. If parts of the program can run independently, then these parts are concurrent. If the independent parts are run on separate processors then the program is often called parallel. The distinction is subtle, but very important when real hardware is used. Since the goal is to make your program run faster, we need to ask the question, Does making all the concurrent parts of the program parallel increase execution speed? Unfortunately the answer is maybe, because in some cases running concurrent parts of your program in parallel may actually slow down your program!. From the above discussion we can make the following general statements about cluster programming:
If you remember the above three rules, you will have an easier time navigating in the cluster world. |
There are therefore two situations where programmers often find themselves.
Situation 1:You have an existing sequential program.
Situation 2: You have a parallel program, but it does not run fast enough.
In the first case, you need to make your program run on several nodes, i.e. parallelize your program. After you parallelize your program, you may move to the second case because you expectation of "parallel means faster" may not be true. Or, perhaps before you parallelize your program you would like to investigate the efficiency of running in parallel.
Before You Start Slicing
The first step to understanding efficiency is determining which operations can be done concurrently. For example, a program may have several different independent actions, so they can be executed in parallel. This is called block parallelism. Block parallelism appears when there are sections (blocks) of independent code. For clusters, such parallelism requires large blocks as small blocks or single operations will not be efficient. For example, there can be subroutines, statements containing function calls, or loops. Consider the following C code snippet.s1(x); for(i=0, i<n, ++i) y[i] = y[i+1]; z = s3(z);
In the case where all blocks are independent (concurrent) this can be executed in parallel as follows. Node 1
s1(x);
Node 2
for(i=0, i<n, ++i) y[i] = y[i+1];
Node 3
z = s3(z);
Block parallelism usually arises due to the nature of the problem and can be the most evident type of parallelism. However, unless it is recursive, block parallelism typically is rarely used in parallel programs because of the reasons we will discuss later. The most common kind of parallelism is loop parallelism. Consider the following C program loop.
for(i=0, i<1000, ++i) x[i] = x[i] + 1.;
The loop can be split and executed concurrently as follows.
Processor1
for(i=0, i< 250,++i) x[i] = x[i] + 1.;
Processor 2
for(i=251, i<500, ++i) x[i] = x[i] + 1.;
Processor 3
for(i=501, i<751, ++i) x[i] = x[i] + 1.;
Processor4
for(i=751, i<1000, ++i) x[i] = x[i] + 1.;
Loop parallelism is very popular in parallel programming because it distributes the workload across cluster nodes evenly. In the above example, four processors have the same amount of work. Of course, you cannot just split your program loops on different nodes and run your code. First, nodes will probably require data they do not have. Data produced on a node will probably be used by the program to produce results or data produced on some node will be needed on another nodes. Were this not the case, we would not have a single program, but a collection of independent programs! When data is needed it is necessary to send data from one node to another or from one node to many nodes. In the program, this communication can be implicitly or explicitly described and depends on what tools you use to make your program parallel. Very often data transfer is done as explicit message passing, which is performed by special libraries like PVM (Parallel Virtual Machine) or MPI (Message Passing Interface). You can learn more about MPI in the MPI Monkey column and more about PVM from the Getting Started With Clusters column. Implicit data transfer can be done by using automatic tools which we will discuss at a later time.