Articles

Understanding Amdahl's Law

That Free Lunch You Wanted...

Clustering seems almost too good to be true. If you have work that needs to be done in a hurry, buy ten systems and get done in a tenth of the time. If only it worked with kids and the dishes. Alas, kids and dishes or cluster nodes and tasks, linear speedup on a divvied up task is too good to be true, according to Amdahl's Law, which strictly limits the speedup your cluster can hope to achieve.

In the first two columns we explored parallel speedup on a simple NOW(network of workstations) style cluster using the provided task and taskmaster program. In the last column, we observed that there were some fairly compelling relations between the amount of work that we were doing in parallel on the nodes, the amount of communications overhead associated with starting the jobs up and receiving their results, and the speedup of the job as we split it up on more and more nodes. {mosgoogle right}

In this column we're going to do some exploration of just how all this works. Be warned -- we will figure out Genuine Algebraic Formulas that are simple enough, but not for the faint of heart. Nevertheless, the basic ideas underlying these formulas are critical for any would be cluster engineer to understand if they hope for success. As you should see by now, building a compute cluster can be as simple as looking at a pile of networked workstations running Linux, waving your hands about a bit (chanting a bit and wearing strange robes helps to impress the feeble-minded), and going "Poof! You're a Cluster!" Or it can be a lot more complicated, and these formulas are the reason why.

Before you begin, I personally recommend putting your right brain and left brain into a state of ideal balance by whatever means you find most suitable, be it a few minutes of a soothing mantra or a few swigs of an equally soothing beverage made from malted grains and hops. Feel free to don strange robes if it helps.

Amdahl's Law

Parallel computing is actually not a terribly new idea. Some of the very earliest "computers" to be applied to a serious problem(cryptography in World War II) were heavily parallel, with lots of units trying to break a message at the same time. Big corporate players, such as IBM, have been studying parallel computation for decades.

During the 1960's, a man named Gene Amdahl was director of IBM's Advanced Computing Systems Laboratory (he later left to form his own company). In a 1967 paper entitled "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities", he formulated what has come to be known as Amdahl's Law, one of the fundamental laws of parallel computer/program design.

This law is very simple to understand. The "speed" (S) of a program that does a certain amount of work (W) can be viewed as the work it does divided by the time it takes to complete;

(1) S1  =    W
T1

where the subscript "1"refers to the number of systems running the program. Amdahl observed that many programs could be at least partially parallelized, that is, part of the work that they did could be split up and one on more than one system at a time. The rest of the work, on the other hand, had to be done serially, one instruction after another.

In our taskmaster/task examples of the last two months, the "setup"instructions in taskmaster and the starting and stopping of the timer have to be done serially whether the task is run on one machine or one million. Even if we precalculated all the answers returned from the nodes (reducing the time required to compute them to zero the program will not be reduced to zero time. Let's turn this observation into algebra.

If we let the time that must be spent doing serial work on a single machine be Ts, and the single-machine time spent doing work that could be parallelized be Tp, so that;

(2) T1  =  Ts + Tp
then splitting the parallel part of the task up on N processors can at best reduce the time required to complete the parallelized work by a factor of N. Thus;

(3) TN  =  Ts +   Tp
N

and we obtain the speedup;

(4) SN  =    W
TN
   =    W
(Ts + Tp/N)
  < N W
T1
 )

This last result is Amdahl's Law. It says that linear speedup, where performing a block of work in parallel on N processors speeds it up by a factor of N is an ideal limit that is, in practice, never achieved!

Some features of our taskmaster experiments are now understandable. As we increased the total amount of work being done (the number of random numbers generated and the delay that simulated the amount of "work" being done to generate a random number) without similarly increasing the serial work that had to be done to manage the computation, we made Ts relatively small compared to Tp, getting a very good (nearly linear) speedup as we split Tp up. Eventually, however, Tp/N was smaller than Ts and we stopped speeding up.

Amdahl's Law is very important as it makes it clear that when engineering a cluster the best you can hope for is getting your work done N times as fast on an N CPU (Central Processing Unit) cluster. However, it is too optimistic. On a real cluster, a job can actually run more slowly in parallel than it did on a single CPU!

This situation needs to be thoroughly understood by any would-be cluster engineer. Otherwise one could end up spending a whole lot of money buying and building a cluster only to discover at the end that your work actually takes longer to complete when you use it! And, you would hate that.

Improved Estimates of the Speedup

This section will be a bit sketchy. That's because in a real world, parallel speedup is not necessarily simple. It can vary with algorithm, with network topology, with program organization. Computers themselves have certain intrinsic nonlinearities even in their single CPU operation that can also interact with the speedup process to confound even Amdahl's Law (as we'll explore in the next section). So bear in mind that this is a simplified picture that is good enough to get you kickstarted on serious cluster engineering, but that if you have a very complex problem you're likely to need to learn a lot more.

When we made up times to describe parallelizing our idealized task above, we did not account for the fact that in the parallelized task the program organization might well be different than the in idealized serial task. For example, if we really wrote a Perl script to execute a task and collect the results on a single machine, we wouldn't bother with threads. We'd just run the task serially, which would be faster. Parallelizing it also added the time required to ssh-connect to each node and initiate the the task and the time required to get the results back across the network (instead of their being written directly into machine memory).

Some of the changes we have to make add to the serial time, making a new serial time Ts' > Ts. Other things add a serial time Tsp to each component of the parallelized task (e.g. the time required to make the ssh connection and get back the data over the network), so that the time spent doing the parallel work turns into:

(5) Tp
N
  + Tsp *N

The total is;

(6) T'N  =  (T's + Tp/N + Tsp * N)

Our speedup then becomes;

(7) S'N  =    W
(T's + Tp/N + Tsp * N)

This result is actually a pretty good estimate for many parallelized tasks. Note that this function has a peak as N is increased and that the speedup approaches zero in the limit that N goes to infinity. This trend exhibits the "poisoning" effect of a relatively big Tsp, but is still (remember) a simplified description. For example, if we generated enough random numbers we'd soon discover that Tsp actually contains a piece that scales like the amount of work done (and not just the number of hosts), as those numbers have to get transmitted back to the master over the network and once more than about 1 kilobyte's worth of numbers are sent back (roughly the size of a network packet) the transmission time starts being proportional to the work.

This last bit is why nobody would actually design a cluster to generate random deviates master/slave fashion for use in a computation that required a lot of them. The time required to transmit them back to the master is larger than the time the master would require to generate them locally! This can easily be demonstrated with taskmaster and task, by the way. Simply increase the number of rands to be generated to, say, several million (you'll likely want to run with the quiet flagset) and the delay to 0. Then ramp up the number of hosts on your toy cluster one at a time. Depending on the speed of your network you should start to see the task actually run SLOWER in parallel.

At this point you are pretty close to knowing enough of the fundamentals of parallel computing to hold your own in a discussion over a beer at a geek supercomputer meeting such as Supercomputing. Just nod knowingly when somebody mentions Amdahl's Law, mumble something about it usually not being pessimistic enough to describe real task scaling, and take another swig. And wipe that foam off of your chin. But what if the person you're talking to is a real supercomputing geek, and suddenly puts you on the spot? Suppose they start to ask you about tree distribution schemes, scaling relations, parallel algorithms? Do you have to just blush and confess to being a clustering/parallel computing newbie? Not at all. Hit them with our final topic of the month, which is guaranteed to firmly establish your credentials as being a bit of a real geek yourself.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.