Cluster Programming: You Can't Always Get What You Want | Parallel Programming

But it does not stop me from asking

Fifteen years ago I wrote a short article in a now defunct parallel computing magazine (Parallelogram) entitled "How Will You Program 1000 Processors?" Back then it was a good question that had no easy answer. Today, it is still a good question that still has no easy answer. Except now it seems a bit more urgent as we step into the "mulit-core" era. Indeed, when I originally wrote the article, using 1000 processors was a far off, but real possibility. Today, 1000 processors are a reality for many practitioners of HPC. As dual cores hit the server rooms, effectively doubling the processor counts, many more people will be joining the 1000P club very soon.

So let's get adventurous and ask, "How will you program 10,000 processors?" As I realized fifteen years ago, such a question may never really have a complete answer. In the history of computers, no one has ever answered the question to my liking -- even when considering ten processors. Of course there are plenty of methods and ideas like threads, messages, barrier synchronization, etc., but when I have to think more about the computer than about my problem, something is wrong.

Having spent many a night trying to program parallel computers (the most recent incarnation being clusters) I have come up with a list of qualities that I want in a parallel programming language. Since I am wish for the moon, I may be asking for the impossible, but I believe some of the features I describe below are going to be necessary before using large number of processors will become feasible for the unwashed masses of potential HPC users.

Failure Is an Option

It is said, that the Buddha's last words were "decay is inherent in all complex/component things." And, Buddha was not even a system administrator. Clusters are complex/component things. The bigger the cluster, the more things that can decay. A program that routinely uses over 1000 processors will experience component failures at some point. As an hypothetical example, if you have 1000 cluster nodes with a MTBF (Mean Time Between Failure) of 10,000 hours (1.1 years) that means you can expect one node to fail every ten hours. Given that the MTBF is fixed for most computer hardware, using more and more processors for your program ultimately becomes a losing proposition.

In the future, I expect clusters to have constant (and expected) failures. Furthermore, the cost to increase the MTBF will probably be prohibitive and adapting to failure will be an easier solution.

I then have to ask, "How the heck do write a program for hardware you know is going to fail at some point?" The answer is quite simple, the program will have to tolerate hardware failures. In other words software must become fault tolerant. And, here is the important part, I the programmer should not have write this into my program.

Dynamic Scalability

One way to make a program fault tolerant is to make it dynamically scalable. That is, it can change the number of processors it is using on the fly. Adding fault tolerance means redoing work so some mechanism is will be needed to dynamically assign processors. Dynamic scalability is therefore, the next thing I want in my program. The idea is quite simple, I want one program that can run on 10,000 processors as well as one processor. Of course, large problem sizes may not be feasible on one processor. After all, if a large problem requires 10,000 processors for an hour it would take 1 processor 10, hours (assuming their was enough memory). I should, however, be able to run a small data set on one processor and then scale the same binary up to a maximal number of processors for that given problem size (and everything in between).

For example, if I should be able to develop a program on my laptop and move the same binary over to a sixteen processor cluster and run it without any modification. If the cluster is running other programs at the same time and there are only four idle processors, then my program should start using these four. As other processors become available it should grow only to the point that adding more processors does not help performance. At a later point in time, if I want to run my program with a larger problem size on 1000 processors, I should be able able to run the same program.

No More Standing in Line

Because my program is now dynamically scalable, I assume yours is as well. In this case our programs should be able to co-operate with one another. If we both have a program to run at the same time we should share resources optimally. In many cases, the need to schedule or queue jobs will not be necessary because the programs will manage themselves. My program will constantly negotiate with the other running programs to get the best set of cluster resources. For instance, my program might negotiate to wait while others run, if it can get exclusive access to 100 processors for one hour. I don't care how the programs do it, I just want them to behave this way and I don't want to have to write such behavior into my program.

Additionally, as part of this dynamic scheme there should be no central point of control. Programs should behave independently and not have to rely on a single resource. Indeed, within the programs themselves subparts should be as autonomous as possible. Centrally managing sixteen processors seems reasonable, managing sixteen hundred and having some time to do computation is a real challenge.

And then Some

Finally, I want an expressive language that is free of any artifacts due to the underlying hardware. I want to be as close to the application I am coding as possible. Thinking in "my problem space" is where I want to live. Concerning myself with memory management, numbers of processors, and other such details takes me away from my problem domain. In short, that is my wish list; fault tolerant, dynamically scalable, co-operative, and expressive. A simple wish, but a tall order. Realizing that I seldom get what I want, I have set my expectations high so maybe, just maybe, I'll get what I need. How are we going to get to this software nirvana? I thought you would never ask. I have some ideas, but first, lets address a few issues that always seem to come up when I talk about this topic.

What about MPI?

Let me be clear. MPI (Message Passing Interface), and PVM (Parallel Virtual Machine) for that matter, are wonderful ideas. They have allowed me and countless others to use collections of processors to achieve great things. Rest assured message passing will not be eclipsed by a new "programming" technology any time soon. Indeed, it will in all likelihood be at the core of most parallel applications in the future because you cannot have parallel computing without communication. As important as MPI is to the HPC world, it does represent a barrier to the domain expert. That is, programming in MPI is too much of an investment for Joe Sixpack programmer. It requires not only code changes, testing and debugging are harder, and possible major re-writes may be necessary. For those cheering on the sidelines, threads and OpenMP are in the same boat. Sure the results can be impressive, but complexity is the cost one pays for working at this level.

Even if we manage to produce an army of MPI programmers, there is another more subtle issue that must be addressed. As written, most parallel programs cannot provide a guarantee of efficient execution on every computer. There is no assurance that when I rebuild my MPI/Pthreads/OpenMP program on a different computer it will run optimally. A discussion of this topic is beyond the scope of this column, but let me just say, that each cluster or SMP machine has a unique ratio of computation to communication. This ratio determines efficiency and should be considered when making decisions about parallelization. For some applications like rendering, this ratio makes little difference, in others it can make a huge difference in performance and determine the way you slice and dice your code. unfortunately, your slicing and dicing may work well on one system, but there is no guarantee it will work well on all systems.

MPI has often been called the machine code for parallel computers. I would have to agree. It is portable, powerful, and unfortunately, in my opinion, too close to the wires for everyday programming. In my parallel computing utopia, MPI and other such methods are as hidden as register loads are in a bash script.

Abstract Art

Climbing above the MPI layer will not come without a cost. Just as there is loss of possible performance when going from assembly language to C, there will be a loss of efficiency when programming without explicit messages. The term often used is a "higher abstraction level". The reasons high level languages are so popular is because they provide a high level of abstraction above the hardware. Programmers move closer to their application and farther away from the the computer.

In my long forgotten article, I made the case that in the early days of computing there was a huge debate concerning the use of a new language, called FORTRAN, instead of assembly language (machine code). Yes, in those dark early days, there was no Perl or Python and the new FORmula TRANslation language was a breakthrough idea because it abstracted away some of the machine and let non-programmers like scientists easily program formulas. The argument went something like this:

Assembly Code Wonk: "If I use FORTRAN instead of assembly language, I loose quite a bit of performance, so I will stick with loading my registers thank you."

FORTRAN Wonk: "Yes, but when the new computer comes next year, I will not have to rewrite my program in a new machine code. And, besides, the new FORTRAN II compiler will optimize my code."

Assembly Code Wonk: "Only time will tell us the best solution. By the way, is that new pencil thin neck tie you are wearing with a new white short sleeve shirt?"

Time did tell us what happened. FORTRAN (or now written as Fortran) allowed many more people to write code. It also allowed code to spread quicker as new machines came on line. Suddenly there was, and still is by the way, vast amounts of Fortran programs doing all kinds of useful things.

If we are going to open up parallel computing to the domain experts we need to introduce a new abstraction level in which to express their problem. My wish is that once the problem is described (or declared) in a new language, compilers and run-time agents can deliver the features I described above.

Cliff Hanging

Now that I have you on the edge of your seat, I need to stop for now. Not to worry though, in the second installment of this theme I will begin to clear the path for some real alternatives to MPI and even suggest some wild ideas. And maybe if we start thinking and talking about these issues and ideas, we may find an answer to my question. I am optimistic, after all Buddha also said, "What we think, we become."

Note: This article is the first of three part series. The second and third parts are:

Part Two: Cluster Programming: The Ignorance is Bliss Approach
Part Three: Cluster Programming: Explicit Implications of Cluster Computing

This article was originally published in Linux Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Douglas Eadline is editor of ClusterMonkey and does wear a pencil thin necktie, although there is one in his closet somewhere.