Dynamic Parallel Execution: Losing control at the high end
In a past column, I talked about programming large numbers of cluster nodes. By large, I mean somewhere around 10,000. If you have been following along, at the end of the article I had promised to mention some real alternatives to MPI and even suggest some wild ideas. I plan to keep my promise, however, I wanted to take a slight detour this month and develop the solution a bit further. One point of note before we begin. To keep things simple, I will refer to cluster nodes as if they were a single processing unit.
As you may recall, my conclusion was based on the notion that dependence on large numbers of things increase the chance that one of them will fail. I then proposed that it would be cheaper to develop software that can live with failure than try to engineer hardware redundancy. Finally, I concluded that adapting to failure will require dynamic software. As opposed to statically scheduled programs, dynamic software will adapt at run-time. I should also mention that my motivation is to make programming clusters easier by moving closer to my problem and further away from the minutia of message passing. {mosgoogle right}
As stated above, we also have no central monitoring hardware. Let's further assume that we can start a program on any node, and the program can ask for assistance from any other node. The response it can expect is either busy of free. Starting from our first node, our dynamic program may employ hundreds of sub-nodes, each of which could, in theory, employ other nodes (sub-sub-nodes) and so on. Our program at run time is beginning to look like an inverted tree. It may help to visualize each branch point as and loop statement. The outer loop will get spawned first as a series of sub-processes, then the inner loop would get spawned as a sub-sub-process.
This type of approach will also work in heterogeneous environments where "slow" can be defined as everything from a dead node, to a slow node, to a flaky interconnect. A master node, living in isolation, does not know or care about other nodes. All it knows is that a branch (or branches) it sent to another node is not completing and that there are other resources that may be able to help. Now assuming that at some point, the cluster is saturated, then the "master nodes" will have to become workers and do the work that was given to them by their masters. There are many variations that can be worked into this type of "ignorance is bliss" approach, but the general idea is for each node to get their work done as fast as possible.
Wait a minute, you say, there are sure to be many cases that would just load up the cluster with redundant work. A good question, which means we need to address another more thorny issue -- choosing what nodes to use.
We are now facing the question, I need to distribute work on a cluster with no central control or scheduler, how do I pick nodes to help me? This question is not trivial. In our scenario, nodes have to ask other nodes if they are idle. We could easily spend much of our time looking for help instead of performing useful work. In my experience, the best way to archive this goal may surprise you, but first, let's look at some other ways we might handle this situation.
If each node has a list of other nodes in the cluster, then it can simply go down the list and ask for help. Assuming there will be several layers of branches or sub-tasks, we might want to consider sending the list of nodes we know are already used to each sub task, so that the sub tasks can skip checking these nodes (to see if they can be used as workers). A handy optimization. We could go further. What if we added a heartbeat between the master and worker nodes and in this heartbeat we communicated the nodes that we were using to our master and then to the other masters. The programmer mind begins to concoct all kinds of interesting ways to track and distribute state information within our program.
Hold on there cowboy. We are heading in the wrong direction. We don't want to centralize anything. As we start developing clever control algorithms, we also start adding runtime overhead to the program. That is, we want to spend a minimum amount of cycles figuring out how to distribute programs and a maximum amount of cycles doing useful work. Now about the heartbeat idea. Heartbeat is a good idea, but recall that a heartbeat only tells me that my worker node is alive, it tells me nothing about the progress of the program (halting problem, notwithstanding). We will come back to this in a moment.
In the past, I spent considerable time trying to work out an optimal way to answer the find an idle node problem. Each time I thought I had found "the way" my approach got more and more complicated. Finally, at one point, I thought, all the fancy ideas seem to no better than just picking any node. Then it dawned on me, the simplest way to find a free node is to pick a node at random. Pages of code became victims of the delete key due to this simple idea. In my later years, I have learned that randomness is a powerful tool when facing problems that have no easy solution. A dart thrown at the stock listings always seems to better than the best stock broker. (Hint: Place a dart board behind the stock listings, don't have the stock broker hold then up.)
About that heartbeat idea. It may be better to call it a handshake instead of a heartbeat because both the master and worker can benefit. Master nodes can send their workers a periodic handshake. If the node does not reply, the master knows for sure that the node is dead and the nodes work must be redone (on the master or on another worker). On the worker node, if a worker stop getting a reply from the master, then it can assume the master is dead or has finished without the worker, and stop working. The node then become free to do other work.
We can now come back to issue of redundant work. A very valid point. A possible refinement is to assign a limit to the number of worker nodes a master may use. lets call this a Branch Number. If this number is zero for all branches (loops), then the program runs on one node. If it is equal to the number of nodes in the cluster, then the program could flood the cluster with redundant work. There is probably an optimum number for each branch point in the program. How is this number determined? There could be several ways, a compiler could guess at this, but it is hard to know how the program will behave for a given cluster at runtime. Another way is to let the program optimize these for a given cluster. It may take several runs, but the program could optimize itself for a specific cluster. All that would be needed is set of list of Branch Numbers for that particular cluster. Dynamic behavior would also help load balance the cluster. If you think of two or more dynamic programs running on a cluster, then they could easily be competing for resources. The end result is a single binary that adapts at run-time to various cluster hardware, cluster loads, and to node failure. Just what I wanted.
As to issues of interconnect type or processor speed. A self tuning dynamic program will run as best it can within a given hardware environment. This means that the number of nodes a program uses will depend on the cluster internals. The idea is that instead of writing codes to use a fixed number of nodes, the program will find the best number that will work on the cluster at hand.
Actually, I must confess, some of the ideas are not that new. A similar type of problem occurs every day in the networking world. As you know, a network is a shared device. How do arbitrary computers connected to an Ethernet network handle the traffic on the network. Instead of some complicated algorithm or central control device that helps manage traffic (that would by itself generate more traffic), Ethernet uses the power of randomness. If an Ethernet device finds the network busy, it waits for a random amount of time, and then retries. Simple, robust and low overhead.
In the future, I believe cluster computing can work the same way, but there is one more thing that has to happen before we can become the ultimate master of our nodes. Next month, we'll talk about how to express your problem so that it is more amenable to dynamic execution. Warning: Some of you are going to get upset.
Note: This article is the second of three part series. The first and third parts are:
This article was originally published in Linux Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.
Douglas Eadline can be found swinging randomly through the trees at here at ClusterMonkey