Print
Hits: 15578

Engineering Clusters, Revisited

Many months ago we started learning how to design serious clusters, starting with prototyping and benchmarking hardware. The CPU part appeared to be pretty easy -- just get the fastest processor you can afford, with enough memory to run your application. Actually it isn't QUITE that simple, as there are some real differences between processors and computer architectures to choose from (and hence real choices to make) but the idea was simple enough.

After the CPU itself, the second most important determinant of cluster performance is (as we learned in our early experiments with parallel scaling) the network so we spent several happy months learning All About Networking. We learned about the ISO/OSI model for a network, the more economical TCP/IP model, latency and bandwidth, Ethernet, IP and datagrams, TCP, and a whole slew of networking tools and true facts. Finally, and possibly most important, we learned about three core networking benchmark tools: netpipe, lmbench, and netperf.

With this wealth of knowledge in hand, we are ready to go back and face the issue of engineering a cluster that is adequate to accomplish a given task. Our knowledge will still be a little thin in places. We know enough about Ethernet to be able to predict fairly accurately what we can expect in the way of latency and bandwidth for the various kinds of Ethernet adapters we are likely to have to choose between when designing our cluster, but we don't know (yet) as much about newer, faster, more cluster specific alternatives such as Infiniband, Myrinet, Serial Channel Interconnect, and Fibre Channel (to name a few).

Fortunately, we know what the terms that describe them mean now, so we can have a serious talk with a salesperson while convincing them to loan us units to test and learn the rest of the way, and we know how to benchmark the units once we get our hands on them with objective tools and compare their performance to the networks with which we are more familiar. All that remains is to think about how that information will benefit us in trying to get the most work done for the least money with the cluster we're designing.

The answer to this in general terms should now be easy to understand. Begin by looking over your application(s). Determine (by measuring if possible, computing and estimating from the code if not) how much data each node has to read in via a network connection (from other nodes, servers, remote data stores) or write out via a network connection (to other nodes, servers, or remote data stores) as the computation advances. This task may not be easy. There are many possible structures for a parallel computation, although (as we've discussed in previous columns) they fit into three general categories: unbounded by communications, bounded (bottlenecked) by communications bandwidth, and bounded by communications latency.

Task Organization Examples

Imagine the simplest sort of node task, running on an imaginary 100 node cluster. It requires a remotely issued command to initiate it (on the order of 1K data from a head-node server). It then runs a long time (say four weeks) and writes a single line of text data into a file on a mounted NFS directory (on the order of 1K data to a network store). It may do this on every node, independent of all other nodes. This behavior is the archetypal "embarrassingly parallel" task, and is the kind of task that Grid computer clusters are designed to accomplish efficiently.

Observe that the network is "busy" two times in a month, each time for a tiny fraction of a second, compared to literally millions of seconds of computing. You will need more network bandwidth monitoring the progress of your application from time to time than your application itself consumes. Clearly any speed of network is adequate for the task, so choosing the cheapest one (likely 100 BT Ethernet with a stack of inexpensive switches) is cost-beneficial, as it saves your money for more, better, faster CPUs (whichever produces the biggest reduction in time per dollar spent) or on other project requirements such as disk, visualization, infrastructure, management.

Next, consider a job (still running on 100 nodes) that obviously still requires a command initiating it (the same 1K start-up data). Once it is begun, its work flow is very data intensive -- in a "run" it must read in a total of 10 GB of data, in blocks of 100 MB at a time -- from a central (NFS) store. Each of these blocks of data is transformed by the node and the resulting 100 MB of new data sent on to the next node downstream in a chain of nodes for further processing. Since each node is in the chain, before continuing with the next read from disk, the node also has to get 100 MB from its upstream node when it finishes its block.

We can imagine that the processing of each block of 100 MB of data takes about one minute. For every minute of actual computing, each node has to read 100 MB of data from a central store, read 100 MB of data from an upstream node, and write 100 MB of data to a downstream node before it can continue to do the next minute of computing. Now the network not only matters, it can be critical. Let's do the math.

100 BT Ethernet can carry at most 100 million bits per second (Mbps) although nearly all Ethernet adapters and switches today are full duplex and can carry this much traffic each way (in and out) at the same time. Forget for the moment the full duplex part and let's figure out just how long it will take to read/write a 100 MB block from/to another node. If we divide by 8 (turning bits into bytes) we see that the theoretical peak is 12.5 million bytes per second (MBps), and that it will take 8 seconds (minimum) to send 100 MB.

However, that neglects the headers -- headers actually count as part of the theoretical peak speed of the physical medium, but all that matters to us is the data (or content of the packets, not the wrappers and headers intended to get it there safely and reliably and possibly even securely). Subtracting away the headers, we find that data is at best 95% or so of a packet.

In fact, we almost never can count on getting all 95% of the theoretical maximum. The processing of the TCP and IP parts of those headers uses time, as does the physical process of moving the data onto and off of the network adapters from main memory. For many adapters, this data has to pass through the CPU (although more expensive ones may use direct memory access (DMA) to put the data directly into memory without using the CPU) and further delays will be introduced every time the CPU does ANYTHING in a multitasking environment (a minimum of over a hundred times a second, even if the system is otherwise idle). Then, the task of writing the data or reading the data from the network socket may introduce delays of its own, perhaps processing or formatting the data before it goes out or similarly acting on it as it comes in. The two ends of the communication channel may well have a handshake that has to occur as sub-blocks of data are successfully sent.

All of this will often reduce the average throughput of data from 95% of wire speed to 90%, 80% or even worse (although lower numbers than 90% often signal relatively poor networking hardware or code design, or a loaded system that is "busy" when a lot of networking transactions are trying to take place). So instead of taking 8 seconds to move 100 MB of data, we wouldn't be surprised if it took 10 seconds or even longer, on average.

Time-wise, this means that our application computes for 60 seconds and then has a minimum of 10 seconds of communication between nodes PLUS a 100 MB read from the data store to accomplish to proceed. Lots of things will make the reality worse than this estimate -- two nodes trying to talk to the same node at once, a node trying to send data to another at the same time that node is already reading from the data store, difficulties managing full duplex communications. Some of these things might be avoidable by carefully designing the application, but clearly any delay experienced by a node is passed on to all nodes downstream that await the data produced by the delayed node.

We still aren't done, though, because while we can imagine trying to arrange the data communication between nodes so it is done in parallel, all the nodes have to get their 100 MB from the server to proceed, and it can only provide that data (with a 100 BT connection shared between all nodes) to one node at a time.

This is the killer task. First of all, reading data from a file server typically reduces bandwidth relative to straight host-to-host communications, because the server is doing many things and because server-to-disk latency and bandwidth come into play. Our 10 second read can become 12 seconds or even 20 seconds even when things are orderly and quiet.

But things aren't quiet. Every minute we do our computation, write to a node, read from a node, and then wait on the server. The nodes literally fight to get data from the server. Some 100 nodes are all requesting 100 MB of data more or less at the same time. The server is constantly seeking on the disk, a process that adds a major latency hit. It is constantly changing contexts to spit out another string of packets to the hosts waiting in on data in its task queue. After the first minute of computation, nearly all of the nodes will be waiting on data from the server. The whole cluster will sit idle, except for nodes that have just finished their server read or the nodes just downstream of them in the data flow.


For this pattern, 100 Mbps Ethernet just isn't the answer, and we have to look carefully at the pattern and design of our data distribution mechanism as well. The computation would spend far longer communicating than computing, and most nodes would actually sit idle most of the time. It is well worth our while to have fewer nodes, but a faster network and better organized data store.

For example, a higher bandwidth network like 1 Gbps Ethernet would reduce the time required to move 100 MB between nodes to around a second. The bottleneck associated with many nodes accessing the master server can be at least partly alleviated by copying the entire data store to scratch space on the nodes (using a maximally efficient pattern and quite possibly from multiple servers) before beginning the computation. The time for this task cannot be avoided, but the time wasted while the central server thrashes while blocking node computations can be. These changes might reduce the communications time to be closer to 1-2% of the actual computation time, permitting one to obtain a speedup with whatever number of nodes we can still afford.

As a final example, the computation may involve each node communicating with many other nodes (possibly all the other nodes) in order to advance the computation a step. For each 0.01 second (say) of computation, it may have to exchange a few tens of bytes of data with 100 systems (the other compute nodes plus the master node) in each step.

Now the other performance dimension we learned about, network latency, is the rate limiting factor. We observed last time that TCP latency on a fairly typical 100 BT Ethernet was measured at around 130 microseconds per packet. To send and receive 100 small packets costs at least 200x130 = 26000 microseconds, or around 0.026 seconds total. As always, this is the best possible result -- real performance would almost certainly be worse, unless one could synchronize the computation's communication pattern so precisely that nodes never "collided" (where two nodes try to talk to the same node at the same time).

This situation is very embarrassing. It takes nearly three times longer to communicate for each step than the computational work associated with the step requires on each node. If we use more nodes, things get worse, not better. The computation might well be taking longer to run on 100 nodes than it would on ten. Here the solution is to use a low latency network, one with communications latencies smaller than 5 microseconds. Many of these high speed networks are designed for cluster computing and bypass the TCP stack (a major source of latency delay) altogether. A network that supports DMA and unattended completion of message transmission and reception further reduces the burden on the node CPUs and makes things run faster still. 200x5 = 1,000 microseconds (or even less), and communications is once again 0.001 seconds total for every 0.01 seconds of computation, suggesting that we will get decent parallel scaling and task speedup on our cluster of 100 nodes.

These three cases by no means exhaust the possibilities for the communications needs of a numerical task being parallelized. A task can have many steps, some of which speed up nicely on any network, some of which are bottlenecked by the network bandwidth, some of which are bottlenecked by latency. The task organization can be as important as the hardware in determining where bottlenecks occur, where changing algorithms can make a bigger difference than changing networks. Untangling all the possibilities is not guaranteed to be easy. Still, these examples are generic enough to allow us to return, at long last, to the question of cost-benefit optimization.

Cost-Benefit is Everything

Recall that back in July we noted that "Cost-Benefit is Everything." Designing a cluster is nearly always about getting the most work done for a more or less fixed budget. At this very moment, I'm going through the process of trying to design an optimal cluster given a (very) fixed budget in a competitive grant proposal. We have to project our computer hardware budgets out for at least three years, and show how what we get will permit the cluster to get the most work done for that fixed budget.

As usual, we'd like to have the most simple compute power, since aggregate speed (in Hz, FLOPs, MIPs, Specint 2000's) stands out as a relatively simple measure that even non-cluster experts can understand when reviewing a proposal. However, in the actual data flow associated with the tasks we expect, the cluster will run involve accessing massive amounts of storage and relatively little inter-node communication. It is a "Grid" design, but with large, fast storage requirements and hence a need for high bandwidth connections between the compute nodes and the head nodes and central store.

To accomplish this, we find ourselves trading off nodes to get more storage, to get higher bandwidth networks (so that networks aren't bottlenecked to storage), and investigating smart storage designs so that nodes don't have to contend for the use of a single data channel in order to write or read from the central store. Our design goal is perfect balance -- the nodes working flat out should generate data that will just fill the storage we have by the time we have the next year's grant money to get more of everything, while in operation the network and disk access speeds are irrelevant (1-2% total effect) bottlenecks on the computation.

Year one is relatively easy to outline in this way; years two and three are much more difficult, as new technologies are likely to emerge over this much time. We will use Moore's Law, which predicts that compute capacity at constant cost will roughly double every twelve to eighteen months, to guesstimate them. This law unfortunately doesn't work so well for networks -- they speed up much more slowly than CPU, memory capacity, or disk capacity, for example -- but we can safely assume that they will get cheaper at any given speed over the next three years.

The above discussion outlines the general process by example. In your cluster design, you may want to get as many nodes as possible. However, if your computation requires a low latency network and a big, fast disk store, you may be finding yourself spending anywhere from a minimum of $3,000/TB to a maximum approaching $30,000/TB for disk, and spending as much on the network(s) connecting your nodes together and to the disk store as you do on the nodes themselves. Suddenly instead of getting 100 nodes at roughly $3,000 each for $300,000, you are spending $4,500 per node (including the network) and have to fund a $100,000 set of head nodes and a connected disk store. Your 100 nodes has now shrunk to 44 nodes. This design, at first glance, isn't as "sexy" to potential grant reviewers. However, the bottom line is that your 44 nodes will get more work done than the 56 nodes you "sold" from your 100 node initial design in order to get the memory, the network, and the storage your task requires to run optimally fast. Now your task will exhibit nearly linear speedup on all 44 nodes (where the 100 default Ethernet-based nodes would be hopelessly bottlenecked) and they have someplace to put all the terabytes of lovely results that they will shortly begin to generate.

Conclusion

This column concludes the appearance of this column as "Cluster Kickstart". Cluster World is now one year old (hooray) and in the new year my column will take on a new name and will have a new focus, one that permits me to address pretty much whatever I think would be of interest or benefit to people interested in cluster computing. I expect that it will still spend a lot of time focusing on teaching basic principles of cluster computing and cluster design, since this is what I "do" on the Beowulf list and as part of my job here at Duke University. However, I can now branch out into new territory and relate anecdotes and tell funny stories with less need to remain strictly focused on beginning cluster design.

It has been a fabulous year. We've moved from building our first cluster, through an understanding of Amdahl's Law and its generalizations (fundamental to cluster design) and the process of writing some actual parallel applications to the basic principles that must be mastered to design really advanced clusters -- finding the right infrastructure, prototyping and testing, understanding the network itself (at least in its simplest form) and now putting it all together into a coherent and deliberate process of task analysis and matching of design to data flow.

There are still plenty of things we haven't touched on or have only touched on briefly -- monitoring a cluster, cluster installation (software) tricks and tools, benchmarking, analyzing your task with profiling tools and debuggers, and writing parallel programs with many kinds of basic algorithms. Fortunately, this column does not stand alone, and most of those topics have been examined in some detail elsewhere on ClusterMonkey.

I look forward to covering many of them myself in months and years to come. I hope to see you there!

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page