The Network Is The Glue

Article Index

It is still useful to know something about both models, including the missing parts, when engineering a cluster. In future columns, we'll get really serious about learning about the network, and look inside some actual packets, because a bunch of computers without a network (even a "network" consisting of a grad student that walks from system to system carrying a floppy disk cannot do work together and is therefore not a cluster by any definition. For now, though, let's focus on a few very important design elements of networking in the context of cluster computing.

Latency, Bandwidth, and Interface.

The cruel truth of cluster design is that for many problems the speed of the computer itself is less important than the speed of the network. If the problem requires a lot of communication per step of computation, communication speed may well be the rate determining component of the overall computation.

If the computation requires sending lots of small packets to many different hosts per computational step, it is likely to be latency bound, that is, bound by the speed with which the computer can set up and send a small packet of information, possibly containing only a single byte of data. If the computation requires sending large packets between many hosts or a few, it is more likely to be bound by the bandwidth of the interface – the amount of data per unit time it can send when the data is encapsulated for maximum transmission rate. A given network can have high bandwidth and poor latency. It is in principle possible for it to have low bandwidth and good latency. The best, and most expensive, networks have high bandwidth and good (low) latency.

What determines the latency and bandwidth of a network (recalling that this is everything between the running executable on one node and that on another node)? In rough order of importance, we start with the link layer itself, including its bus interface to the computer's hardware, puts hard upper bounds on bandwidth and hard lower bounds on latency. A given network card, on a given bus, interconnected with a given medium, simply cannot exceed the physical limitations imposed by the physics and engineering of the overall design. In many cases the link layer is named for its upper-bound bandwidth: 100BT Ethernet can send as much as 100 million bits per second of information, including all link-layer headers and overhead, down UTP wires. Gigabit Ethernet can send (you guessed it) at most 1 billion bits per second.

Sidebar: Sneakernet
There is indeed a network known to networking gurus as "sneakernet". As an exercise, think about how and why this can be considered a network. Not necessarily a particular slow network, by the way – imagine the grad student carrying a 200 GB external hard disk from one computer to another instead of a floppy! [There is also bongo-net – Ed.]

The link layer also comes with a latency minimum. This limit might be determined by the amount of time required to send a minimum-length packet in the link protocol, plus a protocol-mandated pause in between packets, for example. Real world link layer latency will almost certainly be larger, because in nearly all situations the packets will pass through a network switch, and the switch itself adds 5-20 microseconds of latency as it routes the packets here and there. Latency can be much higher on a loaded network, as Ethernet is particularly expensive (inserting mandated dead times) when there is contention for a particular wire or device.

The latency seen by an application will be larger still! This results because both the network and transport layers (needed to route the packets properly and to arrange for reliable delivery of messages) add overhead. Much of this overhead is inside the kernel in the TCP stack, for TCP/IP communications, and is therefore beyond your ability to control or minimize. To give you an idea of the penalty here, pure link layer Ethernet latency on a high end card with direct memory access (DMA) might well be < 20 microseconds including the intermediary switch and the network driver component. By the time you add in the TCP layer, this can increase to 50-60 microseconds even on a quiet line.

Using Netpipe (See our tutorial on Netpipe) one can directly measure network performance (both latency and bandwidth) for TCP stream packets of different size passing between two hosts. For example, on my home network on a relatively quiet and very cheap 100BT Ethernet switch (but with relatively expensive 3Com 3c905 Ethernet cards) I measure 55 microseconds per packet sending packets with a payload of one byte. This is a "data bandwidth" of only 0.0175 megabytes per second!

Netpipe also tells me what the bandwidth is for large messages, ones that are very much limited by wire capacity. On this particular link, I can get just under 90 megabits per second out of 100 theoretical capacity. 90% of wire speed is fairly typical for a well designed Ethernet, although one can easily do much worse if one uses cheap network components! Beware especially the network adapters that come "free" on many motherboards or that are sold for $10 apiece by many vendors. These adapters are often missing buffers and DMA capabilities that make them far slower than a better card. You might not notice it on a PC running Windows and doing nothing but browsing the web and reading email, but in a high performance computing application such a card can seriously cripple a cluster.

In most parallel applications using the network, network performance will be still a bit worse than this. Most people do not write raw socket code to pass messages between application components running on different nodes – they use a message passing library that handles a lot of the bookkeeping and details of reliable message transmission for them such as PVM or MPI. This task adds one more layer of overhead, increasing latency and decreasing bandwidth to the application a tiny bit. In most cases, though, this is well worth it as PVM or MPI will likely do a better job of implementing this essential bookkeeping than you would writing it all yourself inside your application. Netpipe can be linked directly to PVM and MPI to test performance using the library calls for communications instead of raw sockets. It is therefore a very versatile tool. {mosgoogle right}

Conclusions

We aren't finished with the network by any means. This month's column has done little more than scratch the surface of networking as a crucial design component of a serious Beowulf or compute cluster. At this point, though, you should at least understand the network in very general terms (and should understand the meaning and importance of terms like latency and bandwidth and application network interface).

There is so much more to learn, though! We haven't discussed the details of TCP/IP (the contents and layout of the Ethernet, the IP, and the TCP layer headers). Although I've mentioned Netpipe as a useful tool, I haven't told you where to get it or how to use it. And we have yet to discuss the more expensive "super-networks" that are what you need to consider if you are designing a cluster for a fine-grained, synchronous parallel application! These are the networks with application layer latencies of perhaps 1-5 microseconds, and with peak bandwidths higher than a mere gigabit per second. In many cases, they accomplish these excellent numbers by providing you with libraries for e.g. PVM (sometimes) or MPI (nearly always) that don't use TCP/IP at all! They provide their own custom network and transport layer (which are still required if you want reliable delivery of packets) and these libraries directly use hardware features to ensure minimum latency and maximum bandwidth in packet delivery.

We'll come back to networking many times in future columns and see if we cannot make you real experts. A good cluster engineer should know how to use Netpipe to measure real world performance of a network and a tool such as tcpdump to be able to look at raw packets and understand at least approximately what they are seeing. So should a good cluster software engineer – writing an efficient network application requires a deep understanding of the delays introduced by network IPCs, as we saw in our very first few columns in this space.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.