Print
Hits: 75079

Tools of the cluster trade, don't leave home without them

In past articles, we looked at basic Linux networking. At this point you should have a pretty good idea of how the basic network, common to nearly every modern computer system (TCP/IP over Ethernet), is structured at the packet level. We have learned about Ethernet packets and their header, encapsulating IP packets with their header, encapsulating TCP packets with their header. This process of encapsulation isn't quite finished -- there is nothing to prevent anyone from adding YAH (yet another header) inside the TCP header to further encapsulate actual data, but since a lot of network communications carry the data inside the TCP layer without further headers or encapsulation, we'll quit at this point and move on to the next burning question.

Given this marvelous understanding of how the network is supposed to be working, the burning question du jour is: How well is it working?

Like all things associated with the network, this question is not a terribly easy to answer. The network is complicated (as we have seen) with lots of moving parts, many layers (each with its own header) where things can be mis-configured or go wrong. Also, even when it is working "correctly", it can be working poorly and taking a relatively long time to send and receive messages.

In this column we'll examine and review some of the many tools available to verify function and to measure performance of a Linux (or other Unix) based network. The latter is especially important for would-be cluster engineers, as network performance is often a critical design parameter for clusters intended to do parallel computations involving some degree of interprocessor communications (IPCs).

Here is a list of useful tools. We won't examine all of these in detail as we continue to assume a working knowledge of Linux/Unix administration, but since many of these commands are "fronted" these days by graphical interfaces that hide what is actually being done, one can actually set up a Linux system successfully without using many of them. If you are somebody that falls in this category, I'd suggest reading the man pages associated with the commands and some of the HOWTO documents indicated below.

#ping -i 0 uriel
PING uriel.rgb.private.net (192.168.1.2) 56(84) bytes of data.
64 bytes from uriel.rgb.private.net (192.168.1.2): icmp_seq=1 ttl=64
time=0.139 ms
64 bytes from uriel.rgb.private.net (192.168.1.2): icmp_seq=2 ttl=64
time=0.123 ms
64 bytes from uriel.rgb.private.net (192.168.1.2): icmp_seq=3 ttl=64
time=0.121 ms
Figure One: Ping times example

lmbench

To get lmbench you need to visit I and work your way down through their software projects links. This method gives I an opportunity to convince you that their source control software is the best in the world and worth buying. They accomplish this goal, in part, by requiring you to use their bitkeeper product to download (clone) their open source projects, including lmbench (which is GPL, with a couple of very reasonable restrictions concerning modification).

The download/build procedure is complicated enough, and the tool powerful enough, to warrant a column all its own in the future. For the moment, then, we'll give only the executive summary. After downloading and unpacking bitkeeper and using it to clone lmbench's LMbench3 repository, one uses bitkeeper to "get" the sources out of the repository SCCS directories. On my system a straightforward "make" was all that was then required, although I'm sure YMMV (Your Millage May Vary) .

I then ran the "bw_tcp" (TCP bandwidth) benchmark on two hosts (ganesh and g01). On g01 (remote node) I executed:

$bw_tcp -s 

(which starts a "server" -- a daemon -- that listens for a connection and then services it) and on ganesh (host node) I executed:

 
&bw_tcp g01 
0.065536 11.67 MB/sec 

Very simple. Note that the bandwidth between these two hosts is 93.4 Mbps out of 100 available, or about 93% efficiency. This number is a fairly typical data throughput number for a good TCP connection on 100BT Ethernet with an MTU of 1500 (where the theoretical maximum would be 1436 -- Ethernet frame less TCP/IP header -- over 1518 or 94.6 Mbps). Given that there are mandatory pauses between frames, this result is very close to theoretical maximum indeed.

Similarly, lat_tcp can be used to measure the latency. After running lat_tcp -s on g01, the latency can be measured by running the following on ganesh:

$lat_tcp g01
TCP latency using g01: 145.8439 microseconds

This number is a bit higher than it might be, probably because there are two switch hops between these two hosts. We can test this result by running between two hosts on the same switch:

$lat_tcp g01   
TCP latency using g01: 90.9072 microseconds

which is around 55 microseconds faster as it should be.

Both benchmarks permit variation of message size and repetition in order to obtain a statistical picture of network performance at various scales. Graphing performance as a function of message size is often very revealing.

netpipe

As noted, netpipe has been covered in a recent issue so we will only briefly review it here for purposes of comparison. After downloading its source (see Resource Sidebar) and building it, one similarly runs a "receiver" (daemon) on the remote host:

$NPtcp -r

Then the benchmark itself is run on the host to measure times and rates connecting to the remote host. An example of netpipe is shown in Figure Two.

$NPtcp -t -h g01 -P
Latency: 0.000079
Now starting main loop
  0:         1 bytes 3179 times -->    0.11 Mbps in 0.000072 sec
  1:         2 bytes 3461 times -->    0.21 Mbps in 0.000073 sec
...
Figure Two: Example netpipe results

The most interesting thing to note is that netpipe and lmbench get very different answers for the single packet latency: 145 microseconds versus 79 microseconds. They differ elsewhere as well. This result leads us to a number of very natural questions (such as: which one is correct?) which we will defer to a future column. For the moment, let us accept each as valid (sort of) in the context of comparing systems but not necessarily accurate measurements of anything but the particular kind of code used for the test in the two cases.

netperf

netperf is the third of the network benchmarking tools reviewed in this column. It once was my favorite, and now that it is being loved by its owner it may become my favorite once again. To use netperf one proceeds more or less the same way as for the previous two tools: Download, build, start a server/daemon, point a client tool at it with suitable arguments, wait for the numbers to roll on in. It but remains to fill in the details.

Retrieve netperf 2.3 from www.netperf.org, following links to the download page and reading as you go. Unpack the tarball and change to the netperf-2.3 (e.g.) directory. Edit the makefile -- minimally you'll want to change the default netperf directory and you must remove the -DNEED_MAKEFILE_EDIT definition from CFLAGS or it will complain. On my own Linux systems, it then just plain builds.

Once built, you'll note two binaries (netserver and netperf) and a bunch of scripts. The scripts are great for generating a whole suite of tests at once, and they also serve as good examples of some of the many command line arguments netperf takes.

Visit your remote system and start up a netserver (daemon) on its default port or some other:

$netserver
Starting netserver at port 12865

This daemon will typically run until killed (like the lmbench daemon, unlike the netpipe daemon). Multiple tests can be run to the target host once the daemon is running.

Then return to your source host and run netpipe as shown in Figure Three.

$netperf -l 60 -H g01 -t TCP_STREAM  -- -m 1436 -s 57344 -S 57344
TCP STREAM TEST to g01
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

114688 114688   1436    60.01      93.60   
Figure Three: Example netperf results

This particular invocation says to test for 60 seconds continuously, sending a TCP stream of data to the target g01 with a message size of 1436 (recall, the largest message that will fit in an MTU) using a large send and receive buffer on both ends. Many other options are possible -- netperf is a powerful command with many distinct ways of running.

Note the excellent agreement between netperf and lmbench for maximum bandwidth. Observe, however, what happens when we use a message size of 1 byte (the minimum). In this limit, the "bandwidth" is entirely dominated by the minimum packet latency as shown in Figure Four.


rgb@ganesh|B:1357>./netperf -l 60 -H g01 -t TCP_STREAM -- -m 1 -s 57344 -S 57344
TCP STREAM TEST to g01
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

114688 114688      1    60.00       1.30   
Figure Four: Single byte netperf results

1.3 Mbps is obviously much, much lower than the 93 or so Mbps that the interface can manage for large packets. Still, it is more optimistic than one might have expected from the latency measurements above because TCP aggregates packets where possible, reducing apparent latency. netperf also permits one to more or less "directly" measure latency via a request-response (ping-like) operation shown in Figure Five.

$netperf -l 60 -H g01 -t TCP_RR 
TCP REQUEST/RESPONSE TEST to g01
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       60.00    6747.70   
16384  87380 
Figure Five: Latency test with netperf

Converting the transmission rate into 148 microseconds per packet, one obtains good agreement with lmbench.

Conclusion

To fully understand the results obtained with any of these three programs, it is essential to have the actual source code being timed. Network connections have many options and ways of being created and used, and different code can have very different timing. Fortunately, they are all fully open source so the code they use is open for your inspection.

All three programs produce "reasonable" results and have options that permit one to explore network performance in a number of contexts. lmbench appears to be the simplest (as is expected of a microbenchmark tool). netpipe allows one to explore rates in a number of distinct settings such as in boilerplate PVM or MPI code in addition to plain old TCP. netperf has quite good built in statistics and updated interfaces.

We started this whole discussion by noting that clusters are quite often designed and built around the network, not the processors or memory, because the network frequently determines (as we have seen in our studies of Amdahl's law and scaling) whether a parallelized task exhibits good speedup over a large number of nodes. By using any or all of these three tools, one can finally perform measurements of certain fundamental rates that give you a decent chance of estimating network-bound parallel performance for a given application on different kinds of networking hardware. I hope you find them as useful as I have.

Sidebar: Networking and Testing Resources

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page