Tools of the cluster trade, don't leave home without them
In past articles, we looked at basic Linux networking. At this point you should have a pretty good idea of how the basic network, common to nearly every modern computer system (TCP/IP over Ethernet), is structured at the packet level. We have learned about Ethernet packets and their header, encapsulating IP packets with their header, encapsulating TCP packets with their header. This process of encapsulation isn't quite finished -- there is nothing to prevent anyone from adding YAH (yet another header) inside the TCP header to further encapsulate actual data, but since a lot of network communications carry the data inside the TCP layer without further headers or encapsulation, we'll quit at this point and move on to the next burning question.
Given this marvelous understanding of how the network is supposed to be working, the burning question du jour is: How well is it working?
Like all things associated with the network, this question is not a terribly easy to answer. The network is complicated (as we have seen) with lots of moving parts, many layers (each with its own header) where things can be mis-configured or go wrong. Also, even when it is working "correctly", it can be working poorly and taking a relatively long time to send and receive messages.
In this column we'll examine and review some of the many tools available to verify function and to measure performance of a Linux (or other Unix) based network. The latter is especially important for would-be cluster engineers, as network performance is often a critical design parameter for clusters intended to do parallel computations involving some degree of interprocessor communications (IPCs).
Here is a list of useful tools. We won't examine all of these in detail as we continue to assume a working knowledge of Linux/Unix administration, but since many of these commands are "fronted" these days by graphical interfaces that hide what is actually being done, one can actually set up a Linux system successfully without using many of them. If you are somebody that falls in this category, I'd suggest reading the man pages associated with the commands and some of the HOWTO documents indicated below.
- ifconfig is the basic network configuration tool, but it also can be used to examine current network settings.
- route is the tool that configures the default IP routes for each interface. It also can be used to examine the current routing table.
- ping is a simple tool that "bounces" a packet off of a targeted host using ICMP (Internet Control Message Protocol). We have not yet covered ICMP in this column; the only things you need to know about it right now are that it is a protocol restricted to root (so all ICMP-based applications are either suid root or must be run by root) and that it is the basis of ping. ping is extremely useful for verifying that a network is "up and running", as if you can ping a host you can probably connect to ports on it, if there are any offered for connection.
- traceroute is a more involved ICMP-based tool that actually traces out a route between hosts. This tool isn't so useful on a cluster (where nodes are typically "on the same network") but is very useful overall when a network is failing, especially one with one or more router hops in between hosts.
- tcpdump is a "network microscope" used to watch actual network traffic at the packet level, covered in previous columns. It requires an interface in "promiscuous mode" (where an application can read packets intended for other applications and users) and can generally only be run by the superuser. "Generally" being one of the many reasons that unencrypted network traffic cannot be assumed to be secure, of course.
- netstat is a tool that gives you a broad picture of the instantaneous state of the network, including all open sockets. It can be filtered or restricted to only certain kinds of things, and can be run with a delay so that it updates its output every few seconds.
- /proc/net/dev contains information about all running interfaces. This information is read and digested and turned into usable traffic rates by a number of performance monitors and related tools. We'll likely examine some in a future column.
- nmap is a tool generally used to test network security, but it is also a valuable diagnostic tool for verifying network function and discovering open ports associated with tools and processes. The tools above are primarily used to control the network and verify network functionality; the following tools test network performance.
- ping can be used as a very crude measure of network (ICMP/ping) latency in flood mode (usable only as root as it is very nearly a denial of service attack if run for more than a second or so). Figure One is an example where lucifer pings uriel, both on the same 100BT switch. Note the relatively poor latency at 0.06 seconds or so each way. While not indicative of the interface's real potential, it does verify that the connection between the two hosts works, that both hosts are up, and that the target host (uriel) is responsive.
- lmbench This package is Larry McVoy's famous benchmark suite, available from www.bitmover.com. It has been around a long time and is one of the many tools Linus Torvalds and Friends use to test kernel and library performance while they work. It is a suite of "microbenchmark" tools, each of which typically tests and times just one thing. Some of the tools test and time network bandwidth and latency for various kinds of connections. These numbers are quite comparable across systems and can help you form a reliable picture of raw network performance.
- netpipe This program is a very powerful tool for benchmarking network performance. It has the advantage of being able to be built to directly test MPI and PVM performance, as well as the performance of specialized cluster interfaces such as Myrinet that don't run on TCP/IP. It is being actively developed and maintained by Guy Helmer at Ames Laboratory at Iowa State University. You can find a detailed discussion of netpipe in Probing Gigabit Ethernet. We will only discuss it briefly in this column.
- netperf This package is due to Rick Jones of Hewlett-Packard (which hosts www.netperf.org) and is one of the original network benchmark tools -- the first one I ever used to any major extent. It is very easy to use and is reasonably powerful and informative. It languished unloved for a number of years to the point where it would no longer compile on many of the systems I owned without a bit of hacking, but in the last couple of years it has revived, moved to version 2.3, and seems to still be very useful. We will spend some time with netperf below to see what it can tell us about network performance.
#ping -i 0 uriel PING uriel.rgb.private.net (192.168.1.2) 56(84) bytes of data. 64 bytes from uriel.rgb.private.net (192.168.1.2): icmp_seq=1 ttl=64 time=0.139 ms 64 bytes from uriel.rgb.private.net (192.168.1.2): icmp_seq=2 ttl=64 time=0.123 ms 64 bytes from uriel.rgb.private.net (192.168.1.2): icmp_seq=3 ttl=64 time=0.121 ms |
Figure One: Ping times example |
lmbench
To get lmbench you need to visit IThe download/build procedure is complicated enough, and the tool powerful enough, to warrant a column all its own in the future. For the moment, then, we'll give only the executive summary. After downloading and unpacking bitkeeper and using it to clone lmbench's LMbench3 repository, one uses bitkeeper to "get" the sources out of the repository SCCS directories. On my system a straightforward "make" was all that was then required, although I'm sure YMMV (Your Millage May Vary) .
I then ran the "bw_tcp" (TCP bandwidth) benchmark on two hosts (ganesh and g01). On g01 (remote node) I executed:
$bw_tcp -s
(which starts a "server" -- a daemon -- that listens for a connection and then services it) and on ganesh (host node) I executed:
&bw_tcp g01 0.065536 11.67 MB/sec
Very simple. Note that the bandwidth between these two hosts is 93.4 Mbps out of 100 available, or about 93% efficiency. This number is a fairly typical data throughput number for a good TCP connection on 100BT Ethernet with an MTU of 1500 (where the theoretical maximum would be 1436 -- Ethernet frame less TCP/IP header -- over 1518 or 94.6 Mbps). Given that there are mandatory pauses between frames, this result is very close to theoretical maximum indeed.
Similarly, lat_tcp can be used to measure the latency. After running lat_tcp -s on g01, the latency can be measured by running the following on ganesh:
$lat_tcp g01 TCP latency using g01: 145.8439 microseconds
This number is a bit higher than it might be, probably because there are two switch hops between these two hosts. We can test this result by running between two hosts on the same switch:
$lat_tcp g01 TCP latency using g01: 90.9072 microseconds
which is around 55 microseconds faster as it should be.
Both benchmarks permit variation of message size and repetition in order to obtain a statistical picture of network performance at various scales. Graphing performance as a function of message size is often very revealing.
netpipe
As noted, netpipe has been covered in a recent issue so we will only briefly review it here for purposes of comparison. After downloading its source (see Resource Sidebar) and building it, one similarly runs a "receiver" (daemon) on the remote host:$NPtcp -r
Then the benchmark itself is run on the host to measure times and rates connecting to the remote host. An example of netpipe is shown in Figure Two.
$NPtcp -t -h g01 -P Latency: 0.000079 Now starting main loop 0: 1 bytes 3179 times --> 0.11 Mbps in 0.000072 sec 1: 2 bytes 3461 times --> 0.21 Mbps in 0.000073 sec ... |
Figure Two: Example netpipe results |
The most interesting thing to note is that netpipe and lmbench get very different answers for the single packet latency: 145 microseconds versus 79 microseconds. They differ elsewhere as well. This result leads us to a number of very natural questions (such as: which one is correct?) which we will defer to a future column. For the moment, let us accept each as valid (sort of) in the context of comparing systems but not necessarily accurate measurements of anything but the particular kind of code used for the test in the two cases.
netperf
netperf is the third of the network benchmarking tools reviewed in this column. It once was my favorite, and now that it is being loved by its owner it may become my favorite once again. To use netperf one proceeds more or less the same way as for the previous two tools: Download, build, start a server/daemon, point a client tool at it with suitable arguments, wait for the numbers to roll on in. It but remains to fill in the details.Retrieve netperf 2.3 from www.netperf.org, following links to the download page and reading as you go. Unpack the tarball and change to the netperf-2.3 (e.g.) directory. Edit the makefile -- minimally you'll want to change the default netperf directory and you must remove the -DNEED_MAKEFILE_EDIT definition from CFLAGS or it will complain. On my own Linux systems, it then just plain builds.
Once built, you'll note two binaries (netserver and netperf) and a bunch of scripts. The scripts are great for generating a whole suite of tests at once, and they also serve as good examples of some of the many command line arguments netperf takes.
Visit your remote system and start up a netserver (daemon) on its default port or some other:
$netserver Starting netserver at port 12865
This daemon will typically run until killed (like the lmbench daemon, unlike the netpipe daemon). Multiple tests can be run to the target host once the daemon is running.
Then return to your source host and run netpipe as shown in Figure Three.
$netperf -l 60 -H g01 -t TCP_STREAM -- -m 1436 -s 57344 -S 57344 TCP STREAM TEST to g01 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 114688 114688 1436 60.01 93.60 |
Figure Three: Example netperf results |
This particular invocation says to test for 60 seconds continuously, sending a TCP stream of data to the target g01 with a message size of 1436 (recall, the largest message that will fit in an MTU) using a large send and receive buffer on both ends. Many other options are possible -- netperf is a powerful command with many distinct ways of running.
Note the excellent agreement between netperf and lmbench for maximum bandwidth. Observe, however, what happens when we use a message size of 1 byte (the minimum). In this limit, the "bandwidth" is entirely dominated by the minimum packet latency as shown in Figure Four.
rgb@ganesh|B:1357>./netperf -l 60 -H g01 -t TCP_STREAM -- -m 1 -s 57344 -S 57344 TCP STREAM TEST to g01 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 114688 114688 1 60.00 1.30 |
Figure Four: Single byte netperf results |
1.3 Mbps is obviously much, much lower than the 93 or so Mbps that the interface can manage for large packets. Still, it is more optimistic than one might have expected from the latency measurements above because TCP aggregates packets where possible, reducing apparent latency. netperf also permits one to more or less "directly" measure latency via a request-response (ping-like) operation shown in Figure Five.
$netperf -l 60 -H g01 -t TCP_RR TCP REQUEST/RESPONSE TEST to g01 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 60.00 6747.70 16384 87380 |
Figure Five: Latency test with netperf |
Converting the transmission rate into 148 microseconds per packet, one obtains good agreement with lmbench.
Conclusion
To fully understand the results obtained with any of these three programs, it is essential to have the actual source code being timed. Network connections have many options and ways of being created and used, and different code can have very different timing. Fortunately, they are all fully open source so the code they use is open for your inspection.All three programs produce "reasonable" results and have options that permit one to explore network performance in a number of contexts. lmbench appears to be the simplest (as is expected of a microbenchmark tool). netpipe allows one to explore rates in a number of distinct settings such as in boilerplate PVM or MPI code in addition to plain old TCP. netperf has quite good built in statistics and updated interfaces.
We started this whole discussion by noting that clusters are quite often designed and built around the network, not the processors or memory, because the network frequently determines (as we have seen in our studies of Amdahl's law and scaling) whether a parallelized task exhibits good speedup over a large number of nodes. By using any or all of these three tools, one can finally perform measurements of certain fundamental rates that give you a decent chance of estimating network-bound parallel performance for a given application on different kinds of networking hardware. I hope you find them as useful as I have.
Sidebar: Networking and Testing Resources |
|
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.
Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page