Cluster Interconnects: The Whole Shebang

Article Index

It is what makes a cluster "a cluster"

An often asked question from both "clusters newbies" and experienced cluster users is, "what kind of interconnects are available?" The question is important for two reasons. First, the price of interconnects can range from as little as $32 per node to as much as $3,500 per node, yet the choice of an interconnect can have a huge impact on the performance of the codes and the scalability of the codes. And second, many users are not aware of all the possibilities. People new to clusters may not know of the interconnection options and, sometimes, experienced people choose an interconnect and become fixated on it, ignoring all of the alternatives. The interconnect is an important choice and ultimately the choice depends upon on your code, requirements, and budget.

In this review we provide an overview of the available technologies and conclude with two tables summarizing key features and pricing for various size clusters.

Introduction

This article will focus on interconnects that aren't tied to vendor specific node hardware, but can work in a variety of cluster nodes. While determining which interconnect to use is beyond the scope of this article, I can present what is available and make some basic comparisons. I'll present information that I have obtained from the vendors websites, from information people have posted to the beowulf mailing list, the vendors, and various other places. I won't make any judgments or conclusions about the various options because, simply, I can't. The choice of an interconnect depends on your situation and there is no universal solution. I also intend to stay "vendor neutral" but will make observations where appropriate. Finally, I have created a table that presents various performance aspects of the interconnects. There is also a table with list prices for 8 nodes, 24 nodes, and 128 nodes to give you an idea of costs.

I'm going to examine seven interconnects. These interconnects are:

Sidebar One: Some Background and History

This article is an update to an article that Doug and I wrote for ClusterWorld Magazine. The article appeared in April of 2005, but was written in Jan. 2005. Huge changes in interconnect technology, such as the rise of 10 GigE Ethernet and Infinipath appearing in production machines, warranted that I update the original article. Moreover, the original article had a table with some performance data and pricing data for various sized clusters (only interconnect pricing). As you all know, prices can and do change very rapidly, so I wanted to provide updated comparisons. All information was taken from published public information and discussion with vendors.

We have tried to be as fair as possible when presenting these technologies. All the vendors who make high performance interconnect hardware were given a chance to review the article and provide feedback. We expect there maybe a need for further discussion as well. We invite those who wish to make comments to register and add a comment at the end of the article. If anyone including vendors has the need to write an longer article about their experiences with a particular interconnect technology, we welcome your input. Please contact the Head Monkey.

Finally, we urge you to contact the vendors we mention and discuss your specific needs. This review is best used as a road map to help get you started in right direction.

In addition, I'll talk about some of the technologies that some of the interconnects use to improve performance. Namely, TCP bypass (OS bypass), TCP Off-load Engine (TOE), RDMA (Remote Direct Memory Access), zero-copy networking, and interrupt mitigation. I'll also discuss some alternatives to plain Gigabit Ethernet such as

Gigabit Ethernet

Ethernet has an enduring history. It was developed principally by Bob Metcalfe and David Boggs at Xerox Palo Alto Research Center in 1973 as a new bus topology LAN (Local Area Network). It was designed to be a set of protocols at the datalink and physical layers. In 1976, carrier sensing was added. A 10 Mbps (Mega-bits per second) standard was promoted by DEC, IBM, and Xerox in the 1980's. Novell adopted it and used in their widely successful Netware. It finally became an IEEE standard in 1983 (802.3, 10Base5).

The initial speed of Ethernet was only 10 Mbps. As more systems were added to networks, a faster network connection was needed. Development of Fast Ethernet (802.3u), which runs at 100 Mbps or 10 times faster than 10 Mbps, was started in 1992 and was adopted as an IEEE standard in 1995. It is really just the same 10 Mbps standard but run at a higher speed.

With the explosion of the Internet and with the need to move larger and larger amounts of data, even Fast Ethernet was not going to be fast enough. The development of Gigabit Ethernet (GigE), which runs as 1 Gbps (Giga-bits per second) or 10 times faster than Fast Ethernet, was started and became an IEEE standard (802.3z) that was adopted in 1998. GigE uses the same protocol as 10 Mbps Ethernet and Fast Ethernet. This makes it easy to upgrade networks without having to upgrade every piece of software. Plus GigE can use the same category 5 copper cables that are used in Fast Ethernet (If you are buying new cables for GigE choose cat 5e or better).

Currently, there are a large number of manufacturers of GigE Network Interface Cards (NIC) on the market. Most of the GigE NICs are supported at some level in the Linux Kernel. Many of NIC manufacturers use the same basic GigE chipsets (e.g. Intel, Broadcom, Realtek, National Semiconductor), sometimes slightly modifying how they are implemented. This makes it difficult to write "open" drivers without vendor support. There are some good articles available on the web that describe the lack of vendor support for GigE NICs, especially Nvidia (although Nvidia has some closed drivers that allow for the use of the Nvidia audio and network devices). In recent years, most major vendors have released source code drivers for their GigE chipsets or at least provided documentation on the chipsets. The Linux kernel has driver support for most common types of GigE chipsets. There are also Fast Ethernet drivers, original 10 Mbps Ethernet drivers, and fairly recently, 10 Gigabit Ethernet (10 GigE) drivers.

Many of these Ethernet drivers have parameters that can be adjusted. Later in this article I will mention some of the things that can be adjusted. You can adjust the various parameters for any number of reasons and you find that cluster users often experiment with drivers to determine which is the best combination for their cluster and work load.

One of the problems with GigE is that it is based on the standard Ethernet MTU (Maximum Transmission Unit) size of 1500 Bytes. As each Ethernet packet must be serviced by the processor (as an interrupt) moving data at Gigabit speeds can tax even the fastest systems (80,000 interrupts a second). One of the ways to solve this problem is the use of "Jumbo Frames" or basically a larger MTU. Most GigE NICs support MTU sizes up to 9000 Bytes (reducing the number of interrupts by up to a factor of 6). If you want to use Jumbo Frames, make sure your switch supports these sizes (see the subsection on GigE switches entitled "Switching the Ether").

GigE NICs come in various sizes and flavors over a range of prices. Sixty-four-bit PCI "server" GigE NICs are still slightly expensive ("desktop" NICS, which are essentially built from the same components, are very inexpensive, but are limited by the 32-bit PCI bus). For server class motherboards as well as normal desktop motherboards, you will find GigE built into the motherboard at a lower cost. If your motherboard doesn't have built-in GigE, then regular 32-bit PCI NICs start at about $3-$4 with decent ones, such as the DLINK DFE_530TX starting at about $13 or the popular Intel PCI GigE NIC starting at about $30. Sixty-four-bit GigE NICS start at about $33 (when this article was written), but PCI-X NICS are more expensive. PCI-X GigE NICs with copper connectors start at about $100 and PCI-X GigE NICS with with Fiber Optic connectors start at about $260. (Note: These prices were determined at the time the article was posted. They will vary.) Also, be aware that the really low cost NICs may not perform as expected. Some basic testing will help qualify NICs.

GigE is a switched network. A typical GigE network consists of a number of nodes connected to a switch or set of switches. You can also connect the switches to each other to create a tiered network. Other network topologies such as rings, 2D and 3D torus, meshes, etc. can also be constructed without switches. However, some of these networks may require multiple NICs per node and can be complicated to configure, but they do eliminate the need for switches.

While it is beyond the scope of this article, you can also use new network topologies such as Flat Neighborhood Networks (FNNs) or Sparse Flat Neighborhood Networks (SFNNS) that offer potential benefits for particular applications. There are on-line tools to help you design FNNs. As shown in some of the literature, these types of networks can offer very good price/performance ratios for certain codes.

Pushing GigE

Despite the rise in Ethernet speeds, it is possible for a single 32-bit PCI based NIC to saturate the entire PCI bus. Consequently, it's easy to see that the CPU can spend a big part of it's time handling interrupt requests and processing network traffic.

The situation has led vendors to develop TOE (TCP Off-Load Engines) and RDMA (Remote Direct Memory Access) NICs for GigE. The goal of these cards is to reduce latency and to reduce the load on the CPU when handling network data. Several companies started marketing these type of NICs, but really only one remains, Level 5 Networks and they don't use TOE or RDMA.

{mosgoogle right}

Level 5 Networks makes a 32/64 bit PCI GigE NIC that is not a RDMA NIC but uses a new technology called Etherfabric. In a conventional network, a user application converts the sockets or MPI calls into system calls that use the kernel TCP/IP stack. With Etherfabric, each application or thread gets its own user space TCP/IP stack. The processed data is sent directly to the virtual hardware interface that was assigned to the stack and then sent out over the NIC. This process eliminates context switching (mentioned below) to improve latency.

The interesting thing about the Level 5 NIC and Etherfabric, is that you can use existing TCP/IP networks and any routers because it doesn't use a different packet protocol. In addition, the NIC appears as a conventional GigE NIC so it can send and receive conventional traffic as well as Etherfabric traffic. It also means that you don't have to use an Etherfabric specific MPI implementation. Consequently, you don't have to recompile and can use your binaries immediately. This overcomes an Achilles heal of past high-speed GigE NICs because you can use all of your ISV applications.

GAMMA or Genoa Active Message MAchine is a project maintained by Giuseppe Ciaccio of Dipartimento di Informatica s Scienze dell'Informazione in Italy. It is a project to develop an Ethernet kernel bypass capability for existing Ethernet NICs as well as it's own packet protocol (doesn't use TCP packets). Since it bypasses the Linux kernel and doesn't use TCP packets it has a much lower latency than standard Ethernet.

Right now, GAMMA only supports the Intel GigE NIC and the Linux 2.6.12 kernel (Broadcom chipsets (tg3) are said to be under development). Also, since it uses something other than IP the packets are not routable over a WAN (not likely to be a problem for clusters, but could be a problem for Grids). GAMMA can share TCP traffic over the interfaces, but it takes exclusive control of the NIC when transmitting or sending data. That is, all TCP/IP traffic is stopped. Normally this is not a problem as most clusters have multiple networks available to them.

There is also a port of MPICH1 to run using GAMMA. MPI/GAMMA is based on MPICH 1.1.2 and supports SPMD/MIMD parallel processing and is compatible with standard network protocols and services.

Figure One is a plot of the throughput of MPI/GAMMA compared to LAM/MPI and MPICH1. Figure Two is a comparison of the latencies for the same three MPI's (Intel Pro/1000 MT, 32 bit PCI).

Figure One: Throughput  Comparison for GAMMA, LAM, and MPICH
Figure One: Throughput Comparison for GAMMA, LAM, and MPICH

Figure Two: Latency Comparison for GAMMA, LAM, and MPICH
Figure Two: Latency Comparison for GAMMA, LAM, and MPICH

Notice that the latency is only 9.54 microseconds and this includes the latency of the switch! This is probably the lowest latency I've ever seen for GigE. Also, the throughput for MPI/GAMMA is very good as well, peaking at about 72 MByte/s or 560 Mbits/s Not too shabby for an inexpensive Intel 32 bit PCI GigE card.

In addition to GAMMA, there are other network solutions that don't use TCP. For example Scali MPI Connect has a module called "Direct Ethernet" or DET that runs over any communication interface that has a MAC address. In other words, it uses a different protocol than TCP and by-passes the kernel (sometimes called TCP-bypass). Moreover, Scali says that using DET allows you to aggregate multiple GigE networks together. Scali says that you can use independent inexpensive unmanaged layer 2 switches for this aggregation.

Another example of an MPI implementation that doesn't use TCP as the packet protocol is Parastation from Partec. It currently is up to Version 4 and claims an MPI latency of about 11 microseconds. It is based on MPICH 1.2.5.

I Feel the Need for Speed

I have been using various terms such as TCP-bypass, without adequately explaining them. I want to take some time to explain some of the techniques that have been used to improve the performance of interconnects. In particular, I'm going to cover:

  • TCP-Bypass
  • TCP Offload Engine
  • Kernel Bypass (OS Bypass)
  • RDMA
  • Zero-copy Networking
  • Interrupt Mitigation

TCP Bypass
I previously mentioned TCP-Bypass when I discussed Scali MPI Connect. TCP-bypass really means that you are using a different packet protocol than TCP or UDP. The application uses it's own packet definition, builds the packets on the sending side and sends them directly over the link layer where they are decoded and the data processed. There are a number of products that do this, such as, Scali MPI Connect, Parastation, HyperSCSI (iSCSI like storage), and Coraid ATA-over-Ethernet. The term "TCP-bypass" is used because you are "bypassing" the TCP packet protocol and using a different protocol right on top of the link layer.

The idea behind using a different protocol is that you reduce the overhead of having to process TCP packets. Theoretically this allows you to process data packets faster (less overhead) and fit more data into a given packet size (better data bandwidth). So you can increase the bandwidth and reduce the latency. However, there are some gotchas with TCP-bypass. The code has to perform all of checks that the TCP protocol performs, particularly the retransmission of dropped packets. In addition, because the packets are not TCP, they are not routable across Ethernet. So you can't route these packets to other networks. This is not likely to be a problem for clusters that reside on a single network, but Grid applications will very likely have this problem because they need to route packets to remote systems.

TCP Offload Engine

TCP Offload Engines (TOE) are a somewhat controversial concept in the Linux world. Normally, the CPU processes the TCP packets, which can require extreme processing power. For example, it has been reported that a single GigE connection to a node can saturate a single 2.4 GHz Pentium IV processor. The problem of processing TCP packets is worse for small packets. To process the packets, the CPU has to be interrupted from what it's doing to process the packet. Consequently, if there are a number of small packets, the CPU can end up processing just the packets and not doing any computing.

The TOE was developed to remove the TCP packet processing from the CPU and put it on a dedicated processor. In most cases, the TOE is put on a NIC. It handles the TCP packet processing and then passes the data to the kernel, most likely over the PCI bus. For small data packets, the PCI bus is not very efficient. Consequently, the TOE can collect from a series of small packets and then send a larger combined packet across the PCI bus. This design increases latency, but may reduce the impact on the node processing. This feature is more likely to be appropriate for enterprise computing than HPC.

Kernel Bypass

Kernel Bypass, also called OS bypass, is a concept to improve the network performance, by going "around" the kernel or OS. Hence the term, "bypass." In a typical system, the kernel decodes the network packet, most likely TCP, and passes the data from the kernel space to user space by copying it. This process means the user space process context data must be saved and the kernel context data must be loaded. This step of saving the user process information and then loading the kernel process information is known as a context switch. According to this article, application context switching constitutes about 40% of the network overhead. So, it would seem that to improve bandwidth and latency of an interconnect, it would be good to eliminate the context switching.

In Kernel bypass, the user space applications communicate with the I/O library that has been modified to communicate directly with the user space application. This process takes the kernel out of the path of communication between the user space process and the I/O subsystem that handles the network communication. This change eliminates the context switching and potentially the copy from the kernel space to user space (it depends upon how the I/O library is designed). However, people are arguing that the overhead in the kernel associated with a context switch has shrunk. Combined with faster processors the impact of a context switch has lessened.

RDMA

Remote Direct Memory Access (RDMA) is a concept that allows NICs to place data directly into the memory of another system. The NICs have to be RMDA enabled on both the send and receive ends of the communication. RDMA is useful for clusters because it allows the CPU to continue to compute while the RDMA enabled NICs are passing data. This can help improve compute/communication overlap, which helps improve code scalability.

The process begins with the sending RDMA NIC establishes a connection with the receiving RDMA NIC. Then the data is transferred from the sending NIC to the receiving NIC. The receiving NIC then copies the data directly to the application memory bypassing the data buffers in the OS. RDMA is most commonly used in Infiniband implementations, but other high-speed interconnects use it as well. Recently 10 GigE NICs started using RDMA for TCP traffic to improve the performance. There has been some discussions lately that RDMA may have outlived it's usefulness for MPI codes. The argument is that most messages in HPC codes are small to medium in size and that using memory copies to move the data from kernel space to user space is faster than having having a RDMA NIC to it. Reducing the amount of time the kernel takes to do the copy and improving processors speeds are two of the reasons that a memory copy could be faster than RDMA.

There is a RDMA Consortium that helps organize and promote RMDA efforts. They develop specifications and standards for RDMA implementations so the various NICs can communicate with each other. Their recent efforts have resulted in the development of an RDMA set of specifications for TCP/IP over Ethernet.

Zero-Copy Networking

Zero-Copy networking is a technique where the CPU does not perform the data copy from the kernel space to user space (the application memory). This trick can be done for both send operations and receive operations. This can be accomplished in a number of ways including using DMA (Direct Memory Access) copying or memory mapping using a MMU (Memory Management Unit) equipped system. Zero-copy networking has been in the Linux kernel for some time, since the 2.4 series. Here is an article that discusses how the developers went about accomplishing it. The article gives some details at a high level about how one accomplishes this. It also points out that zero-copy networking requires extra memory and a fast system to perform the operations. If you would like more information, this article can give you even more detail.

With every new idea there are always seems to differing opinions. Here is an argument that zero-copy may not be worth the trouble. Rather the authors argue that a Network Processor (kind of a programmable, intelligent NIC) would be a better idea.

Interrupt Mitigation

Interrupt Mitigation also called Interrupt Coalescence, is a another trick to reduce the load on the CPU resulting from interrupts to process packets. As mentioned earlier, every time a packet gets sent to the NIC, the kernel must be interrupted to at least look at the packet header to determine if the data is destined for that NIC, and if it is, process the data. Consequently, it is very simple to create a Denial-of-Service (DOS) attack by flooding the network with a huge number of packets forcing the CPU to process everyone of them. Interrupt Mitigation is a driver level implementation that collects packets for a certain amount of time or a certain total size, and then interrupts the CPU for processing. The idea is that this reduces the overall load on the CPU and allows it to at least do some computational work rather than just decode network packets. However, this can increase latency by holding data before allowing the CPU to process it.

Interrupt Coalescence has been implemented in Linux through NAPI (New API) rewrites of the network drivers. The rewrites include interrupt limiting capabilities on both the receive side (Rx) as well as the transmit side (Tx). Fortunately, the people who wrote the drivers allow the various parameters to be adjusted. For example, in this article, Doug Eadline (Head Monkey) experimented with various interrupt throttling options (interrupt mitigation). Using the stock settings with the driver the latency was 64 microseconds. After turning off interrupt throttling, the latency was reduced to 29 microseconds. Of course, we assume the CPU load was higher, but we didn't measure that.

There are two good articles that discuss "tweaking" GigE, TCP NIC drivers. The first describes some parameters and what they do for the drivers. The second describes the same thing but with more of a cluster focus. Both are useful for helping you understand what to tweak and why, and what the impacts are.

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.