Cluster Interconnects: The Whole Shebang

Switching the Ether

GigE switches have now become less expensive. Switches can be purchased in two varieties: managed and unmanaged. Small switches are almost always unmanaged switches. That is, they are plug-and-play with no option to provide any type of configuration. A small 5-port switch starts at about $32 with an 8-port switch costing starting at about $52 and a 16-port switch starting at $233. These switches typically do not support jumbo frames (MTU up to 9000).

SMC has been selling a very inexpensive line of unmanaged small switches for several years. While these switches are very good in their own right, one of the best features of these switches is that they are capable of jumbo frames. (Note: The prices mentioned below were determined at the time the article was posted.) The SMC 8505T 5-port switch costs about $70, the SMC 8508T 8-port switch costs about $100, and a 16-port version (8516T) starts at about $300. There is also a 24-port version of this new switch called the SMC 8524T that can be found for about $400. The 8516T and 8524T switches are becoming more difficult to find but there is a new line of 16 and 24 port unmanaged switches that are capable of jumbo frames. The SMC GS16-Smart 16-port switch costs about $240. A 24-port version of this switch, the SMC GS24-Smart can be found for about $300.

There are a number of switch manufacturers that target the cluster market by providing large high density/performance managed switches. Managed switches allow users to monitor performance or change settings. Companies such as Foundry, Force10, Extreme Networks, HP, SMC, and Cisco offer large GigE switching solutions.

For clusters with Ethernet, having a single high performance backplane (i.e. one large switch) can be important for good performance. A single backplane can provide good "cross-sectional" bandwidth where all nodes can achieve maximum throughput (if the switch is capable of line-rate data transmission on all ports simultaneously). An alternative to a single backplane is to cascade smaller and usually less expensive switches to provide the requisite number of ports for your cluster. This is usually done in a common topology such as a fat-tree topology. Depending upon the topology, cascading switches can reduce the effective cross-sectional bandwidth and almost always increases the latency. Several companies have examples of large single backplane switches. Foundry has a switch (BigIron RX-16) that can have up to 768 GigE ports in a single switch. Extreme Networks has a 480-port GigE switch (Black Diamond 10808) and Force10 has a monster 1260-port GigE switch (Terascale E-series).

There are some interesting topologies you can use with Ethernet to get higher performance and lower cost networks. The two most prominent ones are FNN (Flat Neighborhood Networks), and SFNN (Spare Flat Neighborhood Networks).

A FNN is a network topology that guarantees single-switch latency and full bandwidth between per processing element for a variety of communication patterns. The way this is achieved is that each node has more than one NIC that are used to plug into smaller switches in such a way that there is only a single link between certain pairs of nodes. If you know how your code communicates you can then use this knowledge to design a low-cost, high performance FNN that uses inexpensive small switches to achieve the same performance that you get from larger much more expensive switches. The website even has an on-line tool for designing FNNs.

Sparse FNN's are a variation of FNNs but allow you to select which communication patterns you want to ensure have single switch latency and full bandwidth. They allow you to design a network for very specific codes to get the best performance at the lowest cost.

GigE Message-Passing

Communication over GigE is usually done by using the kernel TCP services. The result of this standard means a large number of MPI implementations are available for GigE. Virtually every MPI implementation available, either open-source or commercial, supports TCP. There are a number of open-source implementations such as MPICH, MPICH2, LAM-MPI, Open MPI, FT-MPI, LA-MPI, PACX-MPI, MVICH, GAMMA, OOMPI (C++ only), MP-MPICH, and MP_Lite (useful subset of MPI).

{mosgoogle right}

There are also some major commercial MPI vendors. Verari Systems Software (previously MPI Software Technologies) has MPI/Pro and Scali has Scali MPI Connect, Critical Software has WMPI, HP has HP-MPI, and Intel has Intel-MPI.

Also, since this is just plain TCP, we can also use PVM and other message passing schemes may be used over GigE. In addition, many storage protocols support TCP and in some cases native Ethernet. For example, iSCSI, HyperSCSI, Lustre, ATA over Ethernet, PVFS, GPFS, can be run over GigE.

Beyond GigE: 10 Gigabit Ethernet

It was quickly realized that even GigE would not give enough throughput for our data-hungry world. Consequently, the development of the next level, 10 Giga-bits per second, or 10 GigE, was started. One of the key tenants of the development was to retain backwards compatibility with previous Ethernet standards. This new IEEE standard is 802.3ea.

It was also realized that there would have to be some changes to the existing way Ethernet functioned to get the required performance. For example, the IEEE standard has altered the way the MAC layer interprets signaling. Now the signals are interpreted in parallel to speed up the processing. However, nothing has been done to the standard to limit backward compatibility.

For most 10 GigE installation, fiber optic cables are used to maintain a 10 Gbps speed. However, copper is becoming increasing popular. There are NICS and switches that support 10GBASE-CX4, which are copper cables that use Infiniband 4x connectors and CX4 cabling. These cables are currently limited to 15m in length.

However, there are new 10 GigE cables being developed. In approximately August 2006, there should be 10GBASE-T cables available. They use unshielded twisted-pair wiring, the same as cat-5e or cat-6 cables. This is the proverbial "Holy Grail" of 10 GigE. Having just the cables will be somewhat pointless though if the NICs and the switches are not available as well. So we should see a flurry of new product announcements during 2006.

10 GigE NIC Vendors

There are several vendors currently developing and selling 10 GigE NICs for clusters. Currently there are Chelsio and Neterion, formally S2IO, Myricom, and Intel providing 10 GigE NICs.

Chelsio markets two PCI-X intelligent 10 GigE NICs that include RDMA support that are reasonable to use for the HPC market. The T210 uses fiber optic cables and comes in both an SR and LR version (single fiber mode and multi fiber mode) for 64-bit PCI-X. This NIC includes both a TOE capability (TCP Off-Load Engine) and a RDMA (Remote Direct Memory Access) capability to improve the performance of the NIC as much as possible. Since the NIC uses TCP as the network protocol, any MPI implementation that uses TCP, which virtually all of them do, will work with these NICs without change. More over, the Chelsio driver was the first 10 GigE driver to be included in the Linux kernel.

Chelsio also has a copper connector version, the T210-CX. It is a 64-bit PCI-X NIC that uses CX4 copper cables. It has the same features of the T210 including TOE and RDMA capability.

Chelsio also sell a "dumb" 10 GigE NIC, N210, that that does not include RDMA nor TOE capabilities. It uses fiber optic cables, both SR and LR, for connecting to switches or to other NICs. It is cheaper than the T210 series, but likely have the same level of performance for HPC codes.

Neterion has a 10 GigE NIC, called the Xframe. The NIC is a 64-bit PCI-X NIC and has RDMA and some TOE capability and uses fiber optical cabling. Neterion has also announced a new 10 GigE NIC called Xframe II. This NIC has a 64-bit PCI-X 2.0 interface that should allow a bus speed of 266 MHz instead of the usual 133 MHz. According to the company this NIC should be capable of hitting 10 Gbps wire-speed (PCI-X currently limits 10 GigE NICs to about 6.5-7.5 Gbps) and achieve a 14 Gbps bi-directional bandwidth.

Currently both the Xframe and Xframe II NICs use optical fiber connectors, presumably both SR and LR. However, with the desire for copper connectors, it is entirely possible they have a CX4 version of the NIC. Neither NIC is directly sold to the public but is sold to OEM's. Recently IBM has announced that they will use Neterion NICs in their xSeries servers that use Intel processors.

Myrinet has a new NIC that has some interesting features. It can function as a normal 10 GigE NIC if it's plugged into an Ethernet switch. Or it can function as a Myrinet NIC using the MX protocol when plugged into a Myrinet switch (Holy Sybil Batman!). This new NIC, called the Myri-10G, can accommodate a number of connectors including 10GBase-CX4, 10GBase-{S|L}R, and XAUI/CX4 over ribbon fiber. The cards are PCI-Express x8 and start at $795 a NIC (list price). Myricom has Linux, Windows, and Solaris drivers that come bundled with the card and FreeBSD has a driver for it in it's source tree. Myricom reports that they have been shipping the NICs since Dec. 2005.

Intel has developed and is selling three 10 GigE NICS: the Intel Pro/10GbE CX4 NIC, Intel Pro/10GbE SR NIC, and the Intel Pro/10 GbE LR adapter. The CX4 version is a PCI-X NIC that uses copper cabling. When this article was written, the best pricing for it was about $872. The SR version is also a PCI-X NIC that uses multi-mode fiber cables for connectivity. It is intended primarily for connecting enterprise systems and not for HPC. Currently it costs about $3,000 per NIC. Finally, the LR version of the NIC, which is also a PCI-X NIC, is for long-range connectivity (up to 10 km) using single-mode fiber cables. As with the LR NIC, it is not really designed for HPC and it's price is about $5,000.

It is likely that other vendors will be developing 10 GigE products in the near future. Level 5 and other RDMA GigE manufacturers are rumored to be developing a 10 GigE product.

10 GigE MPI

Since 10 GigE is still Ethernet and TCP, you can use just about any MPI implementation, commercial or open-source, as long as it supports TCP or Ethernet. This means that you can run existing binaries without any source changes. This reason is a why people are seriously considering 10 GigE as the upcoming interconnect for HPC.

10 GigE Switches

There are several 10 GigE switch manufacturers. The typical HPC switch vendors such as Foundry, Force 10, and Extreme all make 10 GigE line cards for their existing switch chassis. They have been developing these line cards primarily for the enterprise market, but the now realize that as the costs come down on the line cards and the NICs, that they may have a product line suitable for the HPC market.

Foundry has a new large chassis (14U) switch, called the RX-16. It can accommodate up to 16 line cards. Foundry currently has a 4-port 10 GigE line card. They have a fiber optic version of this line card and, presumably, a copper version using CX4 cables. All ports on the switch run at full line rate and have an approximate per port cost of $4,000.

A company that that has focused on 10 GigE for some time is Extreme Networks. They have a BlackDiamond series of switches that focus on high performance, including HPC. Their largest switch, the BlackDiamond 10808, can accommodate up to 48 ports of 10 GigE, presumably both fiber and copper.

Force10 has probably been the leader in the 10 GigE market. They have a large single chassis switch, called the E Series that can accommodate up to 224 10 GigE ports. To reach this port count, they have a new line card that can accommodate up to 16 ports of 10 GigE (a total of 14 line cards). However, these new line cards are not full line rate cards. To get full line rate on each port, they have an 4-port 10 GigE line card, resulting in 56 total 10 GigE ports. An interesting difference between the cards, beside the performance, is the price. The 16-port line cards result in a per port cost of about $2,700, while the 4-port line cards result in a per port cost of about $7,500. Both line cards have plug-able XFP optics allowing SR, LR, ER, and ZR optics to be used.

The switches from Extreme, Force10, and Foundry focus on the high end of 10 GigE with large port counts. However, other companies are focusing on lower port count 10 GigE switches that deliver good performance. Fujitsu has a single-chip 10 GigE switch called the XG700 that has 12 ports in a compact form factor. Also, the switch can be configured for SR, LR, and CX4, connections. It has a very low latency of 450 ns and has a reasonably lost cost of about $1,200 per port.

SMC has long been a favorite of cluster builders for small to medium clusters. Their switches have very good performance and they have a wide range of unmanaged and managed switches. Recently they brought out an inexpensive 8-port switch, the SMC 8708L2. It is an 8-port single backplane switch that use XFP connectors that support SR, LR, and ER XFP. It is a managed switch, but at press time one of these switches was about $6,300. That comes out to less than $800/port. This is the price/performance leader for small 10 GigE fiber switches.

Quadrics introduced a new 10 GigE switch that uses the Fujitsu single-chip 10 GigE solution. At the Supercomputer05 show, they were showing a new 10 GigE switch that fits into an 8U chassis. It has 12 slots for 10 GigE line cards. Each line card has 8-ports for 10 GigE connections using CX4 connectors. The remaining four ports for each line card are used to internally connect the line cards in a fat tree configuration. This means that the network is 2:1 oversubscribed but looks to have very good performance. If all line card slots are populated, then the switch can have 96 ports. It has been in testing since Q1 of 2006 and fille production is slated for mid 2006. No prices have been announced, but the rumors are that the price should be below $2,000 a port. Quadrics also stated in their press release that follow-on products will increase the port count to 160 and then to 1,600. Further announcements on this produce will be at Interop.

Even more exciting is a new company, Fulcrum Micro, that is developing a new 10 GigE switch ASIC. It has great performance with a latency of about 200 ns and uses cut-through rather than store-and-forward for improved latency and throughput. It can accommodate up to 24 ports and should be available in Jan. 2006 for about $20/port. Fulcrum has a paper that talks about how to take the 24-port 10 GigE switches that use their ASIC and construct a 288-port fat-tree topology with full bandwidth to each port and a latency of only 400 ns. According to Fulcrum, a number of companies are looking at using their ASICs to build HPC-centric 10 GigE switches.


Infiniband was created as an open standard to support a high-performance I/O architecture that is scalable, reliable and efficient. It was created in 1999 by the merging of two projects: Future I/O supported by Compaq, IBM, and HP, and Next Generation I/O supported by Dell, Intel, and Sun.

The reason for the drive to a new high-performance I/O systems was that the existing PCI bus had become a bottleneck in the computing process. It was hoped that updating PCI to something new would allow the bottleneck to be removed.

Much like other standards, IB is a standard that can be implemented by anyone. This freedom has the potential to allow for greater competition. Today there are four major IB companies: Mellanox, Topspin (acquired by Cisco), Silver Storm (was Infinicon), and Voltaire. However, Mellanox is the main manufacturer of Infiniband ASIC.

The Infiniband specification that was finally ratified, provides for a number of features that improve latency and bandwidth for interconnects. One of these is that IB is a bidirectional serial bus. This reduces cost and can improve latency. The specification also provides for the NICs (usually called a HCA - Host Channel Adapter) to use RDMA. This greatly improves latency. Equally important, the specification provides an upgrade path for faster interconnect speeds.

As with other high-speed interconnects, IB does not use IP packets. Rather, it has it's own packet definition. However, some of the IB companies have developed an 'IP-over-IB' software stack, allowing anything written to use IP to run over IB albeit with a performance penalty compared to native IB.

The specification starts IB at a 1X speed which allows for an IB link to carry 2.5 Gbps (giga-bits per second) in both directions. The next speed is called 4X. It specifies that data can travel at 10 Gbps (however PCI-X limits this speed to about 6.25 Gbps). The next level up is 12X which provides for a data transmission rate of 30 Gbps. There are also standards that allow for Double Data Rate (DDR) transmissions which transfer twice the same amount of data per clock cycle, and for Quad Data Rate (QDR) transmissions that transfer 4 times the amount of data per clock cycle. For example, a 4X DDR NIC will transfer 20 Gbps and a 4X QDR NIC will transmit 40 Gbps.

Like many other networks, IB is a switched network. That is, the HCAs, connect to switches that are used to transmit the date to the other HCAs. A single chassis switch can be used or the switches can be connected in some topology. Today there are a wide variety of switches from the major IB companies.

Voltaire is a privately held company focusing on IB for HPC in addition to other areas in need of high-speed networking. They were the first of the IB companies to market a large 288-port IB switch. The switch uses 14U of rack space and provides full 4X SDR (Single Data Rate) bandwidth to each switch port. Alternatively, this switch can accommodate up to 96 ports of 12X Infiniband. Voltaire also ships a small (1U) IB switch, the 9024 that provides up to 24 ports of 4X Infiniband (DDR capable). They also have a cool product, the ISR 6000 that allows Infiniband based networks, like those in clusters, to be connected to Fibre Channel or TCP networks.

Silver Storm (previously called Infinicon) was founded in 2000 and privately held, Silver Storm sells an HCA in both PCI-X and PCI-Express form factors, IB switches, and all of the support infrastructure for IB. They currently sell IB switches as large as 288 ports in a single chassis. Silver Storm also uses their own IB software stack, called Quick Silver that has a reputation for very good performance, reliability, and easy of use.

Silver Storm specializes in multi-protocol switches that allow different types of connections, such as 4X IB, GigE, and Fibre Channel, to all use the same switch. They are also focusing on Virtual I/O that allow you to aggregate traffic from SAN, LAN, and server interconnects into a single pipe. This allows you to take 3 different networks and combine them into a single network connection.

Topspin, which is now part of Cisco, is also pursuing the high-performance computing market as well as other markets that can use the high performance IB interconnect including Grid Computing and database servers. Topspin is producing IB products to combine CPU communication as well as IO communication. They are also shrinking the size of IB switches. They have a 1U switch, called the Topspin 90, that has up to 12 ports of 4X Infiniband and also up to two 2 Gbps Fibre Channel ports and six GigE ports to allow the network to be connected to a range of other networks, such as storage networks. They also sell a PCI-X and PCI-Express HCAs that have two IB ports.

Mellanox was founded in 1999 and develops and markets IB ASIC's (chips), IB HCA's, switches, and all of the software for controlling an IB fabric. There first product was delivered in 2001 shortly after the IB specification came out. Mellanox does not sell directly to the public. Rather they sell to other IB companies and to vendors such as cluster vendors.

Currently, Mellanox has a wide range of HCA cards. The Infinihost III Ex cards fit into a PCI-Expresss x8 slot and come in SDR (10 Gbps) and DDR (20 Gbps) versions. It comes in two versions, one that has memory on the HCA, and one that does not (called MemFree) that uses the host memory. There is also an HCA, the Infinihost III Lx, that only uses the MemFree capability that uses the host memory instead of HCA memory. It also comes in 4X SDR (10 Gbps) and DDR (20 Gbps) versions and uses a PCI-Express interface, either as x4 (SDR) or x8 (DDR). Mellanox has also been an active participant in the development of the OpenIB Alliance software project. The OpenIB Alliance software stack has been included in the 2.6.11 kernel. They are also participating in the development of what is called "OpenIB Gen 2," which is the next generation IB stack for the Linux kernel.

There are a number of commercial MPI implementations that support IB. Infinicon ships an MPI library with their hardware. Scali MPI connect, Critical Software MPI, Intel's MPI-2 and Verari System Software (previously MPI Software Technologies) MPI/Pro works with IB hardware.

There are also a number of open-source MPI implementations that now support IB. For example LAM-MPI, Open MPI, MPICH2-CH3, MVAPICH, and LA-MPI all support various IB stacks.



    Login Form

    Share The Bananas

    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.