From the "I'll take eight department"
The Linux cluster world is moving towards InfiniBand for many reasons: bandwidth, latency, message rate, N/2, price/performance, and other factors that affect performance and price. But usually it's focused on larger systems, many times greater 64 nodes up to multiple thousand nodes. At that same time the reasons for moving to InfiniBand are still valid for smaller clusters, particularly performance, but the economics are not. Basically InfiniBand is just too expensive for smaller systems and usually does not make sense from a price/performance perspective. But that has just changed...
The Rise of InfiniBand
InfiniBand has made a remarkable rise in performance since inception. Just a few years ago, Single Data Rate (SDR) InfiniBand was the standard. SDR has a 10Gbit/s signaling rate and about a 8Gbit/s data rate (recall that GigE is 1GBit/s for signaling and data). Coupled with these high bandwidths was a much lower latency and CPU overhead. The performance of InfiniBand was a very attractive feature that attracted cluster people to it like moths to a flame.
The very first InfiniBand products were pricey. Shortly thereafter, the price started to drop to the point where you could get SDR InfiniBand for less than $1,500 a node (includes the HCA or IB card, cable, and switch port costs). Sometimes you could get it for less than $1,000 a node. In short order it became a much selected interconnect for clusters.
Not long after SDR was out, Double-Data Rate (DDR) InfiniBand came out. DDR InfiniBand has a 20Gbit/s signaling rate and about a 16Gbit/s data rate. Basically you had twice the bandwidth of SDR. In conjunction with the bandwidth increase was a drop in latency. Initially DDR was priced just a bit above SDR, but quickly DDR was priced the same as SDR. So now you could get twice the bandwidth and lower latency compared to SDR for less than $1,200 a node. Consequently, SDR all but disappeared.
Recently Mellanox has announced that Quad-Data Rate (QDR) InfiniBand silicon for the HCA's was available and silicon for QDR switches would be available soon. QDR InfiniBand now has a signal rate of 40Gbit/s and a data rate of about 32Gbit/s. You should start to see QDR HCA's and switches for purchase in late Q3 or Q4 of this year.
Overall InfiniBand provides performance benefits to many applications including those that use MPI and also the those used in the traditional data centers such as Oracle, VMWare, financial etc. The ever-growing demands for compute capabilities for those applications drive the growth of InfiniBand.
A Quick Network Comparison
As you are probably aware of, the network can
have a big impact on code performance, particularly if you are running parallel
codes that use MPI (or God help you - PVM). Table One below lists some common publicly reported
interconnect characteristics for GigE, low-latency GigE, 10GigE, SDR InfiniBand
(two flavors), and DDR InfiniBand.
Table One - Common Network Characteristics
(microseconds) |
(MBps) |
(bytes) |
|
GigE | ~29-120 | ~125 | ~8,000 |
Low Latency GigE: GAMMA | ~9.5 (MPI) |
~125 | ~7,600 |
10 GigE: Chelsio (Copper) |
9.6 | ~862 | ~100,000+ |
Infiniband: Mellanox SDR Infinihost (PCI-X) | 4.1 | 760 | 512 |
InfiniBand: Mellanox Infinihost III EX SDR | 2.6 | 938 | 480 |
InfiniBand: Mellanox Infinohost III EX DDR | 2.25 | 1502 | 480 |
Infiniband: Mellanox ConnectX DDR PCIe Gen2 | 1 | 1880 | 256 |
I don't want to cover the details of these characteristics in this article (here's an article that might help despite it's age). You can see from the table that SDR InfiniBand is still much better than GigE, low-latency GigE, or even 10GigE.
The Rise of SDR InfiniBand
IB is expensive for smaller clusters because the HCA's are fairly expensive and most of the time, the smallest switch you could buy had 24-ports. So if you only had, let's say, 4 to 8 nodes, than the per node cost for the switch was just too high (a factor of 3-4 compared to 24 nodes). But on the application performance side, smaller clusters could use InfiniBand, particularly as the number of cores per node increases. The smaller clusters don't necessarily need to huge bandwidth that DDR InfiniBand offers and many times don't need the extremely low latency of DDR InfiniBand. The bandwidth and latency of SDR InfiniBand will greatly help the applications. But InfiniBand is has always been considered too expensive. Until now.
Mellanox and Colfax International have teamed up to bring back SDR but at a price point that makes it extremely attractive for small clusters. At this point you're saying "Shut up and tell me the prices!" As I tell my children, "Just relax" but I usually end up with something thrown in my general direction. Since I don't want anyone to thrown things at me, let's go over the prices. BTW - the website with all of the prices is here.
Note: The HCA listed in Table Two does not seem to have recent public benchmark data available. Therefore, actual performance may differ from that shown in Table One.
Table Two - SDR Infiniband Pricing from Colfax
without shipping |
||
Low Profile HCA Card, Memory Free, RoHS (R5) Compliant, (Tiger) |
||
24-port 4X SDR w/ Media Adapter Support, one power supply |
||
So let's do a little math. Table Three below has the InfiniBand prices for 8 nodes.
Table Three - 8 nodes with SDR InfiniBand
HCA's (8 of them) | $1,000 |
8-port SDR switch | $750 |
1 meter CX-4 cables (8 of them) | $280 |
Total | $2,030 |
Price Per Node | $253.75 |
So if you buy SDR InfiniBand for 8 nodes you will pay less than $255 a node! (without shipping of course).
Let's do the same thing for a 24 node SDR cluster
Table Four - 24 nodes with SDR InfiniBand
HCA (24 of them) | $3,000 |
24-port SDR switch | $2,400 |
1 meter CX-4 cables (24 of them) | $840 |
Total | $6,240 |
Price Per Node | $260.00 |
The price is slightly higher than for 8-ports because of the switch costs. I'm not sure about you, but this is a fantastic price and is moving down in the general direction of GigE! (Well, not quite, but it's getting there!)
How do I Get Me Some of That?
Ordering SDR InfiniBand at these prices is easy. Colfax International has set up a webpage that allows you to order on-line! Just go to the page and place your order. If you need large quantities or special arrangements please send an email to sales( you know what to put here) colfaxdirect.com.
Please Note: ClusterMonkey or any of its authors have no financial interest in Colfax International. We just like cheap hardware.
To Infinity and Beyond!
I hate to end in a Buzz Light-year quote, but it seems somewhat appropriate. For smaller clusters you usually had to rely on GigE as the interconnect. Now you can afford to add SDR InfiniBand to these systems without it being too expensive. So this means we now get a big boost in performance on these smaller systems (including the one in my basement! Woo! Hoo!). Now we can truly begin to think outside the box or more like outside the server room.
We can start thinking about adding a parallel file system to these smaller clusters or even think about exporting NFS over native IB protocols from the master node. Also don't forget that you can run TCP over IB. (See the The OpenFabrics Alliance for the complete software stack.) Even with SDR InfiniBand you will get much faster TCP performance over IB than GigE. So you can start thinking about applications or places were GigE limits performance (anyone wants to play multi-player games using IPoIB?).
Jeff Layton is having way too much fun writing this article, proving that it's hard to keep a good geek down. When he's not creating havoc in his household, he can be found hanging out at the Fry's coffee shop (never during working hours) and admiring the shiny new CPUs that come in, and cringing when someone buys Microsoft Vista.