Using Bare Motherboards and Choosing a High-speed Interconnect

The cluster expert Joe Landman posted some very good questions that everyone should consider before upgrading to a high-speed interconnect. In essence Joe was suggesting profiling your application(s) to locate the bottlenecks. Once this is done, you need to decide what aspect of "fast" in fast interconnect, is needed - i.e. low-latency and/or bandwidth. Then Joe said that all of the high-speed interconnects have HBA's (Host Bus Adapters - i.e. a "NIC") that range in price from $500-$2,000 and switch prices that are about $1,000 a port (so anywhere from $1,500 to $3,000 a port).

Michael Prinkey suggested looking at the GAMMA project which is a similar idea to the M-VIA (Virtual Interface Architecture) implementation over GigE. The well respected Mikhail Kuzminsky said that they have had trouble with GAMMA on SMP kernels and that they had trouble with the Intel Pro/1000 NICs, Moreover then found that Intel frequently modified the chipset, causing compatibility problems. [Note: Some of these problems have been fixed in newer versions of GAMMA. In addition, the M-VIA project seems to have been discontinued.]

List regular Robert Brown then posted some further suggestions about determining if the application(s) needed a high-speed interconnect. One thing he suggested is that if the application passes large messages then it is likely to be bandwidth limited. If it passes a bunch of small messages, then it is likely to be latency limited. Robert also suggested talking to the various high-speed interconnect vendors to see if they had a 'loaner' cluster that could be used for testing.

Chris Samuel posted that Fluent is very latency sensitive and that Fluent is likely to support Myrinet on Opteron CPUs.

The experienced Mark Hahn posted that the last clusters he bought with Myrinet and IB had about the same latency and cost. However, he hasn't seen any users who are bandwidth limited which means IB's superior bandwidth is not important. However, he did point out that Myrinet could use dual-rail cards, which allow Myrinet to have the same bandwidth as IB but raises the cost above IB. Mark also thought that Quadrics, while more expensive than Myrinet and IB, had lower latency and about the same bandwidth as IB. Mark also admitted that he was a bit skeptical about IB claims of better performance and lower cost, but also admitted he didn't have any IB experience. He also echoed the comments of others that you need to be sure that you need a high-speed interconnect by testing your application(s).

Matt Leininger took issue with some of Mark's comments about IB not being field tested. He stated that where he works they have been running IB in production on clusters of 128 nodes and up. He also mentioned that a large cluster in Japan, RIKEN, has been running IB stably for over 6 months. Matt also mentioned that IB has several vendors to choose from (he mentioned four of them). He lastly pointed out that IB has much more field time than the latest Myrinet offering [Note: This was 2004. Things are much different today with a large number of IB clusters in production and Myricom has a new interconnect - Myri-10G].

Joe Landman responded that AMD has a Developers Center with at least one cluster with IB. It might be worth while to get an account and test on this machine. Joe also observed that he thought IB was drawing wide spread support and is not single sourced.

Mark Hahn responded to Matt's posting and thanked him for the information on IB clusters in production. Mark also asked if the various IB vendors only just used Mellanox chips with some minor mods, thereby limiting IB to a single vendor. However, Mark was still a bit skeptical about IB.

Daniel Kidger from Quadrics then posted with some more detailed information about their offerings. He said that their QsNetII interconnect sells for about $1,700 a node ($999 a card with the rest being for switches and cables). He also wrote that IB has about the same bandwidth, but twice the latency (3.0 microsec vs. 1.5 μsec). He also said that Myrinet was lagging behind but they do have a new product coming out. Daniel then went on to echo the comments of others that you need to profile your application(s) and test on the various high-speed interconnects.

Bill Broadley also echoed the comments of everyone about testing one's code (have you noticed all of the luminaries in the Beowulf community are saying the same thing - test your application first?). Bill also had a good suggestion about forcing the GigE NICs to only go at Fast Ethernet speeds to see the effects on the code performance (a quick and dirty way to test the effects of network performance).

{mosgoogle right}

Finally Joe Landman suggested getting "atop" to help profile your application. He also mentioned that if you see a lot of time being spent in a process called 'do_writ', then the code could also be I/O bound, which opens up a whole new can of worms.

Sidebar One: Links Mentioned in Column

GAMMA

atop

Myricom

Quadrics

Mellanox

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far too much time reading mailing lists. He can found hanging around the Monkey Tree at ClusterMonkey.net (don't stick your arms through the bars though).

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.