Buying the Best Linux Performance? | Cluster Newbie

From the "How tall is a building?" department

The Beowulf Mailing List is a source of valuable information about high performance computing (HPC) Linux clusters. Conversations on the list apply to not only HPC computing, but Linux performance for any system. Recently (March 8, 2007) the following was posted to the Beowulf Mailing list:

I would like to know what server has the best performance for HPC systems between The Dell Poweredge 1950 (Xeon) And 1435SC (Opteron). Please send me suggestions...

Here are the complete specifications for both servers:
- Poweredge 1435SC, Dual Core AMD Opteron 2216 2.4GHz 3GB RAM 667MHz, 2x512MB and 2x1GB Single Ranked DIMMs
- Poweredge 1950, Dual Core Intel Xeon 5130 2.0GHz 2GB 533MHz (4x512MB), Single Ranked DIMMs

From your specifications, almost certainly the Opteron. For a variety of reasons, but higher clock certainly helps -- it would probably have been faster at equivalent clock anyway. Now that I've "answered", let me tell you why you shouldn't believe me and what you should actually do to answer your own question.

There is a standard litany we like to chant on the Beowulf list:

Your Mileage May Vary
A benchmark in hand is worth any number of anecdotal reports
The best benchmark is your own application
What do you plan to do with it?
It depends...

In particular, it depends on your application (mix), its memory and disk and network requirements, the topology and type of your network, the communication and memory access pattern used by your application (mix), the compiler and library used, and a few dozen other variable major and minor, which is why nobody is going to tell you one is always better than the other even if they think it is true.

And then there is the cost -- the REAL question is which one has the better cost-benefit, not which one is the cheapest or fastest independently. Ask yourself the question -- with a fixed budget to spend, which architecture lets me get the most done in the least time.

So if you like, I wouldn't be doing you a favor by telling you definitely the Opteron only partly because it might not be true. If you believed me (because I sound so glib and because you don't know that AMD once sent me a cool tee-shirt and Intel hasn't, although I do have a pair of these cool little contamination-suited Intel dude key chains that come close) then you might be tempted to skip the correct cluster engineering step(s) of:

Study your application (mix) -- figure out in at least general terms its (their) communication patterns, its (their) memory requirements (size and access pattern), its (their) CPU requirements. Some applications are "I/O bound" -- run at a speed determined by the access speed of disk, for example. Some applications are "memory bound" -- they spend all of their time fetching data from memory, relatively little on actually doing something to it. Some applications (especially parallel cluster applications) are "network bound" and run at a rate that is determined by the latency or bandwidth of a network connection, further complicated in the case of real parallel code by the communication PATTERNS which can cause bottlenecks outside of the system altogether. Some applications (the happiest ones, I tend to think:-) run at a rate that is limited by CPU clock (clock, clock and nothing but clock) Although different CPU architectures (e.g. Xeon and Opteron, 32 or 64 bit) have a different BASE performance at any given clock.
If at all possible, and it nearly always is possible, beg, borrow, steal, buy, or rent a system or two in your competing architectures and run Your Code compiled with Your Target Compiler on those systems and just measure its performance. This is actually a whole lot easier than the stuff in step 1) and a lot more likely to be accurate, but I still don't advise skipping 1). If you are planning on buying more than a handful of systems, it is actually often worth your while to buy< one of each (or even three or four candidate systems), test them, and then buy the other 127 or however many nodes you plan to put in your winning cluster. The other option is possibly buying 128 of the wrong kind. You can always recycle the losers as servers, really powerful desktops, whatever. A really good vendor will often loan you systems (or network access to systems) to do this testing. A really good compiler vendor (e.g. Pathscale) will even/often lend you a compiler for a trial period to do the testing. Other compilers like Intel and Portland Group have free trial licenses you can use as well.
Don't lock yourself in to a single distributor. while looking over systems. I personally don't use some tier 1 vendors (e.g. IBM, Dell, HP), although I know people that do. I have found that some (not all) tier 1 hardware is not the most reliable in some cases, but their service plans tend to be very good, their cost is reasonable, and they aren't Linux-averse I do think, however, that they're still working on becoming actively Linux-friendly. There are a number of other tier 2 vendors that cater to the HPC crowd. The hardware is often as good or better and with equally attractive prices and service deals. These include companies like Penguin Computing, Appro, Microway and others. Penguin Computing is my own personal favorite, largely because with the exception of one DOA system out of a good size stack of Altus servers we've purchased (no doubt the one that "fell off the truck" and likely not Penguin's fault) I have yet to see an Altus fail in harness. Seriously, pretty extraordinary, really, given that they run at full load pretty much 24x7 for over a year. I've heard that their service is really good -- maybe one day I'll have a chance to find out. Penguin will almost certainly let you prototype on their systems
When you've done all your research above, then Do the cost benefit analysis. If your application is network bound, don't worry so much about system clock and speed, worry about getting a really high speed cluster network to match (which is expensive, so you may want to get cheaper slower nodes if the application isn't CPU bound. If your application is memory bound, you may want to skip the dual cores and get two single cores or quad single cores -- otherwise you might just be using two cores at a time while the other cores are waiting in line to get at memory, wasting all the money you spent on the dual cores in the first place. Although dual-core may be the default in the near future. If your applications is disk bound then look more closely at disk and less at CPU -- what kind of bus, what kind of disk subsystem, what are the bottlenecks (per system) and the costs of minimizing them.

As you can hopefully now see, the right question to have asked isn't which of two particular systems out of twenty on the market is "best" in some amorphous way, it is which of the twenty systems in the two thousand possible ways of configuring them with network, disk, memory, CPU, and compiler will get the most work done for your investment of a fixed amount of money. Answer that, and then make your purchase with confidence.

I'm sure that other list-humans have experiences or suggestions to share here. If you are very unsure of your abilities to carry out the list of chores above, there are at least 2 or 3 professional cluster consultants on the list who would probably help you for a moderate fee -- ask them to contact you offline if you are interested as they generally won't spam the list beyond maybe letting you know that they exist while helping to answer your original question. They can do anything from helping you with the prototyping and analysis to provide you with a cost-competitive turnkey cluster, depending on your needs and cluster management skills.

There is also the multi-core issue, which we will address in upcoming article. Stay tuned, because the multi-core snowball has started to roll across the HPC world.

I myself provide the kind of dear-Abby advice above on-list and charge only beer (should we ever meet). Mind you, at this point if I ever actually received the beer due me according to this rule, I would die in a gutter somewhere inside six months with my liver in complete failure, so it is probably just as well that I generally don't go to cluster meetings and so forth.

This article was originally published as a response Beowulf It has been edited and formatted for the web. You can view the original post and other interesting cluster discussions by visiting the list archives.

Robert Brown, Ph.D, is has written extensively about Linux clusters. You can find his work and much more on his home page