Article Index

Can Doug break the record he and Jeff set previously? Can he overcome the chains of reality? Inquiring minds want to know?

Cluster optimization is often considered an art form. In some cases it is less art and more like flipping a coin. A previous article described the Kronos value cluster and its record breaking price-to-performance results. Can an investment of time and effort break this record or have we hit the cluster wall?

Introduction

If you followed along with our previous value cluster articles, you will remember that Jeff Layton and I built and optimized a cluster for under $2500. Our results for the cluster set a record of sorts for achieving 14.53 GFLOPS (52% of peak) at a cost of $171/GFLOP (double precision). We used the famous Top500 benchmark called HPL (High Performance Linpack). Not one to rest on my laurels, I decided to push on and see if I could improve on our HPL performance.

The next step in improving performance is to figure out what to optimize. There are plenty of things to change, so where to start is a good question. When fixing cars, my father always had a rule. I, then move to next easiest/cheapest part until you (hopefully) fixed the problem. It is easier and cheaper to replace a gas filter than it is to replace a carburetor. Our efforts thus far, have taken a similar approach where we tweaked the program parameters, choose the best BLAS (Basic Linear Algebra Subprograms) library, and tuned the Ethernet driver. In this article, I would like to find the "good enough wall" for the cluster, which is basically the point of diminishing returns.

Also remember that the source code can not be changed so we are limited to program parameters, the cluster plumbing (interconnects, drivers), and middle-ware (MPI)and compiler tools. Those who have experience in this area will obviously raise there hand and say, "What about other compilers and MPI versions?" My response is, "Well of course, it seems easy enough -- which is a sure indication that probably will not be."

If we are to try other compilers and MPI version, we need to keep in mind that there is quite a list of alternatives (see Sidebar Two). In order to get done testing within this decade, we will limit our choices to some of the more popular packages. The choices should not be taken a slight toward other alternatives. There are plenty of both open and commercial tools available and your choice should be based on your needs.

Sidebar One - The Current Kronos Record

As reported in the November 2005 issue, the current record for our value cluster (called Kronos) is 14.53 GFLOPS. This result was achieved using a cluster composed of eight AMD Sempron 2500 processors with 256 MBytes of PC27000 RAM (512 MBytes on the head node), and a Gigabit Ethernet link. We tuned the program parameters, tried different BLAS libraries, and tuned the Gigabit Ethernet Links (6000 Byte MTU, turned off interrupt mitigation strategies) We are also using the Warewulf cluster toolkit (See Resources). For this article we will use the same hardware and only change software.

Another important note is worth mentioning. As I try and push the GFLOPS higher some things may work for this application and some things may not. It would be a mistake to assume this is the case for all applications - particularly yours. Please keep this proviso in mind when in interpreting the results.

We are going to look at tuning the TCP values, changing the MPI library, changing the compiler, and finally using a kernel by-pass MPI library. As we cannot try every possible combination (see Sidebar Two), at least we will go through the exercise of changing these parameters and at minimum learn a few things about how to turn your hair gray (in my case more gray).

A Nudge, Not a Bump

The easiest thing to try is to tune the TCP parameters. I am borrowing from a Berkley Lab TCP Tuning Guide I found on-line (see Resources Sidebar) Listing One shows the settings that were added to the /etc/sysctl.conf file. You can effect the changes by simply running a sysctl -p. For those interested, all TCP implementations uses buffers for the transfers between nodes. In newer Linux kernels, the kernel will auto-tune the buffer size based on the communication pattern it encounters. Our changes will be to increase the maximum tunable size of the buffers.

Re-running our best case from before with the new setting shows the smallest of increase to 14.57 GFLOPS (multiple runs confirmed that this increase is statistically significant). The change was easy, not much improvement and no huge amount of time lost.

Listing One - New TCP Parameters

  # increase TCP max buffer size
  net.core.rmem_max = 16777216
  net.core.wmem_max = 16777216
  # increase Linux autotuning TCP buffer limits
  # min, default, and max number of bytes to use
  net.ipv4.tcp_rmem = 4096 87380 16777216 
  net.ipv4.tcp_wmem = 4096 65536 16777216

 

One thing to remember is that there is no best compiler/MPI combination for every program! Of course you are you hoping for the best, but you may never know if you are getting the best performance -- until some uber geek on a mailing list let's you in on the secret combination that got his application 10% more performance than yours.

Finally, there are those pesky optimization flags that vary from compiler to compiler and processor to processor. Indeed, the trick is to know when "good enough" is really good enough. If you know your application, in most cases, you will have a "feel" for this point in the optimization process.

Pick an MPI

Up until this point we have been using LAM/MPI. This MPI is the default MPI used in the Warewulf distribution. The version we used was 7.0.6. The most logical MPI to use next is MPICH from Argonne National Lab. The latest version of MPICH is MPICH2. With much anticipation, I compiled MPICH2 and set about running the HPL code.

Sidebar Two - Too Much of a Good Thing

If you are looking for the best performance then you know that compilers and libraries are a great way to "easily" make changes. Unfortunately, "the devil, they say is in the linker". Furthermore, the array of choices is rather daunting. Let's first consider compilers. A short list of Linux x86 compilers include GNU, Portland Group, Intel, Pathscale, Lahey, Absoft, and NAG. Similarly a short list of MPI implementations would include; LAM/MPI, MPICH1, MPICH2, Intel-MPI, MPI/Pro, WMPI, Open-MPI, and Scali MPI Connect. Not to mention all the variation for different interconnects within each MPI. So, lets do the math for the short list. Seven compilers times seven MPIs, that is 49 possible combinations. Considering that building and running your application with each combination may range from easy to aggravating and may take a non-trivial amount of time, it is no wonder that perfect is often considered the worst enemy of good enough.

The first problem I encountered was the need for some shared libraries to be on the nodes. Recall that a Warewulf cluster uses a minimal RAM disk on each node. I added the libraries to the cluster VNFS (Virtual Network File Systems), built a new node RAM disk image and rebooted the nodes in all of 5 minutes. Continuing, I also recalled that MPICH2, similar to LAM, now uses daemons (instead of rsh or ssh) to start remote jobs. To my surprise, I found that the daemons needed a version of Python on each node to run. While, Python and other such languages are great tools, I prefer the "less is better" approach with clusters, which is pretty much the Warewulf approach as well. Requiring Python on each node seems to me to be a move in the wrong direction. In any case, since getting the daemons working under Warewulf will take some time and testing, I decided to take a step back and use the old reliable MPICH1.

After a quick, configure; make install I had MPICH1 running on the cluster. The test programs worked, so it was time to compile HPL and see if we can increase our score. Of course, some fiddling with environment variables and the HPL Makefile was needed to ensure the right libraries were used. Running the benchmark resulting in 13.9 GFLOPS. This result was good, but of course not our best. For the MPI jihadists out there, this result does not necessarily mean LAM is always better than MPICH, for this code it might be, but I have seen other codes where MPICH beats LAM as well. After looking at the MPICH1 results, it seemed that trying OpenMPI might be worthwhile. OpenMPI is a new highly modular MPI that is being written by the LAM/MPI the FT-MPI, LA-MPI, and PACX-MPI teams. The final release is imminent, so it seemed like it might be helpful to have another data point. After down loading and building, the test programs worked, so running HPL was next. The program started, but basically stalled out. After talking with one of the authors, I learn that these are known issues. Sometimes I think my goal in life is to validate known issues. Time to move on. The compiler is next.

Pick A Compiler and an MPI

The compiler is one of those really cool tools that at times can give your application a really nice kick in the asymptote. Again, it all depends on your application, but some of the commercial compilers have a few more hardware specific options than the GNU compilers, so they are worth a try. Presently, most of the commercial compilers are focused on x86-64 level processors and have no great interest in optimizing for Sempron processors. For this project, The Portland Group (PG) compiler was chosen because it has been reliable and robust in the past. The 15 day free trail helped as well. In any case, it is just a recompile, right?

There are three basic components in the HPL program; the program itself, the BLAS library, and the MPI library. The cleanest way to build an application with a new compiler is to build all the supporting libraries as well. Otherwise, you may end up scratching your head as a multitude of linking problems pass in front of your eyes. Building MPIs with alternate compilers has been well documented, so the task now looked to be as follows:

  1. Build a new version of MPI with PG
  2. Build a new version Atlas with PG
  3. Build a new version of HPL with PG linking the components in 1 and 2make. No joy. There is an error message from the Atlas make procedure about an undefined case. Fair enough. Time to check the HPL documentation. It seems they don't recommend the PG compiler to build Atlas. That would be two known issues I have successfully validated thus far.

Moving on, the PG compiler is very good at linking in GNU compiled code, so I'll just use the GNU Atlas libraries and build HPL with PG. After some makefile magic I have a new HPL binary compiled with some cool optimization flags (-fastsse -pc=64 -O2 -tp athlonxp). The code is run and the the GFLOPS hit 13.92. Some further fiddling with compiler options, does not really change things. At this point, the MPICH1-PG version is slightly better than the MPICH1-GNU version, but worse that the LAM/MPI-GNU version. Undaunted, the next thing to try is a LAM/MPI-PG combination. After some more I, the code is running and low an behold, a new record of 14.90 GFLOPS, but no fanfare. The amount of time spent with the MPI/compiler rebuilds was easily two days. The amount of improvement is 0.33 GFLOPS. A new tact is needed.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.