A Tool for Cluster Performance Tuning and Optimization | Benchmarking Methods

Do you know about the Beowulf Performance Suite?

The Beowulf Performance Suite (BPS) was designed to provide a comprehensive and comparative way of measuring cluster performance. Although BPS contains many benchmarking programs, BPS is not designed to directly benchmark clusters. BPS is designed as an analysis tool to measure differences due to hardware or software changes on the same cluster. In addition, successfully running all the tests provides some assurances that the cluster is configured properly.

The suite can run any or all programs and produce HTML output files. The use of HTML makes it trivial to share your results with others on the web. The following tests are available:

bonnie - hard drive performance
stream - memory performance
netperf - general network performance
netpipe - detailed network performance
nas - NAS parallel tests
unixbench - general Unix benchmarks
lmbench - micro Linux benchmarks

As part of the design, all the tests are are "open source" and the tar files for all programs are included in the binary RPM. Some of the tests are are already compiled, while others are built when the test runs. It should be emphasized that the intent of BPS is not to try and obtain optimum numbers for your cluster, but rather generate a baseline on which to measure the effects of changes on your cluster.

Some users have questioned why the High Performance LINPACK (HPL) benchmark is not included in the suite. While HPL is a good measure of performance, the main concern was that the test suite would be used as a "my cluster is better than your cluster" set of tests due to the fact that HPL is used to rate the top supercomputers. Our goal was to provide a set of tools to measure performance increases (or decreases) when things are changed.

Obtaining/Installing BPS

You may download the BPS binary rpm (i386 built on FC4), a binary rpm (x86_64 built on FC6) source rpm, or a tar file. BPS requires Perl, gnuplot, and expect to work properly. These packages are normally found in most distributions or otherwise easily added. It also requires password-less rsh between nodes (ssh is not supported in this version).

Running The Suite

Before running BPS, it is best to examine the options. This step can be accomplished by entering "bps" at the command prompt. (Note: If you installed the tar file you will need to set your path to ~/bps/bin) The options are as follows.

Usage: /opt/bps/bin/bps 

Options:
  -b                            bonnie++
  -s                            stream
  -f <send node>,<receive node> netperf to remote node
  -p <send node>,<receive node> netpipe to remote node
  -n <compiler>,<#processors>,  NAS parallel benchmarks
     <test size>,<MPI>,         compiler={gnu,pgi,intel}
     <machine1,machine2,...>    test size={A,B,C,dummy}
                                MPI={mpich,lam,mpipro}
  -k                            keep NAS directory when finished
  -u                            unixbench
  -m                            lmbench
  -l <log_dir>                  benchmark log directory
  -w                            preserve existing log directory
  -i <mboard manufacturer>,     machine information
     <mboard model>,<memory>,
     <interconnect>,<linux ver>
  -v                            show version
  -h                            show this help

All output files from bps are placed in the log directory. If you do not specify a log directory, then the bps script will use default name "bps-logs". If a previous bps-logs directory exists, you may preserve these results by using the "-w" option where the old directory will be renamed to "bps-logs.some-unique-number" and a new bps-logs directory will be created.

The tests can be broken into three categories, single system (i.e. host machine), network (any two machines), and parallel (any number of machines). The host machine tests are stream, bonnie++, lmbench, and unixbench. The network tests are netperf and netpipe. The parallel test are the eight NAS parallel benchmarks (see below).

Some Examples

The following are some example command lines run on a cluster called "kronos". For example, to run stream simple enter:

bps -s

The network tests can be run by specifying two nodes. Note that the head node does not have to be one of these hosts. In this

bps -f kronos01,kronos03

You may also run multiple tests (they are run one at a time) and create a new log directory (preserving previous results) from the same command line:

bps -w -b -m -u -p kronos04,kronos02

Running the NAS suite is the most cumbersome as all the nodes need to be listed. The command below keeps the NAS directory (npb) in the log directory. It also uses the gnu compilers and LAM/MPI.

bps -k -n gnu,8,A,lam,kronos,kronos01,kronos02,kronos03,kronos04,kronos05,kronos06,kronos00

As mentioned, if you are running the NAS tests, you may tell bps to keep ("-k") the NAS directory (npb) in your log directory. This can be useful in tracking down compile errors (see the npb/make.log). HTML output files can be produced from the results files in the log directory by executing:

bps-html bps-logs

where bps-logs is the bps log directory. The results can be viewed by pointing a browser at the index.html file in the log directory. An example of the results can be found here.

Important Notes

All tests source files are archived in the /opt/bps/src/ directory.
The BPS suite is best run as a normal user, not root. Some of the tests (i.e. NAS parallel) will not run as root.
When using Netpipe and Netperf benchmarks, rsh with no password must be permitted between the nodes.
Also, the tests have been designed so that the BPS rpm only needs to be installed on the master node. For this to work, the BPS log directory must be mounted on all nodes (e.g. under /home).
NAS Parallel Benchmarks have been tested with specific versions of LAM and MPICH. Please see the NAS documentation for more information. Rather than limit potential BPS users, these are not made a part of the required packages list. The benchmark scripts have been written to rely on the three environment variables (LAM_HOME for LAM/MPI, MPICH_HOME for MPICH, and MPIPRO_HOME for MPI-PRO, and MPICH). If you are having problems with the NAS benchmarks, use the "-k" option to run the NAS tests and then go to bps-logs/npb directory and consult the README.bps file. Also, if you wish to use the Portland Group or the Intel Compilers make sure they are properly configured.

In Case of Problems

You should have minimal or no problems with the single machine tests. As more machines become involved with the tests, there is room for more configuration errors to arise. If a test does not run, check the "test_name.log" file in the log directory (the default is bps-logs). In the case of the NAS tests, the results are in the form npb.COMPILER.MPI.CLASS.PROCESSORS.

In general, if you have problems with a test it may be best to examine the results in the log file. In the case of the NAS suite, the "-k" option will keep the npb tests directory in the log directory so you can run the tests directly if you have problems. There is a script called run_suite in the npb directory that will run the tests (run_suite -h ). Also the README.bps file in the npb directly should provide more information on how the tests are run and how to resolve possible problems.

Working with the NAS Benchmark Suite

The NAS suite will probably produce the most problems for end users. The main script run_suite is designed to "wrap" around and hide the various issues with the different MPI's and compilers. While run_suite does an adequate job, it certainly can not predict the potential software environments on a cluster. The NAS suite is run from the command line. The following options are required. If you are using run_suite directly you need to list the machine names in the npb/cluster/machines file (one per line). This file is used by MPI to start your programs. The options are:

Usage: ./run_suite 

Options:
  -v                  verbose output from make stage (default=make.log)
  -c <compiler>       compiler (gnu/pgi/intel)
  -n <processors>     number of processors
  -t <test>           test size (A,B,C,S)
  -m <mpi>            mpi version(lam,mpich,mpipro,dummy)
  -o                  only build programs
  -h                  show this help
To run on a single CPU use: '-c gnu -n 1 -t S -m dummy'

If you have problems producing the binary files, consult the make.log file for a complete listing for the make process. Often an error will cause all the builds to fail, so fix one problem at time and re-run the build.

Future Plans

The BPS suite is showing its age. Virtually all the functionality of the BPS suite is being converted into the CMBP (Cluster Monkey Benchmarking Project).

Acknowledgments

I wish to thank and acknowledge all the authors of the tests suites used in this package. See Sidebar Two (below) for more information on each test.

Sidebar One: Description of the NAS tests

BT is a simulated CFD application that uses an implicit algorithm to solve 3dimensional (3D) compressible NavierStokes equations. The finite differences solution to the problem is based on an Alternating Direction Implicit (ADI) approximate factorization that decouples the x, y, and z dimensions. The resulting systems are BlockTridiagonal of 5x5 blocks and are solved sequentially along each dimension.

SP is a simulated CFD application that has a similar structure to BT. The finite differences solution to the problem is based on a BeamWarming approximate factorization that decouples the x, y, and z dimensions. The resulting system has scalar Pentadiagonal bands of linear equations that are solved sequentially along each dimension.

LU is a simulated CFD application that uses symmetric successive overrelaxation (SSOR) method to solve a seven block diagonal system resulting from finite difference discretization of the NavierStokes equations in 3D by splitting to into block Lower and Upper triangular systems.

FT contains the computational kernel of a 3D fast Fourier Transform (FFT)based spectral method. FT performs three one dimensional (1D) FFT's, one for each dimension.

MG uses a Vcycle MultiGrid method to compute the solution of the 3D scalar Poisson equation. The algorithm works continuously on a set of grids that are made between coarse and fine. It tests both short and long distance data movement.

CG uses a Conjugate Gradient method to compute an approximation to the smallest eigenvalue of a large, sparse, unstructured matrix. This kernel tests unstructured grid computations and communications by using a matrix with randomly generated locations of entries.

EP is an Embarrassingly Parallel benchmark. It generates pairs of Gaussian random deviates according to a specific scheme. The goal is to establish the reference point for peak performance of a given platform. EP is almost independent of the interconnect as communication is minimal.

IS in a parallel integer sort algorithm that is very sensitive to latency of the interconnect.

Sidebar Two: Additional Benchmark Information

bonnie++ - hard drive performance

stream - memory performance

netperf - general network performance

netpipe - detailed network performance

unixbench - general Unix benchmarks

LMbench - low level system benchmarks

NAS Parallel Suite - parallel applications benchmarks