Benchmarking A Multi-Core Processor For HPC

Published on Tuesday, 12 July 2011 10:59
Written by Douglas Eadline
Hits: 5756

It is 10pm, do you know where your cores are?

Ever since multi-core processors hit the market I have always been interested in testing how well they actually work. Of course, there are application benchmarks, which are always a good measure, but I like to look at some basic properties of a new processor (I look at networking and storage in other tests). The compute testing I do is rather straightforward, yet I have seen very few of these types of results. In particular, most results are for Windows or involve "web benchmarks." Since I work in Linux High Performance Computing (Linux Clusters) sector, I thought some examples of how I evaluate multi-core may be of interest to others.

For these type of tests, the absolute best performance is not so important as the relative performance on a specific processor. Most of the tests are Fortran codes (one C code) that stress the number crunching capability of a processor. If you are interested in a good comparison of Fortran compilers take a look at Polyhedron Software where they have recent comparisons of many popular Fortran compilers for a Intel Core i5 2500k (3.30GHz) and AMD Phenom II X4 955 processor (3.2 GHz).

For this series of tests, I used gfortran, which by the way can usually hold its own against some of the other commercial compilers. I used the following processors fro my tests.

{mosgoogle right} Note that the 910e and i5-2400S are 65W processors. Each is in a single socket desktop motherboard. The Q6600 was used as a reference against the new Intel Sandybridge and the aging AMD Deneb.

My motivation for this type of processor is the Limulus Project. Even though these processors are low-power desktop parts they work similarly to their big brother HPC versions. When Intel or AMD make a processor line "the guts" are basically the same. The marketing department slices and dices some of the features configuring products to reflect the market in which they will be sold. There are features servers need that consumer desktops do not (and vis-versa). Some of these options make sense from a technical standpoint, while others are just marketing games.

Generational advances (like memory architecture) travel through the whole product line from desktop to server. The high end server processor may work 20% faster than the lower end desktop processors, but the cost is usually much more (almost always more than 20%). Thus, by using lower cost desktop parts, a large percentage of the "high end" performance can be had at a low end price. The Limulus Project is designed to take advantage of this marketing trend. And, to be clear, the Limulus Project is not a replacement for a large server based cluster.

The testing is designed to measure how well a single multi-core processor performs with multiple programs running on it. Using single core performance as a measure of multi-core performance is difficult if not impossible. Thus, looking at parallel performance and scalability is the real measure of a modern multi-core processor.

For much of my testing I use the NAS Parallel Benchmark Suite. The NAS suite is a set of eight kernels that represent different aerodynamic application types. Each kernel is self checking, reports the run time, and offers a different memory access pattern. The "B" level test sizes were used. I also use the Beowulf Performance Suiteas framework for building and running the tests. The following software environment was used:

Effective Core Test

The first test is based on a simple idea. If a single program/thread runs on a single core in X number of seconds, then Y copies should run in the same amount of time, provided Y is equal to the number of cores and there is perfect memory sharing (i.e. there is no memory contention). If it takes the collection of copies longer to run (than the single copy), then the number of “effective cores” is reduced. I created a series of scripts that works on on 2,4,8,12, and 16 cores (The four core example is given at the end of the column). The results are given below in both tabular and chart form.

Test AMD 910e Intel i5-2400S Intel Q6600
bt 2.1 2.01.6
cg 3.4 3.12.0
ep 4.0 4.04.0
ft 3.7 3.62.7
is 3.9 4.03.3
lu 2.3 2.52.1
mg 2.3 2.41.8
sp 2.4 2.11.5

Table One: Effective cores for NAS Parallel Kernels


Figure One is a bar chart of Table One showing effective cores for NAS parallel kernels.

The results show a clear improvement over the Q6600 by both newer processors (due to better memory architecture). Interestingly, the 910e and the i5-2400S show about the same number of effective processors suggesting that the memory designs (each has two memory channels) are similar. Of course, the server version of these processors have more memory channels, better performance, and a higher cost. The important thing to note is no processor achieves perfect scaling for every test. Furthermore, one could argue that this is a "worst case" test and not likely to be how multi-core processors are used. As we shall see below, these results approximate the behavior of parallel programs.

Single Core Performance

I mentioned previously that single core performance was not as important as using all the cores. In order to measure the scalability, however, single core numbers are needed. The table below are the Million Operations Per Second (MOPS) for the NAS benchmark suite. Each test was run using just one core. Note: Except for IS (Integer Sort) MOPS is effectively Million Floating Point Operations per Second (MFLOPS) Also, note the number of MFLOPS varies by test.

Test AMD 910e Intel i5-2400S Intel Q6600
bt 1327 1950981
cg 364 585 531
ep 11 15 11
ft 1020 1575 980
is 57 80 62
lu 1434 2054 1102
mg 1640 19491387
sp 1112 1427800

Table Two: Single Core results for NAS Parallel Kernels


Figure Two is a bar chart of Table Two normalized to the Q6600 results.

{mospagebreak}

Quad-core Performance

The NAS tests were then run using all four cores as an MPI program. The total MOPS is given in Table Three below.

Test AMD 910e Intel i5-2400S Intel Q6600
bt 2721 38411332
cg 1430 2418 911
ep 42 62 46
ft 2900 4306 1391
is 237 415 143
lu 4679 5730 2971
mg 3775 46941693
sp 2203 30131026

Table Three: Quad-Core results for NAS Parallel Kernels


Figure Three is a bar chart of Table Three normalized to the Q6600 results.

The final figure below is the scalability of each program. Like effective cores, perfect scalability would produce a 4X speed-up. Note that there is also some hyper speed-up, which is probably due to cache effects (i.e. each core is now running a smaller program that can better utilize the cache structure of the processor).


Figure Four is the scalability of each NAS kernel from the data in Table Three.

Discussion

The results shown above offer some insight into modern X86 multi-core processors. First, the effective cores test seems to be a good indicator as to how well a parallel program will perform on a multi-core processor. It seems to be a loose indicator of scalability. Consider tests bt,lu,mg, and sp. None of these tests were able to achieve more then 2.5 effective cores. When run in parallel, only lu was able to push a significantly above a scalability of two.

{mosgoogle right} Second, there is also an interesting lesson about single core vs multi-core results. Consider, the single core results for cg. The AMD 910e is the slowest, however, when run in parallel it over takes the older Intel Q6600 and shows almost perfect scalability. Thus, using single core results as predictor parallel performance is not always the best method.

Benchmarks are always revealing. Clearly, the better memory architecture of the latest AMD and Intel processors leads to better performance. The Sandybridge processor is the current leader in this category. As more cores are added to new processors, these types of tests become more important because good scalability (effective cores) gets harder. Of course, memory contention is the real issue with multi-core and mixing different programs with different access patterns may lesson the contention on a single processor. Finally, The parallel results were run using MPI. These results may vary using OpenMP and should make for another interesting set of tests on this processor group.

Listing Of the BASH Test Script

#!/bin/bash

# Requires NAS parallel codes (or your own) in PROGS line.
# Works Best with Beowulf Performance Suite
# See http://www.clustermonkey.net//content/view/38/34/1/0/
# Install BPS, use "-k" option for NAS test, copy this
# file into ~/bps-logs/npb

PROGS="cg.A.1 bt.A.1 ep.A.1 ft.A.1 lu.A.1 is.A.1 sp.A.1 mg.A.1"
echo "4 Way SMP Memory Test" |tee "smp-mem-test-4.out"
echo "`date`" |tee -a "smp-mem-test-4.out"
# generate single cpu codes change -c for different compiler
# comment if you do not need to generate the programs
./run_suite -n 1 -t A -m dummy -c gnu4 -o

for TEST in $PROGS
do
        bin/$TEST>& temp.mem0
        bin/$TEST>& temp.mem1 &
        bin/$TEST>& temp.mem2 &
        bin/$TEST>& temp.mem3 &
        bin/$TEST>& temp.mem4
        wait

        S=`grep Time temp.mem0 |gawk '{print $5}'`
        C1=`grep Time temp.mem1 |gawk '{print $5}'`
        C2=`grep Time temp.mem2 |gawk '{print $5}'`
        C3=`grep Time temp.mem3 |gawk '{print $5}'`
        C4=`grep Time temp.mem4 |gawk '{print $5}'`
        SPEEDUP=`echo "3 k $S $C1 / $S $C2 / $S $C3 / $S $C4 / \
        + + + p" | dc`

        echo "4 Way SMP Program Speed-up for $TEST is $SPEEDUP" \
        |tee -a "smp-mem-test-4.out"

done
/bin/rm temp.mem*
echo "`date`" |tee -a "smp-mem-test-4.out"