Counting Your Effective HPC Cores | Benchmarking Methods

Can more cores per node address HPC needs or are thin nodes on the horizon?

As multi-core continues the dominate the x86 marketplace, there has always been a nagging question for some HPC users; What is better, many single socket multi-core nodes or fewer fat nodes with as many sockets as possible. Of course the amortization of hardware costs (power supplies, cases, hard disk, etc) is a win for the fat node approach, but what about performance? As nodes offer more and more cores, does HPC performance follow?

When two processor sockets showed up on x86 motherboards this question became interesting to those that look for the best price to performance approach to HPC. To help answer this question, I created a simple test script that determines the number of "effective cores" offered by a multi-core processor.

One of the curious aspects of multi-core processors is how much emphasis is often placed on single core performance. While there is a case for dynamic frequency scaling (i.e. SpeedStep or PowerNow) with machines that have a varied usage pattern, most HPC applications use as many cores as possible for extended periods of time. Thus, for HPC, there is a need to look at how well a processor performs using a range of cores and in particular all the cores.

Another point about testing multi-core is the golden rule of HPC applies "Your mileage may vary, use your codes and hardware to make performance assessments." Indeed, as we will see, the type of application (e.g. cache friendly or memory intensive) can make a big difference. The tests described in this article should be used as a guide and not a final statement as to best performance on a given type of hardware.

An Obvious Test

In the past I created and discussed a short test script that is described in an article that appeared in Linux Magazine. The idea behind the script is rather simple. If a single program/thread runs on a single core in X number of seconds, then Y copies should run in the same amount of time, provided Y is equal to the number of cores and there is perfect memory sharing (i.e. there is no memory contention). If it takes the collection of copies longer to run (than the a single copy), then the number of "effective cores" is reduced. You can find a series of scripts that will work on 2,4,8,12, and 16 cores. (Note: If I invested more time, I could make a single script that would use a command line argument, but I don't run these scripts all that often.)

To make the test interesting, I use the NAS Parallel Benchmark Suite compiled for a single processor or core (i.e. it is not parallel). The NAS suite is a set of eight kernels that represent different aerodynamic application types. Each kernel is self checking, reports the run time, and offers a different memory access pattern. The script can be easily modified for other programs. If you want to use the NAS suite, you may find it helpful to download the Beowulf Performance Suite which has the run_suite script that automates running the NAS suite. An example of the four core script is given below.

#!/bin/bash

PROGS="cg.A.1 bt.A.1 ep.A.1 ft.A.1 lu.A.1 is.A.1 sp.A.1 mg.A.1"
NPBPATH="../npb/"
echo "4 Way SMP Memory Test" |tee "smp-mem-test-4.out"
echo "`date`" |tee -a "smp-mem-test-4.out"
# if needed, generate single cpu codes change -c for different compiler
# just check for last program
if [ ! -e "$NPBPATH/bin/mg.A.1" ];
then
 pushd $NPBPATH
 ./run_suite -n 1 -t A -m dummy -c gnu4 -o
 popd
fi

for TEST in $PROGS
do
        $NPBPATH/bin/$TEST>& temp.mem0
        $NPBPATH/bin/$TEST>& temp.mem1 &
        $NPBPATH/bin/$TEST>& temp.mem2 &
        $NPBPATH/bin/$TEST>& temp.mem3 &
        $NPBPATH/bin/$TEST>& temp.mem4
        wait
        S=`grep Time temp.mem0 |gawk '{print $5}'`
        C1=`grep Time temp.mem1 |gawk '{print $5}'`
        C2=`grep Time temp.mem2 |gawk '{print $5}'`
        C3=`grep Time temp.mem3 |gawk '{print $5}'`
        C4=`grep Time temp.mem4 |gawk '{print $5}'`
        SPEEDUP=`echo "3 k $S $C1 / $S $C2 / $S $C3 / $S $C4 /  + + + p" | dc`
        echo "4 Way SMP Program Speed-up for $TEST is $SPEEDUP" |\
tee -a "smp-mem-test-4.out"
done
/bin/rm temp.mem*
echo "`date`" |tee -a "smp-mem-test-4.out"

Testing 12 Cores

One aspect of testing effective cores is that individual program performance is not as important as multiple program performance. That is, getting the best performance on the test is secondary to running multiple copies. The series of tests measure how well a server scales as more programs are run. Of course, this test could be considered a "worst" case scenario because In the case of an MPI program run on multi-core this is exactly what happens. Example results are reported in Table One below for a dual 6-core Intel Xeon (Gulftown) server. Note that even though there are 12 total cores, I ran up to 16 copies of each program.

Test	2 copies	4 copies	8 copies	12 copies	16 copies
cg	2.0	3.4	5.7	6.6	7.7
bt	2.0	3.2	4.6	4.8	4.9
ep	2.0	3.9	7.8	11.8	12.7
ft	2.0	3.8	7.1	8.9	11.0
is	2.0	4.0	7.8	11.2	14.8
lu	2.0	3.9	6.5	6.1	6.7
sp	2.0	3.7	5.1	5.4	5.7
mg	2.0	3.8	6.4	6.6	9.1
Ave	2.0	3.7	6.4	7.7	9.1

Table One: Effective Cores for a 12-way Intel Xeon (Gulftown) SMP
server running the NAS suite

One way to interpret the results is to assume "effective cores" or how many cores actually get utilized on the processor. The under utilization is due to memory contention. In the table above, results are pretty good across the board up to 8 copies. Running 12 copies we start to see the performance level off for some programs while others keep improving. At 16 copies, most programs see some improvement, but this is probably the limit of effective cores for this system.

The tests were run a second time to see the if there were any benefit to using Simultaneous Multithreading or SMT. Once enabled in the BIOS SMT doubles the number of cores seen by the OS. While many people may recall Hyper Threading (HT), SMT is supposed to be better and help hide memory latency. That is, while a core is waiting for memory access, it can in theory be running another "thread." This technique may be very helpful with I/O issues as well, however, most HPC applications hit the memory floating point unit hard -- of which there are only twelve.

SMT was turned on and the 12-way and 16-way effective core tests were re-run. Only the 12 and 16-way tests were deemed important because SMT should have little effect if the number of processes used is less than the number of real cores. The results are in Table Two below. The 16-way results should be most telling because the real cores are over subscribed by 4 processes. With the exception of the ep benchmark, there does not seem to be any advantage to using SMT. Indeed, some benchmarks saw a decrease in effective cores. ep is more processor bound and thus shows a nice performance boost. As expected, there was no improvement when running the 12 core test using SMT.

Test	12 copies	12 copies (SMT)	16 copies	16 copies (SMT)
cg	6.6	6.6	7.7	7.8
bt	4.8	4.8	4.9	5.4
ep	11.8	11.7	12.7	14.0
ft	8.9	8.9	11.0	10.4
lu	6.1	6.0	6.7	6.7
is	11.2	11.0	14.8	12.6
sp	5.4	5.5	5.7	6.4
mg	6.6	6.5	9.1	7.9
Ave	7.7	7.6	9.1	8.9

Table Two: Effective Cores for a 12-way Intel Xeon (Gulftown) SMP server running
the NAS suite with SMT enabled

In general, SMT does not appear to hurt anything as long as you don't oversubscribe the actual number of cores. It may allow daemons and other such background processes to work better on compute nodes, but I don't see it making a huge difference (this assumption should be tested with your code, however).

The above results range from 41%-98% efficiency with the average utilization for all tests of 64%. Thus, on average, you can expect to effectively use 7.7 cores out of the 12 present in the server for applications similar to the NAS kernels.

A Single Socket Redux?

In contrast to these results above, consider similar tests done on a number of 4-core single socket processors where the best case performance ranged from 50%-100% and the average utilization was 74%. On average, one can expect to effectively use 3 out of 4 cores.

The variation is due to memory bandwidth of each system. In general, more cores means more sharing of memory and more possible contention. Cache friendly programs usually scale well on multi-core, while those that rely on heavy access to main memory have the most difficulty with large multi-core systems.

As mentioned, a valid argument for high density multi-core nodes is the cost amortization of power supplies, hard drives, interconnects, and case/rack hardware across the large number of cores in a single node. This makes sense, but unless the amortization costs is based on effective cores, the assumed savings may not accurately reflect the reality of the situation. Using a single socket node also reduces the MPI messaging and I/O load on the interconnect, but does increase the number of switch ports and network cards needed. In some cases, lower cost Gigabit Ethernet may be adequate for single socket nodes, thus offsetting the increase in interconnect costs. Furthermore, it is possible to build nodes that contain multiple single socket motherboards that share power supply and packaging costs gaining back some of the lost amortization.

Single socket nodes may also provide a more MPI friendly environment than that of a large SMP nodes. That is, an application that requires 72 cores may run better on 18 four-core single socket skinny nodes than 4 twelve-core dual socket fat nodes. There are less data locality issues in the case of the thin nodes. On the other hand, purposely under subscribing fat nodes for jobs may mitigate some of the issues. For example, any one parallel job can only have at most 8 cores per 12-core node. The remaining cores could be used by different jobs, which may have different memory usage patters and allow more effective core usage.

The range of cores per processor and the sockets per motherboard makes designing an HPC cluster an interesting challenge. In the tests described above, thin nodes offer on average 10% better core utilization than fat nodes. In some cases the difference in utilization was far worse. If the number of "effective cores" becomes less as more cores are added to nodes, will thin node design start to dominate or will the economics of fat nodes keep them as the best choice for HPC?