Can more cores per node address HPC needs or are thin nodes on the horizon?
As multi-core continues the dominate the x86 marketplace, there has always been a nagging question for some HPC users; What is better, many single socket multi-core nodes or fewer fat nodes with as many sockets as possible. Of course the amortization of hardware costs (power supplies, cases, hard disk, etc) is a win for the fat node approach, but what about performance? As nodes offer more and more cores, does HPC performance follow?
When two processor sockets showed up on x86 motherboards this question became interesting to those that look for the best price to performance approach to HPC. To help answer this question, I created a simple test script that determines the number of "effective cores" offered by a multi-core processor.
One of the curious aspects of multi-core processors is how much emphasis is often placed on single core performance. While there is a case for dynamic frequency scaling (i.e. SpeedStep or PowerNow) with machines that have a varied usage pattern, most HPC applications use as many cores as possible for extended periods of time. Thus, for HPC, there is a need to look at how well a processor performs using a range of cores and in particular all the cores.
Another point about testing multi-core is the golden rule of HPC applies "Your mileage may vary, use your codes and hardware to make performance assessments." Indeed, as we will see, the type of application (e.g. cache friendly or memory intensive) can make a big difference. The tests described in this article should be used as a guide and not a final statement as to best performance on a given type of hardware.
An Obvious Test
In the past I created and discussed a short test script that is described in an article that appeared in Linux Magazine. The idea behind the script is rather simple. If a single program/thread runs on a single core in X number of seconds, then Y copies should run in the same amount of time, provided Y is equal to the number of cores and there is perfect memory sharing (i.e. there is no memory contention). If it takes the collection of copies longer to run (than the a single copy), then the number of "effective cores" is reduced. You can find a series of scripts that will work on 2,4,8,12, and 16 cores. (Note: If I invested more time, I could make a single script that would use a command line argument, but I don't run these scripts all that often.)
To make the test interesting, I use the NAS Parallel Benchmark Suite compiled for a single processor or core (i.e. it is not parallel). The NAS suite is a set of eight kernels that represent different aerodynamic application types. Each kernel is self checking, reports the run time, and offers a different memory access pattern. The script can be easily modified for other programs. If you want to use the NAS suite, you may find it helpful to download the Beowulf Performance Suite which has the run_suite script that automates running the NAS suite. An example of the four core script is given below.
#!/bin/bash PROGS="cg.A.1 bt.A.1 ep.A.1 ft.A.1 lu.A.1 is.A.1 sp.A.1 mg.A.1" NPBPATH="../npb/" echo "4 Way SMP Memory Test" |tee "smp-mem-test-4.out" echo "`date`" |tee -a "smp-mem-test-4.out" # if needed, generate single cpu codes change -c for different compiler # just check for last program if [ ! -e "$NPBPATH/bin/mg.A.1" ]; then pushd $NPBPATH ./run_suite -n 1 -t A -m dummy -c gnu4 -o popd fi for TEST in $PROGS do $NPBPATH/bin/$TEST>& temp.mem0 $NPBPATH/bin/$TEST>& temp.mem1 & $NPBPATH/bin/$TEST>& temp.mem2 & $NPBPATH/bin/$TEST>& temp.mem3 & $NPBPATH/bin/$TEST>& temp.mem4 wait S=`grep Time temp.mem0 |gawk '{print $5}'` C1=`grep Time temp.mem1 |gawk '{print $5}'` C2=`grep Time temp.mem2 |gawk '{print $5}'` C3=`grep Time temp.mem3 |gawk '{print $5}'` C4=`grep Time temp.mem4 |gawk '{print $5}'` SPEEDUP=`echo "3 k $S $C1 / $S $C2 / $S $C3 / $S $C4 / + + + p" | dc` echo "4 Way SMP Program Speed-up for $TEST is $SPEEDUP" |\ tee -a "smp-mem-test-4.out" done /bin/rm temp.mem* echo "`date`" |tee -a "smp-mem-test-4.out" |
Testing 12 Cores
One aspect of testing effective cores is that individual program performance is not as important as multiple program performance. That is, getting the best performance on the test is secondary to running multiple copies. The series of tests measure how well a server scales as more programs are run. Of course, this test could be considered a "worst" case scenario because In the case of an MPI program run on multi-core this is exactly what happens. Example results are reported in Table One below for a dual 6-core Intel Xeon (Gulftown) server. Note that even though there are 12 total cores, I ran up to 16 copies of each program.
Test | 2 copies | 4 copies | 8 copies | 12 copies | 16 copies |
---|---|---|---|---|---|
cg | 2.0 | 3.4 | 5.7 | 6.6 | 7.7 |
bt | 2.0 | 3.2 | 4.6 | 4.8 | 4.9 |
ep | 2.0 | 3.9 | 7.8 | 11.8 | 12.7 |
ft | 2.0 | 3.8 | 7.1 | 8.9 | 11.0 |
is | 2.0 | 4.0 | 7.8 | 11.2 | 14.8 |
lu | 2.0 | 3.9 | 6.5 | 6.1 | 6.7 |
sp | 2.0 | 3.7 | 5.1 | 5.4 | 5.7 |
mg | 2.0 | 3.8 | 6.4 | 6.6 | 9.1 |
Ave | 2.0 | 3.7 | 6.4 | 7.7 | 9.1 |
server running the NAS suite
One way to interpret the results is to assume "effective cores" or how many cores actually get utilized on the processor. The under utilization is due to memory contention. In the table above, results are pretty good across the board up to 8 copies. Running 12 copies we start to see the performance level off for some programs while others keep improving. At 16 copies, most programs see some improvement, but this is probably the limit of effective cores for this system.