Performance: Now We're Talking
Once Microwulf was built and functioning it's fairly obvious that we wanted to find out how 'fast' it was. Fast can have many meanings, depending upon your definition. But since the HPL benchmark is the standard used for the Top500 list, we decided to use it as our first measure of performance. Yes, you can argue and disagree with us, but we needed to start somewhere.
We installed the development tools for Ubuntu (gcc-4.1.2) and then built both Open MPI and MPICH. Initially we used OpenMPI as our MPI library of choice and we had both GigE NICs configured (the on-board adaptor and the Intel PCI-e NIC that was in the x16 PCIe slot).
Then we built the GOTO BLAS library, and HPL, the High Performance Linpack benchmark.The Goto BLAS library built fine, but when we tried to build HPL (which uses BLAS), we got a linking error indicating that someone had left a function named main() in a module named main.f in /usr/lib/libgfortranbegin.a. This conflicted with main() in HPL. Since a library should not need a main() function, we used ar to remove the offending module from /usr/lib/libgfortranbegin.a, after which everything built as expected.
Next, we started to experiment with the various parameters for running HPL - primarily problem size and process layout. We varied PxQ between {1x8, 2x4}, varied NB between {100, 120, 140, 160, 180, 200}, and used increasing values of N (problem size) until we ran out of memory. As an example of the tests we did, Figure Six below is a plot of the HPL performance in GFLOPS versus the problem size N.
For Figure Six we chose PxQ=2x4, NB=160, and varied N from a very small number up to 30,000. Notice that above N=10,000, Microwulf achieves 20 GLFOPS, and with N greater than 25,000, it exceeds 25 GFLOPS. Anything above N=30,000 produced "out of memory" errors.
We did achieve a peak performance of 26.25 GFLOPS. The theoretical peak performance for Microwulf is 32 GLFOPS. (Eight cores x 2 GHz x 2 double-precision units per core.) This means we have hit about 82% efficiency (which we find remarkable). Note that one of the reasons we asume that we achieved such a high efficiency is due to Open MPI, which will use both GigE interfaces. It will round-robin data transfers over the various interfaces unless you explicitly tell it to just use certain interfaces.
It's important to note that this performance occurred using the default system and Ethernet settings. In particular, we did not tweak any of Ethernet parameters mentioned in Doug Eadline and Jeff Layton's article on cluster optimization. We were basically using "out of the box" settings for these runs.
To assess how well our NICs were performing, Tim did some followup HPL runs, and used netpipe to gauge our NICs latency. Netpipe reported 16-20 usecs (microseconds) latency on the onboard NICs, and 20-25 usecs latency on the PCI-e NICs, which was lower (better) than we were expecting.
As a check on performance we also tried another experiment. We channel bonded the two GigE interfaces to produce, effectively, a single interface. We then used MPICH2 with the channel bonded interface and used the same HPL parameters we found to be good for Open-MPI. The best performance we achieved was 24.89 GLOPS (77.8% efficiency). So it looks like Open MPI and multiple interfaces beats MPICH2 and a bonded interface.
Another experiment we tried was to use Open MPI and just the PCI-e GigE NIC. Using the same set of HPL parameters we have been using we achieved a performance of 26.03 GFLOPS (81.3% efficiency). This is fairly close to the performance we obtained when using both interfaces. This suggests that the on-board NIC isn't doing as much work as we thought. We plan to investigate this more in the days ahead.
Now, let's look at the performance of Microwulf in relation to the Top500 list to see where it would have placed. Going through the archived lists, here is where Microwulf would have placed.
- Nov. 1993: #6
- Nov. 1994: #12
- Nov. 1995: #31
- Nov. 1996: #60
- Nov. 1997: #122
- Nov. 1998: #275
- June 1999: #439
- Nov. 1999: Off the list
Looking back at the lists is really a lot of fun. In November 1993 Microwulf would have been the 6th fastest machine on the planet! That's not bad considering that was only 14 years ago. As recent as 8 years ago, Microwulf would have ranked 439th on the list. That's not bad for a little 4 node, 8 core cluster measuring 11" x 12" x 17".
If you dig a little further, we learn some more interesting things about Microwulf and it's performance compared to historic supercomputers. In Nov. 1993, when Microwulf would have been in 6th place, the 5th place machine was a Thinking Machines CM-5/512 with 512 CPUs, which produced about 30 GFLOPS. So the performance of Microwulf's 4 CPUs is pretty close to the CM-5's 512 CPUs -- a factor of 64!
In Nov. 1996, when Microwulf would have been in 60th place, the next machine on the list is a Cray T3D MC256-8 with 256 CPUs. It's kind of interesting to realize that Microwulf with 8 cores is faster than an 11 year old Cray with 256 CPUs. Not to mention the price difference -- a T3D cost millions of dollars! It puts into perspective the cluster and commodity revolution.
Efficiencies - World Record Time
Now, let's look at some metrics to determine how good Microwulf is or isn't. Believe it or not, these numbers can tell you a lot about your system. If we take the January 2007 price of Microwulf, $2,470, and divide by its 26.25 GFLOP performance, we can see that Microwulf achieves $94.10/GFLOP! So Microwulf has broken the proverbial $100/GFLOPS barrier (for double precision results)! Break out the champagne!
With energy prices rising, another useful metric to examine is the power/performance ratio. This ratio is increasingly important to big cluster farms (e.g., Google), and could become a new metric that people watch very closely. For Microwulf we have measured the total power consumption at 250W at idle (a bit over 30W per core) and 450W when under load. This works out to 17.14 W/GFLOP under load. To make this number more meaningful, let's compare it with the performance/power numbers of some other systems.
One system that was designed for low power consumption is Green Destiny, which used very low power Transmeta CPUs and required very little cooling. The 240 CPUs in Green Destiny consumed about 3.2 kW, and the machine achieved about 101 GFLOPS. This gives Green Destiny a performance/power metric of 31W/GFLOP. This is almost 2 times worse than Microwulf.
Another interesting comparison is to the Orion Multisystems clusters. Orion is no longer around, but a few years ago they sold two commercial clusters: a 12-node desktop cluster (the DS-12) and a 96-node deskside cluster (the DS-96). Both machines used Transmeta CPUs. The DS-12 used 170W under load, and its performance was about 13.8 GFLOPS. This gives it a performance/power ratio of 12.31W/GLFOP (much better than Microwulf). The DS-96 consumed 1580W under load, with a performance of 109.4 GFLOPS. This gives it a performance/power ratio of 14.44W/GFLOP, which again beats Microwulf.
Another way to look at power consumption and price is to use the metric from Green 500. Their metric is MFLOPS/Watt (the bigger the number the better). Microwulf comes in at 58.33, the DS-12 is 81.18, and the deskside unit is 69.24. So using the Green 500 metric we can see that the Orion systems are more power efficient than Microwulf. But let's look a little deeper at the Orion systems.
The Orion systems look great at Watts/GFLOP and considering the age of the Transmeta chips, that is no small feat. But let's look at the price/performance metric. The DS-12 desktop model had a list price of about $10,000, giving it a price/performance ratio of $724/GFLOP. The DS-96 deskside unit had a list price of about $100,000, so it's price/performance is about $914/GFLOP. That is, while the Orion systems were much more power efficient, their price per GFLOP is much higher than that of Microwulf, making them much less cost efficient than Microwulf.
Since Microwulf is better than the Orion systems in price/performance, and the Orion systems are better than Microwulf in power/performance, let's try some experiments with metrics to see if we can find a useful way to combine the metrics. Ideally we'd like a single metric that encompasses a system's price, performance, and power usage. As an experiment, let's compute MFLOP/Watt/$. It may not be perfect, but at least it combines all 3 numbers into a single metric, by extending the Green 500 metric to include price. You want a large MFLOP/Watt to get the most processing power per unit of power as possible. We also want price to be as small as possible so that means we want the inverse of price to be as large as possible. This means that we want MFLOP/Watt/$ to be as large as possible. With this in mind, let's see how Microwulf and Orion did.
- Microwulf: 0.2362
- Orion DS-12: 0.00812
- Orion DS-96: 0.00069
From these numbers (even though they are quite small), Microwulf is almost 3 times better than the DS-12 and almost 35 times better than the DS-96 using this metric. We have no idea if this metric is truly meaningful but it give us something to ponder. It's basically the performance per unit power per unit cost. (OK, that's a little strange, but we think it could be a useful way to compare the overall efficiency of different systems.)
We might also compute the inverse of the MFLOP/Watt/$ metric: -- $/Watt/MFLOP -- where you want this number to be as small as possible. (You want price to be small and you want Watt/MFLOP to be small). So using this metric we can see the following:
- Microwulf: 144,083
- Orion DS-12: 811,764
- Orion DS-96: 6,924,050
This metric measures the price per unit power per unit performance. Comparing Microwulf to the Orion systems, we find that Microwulf is about 5.63 times better than the DS-12, and 48 times better than the DS-96. It's probably a good idea to stop here, before we drive ourselves nuts with metrics.
Summary
We've seen that Microwulf is a 4-node, 8-core Beowulf cluster, measuring just 11" x 12" x 17", making it small enough to fit on a person's desktop, or fit into a checked-luggage suitcase for transport on an airplane.
Despite its small size, Microwulf provides measured performance of 26.25 GFLOPS on the HPL benchmark, and it cost just $2470 to build in January 2007. This gives it a price/performance ratio of $94.10/GFLOP!
Microwulf's excellent performance is made possible by:
- Multi-core CPUs becoming the standard. This makes it possible to cram a lot of processing elements in a small volume.
- Decreasing memory prices. While still the single most expensive part of Microwulf, memory prices continue to decline (see below), making it possible to avoid starving a computation of memory.
- Gigabit Ethernet becoming the standard. On-board GigE adaptors, inexpensive GigE NICs, and inexpensive GigE switches allow Microwulf to offer enough network bandwidth to avoid starving a parallel computation with respect to communication.
We hope we have whetted your interest in building a Microwulf of your own! We encourage you to take our basic Microwulf design and vary it in whatever way suits your fancy.
For example, aside from the fans, the one moving (i.e., likely to fail) component in Microwulf is the 250GB hard drive. With 4GB flash drives dropping in price, it would be an interesting experiment to replace the hard drive with 1-4 flash drives, to see if/how that affects performance.
Since memory prices have dropped since January, another variation would be to build a Microwulf that is balanced with respect to memory (16 GB of RAM, 2 GB per core). Recalling that HPL kept running out of memory when we increased N above 30,000, it would be interesting to see how many more FLOPS one could eke out with more RAM. The curve in Figure Six suggests that performance is beginning to plateau, but there still looks to be room for improvement there.
Since our MSI motherboards use AM2 sockets, another variation would be to build the same basic system, but replace our dual-core Athlon64s with AMD's new quad-core Athlon64 CPUs. (Hopefully available by the time you read this article.) That will give you a 16-core cluster in the same volume as our 8-core cluster. It would be interesting to see how much of a performance boost the new quad cores would provide; whether the price/performance is better or worse than that of Microwulf; the effect of additional GigE NICs on performance; and so on.
These are just a few of the many possibilities out there. If you build a Microwulf of your own, let us know -- we'd love to hear the details!
For more information (especially pictures) of Microwulf, please see our project site. Thanks for reading!
Joel Adams is professor of computer science at Calvin College. He earned his PhD in computer science from the University of Pittsburgh in 1988. He does research on Beowulf cluster interconnects, is the author of several books on computer programming, and has twice been named as a Fulbright Scholar (Mauritius 1998, Iceland 2005).
Tim Brom is a graduate student in computer science at the University of Kentucky. He received his Bachelor of Computer Science degree in May 2007 from Calvin College.
Epilogue: Prices for Microwulf (August 2007)
System component prices are dropping rapidly. CPUs, memory, network, hard drives - all are dropping at a pretty good clip. So how much would we pay if we built Microwulf in August, 2007? Let's find out using Newegg for our prices.
Component | Product | Unit Price | Quantity | Total |
Motherboard | MSI K9N6PGM-F MicroATX | $50.32 | 4 | $201.28 |
CPU | AMD Athlon 64 X2 3800+ AM2 CPU | $65.00 | 4 | $260.00 |
Main Memory | Corsair DDR2-667 2 x 1GByte RAM | $75.99 | 4 | $303.96 |
Power Supply | LOGISYS Computer PS350MA MicroATX 350W Power Supply | $24.53 | 4 | $98.12 |
Network Adaptor | Intel PRO/1000 PT PCI-Express NIC (node-to-switch) | $34.99 | 4 | $139.96 |
Network Adaptor | Intel PRO/100 S PCI NIC (master-to-world) | $15.30 | 1 | $15.30 |
Switch | SMC SMCGS8 10/100/1000Mbps 8-port Unmanaged Gigabit Switch | $47.52 | 1 | $47.52 |
Hard drive | Seagate 7200 250GB SATA hard drive | $64.99 | 1 | $64.99 |
DVD/CD drive | Liteon SHD-16S1S 16X | $23.83 | 1 | $23.83 |
Cooling | Zalman ZM-F3 120mm Case Fans | $14.98 | 4 | $59.92 |
Fan protective grills | Generic NET12 Fan Grill (120mm) | $6.48 | 4 | $25.92 |
Support Hardware | 36" x 0.25" threaded rods | $1.68 | 3 | $5.00 |
Fastener Hardware | Lots of 0.25" nuts and washers | $10.00 | ||
Case/shell | 12" x 11" polycarbonate pieces (scraps from our Physical Plant) |
$0.00 | 4 | $0.00 |
Total | $1,255.80 |
(Check the price links above -- by the time you read this, the prices have probably dropped even lower!)
Assuming that we'll get about the same performance and the same power consumed under load, this means Microwulf's August 2007 price/performance ratio would be $47.84/GFLOP, breaking the $50/GFLOP barrier! What a price drop!
The power/performance numbers stay the same (Watt/GFLOP and MFLOP/Watt).
The metrics that combine price, performance, and power are interesting. The MFLOP/W/$ is 0.04645, which is almost twice as good as the original Microwulf. The $/Watt/MFLOP is 73,255, which is again about twice as good as the original Microwulf.
Our special thanks go out to Jeff Layton, who did the lion's share in putting this article together, working out the metrics, and so on. Thank you, Jeff!