Why Is Cluster HPC So Hard?

Published on Tuesday, 03 January 2006 14:00
Written by Douglas Eadline
Hits: 8939

If it was easy everybody would be doing it, right? And what about those multi-cores?

Update: Microsoft has released their Windows Compute Cluster Server 2003 and the Beowulf list is again actively discussing this topic. In addition, previous postings on the Beowulf Mailing List (Look for "Why I want a Microsoft cluster...") discussed the entry of Microsoft into the HPC cluster space. I found the discussions interesting and well informed. I did however, take a step back to look at some fundamentals that define the HPC (High Performance Computing) market and came to the conclusion that before anyone "takes over anything", there are some issues that need to be addressed.

The fundamental issue is that doing HPC is hard. There is no easy way around and no shortcuts. Practitioners need to roll up their sleeves and work to get the performance and results they desire. Unless Microsoft has some magic, all the corporate Windows goodness will not help them in this arena. Just like everybody else they will have to roll up their sleeves. And, by the way, money cannot necessarily buy magic. For the record, that is all I'm going to say about Microsoft because, in my opinion, the things holding back HPC have little to do with the plumbing and a lot to do with the fundamentals. If you are looking for an anti-Microsoft rant, this is wrong article. Please stop reading.

Update: Joe Landman has some great follow-up comments at scalability.org.

Thanks for continuing. Now let's talk about the hard stuff. I previously covered some of the important issues Linux address in the HPC world. But clearly, these advantages are not enough. Piling up processors to achieve heroic TFLOPS numbers sounds like a dream come true, but one has to wonder, Where are all killer cluster applications that take advantage of the unlimited computational power? The Blue Collar Computing effort at OSC has a very good take on this issue.

I wonder about this as well, and I believe I know part of the answer. Let's look at this from a product standpoint because if there is an HPC market, there needs to be products that solve problems and earn money.

How to enter the HPC market

As a savvy business person, you know that hardware is cheap, software is freely available, the need is real, so there must be a way to clean up in this market. How are you going to "productize" this trend and take over the world?

Build an Appliance

The appliance concept is a good approach. We buy everything from televisions to toasters as appliances. No need to build it from parts because the market for these items is so large it is economical for a manufacturer to make millions of these items. And, they are easy to use.

An appliance is usually simple to use because it is built to perform a very specific set of tasks. These constraints also make it easy to service as well. Like a game console the games may be different, but the reasons they all work (mostly) is a very tightly controlled hardware and software environment. If it breaks the recipe for fixing it is known (a diagnosis flow chart). Or more common, just throw it out because it is cheaper to build a new one than to pay someone to fix the old one.

{mosgoogle right} So where are the HPC appliances? I will argue that they exist, but not quite at the desktop level. As an example, modern medical imaging equipment are HPC appliances. They do a lot of very specific calculations at the push a button. The new generation of game consoles (to a degree) are HPC appliances. And, what about the desktop? Why don't I have a simple sixteen processor cluster cube that I can plug into my desktop PC and run HPC codes to my hearts content. (For those old enough to remember the Inmos Transputer know that in 1987 it was possible to stuff quite a few very fast processors in desktop PCs).

Even today, it is not hard to envision a cluster cube using commodity hardware and open software. There would probably be some constraints, gigabit Ethernet, preset number of processors, limited memory and storage options. Such a system would be an appealing target for an ISV (Independent Software Vendor). Like a game console, a cluster cube represents a predictable and reproducible environment. So where is it? Why isn't everyone running out and building cheap parallel systems? The multi-core processor strategies will put plenty of processors on your desk in the coming years. Is the market ready? In a word, no.

In my opinion, there are several issues why dedicated HPC appliances might fail. First, there does not seem to be a "one size fits all" cluster design. Fine you say, but then there must be a subset of applications for which a general cluster can be designed. Perhaps, but even within a "problem space", the design of the cluster may need to reflect how the application is to be used. Some users may need a small number of "fat nodes" (a large amount of memory) while other may need a larger number of skinny nodes (a small amount of memory) to run a specific type of problem using the same application.

A second reason is the "boat anchor" problem. Those old game consoles worked well, want to run next generation game on them or try something new. Sorry, that is not possible. With exception of the people who seem to be able to run Linux on anything, the convenience of a console also means sacrificing some control. Nintendo Gamecubes may run Linux, but you can't add memory or upgrade the CPU. Appliances also help vendors "lock in" customers. Customers usually don't like this approach. At some point, a vendor may also stop supporting the appliance. Now you have a boat anchor.

Finally, and perhaps most importantly, there is a lack of application software to drive sales of such a device (including desktop multi-core machines). Thus, we have finally arrived at what I consider one of the hard parts. I believe the administrative software issues (the plumbing as I call it) is largely solved for cluster appliances and multi-core systems. The application software is not so easy. We'll take a look at this issue shortly. For now, let's look at the more traditional approach to HPC.

Build a Cluster

The HPC cluster has a thousand faces. It can be a small eight box cluster or it can be large 1024+ node beast. The choices are plentiful, the prices are reasonable and the software is freely available. You can offer customers unbelievable amounts of computing power housed in your custom designed enclosures. Microsoft now offers a "turn-key/point and click" Windows cluster system (complete with support for ISV applications). Why isn't the world running toward this solution?

For me, the cluster/HPC proposition is kind of like offering to put wings and jet engines on cars. You can give your customers the freedom to travel faster and farther than before, but your customers don't know how to drive cars all that well let alone become sky pilots. Plus, the infrastructure is not there to support the new breed of flying cars.

For those practiced in the art of HPC (i.e. those that know how to fly), clusters provide a large amount of "bang for the buck." End users need only buy what they need and no more. Recently at SC05, IDC reported that over the last five years the use of HPC clusters have exceed IDC's optimistic projections. In the last two years alone, clusters have grown from a one third market share to encompass almost half the market. Large capability systems (heroic supercomputers) have seen a decrease in market share. HPC clusters are disruptive. And by the way, IDC only counts those units shipped as "clusters", they don't count what they call "dark clusters" built by end users.

So where is the desk side/top cluster? There have been some efforts in this area including Orion Multisystems (now defunct) and the recent introduction of the Personal Cluster by Penguin Computing. (Update: Tyan has announced their personal super computer (PSC) and Ciara has announced a desk side cluster as well.) That is 200 GFLOPS next to your desk! Just for you! What are you going to do with it?

The Hard Stuff

So now we come to the hard stuff and attempt to answer the question Why can't I sell clusters by the boat load? As any good marketer will tell you, segmenting the customer base is important. So, let's narrow the focus a bit. Clearly, a word processor is not going to need 200 GFLOPS. On the other hand the engineers, biologists, chemists, physicists, and assorted domain experts can use HPC. Indeed, the 2004 HPC Users Survey found that HPC was considered essential for business survival and competitiveness. So the reoccurring question is, The hardware is cheap, much of the software is freely available, industry needs HPC, so what is the problem? I'm glad you asked. Here is may take on this issue.

Clusters are a paradigm shift.

The way I like to explain the shift is that instead of building (modifying) the problem to fit the supercomputer (vectorizing your code), we now can modify the supercomputer to fit the problem. While codes must be parallelized to run on clusters, the optimization does not stop there. The classic cluster issue is often the nodes vs. network problem. Given a fixed budget, where to I put my money? Well, as with all thing cluster, it depends. If your problem set runs well on gigabit Ethernet you can buy more processors. If you need a very fast interconnect (i.e. expensive) you must cut down on the number of processors to stay within your budget. The are other issues as well including I/O, memory size, dual core, compilers, filesystems -- the list gets rather large.

Instead of our domain expert running a code on a supercomputer with a highly defined software and hardware environment, he or she has to understand a whole load of details previously handled by the nice people at your local supercomputing company. Indeed, clusters are cheaper because the cost of engineering an integrated solution has shifted from the provider to the user. Read that again. And, if you are starting to think that the traditional supercomputer was an expensive HPC appliance, you are quite right.

For the more scientifically inclined, there is a kind of conservation of cost when it comes to HPC. Cost in this sense is both time and money because the time to solve an implementation problem often cannot be reduced with money. The low price of clusters did eliminate some costs, but shifted many of the non-reducible costs to the end user.

Clusters and multi-cores are hard to program.

Parallel programming is a tough nut to crack. The parallel programming problem has been around for quite a while and there are no signs it will be resolved any time soon. It was an issue before clusters came on the scene and will be a huge issue for multi-core systems. {mosgoogle right}

Why is parallel programming so hard. Every try an get a group of people to work together at the same time. Now, think about what it would be like if you removed their brains and had to tell each person exactly what to do. That is explicit parallel programming. It basically sucks. It is like programming in machine code, but is actually worse. There is no guarantee that a parallel code you write will be both portable and efficient on all architectures.

There are software tools. For clusters we have things like MPI (Message Passing Interface). Multi-core systems have threads, OpenMP and even MPI. All of these provide a framework to manage your workers, but before you code up your real problem, you have to manage the workers. The programming cost always falls on the user or ISV. However, in the past, programmers got a free lunch from the hardware side. The cost (time and money) for continued performance gains just got much bigger for everyone.

There is a lack of people to help with the hard stuff.

The situation is quite understandable. Clusters came on very quickly, disrupted everything, and now a new paradigm has emerged. There will be a lag while the rest of the market catches up. There are two areas that have been effected. The first is on the administrative side of things. The second, is with the domain experts (end users). Not only is there pile of new information to be learned, but the role each plays in the cluster environment has changed. Administrators need a better understanding of the domain issues and the end users need a better understanding or the plumbing issues.

Making Things Less Hard

Until these hard issues get addressed, uptake of HPC clusters will be slow. And, as multi-core processors spread, things are going to get hard for many more people. Parallel computing is coming to the masses and no one knows how to make it easy. Think about that for a moment.

Getting back to our marketing plan we see that success in this market is not going to be how many processors you can shove in a rack or how easy it is to administer 1024 nodes or what operating system runs on your cluster. These issues (and others) are important and solvable. They also present a great way for vendors to differentiate themselves in a commodity market. The hard issues will, however, stifle the growth we all know is possible. Sorry for bad news.

All is not lost, however. To prove that I can do more than throw cold water on the HPC cluster market, I'll have some suggestions in a future installment. And, more importantly, how you can help.

Douglas Eadline is the swinging Head Monkey at ClusterMonkey.net.

Unfortunately you have Javascript disabled, please enable Javascript in order to experience the comments correctly