Why Linux on Clusters?

Published on Saturday, 06 August 2005 20:00
Written by Douglas Eadline
Hits: 17546

Linux on HPC clusters seems to be an obvious choice. It was not, however, a forgone conclusion that Linux would end up leading the supercomputing parade when Tom Sterling and Don Becker used it to build the first Beowulf cluster. Inquiring minds want to know "Why Linux on Clusters?"

The Nagging Question: Why Linux on Clusters?

Note: This article originally appeared in the November 2004 issue of Linux Magazine. Due to space concerns some of the original article had to be cut from the magazine. The following version is the original longer version (directors cut?) and while it contains extra arguments for Linux on clusters, it does lack some of Martin Streicher's fine editing.

Mention HPC clusters and one of the first things that comes to mind is Linux. One might assume, that the reason is because Linux is very popular and free. While this is true, the growth of HPC (High Performance Computing) clusters has grown on its own merits and not necessarily on the popularity or cost of Linux in general. Of course, there are server farms based on Linux, but I'm talking about the big-honking clusters crunching away on some fundamental problem in science. What did these users see in Linux that has made it so successful? And, more importantly, what can it tell us about how we might approach HPC in the future.

In the Beginning

{mosgoogle right} The history of the Beowulf project is a fascinating read. The article Beowulf Breakthroughs by Tom Sterling in the June 2003 issue of Linux Magazine is an excellent description of the motivation and people who pioneered the effort. The article is also a great "myth buster" of sorts as it sets the record straight on a few of the historical details that tend to get confused from time to time. For instance, I once overhead a discussion at a trade show where someone was quite proudly pointing out that the name "Beowulf" was chosen (in 1993) as a response to the Microsoft "Wolfpack" NT cluster project (announced in 1995). For more information, see the What is a Beowulf?sidebar.

In any case, why did the Beowulf project use Linux? The answer was rather pragmatic. They needed a UNIX like OS for commodity hardware and at the time (circa 1993), BSD was embroiled in a legal dispute that made it not accessible. When Linux was introduced it provided an alternative. As Tom Sterling has stated:

To be sure, Linux wasn't the first Unix-like operating system to run on a PC, and in the beginning, it wasn't even the best. But unlike anything that had come before it, Linux was the focus and consequence of a revolutionary phenomenon made possible through the Internet: hundreds of coders from around the world, most of which had never met each other, working together, sharing expertise on a new operating system.

It should also be mentioned that at this time, Linux needed Ethernet drivers and if it was to be used in a high performance fashion, it needed good Ethernet drivers. Working as part of the Beowulf Project, Don Becker added this missing piece to Linux. It was a huge benefit to clustering. As those new Fast Ethernet adapters became available, Don was right there to provide the drivers.

Considering the great growth and acceptance of Linux, the choice to use and augment Linux could be considered a brilliant move. In the following discussion, however, we explore not only why Linux has worked so well from the beginning, by why it also flourishes in the current HPC ecosystem.

The Plug and No-Pay/Play Proposition

In the beginning, the users of HPC systems were pretty exclusively a UNIX crowd. Dealing with multiple users, large files, remote access, etc. was not something to be found in commodity software offerings from Microsoft or Apple. UNIX was the way it was done. Of course, there were many different dialects, but it was still UNIX.

Using Linux for clusters was very attractive because it was literally plug-and-play for many users. Well, to be accurate it was at least plug-re-compile-and-play. As an example, consider communication middle-ware like PVM or MPI. All of these packages were ported very quickly to a Linux environment because it looked very much like the other environments these packages supported (e.g. Sun workstations). Many of these packages had already been built using GNU tools as well, so quite often, porting would be nothing more than juggling include files.

Easy porting was really only half the story. Because Linux distributions were freely available, the cost to play was very low. Cobbling together a few Pentium Pro boxes with MPICH and Linux was not an expensive a proposition. Indeed, it provided a very low barrier to entry for cluster computing.

The Open Thing

Of course the the openness of Linux (and Linux distributions), has ignited the growth of Linux for HPC clusters. As mentioned, an almost zero cost of entry provided easy access to test the technology. Once the system was shown to work, the advantages were at times staggering. A ten-fold improvement in price-to-performance over traditional HPC machines was not uncommon. People notice these types of things.

One of the first things HPC users realized they could do was build their own kernel. While the new features are always welcome, compute nodes did not really need sound support, laptop power management, or those very interesting Ham radio extensions. A user could build a kernel with exactly what they need and no more. A small compact kernel means more memory to crunch numbers. Indeed, users have the option of using simple small monolithic (modules compiled in) kernels for their specific needs. Compute nodes could be further paired down with smaller distribution sets, smaller kernels, and even run without the need for hard drives. The compute node can be maximized (or minimized depending how you look at it) in any number of ways for the problem at hand. With clusters "less" is often better.

Another advantage of openness is the ease in which hardware can access the kernel internals. This capability is particularly important with clusters. A common method to optimize communications is called "kernel bypass". In this method, communication take place outside the kernel networking stack and memory is copied (messages passed) between two independent process spaces on different nodes. Of course, to implement this you need to involve the kernel. A particularly good example of this has been the drivers provided by Myricom for their Myrinet network hardware. Many other examples also exist. The Gamma project has optimized Ethernet drivers to provide lower latency communications for clusters as well. All of these optimizations are HPC specific and only possible because the kernel is open.

Another advantage to an open kernel is that version problems can often be fixed by the user. These types of problems come up all the time in clusters. A new driver for library needs to be compiled for the kernel you are running. If the vendor released the driver as open source, you can solve the problem with a quick re-compile. In a closed source scenario, the vendor may need to release countless binary versions of their driver/library (none of which seem to work with yourkernel, by the way).


The other main advantage it that the the entire "cluster plumbing" is open. It allows optimizations and fixes that may not be needed for the mainstream and thus deemed unimportant for the kernel maintainers. A good example of this is the TCP acknowledgment fix implemented by Josip Loncaric. (See Sidebar Josip's Fix)

{mosgoogle right} Another shinning example of cluster customization has been the process migration facilities introduced by bproc an openMosix. They were able to do things with an open kernel that would almost impossible with a closed source environment.

If we move above the kernel to the distribution level, we see a large amount of "customization" being done for specific HPC distributions of Linux. The BioBrew distribution is an example of a full Linux version tailored to bioinformatics users. Open software seems to have no bounds when it comes to HPC. If there is a need, the infrastructure is available for customization.

Finally, another factor that is often taken for granted is the Internet. Open collaboration and sharing would be quite difficult without it. News, packages, distributions, fixes, updates, patches, How To's, mailing lists, and even grids, all circulate freely throughout an international community.

The Marketing Department is Closed

In a closed source model, the features that the end users see are, of course, determined by the owner of the source code. Deciding what features a new product should have often falls in the hands of the marketing department. A good marketer checks to see what the competition has, what the current users want, and makes a decision based on the cost to implement and release new features. If you and a small cadre of users require some special feature you are at the mercy of the "marketing optimization" equations. For closed source, there is no other way. If you don't make the features list you are as they say SOL (bad-word-your-mom-told-you-not-to-say Out of Luck).

In the case of HPC, many features are at the bottom of the list because HPC market is not that big compared to other market segments. You will see a better return on your money by appeasing the bigger markets.

Let's look at process migration as an example. Both bproc an mosixrequired access to the intimate details of the kernel. These packages are extraordinary useful to the HPC market. The funny thing is they only show up in open software. There is no marketing department attempting to optimize ROI (Return on Investment). There are users who need something, there are implementors who will build things and get paid to keep them working, and no one else (no costs) in the middle. Marketing, in a sense, has been optimized from the equation.

Sidebar One: What is a Beowulf ?
Perhaps more than any other question, this one always comes up in conversations about HPC clusters. To answer the question, you need to understand that Beowulf was the project name, not a cluster. Tom Sterling says it best:

Then one afternoon, Lisa, Jim Fischer's accounts manager, called me and said, "I've got to file paperwork in 15 minutes and I need the name of your project fast!" or some words to that effect. I was desperate. I looked around my office for inspiration, which had eluded me the entire previous month, and my eyes happened on my old, hardbound copy of Beowulf, which was lying on top of a pile of boxes in the corner. Honestly, I haven't a clue why it was there. As I said, I was desperate. With the phone still in my hand and Lisa waiting not all that patiently on the other end, I said, "What the hell, call it 'Beowulf.' No one will ever hear of it anyway." End of story.

Beowulf is more a concept or methodology than a thing. A very good operational definition can be found in one of the first books on the subject, How to Build a Beowulf: "A Beowulf is a collection of personal computers (PCs) interconnected by widely available networking technology running any one of several open-source Unix-like operating systems."

So if you want to call your cluster a Beowulf, you need to play by the rules. The key words are personal computers, widely available networking, and open-source. Because there is no such thing as shrink wrapped "Beowulf", these key words are very important. The use of personal computers could probably be replaced with commodity servers/computers. The widely available networkingpart provides a bit of wiggle room. One of the major design goals of early Beowulf systems was to keep them as commodity as possible. The benefit of using only commodity components are low cost, no unique vendors, guaranteed upgrade path, and rapid introduction. The problem with using only commodity networking components is that it may severely restrict certain networking technologies that are crucial for performance. With this in mind, Beowulf's can be build from less mainstream technologies (such as Myrinet, QsNet, SCI, or Infiniband). The jury is still out whether Infiniband is a commodity technology.

The last requirement open-source is probably what differentiates a Beowulf from all other "clustered" systems. The use of open-source operating system allows customization and rapid changes that are virtually impossible with commercial operating systems. Another side effect of open-source is the absence of licensing fees for each computer in the cluster. Open-source does not preclude operating systems like BSD from being used, but the GNU license under which Linux is distributed has pushed Linux to the forefront of the market. Finally, it should be noted that an HPC cluster composed of Windows NT or Solaris are not strictly Beowulf systems. They certainly can be an clusters, but probably should not be called a Beowulf.

Lawyer Free Zone

In 1997, I found an articlein EE Times describing the creation of a "Lawyer Free Zone" in Scotland to help foster collaboration in the semiconductor market. Interesting idea.

When I think about Linux and clusters, I think about how the GPL has created a "Lawyer Free Zone" for software development. (SCO of course believes otherwise). Think about the fact that there are people from many large companies (like IBM, SGI, SUN, and HP) who would, outside of the GPL, never put there development people in the same room -- let alone co-develop software. The large array of Linux file systems, is only one example of how clusters have benefited from this safe haven. In a sense, the GPL, has lowered the "lawyer latency" (measured in months/years) for collaborative projects to near zero.

In addition, discussion on mailing lists and technical meetings is also unencumbered. Everyone benefits by co-operating, which, by the way, is the goal of any successful legal agreement.

Because We Play Computer Hardball

From my experience with the HPC community, I can say with complete confidence that if Linux was unstable or did not work as expected, it would have been given the boot long ago. Losing a weeks worth of results because of node crash can be a serious setback. Although a lower level of stability is often an accepted part of the mainstream, it will find no quarter in the HPC world. The HPC market, by definition, pushes the limits of everything it touches. In this respect, Linux is a major league player.

Vendor Lock-in

The classic business strategy of selling a customer something that requires them to continue buying products and services, has fueled the growth of many companies. It has also been the best source of boat anchors in the HPC market.

Let's consider a common scenario. Your organization buys a nice new supercomputer called the Whopper Z1 from FBN (Fly By Night) Systems. The Whopper Z1 runs WOS 1.0 a version of UNIX ported for their system. The computer works well for the first year. Everyone is happy. Then, in the second year, you want to add more memory. Well, in order to keep the service contract intact, you need to buy the memory from FBN systems. Funny, it looks like the memory you bought for your home computer, but it costs ten times more than you paid. So you upgrade the memory, and while you are at it you upgrade to the next version of WOS (version 2.0). Everything is fine, until year three. It turns out that the Whopper Z1 is now going "off contract" because a new replacement system your organization is buying, called the Whopper Z2 , has been installed. The Whopper Z2 also has a new version of WOS (version 3.0) which does not run on the old Z1 system. Now the old Whopper Z1 is pretty much useless and will be kept on-line for another year to allow everyone to move their codes over the new new machine. After this time, you can not really sell it, or use it because hardware or software support is expensive and is considered obsolete. Ah, but if you tied a rope to it, it could indeed be used as a boat anchor.

Now consider the scenario where Linux was used for the operating system. Since the source code is available, you can if you choose keep the old Whopper Z1 running without a support contract. You can find people who can help you fix things. You may even have some "Linux Hackers" on staff because they have been running Linux at home for five years. And, as you find out, this is a good thing because FBN Systems goes out of business and now you are now stuck with two large pieces of hardware and a tape with a binary versions of WOS on it.

In the end, "Vendor Lock-in" is always bad for the customer. No one likes to here "you can't do that." The word "can't" and "Linux" are not often used in the same sentence.


Ownership and Community

Somewhat similar to the vendor lock-in issue is the concept of ownership. Why is it that some people use Linux to solve big problems and at the same time run it on old cash registers? I suspect the answer lies in the level of control provided by the software environment. In the absence of anyone saying "no you can't", there are many people saying "what if". Indeed, Linux in sense has become the paint by which and artist can express what ever they want. Along with this expression of ideas comes ownership. It becomes "your" masterpiece and you control your destiny.

Another part of ownership is community. Since many hands have helped create the "practice and art of cluster computing", you can become a co-owner by simply helping a new person with a question on the Beowulf Mailing list. The "community knowledge base" is quite immense and growing each day. If you experience a problem or have a question, rest assured, there is almost always someone else who is an email away from helping -- who incidentally has suffered a similar fate. And, in an open environment the quality of your answers is often higher.

Sidebar Two: Josip's Fix
In march of 1999, Josip Loncaric, found something peculiar with the Linux TCP stack. It seemed that the kernel was introducing delays for small packets. The behavior had to do with how the kernel implemented delayed acknowledgments and TCP timeouts. While this behavior had minimal impact on pretty much every other corner of the market, the cluster community saw very poor performance for small packets.

The availability of source code allowed Josip to address the issue for two different kernels. The first patch resolved the issue for the 2.0.36 kernel and a second patch for the 2.2.12/13 kernel allowed the user to tune the short packet behavior.

Those clusters users that relied on the the kernel TCP implementation were able to patch and rebuild kernels that worked better for small packets. There was no marketing decisions to be made, no release schedule, and no NDA to sign, and no drawn out decision process. Josip just fixed it.

You can read a report about the fix here.

Challenges

In fairness, all is not rosy in Linux "clusterland." There are issues unique to this environment that will need to be addressed. Perhaps the most important issue is how are ISVs (independent software vendors) going to target a fast moving and diverse software environment? It is not an easy problem. And, it is a problem that needs to be solved.

In addition, an ISV needs a hard line between their product and the cluster infrastructure. A mis-configured MPICH library should not be the problem of the ISV (although it often is).

Finding good professional support is also an issue. Clusters are diverse and thus make support from a single source difficult. New support models that leverage the openness of the cluster infrastructure maybe the best way to proceed.

Beyond ISVs and support, another challenge lurking may be an entire kernel forks for HPC systems. A highly optimized HPC kernel may diverge so much from the original kernel a new version is warranted.

It is not really a Linux thing

In a way, the success of Linux in the HPC world, is as much about the openness as it is about the UNIX heritage. Clusters have been and will continue to be built with closed source systems. Simple minded religious "Windows" vs. "Linux" arguments don't really solve anything. The real issue is that an open approach seems to be a much better way to address a small market with specific needs than using a one size fits all approach designed for mass markets. Both are valid models.

Indeed, HPC users now expect the "open plumbing" provided by Linux. The plumbing analogy is quite accurate. Many people live in house where the plumbing just works. It comes and goes into the walls of the house. Most people have no intention of ever modifying the plumbing, but surely would like the option if they had a chance. They might get a bit perturbed, however, if they wanted to hang a picture. Since the plumbing is secret, putting a nail in the wall presents the risk of damaging the pipes. So, no pictures for you, the plumbing will work, but your home is just not as interesting as it could be.

Fundamentally, we all like options, the more the better. Linux on clusters, like Linux on most other things, is about maximizing choice. In the HPC world, the decision to use Linux may seem like the natural choice, but remember, "... you didn't come here to make the choice, you've already made it. You're here to try to understand why you made it. I thought you'd have figured that out by now." -- famous cluster architect.

{mosgoogle right}

Sidebar Three: Resources
Tom Sterling's Beowulf Breakthroughs

Myricom

Gamma Project

bproc

Open mosix

BioBrew

Beowulf mailing list and Web Page

Book: How to Build a Beowulf, by Sterling, Salmon, Becker, Savarese, MIT Press, ISBN 0-262-69218-X