A brief look at some current options
In our last column, we took a general look at the problem of resource management. This time around, we're going to take a quick stroll through the options for resource management on your cluster. There are a variety of possibilities available, which vary by capability, price, license, and platforms on which they work. Many of the resource management packages available have forked into multiple versions or changed names, leading to a somewhat confusing marketplace for the cluster administrator trying to decide what to do. In this column, we'll trace the origins and history of the popular players, to try and give you some insight into how we got to the state of things today, as the philosophical differences that led to the splits may have a big impact on the features you'll get.
While the total number of choices is large, a few factors will greatly simplify your decision. A couple of packages dominate the world of production solutions. Some of the other options may be just as good, but only work with certain packages and distributions, that you may or may not be using. By the end of this column, you'll have a rough picture of what's out there, and what might work on your cluster. You will be well advised to check the project and vendor web sites for further information as this topic is quite vast and individual needs vary greatly.
PBS itself has split into several versions. After it's initial deployment at NASA, PBS, like all generally useful and freely available software packages, began to spread around the community. The team at Veridian concluded there was a market for this type of thing, so they decided to continue to develop PBS into a commercial product, which became PBS Professional. In order not to immediately orphan the existing PBS, they created the OpenPBS project, which would keep the source open and maintained for the version of PBS the community was using, though it wouldn't contain the new features being developed for PBS Pro. Both these versions live on today. In March of 2003, Veridian sold PBS Professional to Altair Engineering.
{mosgoogle right} While releasing a version of OpenPBS was initially a reasonable step for Veridian to take, the OpenSource cluster community soon grew tired of the limitations. In addition, Altair Engineering still controls the OpenPBS source, and are understandably not eager to spend time and effort maintaining community-contributed features that are available in the PBS Professional product. From the Altair perspective, OpenPBS is merely the gateway to PBS Professional. So, while OpenPBS is a solid, albeit limited, product which works well on clusters, it lacks increasingly important features for modern large clusters, such as scalable performance past a few tens of nodes and sophisticated scheduling algorithms (though it still retains some desirable qualities, like a lack of Windows support).
So, the open source community began developing capabilities for OpenPBS that went well beyond it's initial capacity, but Veridian and then Altair had no incentive to incorporate them into the main OpenPBS code. Meanwhile, PBS had become widely adopted, so going an entirely different direction didn't seem feasible. So, a new project was born, which was initially known as Scalable PBS (due to trademark issues, it became known as Storm, and now Torque, the Tera-scale Open-source Resource and QUEue manager). Torque was to be the all open source descendant of PBS, and, as the original name indicated, was to address issues of scalability, as well as performance, fault tolerance, more sophisticated scheduling and scheduling interfaces, and the incorporation of the many patches the community has developed for OpenPBS. While still (and probably perpetually) under active development, Torque is ready for use, reasonably stable, and is developing a fairly wide following.
All three major forks of PBS are still active and in use; PBS Professional as a commercial product making advances in fault tolerance and scalability, OpenPBS as the solid standby in wide use and the default in packages like OSCAR and ROCKS, and Torque as the open source community development platform of choice, and used by a large group of do-it-yourselfers.
PBS (in all incarnations) consists of several components: a server, a scheduler, and the process that runs on all the compute nodes, known as a MOM. The server runs only on the head node, and is the process that actually accepts submission of jobs, maintains the queue of running jobs, and reports when jobs are completed. The MOM, or Machine-Oriented-Miniserver process is fairly lightweight, which is a good thing as a copy of the MOM process must run on every node in your cluster. The MOM interacts with the server to actually run each of your queued tasks on the compute nodes. The scheduler makes decisions about the order of jobs in the queue; most significantly, which job will run next. One of the common features of all versions of PBS is that the scheduler can be replaced with external schedulers containing different scheduling algorithms. The default scheduler in OpenPBS simply employs a first-come, first-served scheduling algorithm. PBS Pro, of course, uses a substantially more sophisticated policy.
Because of the plug-in scheduler feature, the most common way to run PBS is to replace the built-in scheduler with the Maui scheduler. This arrangement is the standard setup in the OSCAR system described last month, for instance. Maui is worthy of a column of it's own, but basically Maui is a high-powered open source stand alone scheduler. Maui focuses on scheduling functionality, and leaves the problems of launching jobs and dealing with users to resource managers like PBS. Maui achieves many of the scheduling goals described in a previous column, through the use of a planning scheduling algorithm that supports reservations for particular jobs, and a backfill mechanism which looks for available space in the planned schedule to squeeze in more jobs. While PBS alone, particularly OpenPBS is not much of a scheduler, it is a very solid resource manager, and the addition of Maui makes for a truly powerful combination.
SGE is the descendant of a long line of queuing systems, dating back to an academic project known as DQS (Distributed Queuing System). DQS became a commercial product known as CODINE, which became popular particularly in Europe, and still has some active users there. DQS itself lived on for quite a while as well, and a google search implies it is also still in use. SGE, and particularly its core (non-grid) resource management functionality, is the direct child of CODINE. {mosgoogle right}
SGE now exists in both open source and commercial versions. Both versions 5.3 and 6.0 are available for download. Including the SGE "Enterprise Edition" which is fully integrated in SGE 6.0. Advocates of SGE point to it's ease of installation, documentation, graphical users interface, active support list, and it's failover server process capability (a feature that has been in SGE for a while).
However, once LSF entered the Beowulf cluster world, it entered in a big way, with a rich and mature feature set, a solid interface that is familiar to PBS users, and powerful scheduling algorithms rivaled only by Maui in the open source world. Though LSF has history as a closed-source commercial product, Platform has begun a push into the open source world with the release of the Community Scheduler Framework. Platform's commercial versions offer significant add-ons, like support for multiple clusters and standard grid services. Like PBS and SGE, Platform also supports integration with the Maui scheduler. Look for a more detailed review of Platform's LSF in a future column.
| Sidebar:Resources |
If you use a bproc-based cluster distribution, you have still more options. Like MOSIX, Scyld has it's own resource management package (the Beowulf Batch Scheduler, bbq). Both of these are fairly basic but functional schedulers, useful out-of-the-box in only very simple environments. Both bjs and bbq are easy to extend, but this requires a burning desire to code your own scheduling algorithms. Fortunately, the bproc scheduling API has gained some momentum, and PBS Pro, LSF, and SGE will all work with it. I'd strongly consider one of these options if your cluster has more than a couple of users.
Clubmask is a full, bproc-based cluster distribution, so although it belongs in the bproc section, it's not something you can install on top of your existing bproc cluster. It aims to be the first "physicist-proof" cluster distribution although development seems to have slowed. While Clubmask does use some standard pieces of cluster software (such as Ganglia, Supermon, and of course bproc). Most importantly for this column, it includes it's own resource management system. It's got a fairly simple scripting interface, support for batch and interactive jobs, but a scheduling interface sophisticated enough to support the Maui scheduler.
The next step in selecting a resource management package is to take a look at the interface. Almost all of the packages described here require jobs to be submitted in the form of scripts, though some provide GUI tools to create the scripts for the user. The script languages are rich, powerful, and of course vary from package to package. Most jobs will use only a small subset of the capabilities available, so a good practice as an administrator is to provide some template scripts for the common cases. In our next installment, we'll compare interfaces by delving into sample scripts for the leading resource managers.
A quick summary of each package is given below. Please check with the project/vendor sites as information may have changes.
| Package | Open Source | Professional Support | Maui Compatible | Commercial Package | Bproc Compatible | Active Development |
|---|---|---|---|---|---|---|
| PBS Professional | No | Yes | Yes | Yes | Yes | Yes |
| OpenPBS | Yes | No | Yes | No | No | Not really |
| Torque | Yes | No | Yes | No | Yes | Yes |
| SGE | Yes | Yes | Yes | Yes | Yes | Yes |
| LSF | Some | Yes | Yes | Yes | Yes | Yes |
| Generic NQS | Yes | No | No | No | No | No |
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.
Dan Stanzione is currently the Director of High Performance Computing for the Ira A. Fulton School of Engineering at Arizona State University. He previously held appointments in the Parallel Architecture Research Lab at Clemson University and at the National Science Foundation.