Resource Management 101

A Primer of Queuing and Scheduling

After a few months of reading about clusters on Cluster Monkey, you've gotten your cluster hardware, assembled it, selected a distribution, installed it, checked to make sure your nodes are up and running, learned some parallel programming, and written the highest performance "Hello, World" the universe has ever seen. So, you figure you're ready to turn your machine loose to your users, and amazing jumps in productivity are about to be seen.

In the immortal words of ESPN's Lee Corso, Not so fast, my friend!

Unless you're fortunate enough to be your cluster's only user (the ever growing Beowulf-in-the-basement crowd), selecting and configuring the appropriate resource management software is an absolutely critical step in administering your cluster. Outside of your choice of hardware, few things you do will have as big an impact on the user experience and performance your cluster provides.

In this column, we're going to start examining resource management with a look at the basic concepts of queuing and scheduling and why they are so important. Next month, we'll take a quick look at the resource management options available, get into the details of configuring and using this software, and give some examples of how some large clusters around the U.S. are run.

A Resource Management Primer

Resource management is not a problem that is unique to clustering (in fact, it's not unique to computing). In your workstation, keeping track of which processes are waiting to run, and selecting which one to run next, is one of the core functions of the operating system. The OS still serves this function in your cluster, however, the OS schedules each compute node independently of all others. In the cluster world, resource management is about deciding which jobs run on which of you compute nodes, and when.

There are three basic components to resource management: queuing, scheduling and allocation. In actual implementation, the line between these components is often blurred. Allocation is simply the process of actually granting access to processors to a particular job. For instance, if you run a Scyld cluster, the bproc software takes care of the allocation. The queuing software keeps track of the jobs that are waiting to run. However, its job is more complicated than it sounds. The queuing software must dispatch jobs by interacting the allocator (or function as it), it must interact with the scheduler (or function as that too), and it must provide an interface to the users for submitting jobs and getting results back to them. In that sense, the queuing software can be the most critical part of your user environment.

The scheduler may be the most complicated piece of your resource management system. In the simplest since, the scheduler reorders the queue (or at least decides which job is at the head of it). This can be as simple as choosing to run the jobs in the order they are submitted. In reality, scheduling algorithms can be much more complicated. The popular and feature rich Maui scheduler has tens of thousands of lines of code dedicated to the scheduling algorithm itself.

In practice, setting up a scheduling system is not as simple as selecting one from each of these three categories. Your choice of cluster distribution may constrain you, and the available software may implement one, two, or all three of these functions. For instance, the popular PBS (Portable Batch System see PBS-Pro, OpenPBS, and Torque) family of resource managers is primarily a queue, but also has a built-in simple scheduler. This scheduler can be deactivated and replaced with Maui. Platform Computing's LSF (Load Sharing Facility) is designed as a complete resource management system, but the allocation piece can be handed off to bproc, and, although it contains a sophisticated scheduler of it's own, it can also be convinced to interact with Maui.

Why Bother with a Scheduler?

At this point, you might be asking if all this is really worth the trouble. Why can't we simply let people run their jobs? The answer, like in so many questions about clusters, is performance. In many (though certainly not all) cluster applications, you'd like jobs to have exclusive access to processors. Part of this is simply for bragging rights; you'd like jobs running on your cluster, especially parallel jobs, to go faster than they would on a single processor. But the answer runs much deeper than that. One complication is that processors are not the only limited resource in a computer system. If you are running parallel jobs, there's a pretty good chance it's because you need to solve fairly large problems. And if you are solving large problems, there's a a fair chance you are using a lot of memory. While modern operating systems are pretty good at switching between processors, memory is still a finite resource. The performance price for running out of physical memory can be large. If you issue two equal size jobs on the same compute node, and the combination of the two causes you to run out of physical memory, the total running time for the two jobs may be much larger than if you issued them one at a time.

{mosgoogle right} Ensuring adequate memory capacity (or network bandwidth, or I/O capability) for your job makes a strong argument for restricting your cluster to one job on given compute node processor at a time, but it doesn't argue for a complicated scheduler. You could achieve exclusive access by simply running a job, waiting for it to complete, then starting the next job in the order they were received. But scheduling can still do far more for your performance. To fully explain this, let's define some basic performance metrics. Utilization is the amount of available processor cycles that are used doing useful work. So, whether you use all of your processors half the time, or half your processors all the time, your utilization is 50%. Throughput is the measure of how many jobs you complete in a given time period. If you are running a large cluster and selling access by the job, this is the metric you would care about. But your users may disagree. To them, the metric that matters most may be average turnaround time, which is how long from when they submit their job until they get a result back (another popular metric I won't use here is waiting time, the time a job spends waiting in the queue; waiting time plus execution time gives you turnaround time).



    Login Form

    Share The Bananas

    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.