How to Write a Technical Cluster RFP

Need a cluster? Read this first!

		Page 1 of 3

Need a cluster? Read this first!

Writing
a technical RFP (Request for Proposal) for a cluster is not something
to be taken lightly. There are many options available for clusters
that must be researched and addressed before writing the technical
RFP. Moreover, there are things one can put into a technical RFP that can
help discriminate between vendors and help make your decision process
easier (or at least make things clearer). To help with your procurement, there is an RFP outline/template provided at the end of the article.

This article was originally a synopsis of
a tutorial given at the 2003 ClusterWorld Conference and Expo on how to
write a technical cluster RFP. However, it has been expanded and edited
to discuss some topics more in depth and to add topics that weren't
discussed before. 

Introduction

Before
beginning with a discussion of how to go about writing a technical
cluster RFP, we should define some terms and go over some ground
rules. First, let's define a technical RFP as a set of requirements
for the cluster itself excluding procurement and company policies,
such as, costs, warranties, delivery schedules, etc. However,
there are cases where you might want to include warranties, support,
delivery schedules, etc., in your technical RFP. Whether you include
these or not is up to you and your company policies. In my opinion
putting delivery schedules in the technical RFP is definitely warranted
since it can impact the technical requirements.

In many
cases, the RFP makes up the vast majority of the document that is
sent to the vendors from
the procurement group. However, the procurement group adds all of the required
procurement procedures and legal language and policies. This article
is not concerned with the procurement aspects of the RFP and/or
purchasing process since these details vary from company to company.
This article is interested in only the technical details of
the RFP. But, as I mentioned in the previous paragraph, some
details that usually come from a procurement office can and should
be put into the technical RFP because they affect the technical aspects of
the cluster and they affect the technical evaluation.

In this article, I'll walk through the steps you should go through
in writing the RFP. You may be surprised that the first step in
writing an RFP is understanding your code(s).

Step One - Understanding Your Code(s) and Users - Profiling

We all want speed,
glorious speed. And we want it for nothing. While we're at it, it
would be nice if it were reliable, easy to maintain, use only a small
amount of power, have a small foot print, and so on. However, we
can't have this, sigh... . So what do we do? Well, we have to find
the best system to meet our requirements. These requirements form the
basis of our technical cluster RFP. This means you have to do your
homework (seems like we never escape homework doesn't it?). The first
step in writing our technical RFP is to understand the code(s) and
understand the user base for our cluster. The generic term I'm going
to use for understanding codes is, profiling, but I'm going
to go beyond the usual meaning of the term profile. I use the term
to mean profiling your codes and profiling your user base

Usage Profiling

Since clusters are almost always for technical computing, many
people ask why they have to understand their user base.
Profiling your user
base is a very important source of information that many people over
look when preparing for a technical cluster RFP. Determining what
applications the users are running, how often they run them, the
problem sizes, typical number of processors used, what time of day
they run, etc., can be a great source of information.

For example, by
profiling your users, you can determine the largest problem size and
the number of processors for a single job. This will tell you the
minimum "size" of the cluster (where size is the combination of number
of processors and amount of memory per processor). You can also find
the largest number of processors anyone is using in their runs or the
largest amount of memory anyone is using in their runs.

It would
also be good to have this same data; problem size and number of
processors, for the past couple of years. You can use this historical
data to make projections about the number of nodes you are likely to
need, how much memory you might have to add, etc. More precisely, if
the cluster is to last 3 years, which is the maximum you should keep
a cluster, you need to know what the cluster should look like in
3 years (more on this later).

How do you get this
user base information? Good question. There are many ways to get it.
You can survey your user base on a periodic basis (but be careful,
users don't like to fill out too many surveys). You can also watch
your systems and capture job information from the queuing system and
the nodes. Usually monitoring systems can help capture this
information as well.

Application Profiling

The other side of the profiling coin is to profile the
user applications themselves. I'm going to focus on MPI (Message
Passing Interface) applications since they are probably the dominant
class of cluster applications. The first step is to gather a list of
parallel user applications. If you have access to the application's
source code then your life can be a bit easier. Regardless, you need
to profile your applications in the traditional sense of the word.
That is, determine which parts of the code consume the most time
including MPI functions.

How you profile depends
upon the MPI implementation being used. Many of the main MPI
implementations, such as
MPICH,
MPICH2,
LAM,
MPI-Pro,
Scali MPI Connect,
OpenMPI,
all have ways to
gather profiling information. The systems are flexible enough that
you can gather a great deal of information or a little bit of
data. From this information you want to know the message sizes
and the number of messages passed for the various MPI functions used
in the code. You also should look for how much total time is used in
each MPI function call. 

Now that you have all of this MPI data, what do you do with it?
All of this data will help determine the typical message size and what
MPI function(s) are used the most frequently. You can then pass this
information on to prospective vendors to help them select an
appropriate interconnect to bid. Or you can use the data to help you
specify an interconnect for the RFP.

Since we're talking
about parallel codes, we need to do some profiling on parallel
systems and vary the number of processors. Ths point of this profiling
is to gather scaling information (how your code performs as you increase
the number of processors) as well as how much the memory usage per node
changes
as you increase the number of processors. To gather this information,
if possible, start with one processor or one node and then increase the
number of processors, but don't
get use too many processors. Profiling codes can produce a great deal of
information, so be careful how many processors you use. Also, don't
worry about the cluster interconnect you are using. Get some
profiling data first and then worry about the interconnect details
later.

Now that you have this
large database of profiling information of user applications, what do
you do with it? That's a very good question and I'll try not to
weasel out of answering it, but my general answer is that it depends
upon your specific case (how's that for weaseling out of answering
the question!). However, there are a few general things we can do
with the data. First look at the message sizes as a function of the
problem size and number of processes for a particular code. If the
code is using a large number of relatively small messages, then the
code is likely to be more sensitive to interconnect latency rather
than bandwidth. The code is also likely to be require a small N/2
for best performance (see Sidebar 1 for an explanation of N/2).
If the code(s) are using a large number of large
messages, then it is likely that the code is going to be more
sensitive to interconnect bandwidth than latency. If the message
sizes are distributed fairly well, with not too many small or large
messages, then the code is perhaps reasonably designed well and
could scale fairly well, possibly even on Fast Ethernet (be sure to
check the scalability results)

You should also look
for scalability trends. How does the code perform on the same data
set as you increase the number of processors? You can make the
classic scalability plots of speedup on the y-axis and number of
processors on the x-axis to get the scalability information. You can
also increase the problem size as you increase the number of nodes.
What you are looking for is the increase in MPI traffic as the number
of nodes increases. Does this traffic increase much faster than the
number of processors? If so, how much? This information gives some
insight into how the code scales.

Sidebar One: Definition of N/2

The usually way to characterize the bandwidth and latency of an
interconnect is to send messages of varying size between two
nodes through the interconnect. You measure the time is takes for
the messages to travel between the two nodes. From the time you
can compute bandwidth (amount of data per second). Then you can
create a plot of bandwidth on the y-axis and message size on the
x-axis. The peak bandwidth is the largest value of bandwidth on
the plot. Latency is the amount of time it takes to send a
0 byte packet.

During this testing, both nodes are sending and receiving data.
N/2 is size of the message where you get half of the peak
bandwidth. Why half of the peak bandwidth? N/2 is message size
where you get the full bandwidth in one direction (i.e. sending
data from one node to another). If you profile MPI codes you
are likely to see lots of small messages. Being able to transmit
the messages as fast as possible helps improve performance of the
code and improve scalability. This is why N/2 is important.

Operation Counting

When you profile
the user codes for MPI information, you can also take the opportunity
to count certain operations of the codes at the same time. Most
modern processors have hardware counters that allow you to count
certain CPU operations while a code is running. You can get to this
information by applying a small patch to the Linux kernel and then
using some user-space tools to extract the information desired

There are several
counter projects for Linux. The most wide-spread project , the
Performance Application Programming Interface (PAPI),
is hosted at
the University of Tennessee. It is an application programming
interface (API) that allows you to instrument your code to count
events. There are also tools built on top of PAPI that allow you to
watch your codes without having to modify the source code, such as
PapiEx,
TAU,
Kojak,
PerfSuite, and
HPCToolkit.
For Pentium or AMD
processors, you have to use the
perfctr
kernel patch to get system
level access to the counters. You can control what information you
extract, but you should at least extract the following:

Number of floating-point operations performed
Total number of cycles
Total number of instructions
L1 cache misses
L2 cache misses

Coupling the time
it took to run the code with the floating-point operation count
allows you to compute FLOPS (Floating-Point Operations per Second).
You can use the FLOPS information when you increase the number of
processors and the problem size to watch the efficiency of the code.
You can also see the effects of various size caches on the
performance of the code by watching the cache misses

Step 2 - Testing Candidate Hardware

Now that you know what
applications will be run on the cluster, what kind of datasets will
be used, the size of the datasets, and how the applications behave
and scale, you need to test candidate hardware. While this may seem
like a logical thing to do, many people do not take the opportunity
to test their codes prior to writing the technical RFP. 

Testing before you issue the technical RFP will only give you more
information to help write the specifications or to help judge the
proposals from the vendors (still more homework!). Many vendors or
labs or universities have clusters you can test on and they are very
good about helping people test their codes. The
beowulf mailing list
is a good place to start. I will warn you that some people may give
you free time on their machines for testing, but do not abuse that
privilege by using too much time. Instead determine how much time
you will need and talk to the cluster owner about the best time
to run your tests. You might also discuss the possibility of sharing
your results with the cluster owner.

Try to test on various
processors at various speeds and various cache sizes (if possible).
Also try to test on clusters with various cluster interconnects.
Ideally you would like to test your codes in a cluster with multiple
interconnects so you can sort out the effects of the interconnect on
performance.

Also, try various MPI
implementations. There are various implementations of the MPI
standard and each has their own goals and assumptions. Testing
your code(s) with the various MPI implementations will tell you
which ones are appropriate for the
particular code. Many of the MPI implementations also have tuning
options that you can use to improve performance. Be sure to try these
on the codes since several simple changes can greatly improve code
performance

As I mentioned before, there are
times when you can't test your codes. Perhaps they are commercial
codes, or they are covered by security rules, or they are
proprietary. All I can suggest is to do your best to test. I have
even heard of vendors shipping small systems so you can do on-site
testing (just be sure to wipe the disks thoroughly or better yet,
replace the disks with new ones). If all of this won't work for you,
then you may have to write pseudo-applications that approximate the
behavior of your codes. This can be a great deal of work, but it
means the codes are easy to ship for testing. In fact, writing
applications simulators is exactly what the Department of
Defense Major Shared Resource Centers (MSRC) do. They write codes
that approximate the behavior of user codes

What happens if you don't test? I have seen many RFP's where it is
obvious the writers have never done any testing. In that case, they
rely on standard benchmarks to measure performance. I've use people
use HPL (Top 500 benchmark), NAS Parallel Benchmark (NPB), and some
other codes to measure performance of the system. The problem is
that it in almost every case, they have no idea how the results
of these benchmarks correlates to the performance of their codes.
More over, I have seen people purchase clusters based solely on the
HPL result (Top 500 benchmark). Unfortunately, the system they
chose based on this performance measure is actually slower
on their user codes than a another system that didn't have as high
a Top500 result. This is the danger in not testing - you might end
up with a system that doesn't run your codes as well as you thought.

Step 3 - Selecting Prospective Vendors

At this point you
should have a pretty good idea of how your code(s) perform and you
should have started testing. The next step I recommend is to start
selecting vendors. A good way to get started is to do what every
person does today - use Goggle. After your google search, you
might spend some time on the beowulf mailing list asking for advice
and recommendations for vendor candidates. You can
also read the archives of various mailing lists to get an idea of
companies you want to consider. You can also post to the mailing
lists yourself and ask what vendors people recommend and which ones
they stay away from. Be careful though, you will get conflicting
recommendations. But, sometimes you can get names of companies you
didn't even think of considering. However you chose to create a list
of possible vendors you will end up with a list of various vendors,
some small and some not so small. What do we do with the list? I
recommend classifying the prospective vendors into various categories.

I like to classify
vendors in three ways. The first way to classify vendors is based on
their overall size, sales, and experience with computer systems.
The second way to classify them is
based on their size, sales, and experience with clusters.
And the the third way is their understanding of clusters.
Particularly how to architect them, how to tune the distribution for
performance, and how they work with you (the customer) to get the
best price/performance possible.

First Classification - Computer System Vendors

The first classification is based on the company based on
all computer systems and support.  
You can divide companies into the usual Tier 1, Tier
2, and Tier 3 vendors that you read about in the literature. These
breakdowns are usually based on revenue from computer systems and
their surrounding services (e.g. support, etc.). However,
remember that we're talking about the overall size of the company and
sales on all computers, not just clusters. I call these the
"conventional" Tiers.

Second Classification - Cluster System Vendors

You can do the same basic classification as we did before, but base
the tiers on cluster sales.
That is, you can break vendors into Tier 1, Tier 2, Tier3, etc. vendors
based only on clusters sales. The ranking based on
clusters will definitely be different than the first ranking. I call
these the "cluster" Tiers.

Third Classification - True Cluster Vendors

The third
classification is a bit different than the previous two classifications,
and in many ways becomes a subjective
ranking. But, since I'm on my soapbox here, let me give you me
interpretation of true cluster vendors.

There are many vendors who are quite capable of
assembling computers and connecting them with networks. They
may even put a standard cluster distribution on them. I call these
vendors, the "rack-n-stack" vendors. They really don't know much
about clusters but just ship hardware. Unfortunately, you can expect
little support from these kinds of vendors. If you get into trouble
that involves more than, "how do I power off nodes?" then you will
be out of luck. In some cases, they won't even be able to answer
questions about the cluster management system they installed!

The second kind of true cluster vendor
can assemble clusters correctly, put on a cluster distribution with
a cluster management system, and even provide some basic support.
They can help you if you have a node go down or if you have to put a
node back into production, or if you have some basic error. I call
these vendors "basic cluster" vendors. They may have a decent
idea of how to create a cluster, but if you get into trouble with
your applications, they will never be able to help you. Also, they
have little or no idea about how to architecture a cluster for
your application(s) and how to tune performance. 

The third group of true
cluster vendors really understand how to assemble a cluster based on
your requirements and your code. They can tune your cluster for
performance for your code(s) and can help you if you have a problem
with your codes. Let me give you an example.

I know of one case where a major
"conventional" Tier 1 vendor and a "conventional" Tier 4 vendor,
who was also a Tier 1 "true cluster vendor" used
identical hardware and installed the same distribution of Linux, but
the conventional Tier 4 vendor tuned their system and the Tier 1
did not. The conventional Tier 4 vendor's cluster ran 20% faster using
the same compilers, compiler options, and MPI! Remember this is using
the same hardware and software. The Tier 4 vendor understood how to
integrate the cluster and tune for performance. I call these vendors
"cluster architects" and unfortunately, there are very few of them. 

New Approach - Team Cluster!!!!

People usually want a single vendor to provide everything - hardware,
software, services, support, warranty, etc. This is often referred
to as the "One Throat to Choke" approach. Since I'm in a
Xen-like mood let me
ask, why? Why do you have to have one company do everything for you?

Looking at the general commodity computer market you will see that
there are companies who are very, very good at hardware. These companies
are very good at making hardware that works well and is very
low cost. Many of them also manage to make money in the process. However,
they are not the best when it comes to software and they are
definitely not the best when it comes to understanding how to architect,
tune, and integrate clusters.

Then we have small companies that I call "cluster architects" as
I mentioned before. They are usually behind the curve on hardware
because of costs and volume, but they know clusters very well,
including cluster software.

Why can't you buy hardware from the hardware vendor and cluster
software and services from a true cluster company? Basically you
build your
own "team" - a hardware vendor, a cluster software and integrator,
and yourself. While this isn't the "one throat to choke" approach
that virtually all IT departments have become fixated upon, it
does have some compelling arguments.

First, this team approach gives you the best performance from the
various aspects of a cluster. You get the best hardware from a
company that truly understands hardware. You get the best
cluster software and integration from a vendor who knows clusters.
The hardware vendor doesn't have to worry about software, which
many of them find to be the bane of their existence. The cluster
vendor doesn't have to worry about hardware which requires a great
deal of time, effort, capital, and usually has low margins.

Also, this approach allows you to select hardware from vendors
who are not thought of as cluster companies. If you are bound by
the "one throat to choke" mentality you have to pick a single
vendor for cluster hardware, software, and services. This limits
your selection. However, what if you could chose a hardware company
that is not a cluster vendor? This gives you a wider range of
companies to choose from - this gives you more flexibility and
possibly better price/performance. What if you could chose a
cluster software/integrator that truly understands clusters?
This company gives you a wide range of cluster software to choose
from (as opposed to the "one throat to choke" who only have one set
of cluster software). This gives you more flexibility and perhaps
better price/performance as well. 

The approach of "decoupling" hardware and software gives you
opportunities that the "one throat to choke" concept does not
give you. For example, you can standardize on a single set of
cluster software and then select the hardware vendor that gives
the best price/performance, the latest hardware, or some specialized
hardware adapted for your application.

Just like real-life, there are downsides to this approach. Because
you don't have "one throat to choke" you have to bear more of the
responsibility for the cluster. You can't just call a single phone
number and expect "cluster batman" to fix whatever problem you
have. You know need to make sure the hardware vendor and the
software vendor work together (I would make sure they can do this
prior to buying anything) and that the habit of blaming each other
does not become a common occurrence. Also, this idea depends upon
the cluster software and the hardware vendor adhering to standards.
This allows the hardware piece and the software piece to be
interchangeable.

I can promise you that any IT manager who is reading this is silently
saying that they would never consider this idea and that I'm nuts.
Why would they trade ease of support for more headache on their
part? Well, the answer is simple - you can get much better
price/performance with this approach. As I mentioned before you
can get the best price/performance in hardware and the best
price/performance in software. If the integration is done well,
then you should have the best price/performance system.

This concept is a developing one that customers are starting to
embrace. It bears further thought, but I'm personally betting this
is the wave of the future.

Recommendations (but don't sue me)

You want to have a reasonable number of candidate vendors without
having too many. How many is enough? That's really up to you.
But, there are some easy recommendations I can give you (at least
"easy" in my opinion). Then I will give you my idea of how many
vendors I would pick and which kind of vendors I would pick.

The easiest recommendation I can make is to stay
away from the "rack-n-stack" vendors unless you have some good
in-house cluster staff. If you are trying the "cluster team"
approach, then the 'rack-n-stack" companies could make sense. However,
I would recommend looking for some "basic cluster" vendors
since they are plentiful in numbers. Most important, however, is to
spend as much time as you can looking for the "cluster architects."
These are the companies you will want to buy from.

In general, I would select around 4-5 vendors in total. I might
select at
least one conventional Tier 1 vendor, one or two "basic cluster vendors"
which are not conventional Tier 1 vendors, and at least one, but hopefully
two or more "cluster architect" companies. Don't select too many companies
since this will
make your life difficult when doing an evaluation, but also don't
select too few, since you will have a difficult time comparing the
vendors and their offerings

I have one more
comment about selecting vendors. Don't select a vendor based on just
their size and sales. Companies, or rather IT managers, 
have a tendency to select
companies based on their size and sales (the bigger the better).
People seem to have some comfort in large companies because
they believe they will always be around. While they have a valid
point, remember that clusters, and beowulf's in particular, are made
from commodity components, that can be easily found from other
companies. Also remember that Enron and Worldcom were huge
corporations with large sales and supposedly large cash reserves just
before they went bankrupt.

I have also seen a disturbing trend in cluster sales recently. There
are many people who are shopping for cluster vendors, and making
decisions, based purely on one number. This number is $/GFlop
(price per billion floating-point operations per second). The GFlop
number is either the peak performance or the Top500 performance
(I've seen requests for both numbers). What these people are looking
for is the cheapest hardware possible. By focusing on this single
number, they have immediately eliminated all discussion about the
ability of a vendor to support their hardware, fix problems, interconnect
options, software tuning, cluster management, etc. Also, they have
now eliminated any discussion of the performance of real codes.
It's really sad to hear people ask for these numbers knowing that
they are likely to end up with a cluster that doesn't work as
advertised, and doesn't deliver the best performance on user codes.

			Next 		

Article Index

Introduction

Step One - Understanding Your Code(s) and Users - Profiling

Usage Profiling

Application Profiling

Operation Counting

Step 2 - Testing Candidate Hardware

Step 3 - Selecting Prospective Vendors

First Classification - Computer System Vendors

Second Classification - Cluster System Vendors

Third Classification - True Cluster Vendors

New Approach - Team Cluster!!!!

Recommendations (but don't sue me)

Main Menu

Search

Login And Newsletter

Feedburner