Features

The True Cost of HPC Cluster Ownership

Validation/Optimization Costs

One cost that is particularly hidden from the customer is the cost to validate and optimize the hardware and software. Since there is no single point of contact for the entire cluster, the user must make sure everything works as expected. In some cases, integrators will run system wide tests, but the wide array of hardware and software choices pushes the ultimate responsibility for correct operation onto the customer. This process takes time and should be performed each time a significant change is made to the system. (i.e. a new compiler, kernel, MPI version etc.) In the worst case, a change or upgrade may actually produce wrong answers or fail entirely because the solution was not validated before implementation. An often forgotten integration issue is I/O. Most clusters must be integrated into an existing storage hierarchy. In addition,accessing the actual storage needs for a cluster, before storage decisions are made, is critically important. A poor storage design can result in poor application performance. In many cases, the first hardware added to a new cluster is to resolve a storage based issue.

Maintenance Costs

Keeping a cluster running can be a time sink. Some false comfort can be gained by purchasing a hardware maintenance agreements from a vendor. While, they will repair obvious problems they often must deffer to another vendor or software project for any non-obvious failures (i.e. disk drive and power supply failures are obvious, poor interconnect performance may be due to several sources) The user is then required to invest the time to identify and assign responsibility to a specific vendor and in some cases play negotiator between two vendors.

Another issue with the classic cluster design (OS image installed on each node) is that of version skew or "node personalities." Initially keeping nodes in sync, seems trivial -- just install the same thing on all the nodes. This approach breaks down as the cluster ages because replacement nodes must be re-imaged to reflect all other nodes. To accomplish this, changes must be tracked and a current "snapshot" created. Changes also include OS tuning parameters and tweaks that must be performed on nodes so that certain software applications will run correctly. This "change/snapshot/re-image" cycle is expensive and can incur significant down time for the simplest of maintenance issues.

There are several more advanced cluster methodologies, such as NFS-root, or RAM-Disk, that help solve some of these issues. These applications must be evaluated carefully as changing your provisioning scheme after the cluster is operational can be difficult and cause disruptions.

Upgrade Costs

Of course software changes and upgrades provide better security, more features, and hopefully better performance. In many clusters the "software stack" comes from many sources and there is often an unknown dependency tree living in the installed software. For instance, upgrading to a new distribution of Linux (Red Hat, Suse, etc.) may require rebuilding of MPI libraries and other middle-ware. User applications may also need to be rebuilt with a third party optimizing compiler that does not yet support the new distribution upgrade. Administrators and users are then required to determine work arounds or fixes that allow the users to run the new software. Other packages may suffer a similar fate resulting in extra time and frustration.

Infrastructure Costs

In addition to the hidden support costs, commodity clusters also place a burden on infrastructure costs. The power and cooling costs for a typical (x86 based) cluster are often not factored into the price to performance numbers. An the average dual socket cluster node currently requires around 300 watts of power. Cooling and power delivery inefficiencies can double this node power requirement to 600 watts. Therefore, on an annual basis a signal cluster node can require 5256 kilowatt hours. At a nominal cost of $.10 per kilowatt hour, the annual power and cooling costs for a single cluster node is approximately $526.

These numbers are more striking when the cost of the entire cluster is taken in to account. Consider a typical cluster purchase in today's market where the typical node can cost $3500 per (including racks, switches, etc.) Using standard dual core technology a node provides two processors and four cores. A typical 128 node cluster will then provide 256 processors and 512 cores and costs $448,000. Based on the above assumptions, the annual power and cooling budget is then $67,300. Over a three year period this amounts to $202,000 or 45% of the system cost.

While costs may vary due to market conditions and location, the above analysis illustrates that for a typical commodity cluster the three year power cost can easily reach 40-50% of the hardware purchase price.

Other infrastructure issues can effect cost as well. A typical industrial rack mount chassis can hold 42 cluster nodes. An average cluster node weighs around 45 pounds. Thus, each rack requires a floor capable of supporting 2000 pounds in the space of a single rack mount enclosure. In a typical data center, racks mount hardware is a mix of storage and servers with many underpopulated racks. HPC clusters, on the other hand, represent the most dense and therefore heaviest load in a data center. In our 128 node example, the cluster will require support for 6000 pounds in a 4x8 foot area.

Time Is Money

The above issues need to be resolved before any real production computing can begin. Instead of a domain expert running a code on a supercomputer with a highly defined software and hardware environment, he or she has to understand a details previously handled by the people at a traditional supercomputing company. The initial cost of clusters are cheaper because the cost of engineering and integration has shifted from the vendor to the user.

For the more scientifically inclined, there is a kind of conservation of cost when it comes to HPC. Cost in this sense is both time and money because the time to solve an implementation problem often cannot be reduced with money. The low price of clusters did eliminate some costs, but shifted many of the non-reducible costs to the end user which ultimately impacts how much computing per dollar the cluster user can archive. These costs coupled with infrastructure costs often push the push the total cost of ownership much higher than originally anticipated.

Success Metrics

The lack of software procurement costs (i.e. the use of freely available software) invited many organizations to focus solely on nodes per dollar cost of an HPC solution. A more correct and measurable number should be based on sustainable solutions per day per dollar, where the dollar estimate includes all the above hidden costs (software integration and infrastructure). This measure is the real cost of HPC and will provide a sound basis on which to determine the ultimate cost effectiveness of an HPC investment.

Factoring the hidden costs into such a number can be very difficult. The amount of time and money required depends on your level of in-house expertise. Attempting to build and maintain a production HPC cluster requires a skill set that is currently in short supply and thus expensive. If your organization does not have the technical depth, then purchasing hardware in a very real sense is putting the cart before the horse. Infrastructure costs, on the other hand, are more easily estimated and therefore should be an integral part in all success metrics.

A Strategy For Success

{mosgoogle right} Commodity clusters have shown a tremendous price to performance advantage over the traditional approach to supercomputing. Those considering an HPC solution should be aware that the cost to fully implement a solution goes well beyond the hardware price and includes both the software integration/maintenance and infrastructure costs.

If you are planning to purchase an HPC cluster, consider the additional work required to achieve a functioning system. Failure to account for the hidden time and money will result in lost up-time, higher costs, and poor performance. As part of your cluster plan, determine whether you have the in-house expertise to accomplish these tasks in a cost effective manor. If you need help, look to a vendor that has an intimate understanding of your needs and experience with HPC systems. In reality most large vendors will stop slightly beyond a "standard install" by using a professional services organization (either internal or externally based) at which point, you are on your own. There are a number of smaller vendors that can help minimize the hidden costs and provide real long term support for you HPC needs. Finally, there are a small number of consultants that specialize in cluster integration, testing, and support.

Cluster HPC is powerful and effective computing platform. Understanding the real cost structure will help set expectations and assist in planing and implementing your HPC resource.


Authors Note: Portions of this article are taken from a white paper I had written for SiCortex in January of 2007. I also want to point readers to an excellent article by fellow cluster monkey, Jeff Layton, entitled How to Write a Technical Cluster RFP

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.