Cluster Cooling, Noise and Benchmark Timing

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this column I report on some popular infrastructure issues as well as some benchmarks for beowulfs. You can consult the Beowulf archives for the actual conversations.

Quite and Powerful

If you have stood next to a rack of 1U dual processor computers you know how loud they can be. On December 12, 2003, Joshua Baker-LePain asked about systems that provided the best bang for the noise (performance per dB anyone?). He currently has six 1U nodes with dual 2.4 GHz Xeon processors in them. Each node has eleven small fans, resulting in a setup that was very loud for the people around them (and there were people in the same room as the machines). He is looking for systems that are much quieter but without a penalty in speed. Nicholas Henke was the first to respond that his desktop Dell 650N box was very quiet and it was a dual processor machine.

Bill Broadley also mentioned that he has a Dell 360N that he felt was quieter than the Dell 650N. He also said that he's found the single P4 processor in the 360N to be faster than the 650N even running two processes, because the memory subsystem in the 360N was substantially better then the 650N. He also mentioned that he's never heard a quiet 1U or 2U system. He's found that even Apple's XServe systems to be loud. He finally recommended that if you are building a system yourself, to use rubber grommets in the case and to use large slow spinning 120 mm fans.

Joe Landman took the same idea of Bill's and put it in general terms. He recommended using larger size fans since they move more air for a given RPM (revolutions per minute). This decision usually dictates a larger case which Joe also mentioned allows for sound absorption to be installed fairly easily. He offered that a 2U or 3U case of good quality might be able to be cooled effectively while remaining quiet and allowing for sound absorption to be installed without adversely affecting airflow.

Greg Lindahl also offered an explanation for the noise. The typical 1U case uses very small fans that are inefficient. So a large number of fans have to be used to achieve the required cooling and airflow. Dean Johnson added to Greg's comment by stating that the smaller fans in the 1U cases run at a higher RPM to improve the airflow which raises the pitch of the noise. In Greg's words, this "... pegs the annoy-o-meter." He gave an example where he had an SGI 1100 (a 1U, dual PIII node) and an SGI Origin 200 (a regular case size system running MIPS processor(s) ). He said that both probably had about the same level of loudness, but he turned off the 1100 when he wasn't using it because of the pitch of the noise.

{mosgoogle right}

Daniel Pfenniger offered a link to a company with a desktop box that is designed for low noise. Daniel said that he had one in his office that could barely be heard. He thought that putting even 12 in a room would not disturb a conversation. (26 dBA)

Andrew Latham also mentioned that koolance has water cooling kits for standard rack mount cases that could potentially reduce the noise level. However, Joel Jaeggli added that in fact the case was a 4U case. He said that in a 4U of space, he could get eight Opteron 242's (four dual setups) with panaflo crossflow blowers in each which he felt was quieter than the small 40mm, 8,000 RPM fans in typical 1U cases.

In the drive for speed, system vendors increased the packaging density of systems from the old 4U cases, to 2U, to 1U, and finally to blade servers. At the same time, the power requirements, and thus the heat dissipation requirements, for the systems had gone up. Trying to effectively cool them and quiet them is a monumental task. This brief exchange on the Beowulf mailing list points out there are times when quiet systems are really needed.

Benchmarks and Results - Timing

Benchmarks are one of those areas that one could say takes on a religious quality. However, the only real generalization I can offer about benchmarks is that you should benchmark your own code(s) and use those as your only benchmark. But, this should never stop us from talking about benchmarks and what makes up good benchmarks.

On January 7 2004, Rene Storm asked about testing the Top500 benchmark (called hpl) on his Opteron system and why he was getting some variation in the results when run on different processors in the dual CPU box. This question started a very long and interesting discussion about how you time codes. Exact timing is a key to getting accurate results. Let's examine some of the highlights but be warned that the discussion will get into some great detail.

The first responder is the well-respected Joe Landman who posted some very good comments about how to do benchmarking properly. First he suggests that you reboot the machine between runs and unmount and remount file systems between runs if you are testing codes with heavy I/O (Input/Output) usage. He also suggested running in single user mode and turning off unneeded services. He also explained that Opterons are really NUMA (Non-Uniform Memory Access) systems so you really want to assign a process to a processor to avoid memory 'hops' from one CPU to another. Other UNIX systems have a command called runon that facilitates this and Joe has heard a rumor that such a command has been written for Linux. [Note: Check out Take charge of processor affinity and Portable Linux Processor Affinity (PLPA) for current processor affinity developments] Joe also started the discussion about timing codes with some very good comments that he entitled 'Physics 101.'

Robert Brown (aka' rgb) posted with a very long, detailed, and comprehensive discussion about timing codes. Robert has examined code timing quite a bit when he wrote cpu_rate (a benchmark code) and while he was using lmbench another benchmarking code from Larry McVoy. He says that are there two reasonable accurate timing mechanisms in Linux. One of them is less accurate than the other. He first mentions a function gettimeofday which returns timing data for wall clock time (wall clock time is the time as if you are looking at a clock on the wall), and is accurate to within about 2 microseconds. He points out that this makes it about useless for timing short loops that execute in less than 2 microseconds. For example, timing a few hundred multiplies will take less than a couple of microseconds. You would have to do thousands of multiplies before you would get above the resolution of the clock timer. He gives the example of a few billion multiplies that take on the order of a couple of seconds. The timing results using gettimeofday would be something like 2 seconds +/- 2 microseconds, which is a good timing resolution. Robert also discusses that timing some bit of code that takes a few hundred microseconds is not a bad things because it is less likely to be interrupted by kernel functions. So, you will get timings of the code itself without OS dependent factors getting involved. He also discusses a bit more about microbenchmarking in regard to using gettimeofday to measure the timing.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.