Cluster Noise and Clock Resolution | Beowulf List

In this installment of the Best of the
Beowulf Mailing List
we review some postings about noise issues (always a problem)
and clock resolution for code profiling. You can consult the
archives
for the actual conversations.

Beowulf: Quiet and Powerful

If you have stood next to a rack of 1U dual processor computers you know
how loud they can be. On December 12, 2003, Joshua Baker-LePain asked
about systems that provided the best bang for the noise (performance per
dB anyone?). He currently has six 1U nodes with dual 2.4 GHz Xeon
processors in them. Each node has eleven small fans, resulting in a
setup that was very loud for the people around them (and there were
people in the same room as the machines). He is looking for systems
that are much quieter but without a penalty in speed. Nicholas Henke
was the first to respond that his desktop Dell 650N box was very quiet
and it was a dual processor machine.

Beowulf: Quiet and Powerful

If you have stood next to a rack of 1U dual processor computers you know
how loud they can be. On December 12, 2003, Joshua Baker-LePain asked
about systems that provided the best bang for the noise (performance per
dB anyone?). He currently has six 1U nodes with dual 2.4 GHz Xeon
processors in them. Each node has eleven small fans, resulting in a
setup that was very loud for the people around them (and there were
people in the same room as the machines). He is looking for systems
that are much quieter but without a penalty in speed. Nicholas Henke
was the first to respond that his desktop Dell 650N box was very quiet
and it was a dual processor machine.

Bill Broadley also chimed in that he has a Dell 360N that he felt was
quieter than the Dell 650N. He also said that he's found the single
P4 processor in the 360N to be faster than the 650N even running two
processes, because the memory subsystem in the 360N was substantially
better then the 650N. He also mentioned that he's never heard a quiet
1U or 2U system. He's found that even Apple's XServe systems to be
loud. He finally recommended that if you are building a system yourself,
to use rubber grommets in the case and to use large slow spinning
120 mm fans.

Joe Landman took the same idea of Bill's and put it in general terms.
He recommended using larger size fans since they move more air for
a given RPM (revolutions per minute). This decision usually dictates
a larger case which Joe also mentioned allows for sound absorption
to be installed fairly easily. He offered that a 2U or 3U case of
good quality might be able to be cooled effectively while remaining
quiet and allowing for sound absorption to be installed without
adversely affecting airflow.

Greg Lindahl also offered an explanation for the noise. The typical
1U case uses very small fans that are inefficient. So a large number
of fans have to be used to achieve the required cooling and airflow.
Dean Johnson added to Greg's comment by stating that the smaller
fans in the 1U cases run at a higher RPM to improve the airflow which
raises the pitch of the noise. In Greg's words, this "... pegs the
annoy-o-meter." He gave an example where he had an SGI 1100 (a 1U,
dual PIII node) and an SGI Origin 200 (a regular case size system
running MIPS processor(s) ). He said that both probably had about
the same level of loudness, but he turned off the 1100 when he wasn't
using it because of the pitch of the noise.

Daniel Pfenniger offered a link to a company with a desktop box that 
s designed for low noise. Daniel said that he had one in his office
that could barely be heard. He thought that putting even 12 in a
room would not disturb a conversation. (26 dBa)

Andrew Latham also mentioned that koolance has water cooling kits for
standard rack mount cases that could potentially reduce the noise level.
However, Joel Jaeggli chimed in that in fact the case was a 4U case.
He said that in a 4U of space, he could get eight Opteron 242's (four
dual setups) with panaflo crossflow blowers in each which he felt was
quieter than the small 40mm, 8,000 RPM fans in typical 1U cases. 

In the drive for speed, system vendors increased the packaging density
of systems from the old 4U cases, to 2U, to 1U, and finally to blade
servers. At the same time, the power requirements, and thusly the
heat dissipation requirements, for the systems had gone up. Trying
to effectively cool them and quiet them is a monumental task. This
brief exchange on the beowulf mailing list points out there are times
when quiet systems are really needed. 

Beowulf: Benchmarks and Results - Timing

Benchmarks are one of those areas that one could say takes on a
religious quality. However, the only real generalization I can
offer about benchmarks is that you should  benchmark your own code(s)
and use those as your only benchmark. But, this should never stop us
from talking about benchmarks and what makes up good benchmarks.

On January 7 2004, Rene Storm asked about testing the Top500 benchmark
(called hpl) on his Opteron system and why he was getting
some variation in the results when run on different processors in the
dual CPU box. This question started a very long and interesting
discussion about how you time codes. Exact timing is a key to getting
accurate results. Let's examine some of the highlights but be warned
that the discussion will get into some great detail.

The first responder is the well-respected Joe Landman who posted some
very good comments about how to do benchmarking properly. First
he suggests that you reboot the machine between runs and unmount and
remount file systems between runs if you are testing codes with heavy
I/O (Input/Output) usage. He also suggested running in single user
mode and turning off unneeded services. He also explained that Opterons
are really NUMA (Non-Uniform Memory Access) systems so you really want
to assign a process to a processor to avoid memory 'hops' from one CPU
to another. Other UNIX systems have a command called runon that
facilitates this and Joe has heard a rumor that such a command has been
written for Linux.  Joe also started the discussion about timing codes
with some very god comments that he entitled 'Physics 101.'

Robert Brown (aka' rgb) posted with a very long, detailed, and
comprehensive discussion about timing codes. Robert has examined code
timing quite a bit when he wrote
cpu_rate
(a benchmark code) and
while he was using
lmbench another
benchmarking code from Larry McVoy.
He says that are there two reasonable accurate timing mechanisms in
Linux. One of them is less accurate than the other. He first mentions
a function gettimeofday which returns timing data for wall clock
time (wall clock time is the time as if you are looking at a clock
on the wall), and is accurate to within about 2 usecs (microseconds).
He points out that this makes it about useless for timing short loops
that execute in less than 2 usecs. For example, timing a few hundred
multiplies will take less than a couple of microseconds. You would
have to do thousands of multiplies before you would get above the
resolution of the clock timer. He gives the example of a few billion
multiplies that take on the order of a couple of seconds. The timing
results using gettimeofday would be something like 2 seconds
+/- 2 usec, which is a good timing resolution. Robert also discusses
that timing some bit of code that takes a few hundred microseconds
is not a bad things because it is less likely to be interrupted by
kernel functions. So, you will get timings of the code itself without
OS dependent factors getting involved. He also discusses a bit more
about microbenchmarking in regard to using gettimeofday to measure
the timing.

Robert goes on to say that gettimeofday is a lousy timer for
microbenchmarks that take small amounts of time, but good for
'macroscopic' benchmarks that take much longer times to run (well over
several seconds). This result is because if you run the benchmark several
times, any OS dependent timing interrupts will have been averaged out.
Therefore you will get a nice Gaussian distribution of timings with a
small standard deviation.

Bob then presents something that he calls a nano-timer that is available
on nearly all Intel/AMD CPUs. He said that a little bit of assembler
code can read the CPU's on-board cycle counter. If you combine this
with reading the CPU's clock and doing a bit of division, you can get a
timer with an accuracy of 10-60 nanoseconds. He points out that this
nanotimer is far more suitable for microbenchmarks that take little
time. He also discusses how these mircobenchmarks can yield very useful
and interesting information about CPUs. And, finally Robert does the
world a huge favor and posts the code to his nanotimer.

Josip Loncaric, who has been a consistent contributor to the Beowulf
mailing list, greatly appreciated Bob's wonderful post and added an
observation. He has observed that gettimeofday has a resolution of
a few microseconds, but he has seen a resolution of about 1000
microseconds for other systems which appears dependent upon the kernel
scheduler. He points out that this destroys the portability of any
benchmarking that uses gettimeofday. He ended this post with a
challenge to the Beowulf and MPI communities to achieve true microseconds
precision, globally synchronized over all the system clocks within
the cluster. 

Jim Lux joined in that he thought perhaps using a GPS receiver on each
node with some pulse would allow the nodes in a cluster to be
synchronized. Josip Loncaric jumped in to ask about NTP (Network Time
Protocol) to synchronize the nodes, especially using a fast network
that has a very low-level latency like Quadrics. Jim Lux responded that
he thought there had been a lot of work in NTP but using it over a
fairly slow network such as Fast Ethernet would pose problems. Robert
Brown added that he thought NTP could be made to have a resolution of
about 1 usec but it would probably take a complete rewrite of the
internal workings of NTP to use the nanotimer coupled with perhaps a
kernel module (so that the timer could be guaranteed to be effectively
run in single user mode). He also thought that if one used the
principles in the nanotimer, you could get even Fast Ethernet to allow
an overall time synchronization of about 1 microsecond.

At this point, the discussion delved into the inner workings of clock
oscillators and how they could be used for NTP. Then the topic came
back to the subject of synchronizing time across a cluster using GPS.
However, the discussion was very detailed on various aspects of
implementing GPS (Jim Lux has a great deal of experience in this area).

The overall discussion was very interesting because it started to
explain some of the inner workings of benchmarks and as a result how
you could better time your codes. The discussion was very precise and
detailed but provides lots of useful information. It also illustrated
how companies can use benchmarks to their benefit (lies, damn lies,
and benchmarks). Companies have been known to have done this in the
past and some are presently doing this (particularly when their CPUs
aren't competitive in the HPC world).

The discussion also produced a very interesting side topic of
synchronizing nodes in a cluster. As good administrators know,
synchronizing a cluster is a mandatory subject that must be mastered.

Sidebar One: Links Mentioned in Column

Beowulf Mailing List

Koolance

cpu_rate

lmbench

This article was originally published in ClusterWorld Magazine. It has been
updated and formated for the web. If you want to read more about HPC
clusters and Linux you may wish to visit
Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far
too much time reading mailing lists. He has been to
38 countries and hopes to see all 192 some day.