SATA drives, benchmarks, booting from USB | Beowulf List

Wisdom of the ages ...

Wisdom of the ages ...

The Beowulf mailing list provides detailed discussions about issues
concerning Linux HPC clusters.  In this column I report on Serial ATA
(SATA) drives , I/O benchmarks, cluster benchmarks, and booting
from solid state USB drives. You can consult the
Beowulf archives
for the actual conversations.

Beowulf: SATA or SCSI drives - Multiple Read/Write Speeds

There was an interesting discussion on the beowulf mailing list that
started on December 8, 2003 with a posting from Robin Laing who asked
about SATA drives (Serial ATA) versus IDE drives (also called Parallel
ATA or PATA drives) versus SCSI drives. In particular, he wanted to
know which one was better for multiple drive read and write
operations. While the resulting discussion wasn't about clusters per
say, disk I/O (Input/Output) performance can have a great impact on
many cluster applications. Bill Broadley responded that there was
some bad information and biases floating around about drive performance
(e.g. IDE versus SCSI) and strongly suggested benchmarking your own
code or some disk benchmark, such as
Bonnie++ or
Postmark, that is
close to your application. He pointed out that there are many factors
that can be adjusted to affect I/O performance. The discussion then
broke into two parts. The first part part discussed opinions and test
results of PATA and SCSI drives and controllers.

The ever present Mark Hahn followed up with some general rules of thumb
for different types of drives and pointed out that Jeff Garzik is
writing all new SATA drivers for the 2.6 kernel that should greatly
improve performance of SATA drives under Linux. Robert Brown asked
Mark his general opinion about SATA. Mark answered Robert's questions
and pointed out that his next server will have SATA controllers in
it (hint, hint).  David Lombard followed up that he personally liked
SCSI drives because in his experience he could get much higher I/O
rates regardless of the CPU load (PATA drives involve more CPU usage
than SCSI drives). David mentioned that he had seen an I/O rate of
290 MB/second on x86 systems using SCSI drives. Bill Broadley followed
up with a large number of questions and offered that he has seen PATA
drive arrays reach speeds of 300-400 MB/second even when the CPU load
was under fairly high. David and Bill discussed some technical details,
including the fact that these tests were done using multiple
controllers and running RAID-0 (stripping). Bill finally finished with
the comment that he saw greater I/O rates using XFS as opposed to ext3.

To reinforce Bill's comments, there was a recent posting on the
Linux-IDE-Arrays mailing list from Dan Yocum with some test results
for a SATA drive array that uses three
3ware 8506-8
SATA RAID cards.
The cards were configured with hardware RAID-5 than striped with
software RAID-0 across the three cards. Using Bonnie++ on a 125 GB
file with 64 KB chunks, Dan was able to achieve about 230 MB/sec for
block writes and 520 MB/sec block reads.

The second branch of the discussion dealt with some observations about
the Linux kernel in relation to I/O performance. Robin Laing responded
to the initial discussion by stating that his application used one or
two files that were much larger than memory while his code was running.
He noticed that his machine 'stutters' for a few seconds every time
there is a disk access. Mark Hahn responded that he thought the 'stutter'
was not a drive problem but rather the a memory management problem
within the kernel. He offered the observation that Linux seemed to
over-cache and can get to a point where it's scavenging scans (looking
to dump cached data that it's not likely to need and then re-caching
data). Robert Brown followed this with some comments, which Mark had
sent him off-line, that too much memory seems to confuse the caching
system of some kernels. Robert also mentioned that when he used a
1.8 Ghz P4 system as a server he also saw some 'mini-delays.' When he
took the exact same drives and put them in a 400 MHz Celeron system he
got better performance.

This discussion was very useful in pointing out the need to test the
entire system, from the exact hard drives, to the RAID configuration
to be used (if one is to be used), to the exact kernel and kernel
configuration, to determine the I/O performance. Using multiple
controllers (whether SCSI or PATA or SATA), RAID-0, and XFS seems to
provide the best I/O performance. However, you need to pay attention
to the kernel and kernel configuration to extract the best performance
possible.

Beowulf: New Cluster Benchmark

Bill Broadley posted to the Beowulf mailing list on November 17, 2003
about a better benchmark for clusters than the Top500 benchmark. This
post grew out of a discussion about Virginia Tech's fantastic
Top500
result on their new Apple G5 cluster. Bill was interested in the
performance of larger clusters since they are starting to dominate the
Top500 benchmark. In particular he thought the big difficulty for larger
cluster was scaling, which is usually an interconnect issue. Jakob
Oestergaard responded that he thought the Top500 was a fine benchmark
for what it was. But it's definitely not a benchmark that measures the
true power of a cluster for one's particular application. He thought
that developing a series of benchmarks to quantify a cluster's performance
would render the benchmarks useless (he was also the first person in this
conversation to use the
famous paraphrase, "There are lies, damn lies, and statistics... ").
Robert Brown joined in stating that the one true benchmark was one's
application. Robert brought up the point that he thought Microbenchmarks
(a Microbenchmark tests only a small single aspect of a single system
or a cluster) were more appropriate for benchmarking machines. He
suggested something like Larry McVoy's 
lmbench benchmark suite.
Moreover, he thought Larry's insistence that lmbench results could
only be published if all of the results are published was a very
good idea (chips companies are notorious for only publishing certain
benchmark results that make their chips look good). He then stated that
in his opinion he would like to see a full suite of microbenchmarks to
test core functions "that are building blocks of real programs." These
would include some microbenchmarks to test clusters. He finally finished
with a typical Brownian comment that the Top500 benchmark was really
intended to measure the size of one's, umm, cluster and nothing else.

John Hearns pointed out that the old Paralogic website has a link (since
moved to here)
to a set of tools called
the Beowulf Performance Suite (BPS). Robert Brown followed up that
Doug Eadline (previous editor of ClusterWorld Magazine and now head monkey at Clustermonkey.net) had
done a good job
putting together BPS and perhaps in the future a gathering of cluster
experts could extend it and define a good and useful series of cluster
benchmarks. Doug replied that BPS is called a Performance Suite,
not a Benchmark Suite because it should be used to generate a baseline
to measure changes (good or bad) to the cluster. Felix
Rauch also chimed in with some very good comments about measuring
network performance in clusters. Robert Brown really liked Felix's
comments and went on to talk about a network microbenchmark that would
watch the performance of the system and switch algorithms at the
appropriate time to improve performance. 

The Top500 benchmark is a simple benchmark with a long history. It has
provided useful information about the general trend in high performance
computing, that is, the increasing dominance of clusters. However,
using it to say my cluster is faster than yours is a bit like using the
heights of the basketball players to indicate how good they are. The
heights are not an accurate indication of how good a team is, and the
Top500 is not a measure of how useful a cluster is (although it is fun
to play with). The discussion was very useful in providing good
suggestions about how benchmarking for cluster should proceed.

Beowulf: Booting from USB Pen Drive

While a bit old, there was an interesting discussion on the beowulf
mailing list that started with a posting by p.pennaz on 21 November,
2003, asking about booting a Linux system via a USB cartridge (USB
solid-state storage device). USB storage, or any solid state storage
for that matter, are very interesting because there's no moving parts
and if the power goes out you don't loose your data. There was an
immediate response that you should be able to boot from a USB storage
device if your motherboard has a BIOS option to support it. Mark Hahn
provided some simple ideas about what it would take to boot from a USB
storage device. Donald Becker responded that just because a motherboard
can boot from a USB storage device doesn't mean it's that easy. Many
of the USB storage devices cannot be used for booting.

There are several Linux distributions that can fit onto a USB storage
device and allow systems to boot directly from them. In fact, John Hearns
pointed out that he has routinely booted systems from a USB memory
stick that had
StressLinux loaded in it. 

Jim Lux also pointed out that there are simple IDE-to-CF (CF=Compact
Flash) adapters that allow you to use CF cards as though they are
disks. In a later post, Jim also pointed out how nice it could be to
boot a diskless cluster node from a CF card using the adapter. This
capability would help improve reliability (no moving parts) and reduce
heat generation inside a node. Jim's intent is to use these kinds of
devices on nodes that only have a wireless network (he doesn't want
to ship a kernel and associated parts over a wireless network because
of the low bandwidth). Andy Cater reminded everyone that Compact
Flash has a number of limited rewrites, so perhaps using the CF card
only for the read-only portions of the operating system and a small
ramdisk for the portions that readily change (e.g. /var and
/tmp). 

Solid-state storage is fast becoming very inexpensive thanks to
commodity uses (cameras, MP3 players, cell phones, etc.). These devices
offer increased reliability and lower power usage and heat generation
compared to hard drives. However, they are more expensive and slower
(perhaps not an issue for read-only file systems) than hard drives.
Overall, sold-state storage has much to offer and may be very useful
for clusters.

Sidebar One: Links Mentioned in Column

Beowulf archives

linux-ide-arrays mailing list

linux-ide-arrays archive

Bonnie++

Postmark

lmbench

Stress Linux

Top500

BPS

This article was originally published in ClusterWorld Magazine. It has been
updated and formatted for the web. If you want to read more about HPC
clusters and Linux you may wish to visit
Linux Magazine.

Jeff Layton has been a cluster enthusiast since 1997 and spends far
too much time reading mailing lists. He has been to
38 countries and hopes to see all 192 some day.