Article Index

Wisdom of the ages ...

The Beowulf mailing list provides detailed discussions about issues concerning Linux HPC clusters. In this column I report on Serial ATA (SATA) drives , I/O benchmarks, cluster benchmarks, and booting from solid state USB drives. You can consult the Beowulf archives for the actual conversations.

Beowulf: SATA or SCSI drives - Multiple Read/Write Speeds

There was an interesting discussion on the beowulf mailing list that started on December 8, 2003 with a posting from Robin Laing who asked about SATA drives (Serial ATA) versus IDE drives (also called Parallel ATA or PATA drives) versus SCSI drives. In particular, he wanted to know which one was better for multiple drive read and write operations. While the resulting discussion wasn't about clusters per say, disk I/O (Input/Output) performance can have a great impact on many cluster applications. Bill Broadley responded that there was some bad information and biases floating around about drive performance (e.g. IDE versus SCSI) and strongly suggested benchmarking your own code or some disk benchmark, such as Bonnie++ or Postmark, that is close to your application. He pointed out that there are many factors that can be adjusted to affect I/O performance. The discussion then broke into two parts. The first part part discussed opinions and test results of PATA and SCSI drives and controllers.

The ever present Mark Hahn followed up with some general rules of thumb for different types of drives and pointed out that Jeff Garzik is writing all new SATA drivers for the 2.6 kernel that should greatly improve performance of SATA drives under Linux. Robert Brown asked Mark his general opinion about SATA. Mark answered Robert's questions and pointed out that his next server will have SATA controllers in it (hint, hint). David Lombard followed up that he personally liked SCSI drives because in his experience he could get much higher I/O rates regardless of the CPU load (PATA drives involve more CPU usage than SCSI drives). David mentioned that he had seen an I/O rate of 290 MB/second on x86 systems using SCSI drives. Bill Broadley followed up with a large number of questions and offered that he has seen PATA drive arrays reach speeds of 300-400 MB/second even when the CPU load was under fairly high. David and Bill discussed some technical details, including the fact that these tests were done using multiple controllers and running RAID-0 (stripping). Bill finally finished with the comment that he saw greater I/O rates using XFS as opposed to ext3.

To reinforce Bill's comments, there was a recent posting on the Linux-IDE-Arrays mailing list from Dan Yocum with some test results for a SATA drive array that uses three 3ware 8506-8 SATA RAID cards. The cards were configured with hardware RAID-5 than striped with software RAID-0 across the three cards. Using Bonnie++ on a 125 GB file with 64 KB chunks, Dan was able to achieve about 230 MB/sec for block writes and 520 MB/sec block reads.

The second branch of the discussion dealt with some observations about the Linux kernel in relation to I/O performance. Robin Laing responded to the initial discussion by stating that his application used one or two files that were much larger than memory while his code was running. He noticed that his machine 'stutters' for a few seconds every time there is a disk access. Mark Hahn responded that he thought the 'stutter' was not a drive problem but rather the a memory management problem within the kernel. He offered the observation that Linux seemed to over-cache and can get to a point where it's scavenging scans (looking to dump cached data that it's not likely to need and then re-caching data). Robert Brown followed this with some comments, which Mark had sent him off-line, that too much memory seems to confuse the caching system of some kernels. Robert also mentioned that when he used a 1.8 Ghz P4 system as a server he also saw some 'mini-delays.' When he took the exact same drives and put them in a 400 MHz Celeron system he got better performance.

This discussion was very useful in pointing out the need to test the entire system, from the exact hard drives, to the RAID configuration to be used (if one is to be used), to the exact kernel and kernel configuration, to determine the I/O performance. Using multiple controllers (whether SCSI or PATA or SATA), RAID-0, and XFS seems to provide the best I/O performance. However, you need to pay attention to the kernel and kernel configuration to extract the best performance possible.

Beowulf: New Cluster Benchmark

Bill Broadley posted to the Beowulf mailing list on November 17, 2003 about a better benchmark for clusters than the Top500 benchmark. This post grew out of a discussion about Virginia Tech's fantastic Top500 result on their new Apple G5 cluster. Bill was interested in the performance of larger clusters since they are starting to dominate the Top500 benchmark. In particular he thought the big difficulty for larger cluster was scaling, which is usually an interconnect issue. Jakob Oestergaard responded that he thought the Top500 was a fine benchmark for what it was. But it's definitely not a benchmark that measures the true power of a cluster for one's particular application. He thought that developing a series of benchmarks to quantify a cluster's performance would render the benchmarks useless (he was also the first person in this conversation to use the famous paraphrase, "There are lies, damn lies, and statistics... "). Robert Brown joined in stating that the one true benchmark was one's application. Robert brought up the point that he thought Microbenchmarks (a Microbenchmark tests only a small single aspect of a single system or a cluster) were more appropriate for benchmarking machines. He suggested something like Larry McVoy's lmbench benchmark suite. Moreover, he thought Larry's insistence that lmbench results could only be published if all of the results are published was a very good idea (chips companies are notorious for only publishing certain benchmark results that make their chips look good). He then stated that in his opinion he would like to see a full suite of microbenchmarks to test core functions "that are building blocks of real programs." These would include some microbenchmarks to test clusters. He finally finished with a typical Brownian comment that the Top500 benchmark was really intended to measure the size of one's, umm, cluster and nothing else.

John Hearns pointed out that the old Paralogic website has a link (since moved to here) to a set of tools called the Beowulf Performance Suite (BPS). Robert Brown followed up that Doug Eadline (previous editor of ClusterWorld Magazine and now head monkey at Clustermonkey.net) had done a good job putting together BPS and perhaps in the future a gathering of cluster experts could extend it and define a good and useful series of cluster benchmarks. Doug replied that BPS is called a Performance Suite, not a Benchmark Suite because it should be used to generate a baseline to measure changes (good or bad) to the cluster. Felix Rauch also chimed in with some very good comments about measuring network performance in clusters. Robert Brown really liked Felix's comments and went on to talk about a network microbenchmark that would watch the performance of the system and switch algorithms at the appropriate time to improve performance.

The Top500 benchmark is a simple benchmark with a long history. It has provided useful information about the general trend in high performance computing, that is, the increasing dominance of clusters. However, using it to say my cluster is faster than yours is a bit like using the heights of the basketball players to indicate how good they are. The heights are not an accurate indication of how good a team is, and the Top500 is not a measure of how useful a cluster is (although it is fun to play with). The discussion was very useful in providing good suggestions about how benchmarking for cluster should proceed.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.