Page 2 | PVFS Performance Enhancement | FileSystems

Home

Columns

FileSystems

PVFS Performance Enhancement

Details: Written by Jeff Layton; Published: 17 September 2005; Hits: 22974

Article Index

Page 2 of 3

Number of I/O Servers

A common architectural question is how many I/O servers to use and whether to utilize all the spare space on the compute nodes for PVFS (i.e. use the compute nodes as I/O servers). Answering this question is very difficult due to the myriad of options available. However, there have been a few studies to try to provide some insight into this issue. Three of the major studies were by Kent Milfeld, et al at the Texas Advanced Computing Center (TACC), Jens Mache, et. al., at Lewis & Clark College, and Monica Kashyap, et al at Dell Computers.

In 2001, Dr. Mache and his associates used PVFS1 to get high performance disk access from a PC cluster using IDE disks. Their goal was to break the 1 GB/sec I/O barrier that ASCII Red had broken at a cost of $1 million. Dr. Mache used 32 AMD Athlon 1.2 GHz nodes connected with Gigabit Ethernet (GigE). Each node had two IDE disks that were configured with software RAID-0 (striping). They experimented with varying the number of client nodes and the number of I/O servers on a small 8 node system while running a ray tracing program to compute a number of frames of a simple scene. The best configuration consisted of 2 I/O servers and 6 clients. However, when all 8 nodes were made both clients and I/O servers, the overall completion time was 1.264 times better than the 2 I/O server/6 client configuration even considering that the nodes were computing as well as functioning as I/O servers.

Next, Dr. Mache and his team setup all 32 nodes as both clients and I/O servers. They then ran a code that was a variation of a read/write test program that comes with PVFS1. The code writes and then reads blocks of integer data to and from a PVFS1 file. Each node adds 96 MB (Megabytes) to the global file that has a total of (96*n) MB, where n is the number of nodes used. They found that after 25 overlapping nodes they achieved at least 1 GB/sec in read performance and after 29 nodes they achieved at least 1 GB/sec in write performance.

The cost comparison is even more interesting. ASCII Red spent about $1 million at the time to achieve 1 GB/sec I/O throughput. Dr. Mache spent about $7,200. They beat the I/O price/performance by over a factor of 100!

Kent Milfeld and his associates at TACC have examined PVFS1 performance in a cluster with a simple read/write code and a simulated workload code. The first study focused on 16 Intel PIII/1 GHz single CPU nodes connected with Fast Ethernet. They varied the number of nodes that were assigned as I/O servers with the remaining number of nodes assigned as clients with the sum of the two always 16. They found that 8 I/O servers and 8 clients gave the best performance for the simple read/write test code. They also found that the Fast Ethernet network handicapped the throughput of PVFS1.

A second system with 32 nodes of dual PIII/1 GHz connected with Myrinet 2000 was also tested. In these tests, they allowed one of the two CPUs to be used as a client and one to be used for an I/O server. They found that splitting the functions on a dual CPU system produced higher throughput than using dedicated nodes. The most likely reason is that a portion of the I/O was local to the nodes. They also found that an equal number of clients to I/O servers produced the best performance. This result is basically the same overlapped node configuration of Dr. Mache.

Monica Kashyap, et al at Dell Computers performed a similar study. They used 40 Dell 2650 nodes with dual 2.4 GHz Intel Xeon processors connected with Myrinet. Up to 24 nodes were used as compute nodes and up to 16 nodes as I/O servers. Each I/O server had five 33.6 GB SCSI drives. They used a test code from ROMIO, called perf, that performs concurrent read/write operations to the same file. They examined two type of write access, those without file synchronization and those with file synchronization (MPI_File_sync) and two types of read access, without file synchronization, and read access after file synchronization.

In general, they found without synchronization they could achieve very high levels of throughput for write operations. Interestingly, for a small number of I/O servers, you could rapidly increase the number of clients from 4 to 24 without too much impact on the overall throughput. Including synchronization ensures that the data is on the disk before returning from the function call and as expected impacted the throughput. However, the general observation of a small number of I/O servers being somewhat insensitive to the number of clients, up the number tested, was still true. The file read access testing exhibited the same trends as the write performance.

The differing results are probably due to the complex nature of optimizing the best number of I/O servers and configuration options. There are two interesting things you can take away from these studies. First, you can "dial-in" your desired performance by adding I/O servers to a number of clients until you reach the desired throughput. This option is cost effective because you only add the number I/O servers needed for a given level or performance. Second, the option of using compute nodes as both clients and I/O servers has been shown to be cost effective, but could also lead to some network congestion if multiple jobs are running at the same time.

Increased Disk Performance

In some cases, people have found that the underlying disk speed is the primary bottleneck. You can tune the disk for improved performance. People have been using the command hdparm for several years to improve the performance of disk drives. See the Resources Sidebar for more information.

Additionally, there is an easy thing you can do to improve disk performance, namely using RAID (Redundant Array of Inexpensive Disks). There are several RAID levels you can use to improve performance. At the simplest level you can use multiple disks in a RAID-0 (striping) configuration. As part of their study, Dell looked at 1 to 4 disks in a hardware RAID-0 configuration. They found that while the number of disks had only a small impact on read performance, increasing the number of disks in the RAID-0 set had a large impact on write performance, particularly for the file synchronization case (synchronizing your data is always a good idea).

You could also use RAID-5 that would also give you some fault tolerance (we'll discuss this in the next column). For increased reliability you could combine RAID-0 with RAID-1 (mirroring). For whatever RAID level you select you can use a dedicated hardware RAID controller or use software RAID that is built into Linux.

You have no rights to post comments

JComments

Main Menu

Search

Login And Newsletter

Feedburner

Subscribe Now!

Front Page RSS Feed

This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.