Print
Hits: 33237

Our own benchmark, we are special you know

In a previous article I started to explore benchmarks for parallel file systems. In the article, we learned that benchmarks for serial file systems are not the best tools to measure the performance of parallel file systems (big surprise). Five of the most common parallel file system benchmarks were also mentioned, but the use of these was limited because they were only applicable to certain workloads and/or certain access patterns -- either memory access or storage access.

In this article we will take a look at a relatively new parallel file system benchmark suite that was designed to capture the behavior of several classes of parallel access signatures.

Introduction

A very good starting point for a new synthetic parallel file system benchmark suite is the Master Thesis from Frank Shorter at the Parallel Architecture Research Laboratory ( PARL) at Clemson University under the direction of Dr. Walt Ligon. [Just to be clear, there is no connection (that we know of) between the name of our mascot (Walt) and the esteemed Walt Ligon.-- Ed.]

This article will discuss synthetic benchmarks, that is, benchmarks that are not real applications, but are designed to reflect real applications. Due to the wide range of I/O requirements from various codes, developing synthetic benchmarks is the only realistic approach to benchmark parallel file systems.

Before jumping into a discussion of the new benchmark suite that Mr. Shorter suggests, we should ask the question about what should a parallel file system benchmark produce or what should it do?

Let's start with the premise that parallel file systems are designed for high performance for large amounts of data. So we're interested in measuring how long it takes to read and write data to and from the file system. One can think of this as the bandwidth of the file system (given the time and the amount of data, you get compute bandwidth).

However, there are some caveats here. First, what is a "large amount of data"? Second, how do we effectively measure time for distributed I/O when the nodes are independent of one another? Hopefully I'll answer these questions as we discuss some new benchmarks.

MPI-IO Is Here

Before the advent of the MPI 2.0 standard writing data to and from parallel file systems was done on an ad hoc basis that many times included non-standard methods. MPI 2.0 brought MPI based functions for reading and writing data to the file system, in our case, a parallel file system. This change provides portability to MPI applications needing parallel I/O. Some MPI's have implemented MPI-IO functions in their 1.X versions as well.

To simplify coding and help improve performance, MPI-IO was designed to abstract the I/O function. It allows the code to set what is called a virtual file view, which defines which part of a file is visible to a process. It also defines collective I/O functions that help non-contiguous file access in a few functions. By defining a virtual file view and using collective I/O functions, the code can then perform the I/O in a few function calls.

Also, MPI-IO allows complex implicit data types to be constructed for data of any type and size. The combination of a file view and virtual data types, allows virtually any access pattern to be addressed. In a generic sense, it allows the data, whether in memory or in a file, to be stored either contiguously or non-contiguously. The data can be stored either non-contiguously in memory and contiguously in a file, or contiguously in memory and non-contiguously in a file, or both non-contiguously in memory and non-contiguously in a file.

For the rest of the discussion, I'll focus on benchmarking that use MPI-IO. First, because it is a standard that allows codes to be portable. Second, because it allows different access patterns to be easily coded into the benchmark.

Work Loads

Before synthetic benchmark(s) can actually be written, we must choose the access patterns that we desire to simulate. These patterns come from examining the typical work loads that people see in their codes.

As discussed in Mr. Shorter's thesis, past efforts at defining work loads provide a very good summary of the dominant work loads. One of the dominant work pattern found by examining a large number of codes is what is called a strided access. A stride is the distance within the file from the beginning of one part of a data block to the beginning of the next part.

Mr Shorter has described some past work on the CHARISMA project that has shown there are two types of strided access. The first, termed, simple strided, was found to be used in many codes with and without serial access patterns. In other words, the data may be accessed in a serial manner, but there is was some distance between the data access. They went on to say that serial access can be mathematically decomposed into a series of simple strides.

The second popular stride pattern is called "nested stride." In this case, the code is accessing a stripe of data. Within the stripe, the data can be accessed in a set of simple strides. Hence, the term, nested stride.

In addition to how the data is actually accessed (the spatial access pattern), past efforts at examining I/O patters of programs have shown that there is a temporal access pattern. As you would guess, researchers found that codes sometimes have an access patterns that reflect various procedures in the code. Such patterns vary with time, and are called temporal access patterns.

Introducing pio-bench

The result of Mr. Shorter's thesis was a new benchmark suite, named pio-bench. In this section, I'll go over some of the design aspects that he considered. While it may seem a bit dry, understanding the critical issues and how they were implemented in the benchmark suite will help you understand the benchmark results.

Finding The Time

In his work, Mr. Shorter has taken great pains to develop a framework that provides accurate timing. The first step was to divide the actual benchmark into three pieces, a setup phase, an operations phase (reads or writes), and a cleanup phase. While this choice may sound simplistic, it allows the timings to focus on the various pieces of the whole I/O process that may be of interest to certain people. Also, it allows for standardized timing reports for all of the various benchmarks.

As I mentioned in the previous column, measuring the time it takes to perform I/O on a distributed system is a difficult proposition. The clocks on the node are skewed relative to one another and the nodes will finish their various portions of the I/O at different times. So, how does one measure time for a parallel file systems benchmark?

To resolve this issue, the pio-bench code uses aggregate time. That is, the time from when the earliest process starts its I/O, to the time that the last process finishes its I/O.

The general flow of the benchmarks begins with the setup phase. After the setup phase, an MPI_Barrier is called. This step ensures that the various MPI processes are synchronized. Then the operations take place, where the I/O is actually performed, followed by another MPI_Barrier. This aggregate time is written as:

Aggregate Time = T + S + B
Where the term S is the time it takes between the end of the first MPI_Barrier and the beginning of the first file access by any process. The term T is the total amount of time that all accesses take from beginning to end. Note that some processes may be done before T is reached. At the end of the operations phase, another MPI_Barrier is called. The term B is the time that this second barrier takes after each access has completed. This second barrier ensures that all processes have finished executing their operations.

If the amount data processed is small, including B in the timings can skew the results. So, the time for the I/O processing must be long enough so that B is insignificant to the aggregate time. This requirement helps define the minimum amount of data the benchmark needs to run for reproducible results.


Access Patterns

For the purposes of the benchmark, an Access Pattern refers to the specific mapping between an MPI process and some set of bytes within a file. In essence, how the data gets to and/or from the file system via an MPI process. This pattern is described with a spatial distribution (i.e. how the data is distributed on the file system or in memory) and temporal distribution (i.e. how the data is accessed as the code is run).

The benchmark as currently written has a number of common spatial access patterns. These patterns are simple strided, nested strided, random strided, sequential access, segmented access, tiled, flash I/O kernel, and unstructured mesh.

A simple strided access pattern divides sections of a file into logical strips. Each participating process accesses the data at some displacement relative to a stripe. Most of the time, the displacement is related to the process number (rank) in an MPI communicator group. This access pattern has been found to be very common in many applications.

A nested strided access pattern combines several simple strided access into each stripe. The I/O access may then consist of simple stride accesses of a single stripe inside a simple stride access of another stripe and so on. So a simple nested stride access can be used for accessing data in a two-dimensional array. Consequently, it can be easily extended for accessing data in three dimensional arrays. This type of access is the second most common and can be found in many codes such as linear algebra codes.

A random strided access pattern has each process read or write an amount of data in a round-robin pattern. The size of the stripe can vary depending upon how much data is accessed by each process in turn. A good example of this kind of access pattern is encoding of media where each frame size of an image is variable.

In the sequential access pattern each process opens the file and issues requests (read or write) that access a fixed size part of the file with linearly increasing offsets for successive requests. This access pattern is common for file systems that allow regions of a file to be partitioned on separate I/O servers. In particular, it maximizes reading performance (all processes can simply read the same file). However, write I/O is a different matter because you don't want multiple writes to occur for the same part of the file. The way to get around this is to have each process write a different file. This type of access is common in older codes.

A segmented access pattern is a fairly simple one. It divides the file into logical segments for as many processes as there are performing I/O. For example if a file is 4096 bytes long and there are four processes, a segmented access pattern will just assign the first 4096/4 bytes to the first process, the next 4096/4 bytes to the next process and so on. This type of access pattern is used by many different codes.

A tiled access pattern takes a two-dimensional set of data and breaks it into small rectangular regions. Depending upon how the tiles are laid out each tile access can be decomposed into a segmented access pattern or a simple strided access pattern. One example that uses this type of I/O is a large virtual screen on multiple video devices. Another example is storing two-dimensional data that is larger than memory. This occurs in out-of-core solvers.

The next access pattern is called the Flash I/O Kernel access pattern. Flash is a code that is used to study the effects of thermonuclear explosions on the surfaces of many types of stars. Specifically, a heavy dependence on I/O performance is required by simulations related to X-ray bursts of Type 1 supernovae. This access pattern is non-contiguous in memory and non-contiguous within the file. In essence, the access pattern is a sequence of 'memory blocks' where each block contains a three dimensional cube of data. Inside this three dimensional cube there are areas that need to be accessed from the file and those areas that are skipped over.

The last spatial access pattern is the unstructured mesh access pattern. The data that are written for this access pattern is typically composed of a set of points with x,y,z coordinates. At each point there are a number of properties that can range from just a couple to hundreds. Also, included is how the points relate to one another to form polygons. Typically, this type of data can be decomposed into one of the previously mentioned access patterns (simple strided or segmented access patterns) because each process has a subset of the total data. The unstructured mesh access pattern is very common in engineering applications such as Computational Fluid Dynamics, Finite Elements, terrain mapping, etc.

The benchmark suite can also accommodate temporal access patterns. It covers five patterns: read once, write once, read-modify-write, re-read, and re-write.

For the read once access pattern, data from a file is accessed and copied to some location in memory. The file access and the memory access may be contiguous or non-contiguous. The benchmark assumes that all read access is blocking. That is, all data must be in memory before the read function returns.

The write once access pattern is similar to read once but there is more care taken if the data is flushed to the storage devices or not.

The read-modify-write access is a very intensive temporal access pattern. It involves reading data, operating on it, and then writing the data back to the original location in the file. One common application of this type of access is an out-of-core simulation where there is more data than fits into memory.

A re-read access pattern is very simple. All or part of a data file is retrieved and then later, it is retrieved again. Only the second read is timed in the benchmark since file systems can take advantage of caching to speed up this operation.

A re-write is similar to the re-read pattern expect it involves a write operation, not a read. This access pattern allows the write-behind caching of a file system to be tested.

Summary

As part of his research, Mr. Shorter has written a useful benchmark suite for testing parallel file system performance. This benchmark is perhaps the most robust of all the parallel file system codes I have encountered. Here is your homework, how many different combinations of access and temporal patterns can the benchmark test?


Sidebar One: Links Mentioned in Column

PARL

pio-bench

Frank Shorter's Thesis "Design and Analysis of a Performance Evaluation Standard for Parallel File"


This article was originall published in ClusterWorld Magazine. It has been updated and formated for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He can sometimes be found lounging at a nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).