Benchmarking Parallel File Systems

Article Index

mpi-tile-io

Rob Ross at Argonne National Laboratory wrote this benchmark. The goal of this benchmark is to test the MPI-IO implementation and underlying file system, usually parallel, under a noncontiguous access workload.

The benchmark simulates the workload of a large visualization code where a single large scene is broken up across multiple monitors - such as in a video wall. The scene is broken into a grid and each tile of the grid is assigned to a process. The data for each process is accessed contiguously in a sequential manner. There can be some overlap in the data between tiles. This data access pattern also follows some types of codes.

The benchmark will report the min, mean, max, and variance of the times to open, close, and read the data. The application can also write out the data in a file that can be used for subsequent testing. Also, as mentioned in the benchmark title, the benchmark makes heavy use of MPI-IO. To get at least good performance it also uses MPI datatypes for file access.

The benchmark is also smart enough to do some checking on the number of processes available versus the number of tiles the grid is broken into. It will through out extra processes, removing them from the collective operations. It will also return an error message if there are fewer processes than tiles requested.

NPB IO

The NASA Parallel Benchmarks (NPB) are probably the best-known public parallel benchmark codes. As the codes were tested on new systems and were developed as new architectures appeared. The problem sizes were increased. It was realized that the I/O portion of the code was becoming increasingly important. Some attention was focused on developing an MPI benchmark to test the I/O capability of file systems, particularly parallel file systems.

The initial code for testing I/O performance was provided by the Block-Tridiagonal (BT) NPB problem. The MPI implementation employs a fairly complex domain decomposition called diagonal multi-partitioning. The first implementation was called BTIO and used MPI-IO for file access.

The BTIO code was then formalized into a specification of a benchmark based on BTIO and was released under the name NPB-MPI 2.4 I/O. The code is a basically a CFD like solver (a time-stepping code that solves a set of nonlinear equations). After every five time steps the entire solution field is written to one or more files. After all time steps are finished, all the data belonging to a single time step must be stored in the same file and my be sorted. Any rearrangement of the data due to the file system or file system access is taken into account in the final timings. The benchmark has also been designed to minimize the possibility that output data resides in systems buffers by postponing a re-read of the data until the end of the program.

The benchmark includes a number of options for writing the data. These options allow the benchmark to write the data in various ways so you simulate various I/O patterns for your code. However, the memory access and file access patterns are in general fixed. Consequently, the benchmark cannot be adjusted for a specific memory or file access pattern.

Summary of Current Benchmarks

These parallel file system benchmarks provide some measure of I/O performance. However, they all have made decisions about the workload, whether it is small or large, how many times the workload is repeated, what API is used to perform the I/O, how the data is contained in memory or on the file system (contiguously or non-contiguously), and how the I/O is timed. While timing may seem somewhat trivial, it is very important in clusters since there is a propagation delay between nodes, and the all of the nodes have an independent clock.

In the future columns I'll examine some new parallel file system benchmarks that can possibly mitigate some of the limitations of these benchmarks.

{mosgoogle right}

Sidebar One: Links Mentioned in Column

Bonnie++

Postmark

IOZone

b_eff_io

IOR

mpi-tile-io

NPB IO


Dr. Jeff Layton hopes to someday have a 20 TB file system in his home computer. He lives in the Atlanta area and can sometimes be found lounging at the nearby Fry's, dreaming of hardware and drinking coffee (but never during working hours).

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.