Using Strace to Examine Your IO

MPI Example

What do you do when you strace an MPI code? Ideally you want a file for each MPI process. How do you use strace? in conjunction with mpirun or mpiexec to produce one file per MPI process? One technique I use is to write a main script for the mpirun or mpiexec command. This main script then calls a second script that actually runs the code. In this second script I'll actually put the strace command and the binary. There are a few gotchas that I'll point out in the script. Readers may also be interested two debugging MPI articles by fellow Cluster Monkey Jeff Squyres: MPI: Debugging -- Can You Hear Me Now? and MPI: Debugging in Parallel (in Parallel).

Let's start with a simple example from the MPI-2 book by Bill Gropp, et. at. In Chapter 2 the authors present a simple example of an MPI code where each process of N processes writes data to an individual file (this is usually referred to as N-N IO). I modified the code to write more data than originally presented. {mosgoogle right}

/* example of parallel Unix write into separate files */ 
#include "mpi.h" 
#define BUFSIZE 100000 
int main(int argc, char *argv[]) 
    int i, myrank, buf[BUFSIZE]; 
    char filename[128]; 
    FILE *myfile; 
    MPI_Init(&argc, &argv); 
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 
    for (i=0; i < BUFSIZE; i++) 
	buf[i] = myrank * BUFSIZE + i; 
    sprintf(filename, "testfile.%d", myrank); 
    myfile = fopen(filename, "w"); 
    fwrite(buf, sizeof(int), BUFSIZE, myfile); 
    return 0; 
Being the versatile cluster geek that I am, I re-wrote the code in Fortran for us older folks.
   INCLUDE 'mpif.h'

   REAL :: BUF(100000)
! -------------------------------------------------------
   BUFSIZE = 100000
   IERROR = 0
      BUF(I)= 2.5 * BUFSIZE + I
   IF (MYRANK < 9 ) THEN
      WRITE(6,11) MYRANK
      FILENAME = "testfile"//RANK1
10    FORMAT(I1)
11    FORMAT("MYRANK:",I1)
      WRITE(6,11) MYRANK
      FILENAME = "testfile"//RANK2
20    FORMAT(I2)
Let's spend a little bit of time writing the scripts we need to run the code and get the strace output. Don't worry if you don't know bash scripting. I'm not an expert by any stretch, and I usually have to ask friends for help. But the scripts are simple, and I will show you the couple of bits of specialized knowledge you need.

I start with a main script that I usually call that contains all of the setup for the code as well as the command to run the MPI code. For this example, I used MPICH2 for the MPI layer and I used g95 for the Fortran90 compiler and gcc for the C compiler. I won't cover all of the details of how to use MPICH2 since the website covers everything much better than I could. Below is the main script I use.


mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/
The first line just says to use bash shell to run the script. The last line is the command to run the MPI code. In this case, it's mpiexec. Notice that what mpiexec actually runs is another script

Before I talk about the script, I want to mention that it's fairly easy to adapt to a job scheduler such as SGE, Torque, PBS-Pro, or LSF. I don't have the space to talk about how to write job scripts for these schedulers, but it's fairly straight forward and there is documentation on the web. If you get stuck, you can always ask on the Beowulf mailing list.

Let's take a look at the meat of the scripts, the script


/usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/code1 $@
Similar to the main script, this script starts by using the bash shell. The third line is the meat of the script. The first part of the line,
/usr/bin/strace -tt -o /tmp/strace.out.$$ 
starts the code in the same way that we started the serial example, by using the command strace. As before I used the options -tt to get microsecond timing, and -o to point to an output file. Here's where we need to think about how to write the script so that each MPI processes writes to a separate output file.

This is the first bit of bash knowledge that we'll use in our scripts. In the script I have specified the strace output file as,

So the output files will be located in the /tmp directory on each node used in the run. To keep the files separate, I have added $$ to the end of the file name. In bash scripts, this is a special variable that contains the PID (Process ID) of the script. This is just a number that is specific to each MPI process. So now we have separate file names for each MPI process.

The last bit of bash knowledge we need is how to pass command line arguments to our code (if we need them). In this case, we use another predefined bash variable, $@. This allows you to use all of the arguments that were passed to the code1.shscript (arg1, arg2, ...) as arguments to the code itself. To better see how this works, let's look at a simple example to make sure you know how to do pass command line arguments to the code in

There is an IO benchmark called IOR that has a number of arguments you can pass to the code that describe the details of how to run the benchmark. Here's an example,

IOR -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o 
Don't won't worry about what all of the options mean, but let me point out a couple because they can be important for a job scheduler script. The option -N 4 tells the code to use 4 MPI processes. You can change the value of 4 to correspond to what the scheduler defines. Now how do we pass these arguments to the script that actually runs the code?

Sticking with the IOR example the script looks like,


mpiexec -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/ \
  -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o 
Notice how I've taken the command line arguments and put them in the script. With the $@ bash predefined variable in the code script, the options are passed to the code. The code script doesn't change at all (except for the name of the binary).

/usr/bin/strace -tt -o /tmp/strace.out.$$ /home/laytonj/TESTING/IOR $@
The only thing that changed was the name of the binary from code1 to IOR. So if you want to change the arguments to a code you have to modify the main script. If your code doesn't have any command line arguments, I would recommend just leaving $@ in the code for future reference.

If you look a little closer at the examples scripts for running IOR, it is interesting to note the differences compared to running it without strace. Normally, we would have a single script to run IOR where the command consists of three parts. In order they are,

  1. mpirun command (command to start MPI code)
  2. binary of code to run
  3. arguments to code
If we want to run strace against the binary, then we make a single change to the main script. We change the name of the binary to a script. Then in this new script we put the strace command and the name of the actual binary that is to be run. So it's fairly simple to take a current script or scheduler job script and modify it to allow for strace.

When the job is finished you have to go to each node used in the run, and copy the files from /tmp back to whatever file system is more permanent than /tmp. You could write all of the strace output files to a central file system, but you run the risk that you could get two PIDs that are the same. The chances of this are fairly small, but I don't like to take that chance.

Now that we know how to run our MPI jobs using strace, let's look through a simple example. I'm running the code that I presented earlier. I'm going to run with 4 MPI processes for this article. After I run the code I get four strace.out files.

The PIDs are numbered sequentially because I ran all 4 MPI processes on the same machine. Let's look at one of the strace output files.

If you look at the strace file, you will notice that it is much longer than for the serial case we ran. The reason is that now we're running an MPI code so much of the extra function calls are due to MPI doing it's thing in the background (i.e. behind our code). The first strace output file is listed in Sidebar One at the end of this article. I've extracted a few of the important lines from the output and put them below.

15:12:54.920557 access("testfile1", F_OK) = -1 ENOENT (No such file or directory)
15:12:54.920631 access(".", R_OK)       = 0
15:12:54.920687 access(".", W_OK)       = 0
15:12:54.920748 stat64("testfile1", 0xbfa56800) = -1 ENOENT (No such file or directory)
15:12:54.920816 open("testfile1", O_RDWR|O_CREAT|O_LARGEFILE, 0666) = 7
15:12:54.943471 write(7, "\200\32\6\0@$tH\200$tH\300$tH\0%tH@%tH\200%tH\300%tH"..., 400008) = 400008
15:12:54.945790 ftruncate64(7, 400008)  = 0
15:12:54.945888 _llseek(7, 0, [400008], SEEK_END) = 0
15:12:54.945954 ftruncate64(7, 400008)  = 0
15:12:54.946010 _llseek(7, 0, [400008], SEEK_END) = 0

If you compare these lines to the ones in the serial code, you can see that they are very similar. Despite having more "junk" in the output, let's look at the IO performance.

The write function call writes the same amount of data, 400,008 bytes. The amount of time to write the data is,

54.945790 - 54.943471 = 0.002319 seconds (2319 micro-seconds).
So the IO rate of the write function is,
400,008 bytes / 0.002319 secs. = 1.7249x10^8 bytes/second
This works out to be 172.49 MB/s. A bit faster than the serial code, but again, I think there are some caching affects.

I won't examine the other 3 strace.out.* files since it's fairly straight forward to compute the write performance for each of them. But we're only compute the IO performance for a single write call. Imagine if you have a number of write and read calls in a single code. Then you have to perform the computations for a number of write and read calls.



    Login Form

    Share The Bananas

    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.