MPI: Debugging -- Can You Hear Me Now?

Article Index

In the unlikely event that your code has a bug or two

It's been said that if debugging is the process of removing bugs from a program, then programming must be the process of putting them in. Although I don't personally write bugs (ahem), I understand from others that they can be quite difficult to track down and fix. My personal recommendation is to avoid writing them in the first place; steer clear of popular assembler instruction such as "BFM" (branch on full moon), "WTR" (write to random), and "MM" (more magic).

The Story So Far

Over the course of past columns, we have talked about all kinds of aspects of MPI - ranging from what MPI is all the way through advanced communication patterns and techniques. Assumedly, loyal readers who have been following this column are now among the World's Greatest MPI Hackers (and should feel free to add "Member, WGMH" to resumes and vitae). {mosgoogle right}

But now that you're a hotshot MPI programmer, the reality of writing code sets in: debugging. Debugging any application can be a difficult and arduous task; multiple orthogonal dimensions of factors combine to create unique - and sometimes elusive - unforeseen circumstances. Assumptions in code can therefore turn out to be false leading to system crashes, corrupted output, or, even worse, subtly incorrect answers. Parallel applications add several more dimensions to the mix - bugs can be the result of complex interactions between the individual processes in a parallel job. Typical parallel bugs manifest themselves in race conditions, unexpected or un-handled messages, and the ever-popular deadlock and live-lock scenarios.

While many of these errors can also occur in serial applications, the fact that these scenarios can simultaneously occur in one or more processes of a parallel job dramatically increases the difficulty of isolating exactly what the bug is. Even after the bug is identified, understanding the events that cause it to occur can be equally as difficult since it may be the direct (or indirect) result of interactions between multiple [semi-]independent processes. Hence, examining the state of the single process where the bug occurred (perhaps through a core dump file) may be neither sufficient to understand why it happened, nor how to fix it.

In short, parallel bugs span multiple processes. This condition is, unfortunately, simply the nature of parallel computing - it is not specific to MPI applications. Multi-process bug dependency can create a multiplicative effect both in terms of system complexity and difficulty in tracking down even simple problems.

Sidebar: Why Bother With Parallelism?

Given the potentially significant increase in difficulty of debugging parallel applications (as compared to debugging serial applications), why bother writing in parallel? What would justify the added time, resources, and expense required to obtain a correctly-functioning parallel application?

The rationale behind parallel computing is that it enables two main things:

  • Applications that are too large to fit on a single machine
  • Decreased execution time as compared to serial applications

Oil companies, for example, gather vast datasets from surveying equipment to determine where to drill. Such datasets can reach into the terabyte realm - far too large to fit in the RAM of single node at a time. A serial application would have to iteratively request subsets of the data, store partial results, and later combine them into a final result. By definition, this process is (at least) linear. Extending this scenario to allow multiple copies of the processing application to simultaneously request and process subsets of the data can result in a significant decrease of execution. Hence, what used to take weeks to run can now be accomplished in hours.

Such gains can also be translated into increasing the resolution or accuracy of the results. For example, consider a serial process that takes 1,000 hours (41 days). Say that running this process in parallel on a large cluster takes 10 hours to get the same results. With so much speedup, it seems natural to increase the resolution of the computation. Perhaps it will take 20, 40, or 80 hours to obtain the finer-grained results, but it is still significantly less than 41 days.

Bottom line: although some up-front investment is required, parallel solutions can result in better results in less time - a competitive advantage that directly impacts the overall cost of a project.

All that being said - fear not! Parallel applications, just like serial applications, only do exactly what they are told to do. They are discrete creatures that, even though they seem to be devious and malicious (particularly to the programmer who is debugging them), are bound by finite rules and operating procedures. For every bug, there is a reason. For every reason, there is a bug fix.

This month we'll examine some of the more popular and some of the more effective techniques of parallel debugging (note that "popular" is not always the same as "effective"!).

printf Debugging

Using printf is perhaps one of the most common forms of debugging. Want to know what this variable is at that point in the program? Put in a printf! What to check and see if that conditional was taken? Put in another printf!

Variations on this theme include sending printf-like output to files for postmortem analysis, selectively enabling and disabling specific classes of output, and compile-time disabling all debugging output. The end result is inevitably the same - an ever-growing set of output messages that must be sorted through in the hopes that one or more of the messages will reveal the exact location and/or conditions where bugs occur.

printf-style debugging can be even less effective in parallel because of the multiplicative effect: every process will print messages, resulting in potentially N times as much output to sort through. Worse, since standard output is not always guaranteed to be in order by process (or may even be delayed until MPI_FINALIZE), output may be interleaved from multiple sources.

All of these can be solved with workarounds (e.g., labeling each output message with the process' rank in MPI_COMM_WORLD, sending the output of each process to a different file, etc.), but still results in the same problem: printf-style solutions can only display a limited subset of the process' state. If the programmer wants to display something else, the application must be edited, re-compiled, and re-run.

Even worse, inserting printf (and friends) can change an application such that the bug does not occur, or, even worse, changes symptoms. This situation can be quite common in MPI applications, especially with bugs that are due to race conditions. For example, a printf can slow one process down just enough such that the bug mysteriously disappears (remember that sending to the standard output is a relatively time-consuming task).

Using Serial Debuggers in Parallel

Remember what your Introduction to Programming instructor told you: use a debugger! Debuggers allow you to single-step through your running application, examining just about anything in the process. Even a serial debugger can be useful in parallel. Most vendor-provided compilers, as well as the well-known GNU debugger (gdb) and its GUI counterpart the Data Display Debugger (ddd) can be used with parallel applications in one of two ways.

Sidebar: 1-800-DBUG-HPC

Debugging is a frustrating, maddening task. Run the application. Watch the crash. Examine the core dump. Make a fix. Compile the application. Run, watch, examine, fix, compile. Repeat. Repeat. Repeat...

Here's a few of my favorite mantras that I like to repeat while working out particularly nasty bugs:

  • A computer is a labor-saving device
  • Asynchronous behavior is my friend
  • Computers do not exhibit malice
  • MPI means Misbehaving Program Interface
  • I love my job
  • I will bend the computer to my will

The first method is to initially launch the MPI processes under the serial debugger (if the MPI implementation supports it). For example, the following works in LAM/MPI and Open MPI:

$ mpirun -np 2 xterm -e gdb my_mpi_application

This command will launch two xterms, each of which will launch a copy of gdb and load my_mpi_application (note that this example assumes that you have proper X authentication between the nodes that you are running on; see the LAM/MPI FAQ for more information). You can then individually control each MPI process. This method is quite helpful, but only for small to mid-sized parallel runs; attempting to use 32 gdbs in 32 xterms can be quite difficult to manage.

This concept can be extended (again, if the MPI implementation supports it) by mpirun'ing a script instead of an xterm:

$ mpirun -np 32 my_debug.sh my_mpi_application

Where my_debug.sh only launches a debugger for a subset of the processes, not all of them. Hence, although the MPI job consists of 32 applications, you can set my_debug.sh to only launch a debugger on MPI_COMM_WORLD ranks 0, 1, and 2.

Note, however, that this method will only work for a single run - once the processes complete, you will need to exit gdb and re-mpirun the MPI job. You cannot re-run the application from within gdb.

The second method of using a serial debugger is to attach to an already-running MPI process. This method is more portable than the first because debuggers can always attach to processes; it does not depend on the capabilities of the MPI implementation. A common scenario where this is helpful is in a deadlock scenario - if your MPI application "hangs" for no apparent reason, you can attach a serial debugger to one (or more) of the MPI processes and simply see what it is doing.

Another typical example is where one MPI process repeatedly crashes (e.g., rank 3 in MPI_COMM_WORLD). Use code similar to the following:

Listing 1: Using a script to launch gdb
 1  int rank;
 2  MPI_Init(&argc, &argv);
 3  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 4  if (rank == 3) {
 5      int i = 0;
 6      printf("PID %d waiting\n", getpid());
 7      fflush(stdout);
 8      while (i == 0) 
 9          sleep(10);
10  }

This code causes MPI_COMM_WORLD rank 3 to loop forever. However, it prints out its PID, enabling you to attach a debugger to it, change the value of the variable i, and continue single-stepping through the process to examine the events leading up to the crash.

Admittedly, this is not an elegant approach, but it is quite practical.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.