MPI: Zen and the Art of MPI Collectives

Article Index

When the grass grows in parallel, do they know about each other? Do they coordinate? Perhaps they have some kind of collective intelligence? Do they use MPI?

In previous editions of this column, we've talked about the 6 basic functions of MPI, how MPI_INIT and MPI_FINALIZE actually work, and discussed in agonizing detail the differences between MPI ranks, MPI processes, and CPU processors. Armed with this knowledge, you can write large, sophisticated parallel programs. So what's next?

Collective communication is a next logical step - MPI's native ability to involve a group of MPI processes together in a single communication, possibly involving some intermediate computation.

MPI Collective Basic Concepts

Many parallel algorithms include the concept of a collective operation - an operation in which multiple processes participate in order to compute a result. A global sum is an easy example to discuss - each process contributes an integer that is summed in an atomic fashion and the final result is made available (perhaps just to a single "root" process, or perhaps made available to all participating processes). {mosgoogle right}

A brief recap: MPI defines all point-to-point communications in terms of "communicators." Communicators are fixed sets of ordered processes with a unique context. Communication that occurs on a communicator is guaranteed to not collide with communications occurring on other communicators.

MPI also defines collective communication in terms of communicators. All collective operations explicitly involve every process in a communicator. Specifically: a collective will not be complete until all processes in the communicator have participated. Due to the nature of some of MPI's pre-defined collective operations (see the sidebar "Will That Collective Block?"), this may or may not imply blocking behavior. There is one exception to this rule: MPI_BARRIER is guaranteed not to return until all processes in the communicator have entered the barrier.

There are two main kinds of collectives defined in MPI: rooted and non-rooted. "Rooted" operations have a single process acting as the explicit originator or receiver of data. For example, MPI_BCAST broadcasts a buffer from a root process to all other processes in the communicator; MPI_GATHER gathers buffers from each process in the communicator to a single, combined buffer in the root process. "Non-rooted" operations are those where there is either no explicit originator/receiver or all processes are sending/receiving data. MPI_BARRIER, for example, has no explicit senders/receivers, but MPI_ALLGATHER both performs a gather operation from all processes in the processor and makes the result available to all processes.

Barrier Synchronization

Sidebar: Why Not Use MPI_SEND and MPI_RECV?
An obvious question that arises is: why bother? Why not simple use a linear loop over MPI_SEND and MPI_RECV to effect the same kind of operations? In short: it's all about optimization. The MPI built-in collectives usually offer the following advantages:

  • Avoid MPI_SEND and MPI_RECV: An MPI implementation is able to optimize each collective operation in many ways that are not available to user applications. For example, on SMP nodes, collective operations may occur directly in shared memory and avoid the entire MPI_SEND / MPI_RECV protocol stack.
  • Multiple algorithms: There are typically many algorithms that can be used for implementing a collective operation (even when using MPI_SEND / MPI_RECV), each yielding different performance characteristics in a given run-time environment. Factors such as the size and configuration of the communicator as well as the size and shape of the data to be communicated may influence the specific algorithm that is used. Much research has been conducted in this area over the past 20 years; let the MPI implementors worry about it - not you.
  • Performance portability: Collective algorithms that work well on a cluster may or may not work well on "big iron" parallel machines. Using the collective operations in the native MPI implementation usually means that you'll get algorithms that are tuned for the platform that your application is running on - one of the main goals of MPI.

The moral of the story: it is generally safer to trust your MPI implementation's collective algorithms than to implement your own. While no MPI implementation is perfect, most modern versions do a reasonable job of optimizing collective operations.

One of the simplest collective operations to describe is the barrier synchronization. MPI's function for this is MPI_BARRIER. It takes a single significant argument: a communicator.

One should note that, while the argument lists of the MPI C, C++, and Fortran bindings for a given function are typically similar in terms of "significant" arguments, there are some minor differences. One notable difference is that all MPI Fortran calls take a final "ierr" argument that the C and C++ bindings do not. The "ierr" argument is used for passing errors back to the caller (errors are handled differently in the C and C++ bindings).

As described above, MPI_BARRIER does not return until all processes in the communicator have entered the barrier. The seemingly-simple barrier operation is a good example illustrating that a variety of different algorithms that can be used:

  • linear: the root receives from all processes followed by the root sending to all processes
  • logrithmic: a binomial tree gather to the root followed by a binomial tree scatter from the root
  • 2-level latency split algorithm: a local gather, global gather, global scatter, and finally a local scatter
  • N-level latency split algorithm: similar to the above, but for N levels, not 2
  • shared memory: each process increments a shared counter; when the counter equals the number of processes, exit the barrier

The barrier operation has been researched for years (particularly in the area of shared memory algorithms; the shared memory algorithm listed above will typically provide dismal performance); many other algorithms are possible; the above list just a few possibilities.

There's no good reason for a user application to include implementations for all of these algorithms; the MPI implementation should provide some form of an optimized barrier (which may be one or more of the above algorithms) so that the user application does not have to worry about such issues.

Be wary of over using MPI_BARRIER. It is frequently tempting to insert barriers for ease of control and simplicity of code. However, barriers are usually unnecessary when writing MPI programs - MPI's tag/communicator matching rules for point-to-point communicator and "fence" operation for one-sided operations (to be described in a later column) typically obviate the need for barriers. Indeed, a barrier that is executed in every iteration of a repetitive code can introduce artificial performance limitations.

    Search

    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.

    Feedburner

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.