MPI: Is Your Application Spawnworthy?

Article Index

Spawnworthy? Who are these people?

I like March - a month of celebration. It's the time when everyone around the world unites in joy and praise. Songs, dancing, pot-luck dinners, t-shirts, etc. I'm speaking of Pi Day, of course - on 3/14. In case you missed it this year, don't forget Pi Approximation Day is fast approaching. In honor of this number, I suggest all HPC users should contribute to the Pi revelry by computing and reciting as much of Pi as possible. What better way to do this than to optimize approximate computations of Pi in parallel? Hint: the first three digits are in this paragraph.

The Story So Far

Last column we outlined the three models of dynamic processes in MPI: spawning new processes using MPI_COMM_SPAWN and MPI_COMM_SPAWN_MULTIPLE, client/server connections between existing MPI processes using MPI_COMM_ACCEPT and MPI_COMM_CONNECT (and supporting functions MPI_OPEN_PORT, MPI_PUBLISH_NAME, MPI_LOOKUP_NAME, and MPI_CLOSE_PORT), and using independently-established sockets between existing MPI processes using MPI_COMM_JOIN. {mosgoogle right}

All of these models are synchronous, meaning that they block until the action is completed. With some strong caveats about scheduled environments (discussed last column), the SPAWN functions will likely completed more-or-less immediately (i.e., they will probably take as much time as an MPI implementation's job startup mechanism, such as mpirun). Hence, it will usually block for a short while, but complete in finite time. JOIN, while fundamentally asynchronous in nature, is likely to be used mainly in synchronous situations. Specifically, since a TCP socket must be established prior to invoking JOIN, the asynchronous aspects of connecting two previously-existing processes are satisfied elsewhere, and JOIN will likely be invoked right after the socket has been established. So JOIN is also likely to be used in finite/time-bounded situations.

ACCEPT and CONNECT, however, are different. They are fundamentally asynchronous both in nature and use. The "server" process blocks in ACCEPT until a "client" process calls a corresponding CONNECT. Since the client process is likely to be independent of the server, it is effectively random as to when the client will invoke CONNECT. This situation can leave the server blocking indefinitely, and is unsuitable for most single-threaded applications / MPI implementations.

Threads To The Rescue

ACCEPT works best when it can be left blocking in an independent thread. This thread can simply loop over MPI_COMM_ACCEPT, accepting client connections and then dispatching them to other parts of the server upon demand. This method is actually quite similar to how many client/server applications are implemented. The server process can continue other meaningful work and be interrupted with client requests only as necessary.

A side effect of this approach (and the MPI design) is that the ACCEPT cannot be interrupted or killed cleanly. In order to shut down the server process, a dummy connection must be made to the server's pending ACCEPT (probably originating from within the server process itself) that issues a command telling the accepting thread to break out of its ACCEPT loop and die. This trick is necessary because it is illegal for an ACCEPT to be pending when another thread in the server invokes MPI_FINALIZE.

Note that not all MPI implementations support ACCEPT / CONNECT (or MPI-2 dynamic processes in general) and multi-threaded MPI applications. The MPI implementation that I work on, Open MPI, does, and is the basis for the examples provided in this column.

Sidebar: MPI Connected

MPI formally defines the communication status between two processes - they are either "connected" or "disconnected" (MPI-2 section 5.5.4):

Two processes are connected if there is a communication path (direct or indirect) between them. More precisely:

  1. Two processes are connected if:
    1. they belong to the same communicator (inter- or intra-, including MPI_COMM_WORLD) or
    2. they have previously belonged to a communicator that was freed with C<MPI_COMM_FREE instead of MPI_COMM_DISCONNECT or
    3. the both belong to the group of the same window or filehandle.
  2. If A is connected to B and B to C, then A is connected to C. - Two processes are disconnected (also independent) if they are not connected.

As such, the state of being "connected" is transitive. This situation has implications for MPI_COMM_ABORT (used to abort MPI processes), run-time MPI exception handling, and MPI_FINALIZE (used to shut down an MPI process). MPI_COMM_ABORT and MPI_ERRORS_ABORT are allowed (but not required) to abort all connected processes. MPI_FINALIZE is collective across all connected processes. Hence, in order to ensure that processes do not unintentionally block in MPI_FINALIZE, it is a good idea for dynamic processes to DISCONNECT when communication between them is no longer required.


Once communication between dynamic processes is no longer required, the function MPI_COMM_DISCONNECT can be invoked to formally break communication channels between the processes (see the "MPI Connected" sidebar). Connected processes impact each other in several ways; independent processes are unaffected by the run-time behavior of each other (in terms of MPI semantics).

Hence, processes that are spawned are connected to their parents. Processes that establish communication via CONNECT and ACCEPT or JOIN are also connected.

To disconnect from another job, all groups referring to processes in that job must be freed. Groups spanning the two jobs may exist in communicators, file handles, or one-sided window handles (the later two are not discussed in this column). Hence, it may be necessary to free multiple handles (communicators, files, windows) before processes become independent of each other.

Note that communicators must be released via MPI_COMM_DISCONNECT instead of MPI_COMM_FREE. There is a subtle but important difference: MPI says that MPI_COMM_FREE only marks the communicator for deallocation and is guaranteed to return immediately; any pending communication is allowed to continue (and potentially complete) in the background. MPI_COMM_DISCONNECT will not return until all pending communication on the communicator has completed. Hence, when DISCONNECT returns, the communicator has truly been destroyed.


    Login And Newsletter

    Create an account to access exclusive content, comment on articles, and receive our newsletters.


    Share The Bananas

    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.