MPI: "Progress" Is the Opposite of "Congress"

Non-Blocking Communication

Non-blocking communication is where a lot of the bookkeeping complexity of an MPI implementation originates. The MPI "immediate" send functions are supposed to return immediately (e.g., MPI_ISEND). Most MPI implementations try to send the message immediately, but may not be able to do so. That is, the socket will be set to non-blocking mode and the implementation will invoke write() (or writev()). The operating system will write as many bytes as it can before it would block and then return.

It is up to the MPI implementation to track the fact that a given message has "claimed" the socket (i.e., no other traffic should be written down the socket until this message has been completely written), how many bytes have been written, how many are left to write, etc. Complexity arises when multiple outgoing messages are destined for the same peer - since they all have to be multiplexed across the same socket, the MPI must queue them up and progress them in order (potentially in a non-blocking manner - sending as many bytes of a envelope/message as possible each time through the progression engine).

These issues are multiplied when you consider that there are multiple sockets and messages that must be polled for progress simultaneously in order to provide some degree of fairness of message delivery to all peers. The MPI implementation typically invokes poll(2) or select(2) to see which sockets can be written. If any are available, it tries to progress the pending send queue for each ready file descriptor as far as possible until it would block.

Taking this into account, re-consider the above blocking example. What if other non-blocking sends are still waiting to be sent? Unless the MPI can handle out-of-order sending, the blocking send must progress them all until it can send its own message. So the simplistic pseudocode from above is likely only relevant once the blocking send is allowed to be sent (i.e., can "claim" the socket and all prior messages have been sent).

Receiver Logic

The receiver has non-trivial logic to implement as well. TCP allows for partial receives - a read() call may receive some bytes, but not all. Analogous to the non-blocking sending logic, the MPI implementation must maintain bookkeeping for which receiving envelopes / messages have "claimed" the socket, how many bytes have been received so far, how many are left to receive, etc. The same poll() or select() used for checking write progress can also be used to check for pending incoming data.

Once an envelope has been fully received, it must either be matched with a previously posted receive or placed in a special queue for "unexpected" messages. For eager sends, the body of the message must be received as well. If the message is expected, the body can be received directly into the target buffer. But if the message is unexpected, a temporary buffer must be allocated to receive the message. This temporary buffer is then associated with the envelope in the unexpected queue.

If the message body was not sent eagerly and was expected, the receiver must mark the matching receive as "in progress" and send an ACK back to the sender. Consider the following valid MPI code: {mosgoogle right}

Listing 2: Overtaking non-blocking MPI traffic
1 if (i_am_sender) {
2     MPI_Isend(long_msg, ..., peer, tag, comm, &req);
3     MPI_Send(short_msg, ..., peer, tag, comm);
4     MPI_Wait(&req, &status);
5 } else {
6     MPI_Recv(msg1, ..., peer, tag, comm);
7     MPI_Recv(msg2, ..., peer, tag, comm);
8 }

Now consider how the traffic will actually flow across the socket:

  1. Sender sends envelope for the long message
  2. Sender sends envelope + message for the short message
  3. Receiver receives envelope for the long message
  4. Receiver sends back ACK for long message
  5. Receiver receives envelope + message for the short message
  6. Sender receives ACK
  7. Sender sends envelope + message for the long message
  8. Receiver receives envelope + message for the long message

Sidebar: Rendezvous Protocols: Why Bother?

I am frequently asked why we bother using rendezvous protocols for sending messages - they're more complicated than eager sends (where the entire message is sent immediately) and, at a minimum, add one round-trip latency to the overall message transference time.

The answer is: because of limited resources. Particularly in operating system bypass networks (such as Myrinet and Infiniband), only "special" memory can be used to send or receive messages across the network. But only so much "special" memory is available - sending a single, enormous message may easily consume all of it and starve other message passing efforts.

Put even more simply - even for TCP networks - it is unattractive to send enormous messages without the receiver's permission. Consider sending a 32MB message eagerly. If the peer process has not posted a matching MPI receive, the MPI implementation will have to allocate a temporary 32MB buffer to receive it. When a matching receive is posted, the MPI application then has to copy the entire 32MB buffer and then de-allocate the temporary buffer.

If this situation is repeated multiple times (e.g., multiple senders send enormous messages to a single receiver), the receiver can easily run out of memory. For MPI implementations that cannot guarantee that applications will not perform this way (and most cannot), rendezvous protocols prevent resource exhaustion at the receiver.

This is typically called the "overtaking" problem - although the long message was sent first, because of the use of the rendezvous protocol, the short message payload actually arrives first. MPI's message ordering guarantee states that the receives will match the order in which the sends were issued. Put simply, msg1 must receive the long message and msg2 must receive the short message (regardless of what order they were actually sent across the underlying network).

The receiver must do appropriate bookkeeping to ensure that this condition is met. As a side effect of the sample code shown above, by the time the first MPI_RECV completes, the short message will have been received into a temporary buffer in the unexpected queue. So when the second MPI_RECV executes, all that is required is copying the message from the temporary buffer to the user's buffer.

Where to Go From Here?

A disclaimer on this column: the careful reader will notice that there were a lot of assumptions in the explanations given in this column. For each assumption listed above, there are real-world MPI implementations with different assumptions.

The number of assumptions provide some insight into why there are so many different MPI implementations: every system performs differently. Tweaking to receive the best possible performance is a compromise between the network behavior, operating system characteristics, management of finite resources, and behavior of the MPI application.

Next column, we'll explore the same MPI progression kinds of issues, but for non-sockets based kinds of networks (e.g., Myrinet, Inifiband). Such networks typically have lower latency and more tangible finite resources than TCP networks, so issues such as determining the short / eager message size become quite important.

Got any MPI questions you want answered? Wondering why one MPI does this and another does that? Send them to the MPI Monkey.

Sidebar: MPI Quiz

MPI has message ordering guarantees. But do you know what they are, exactly? Consider the following pseudocode, with an MPI_COMM_WORLD containing size processes:

Listing 3: MPI Quiz
1  if (rank == 0) {
2      for (i = 0; i < size - 1; ++i)
3          MPI_Recv(buffer[i], ..., MPI_ANY_SOURCE, tag, 
                    MPI_COMM_WORLD, MPI_STATUS_IGNORE);
4  } else {
5      MPI_Send(buffer, ..., 0, tag, MPI_COMM_WORLD);
6  }

In what order will the messages be received at MPI_COMM_WORLD rank 0? See the next edition of this column for the answer.


Resources
MPI Forum (MPI-1 and MPI-2 specifications documents) http://www.mpi-forum.org/
MPI - The Complete Reference: Volume 1, The MPI Core (2nd ed) (The MIT Press) By Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. ISBN 0-262-69215-5
MPI - The Complete Reference: Volume 2, The MPI Extensions (The MIT Press) By William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. ISBN 0-262-57123-4.
NCSA MPI tutorial http://webct.ncsa.uiuc.edu:8900/public/MPI/

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Jeff Squyres is leading up Cisco's Open MPI efforts as part of the Server Virtualization Business Unit.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.