MPI: More Joys of Asynchronous Communication

To block or not to block, that is the question, read the following while I work on my answer

Last month we started discussing non-blocking communication (get it?). We covered the basic non-blocking (or immediate) send and receive functions - all of which start a communication - and touched on their various flavors. We also discussed the TEST and WAIT functions, and how they are used to complete communications.

Recall that previous articles have only covered standard communication (sometimes called "blocking" communication, even though the functions may not always block!): functions that will not return until MPI guarantees that the buffer can be [re-]used. Using non-blocking communications effectively allows the separation of communication initiation and completion, and allows for the possibility of communication and computation overlap.

This month, we'll talk more about non-blocking methods and benefits, and fuel the fire with some more examples about how and why they can be useful to your MPI application. And remember, latency is like a good speech; the shorter, the better.

Persistent Sends and Receives

Another form of non-blocking communication is MPI's persistent messages. Persistent communication offers a slight optimization to applications that repeatedly send or receive a buffer with the same message signature. In such cases, the use of persistent communication can reduce overall latency.

The rationale is to pass all the arguments (buffer, count, datatype, tag, source/destination, and communicator) and perform the setup required for the communication only once. Then, in each iteration of the application, simply say "go" on the previously-setup operation and let the communication commence. For example: {mosgoogle right}

Listing 1: Simple peristence
1 MPI_Status status;
2 MPI_Request req;
3 MPI_Send_init(buf, count, dtype, dest, tag, comm, &req);
4 while (looping) {
5     MPI_Start(&req);
6     do_work();
7     MPI_Wait(&req, &status);
8 }
9 MPI_Cancel(&req);

The MPI_SEND_INIT function creates a request and sets up the communication. Its signature is identical to MPI_ISEND (all the normal sending parameters and the address of an MPI_Request to fill). The MPI_START function actually starts the communication operation. The send is a non-blocking operation and therefore must be finished with a TEST or WAIT operation. During the next iteration, there is no need to invoke MPI_SEND_INIT again - we simply START and WAIT the request. After the loop has completed, it is proper to MPI_CANCEL a persistent request. This command tells MPI that the application will not use that request again - it is safe to destroy and free all associated resources.

MPI_SEND_INIT is a standard mode persistent send; MPI_SSEND_INIT, MPI_BSEND_INIT, and MPI_RSEND_INIT are the synchronous, buffered, and ready mode persistent functions, respectively. MPI_RECV_INIT is the persistent receive. They all function similarly to MPI_SEND_INIT: use the INIT function to create the request, use the START function to initiate the communication, and finally use some flavor of TEST or WAIT to complete it. Also note that just like the TEST and WAIT functions, START has a variant that can operate on an array of requests: MPI_STARTALL.

Why Bother With Non-Blocking?

Invoking special functions and creating additional logic for splitting the initiation and completion of communications can be quite a hassle. Why bother?

As with parallel computing in general, the answer is rooted in optimization. For example, some networks are powered by communication co-processors - processors that are separate from the main CPU and can progress message passing events independently of the operating system and user's application. This design allows even single-threaded MPI implementations to effect at least some degree of asynchronous communication progress while the application is not executing inside the MPI library; the network itself can be given responsibility for some portion of MPI semantics.

Additionally, standard mode functions only allow one communication to occur at a time. Non-blocking functions allow the application to initiate multiple communication operations, enabling the MPI implementation to progress them simultaneously. Consider the following code example:

Listing 2: Cascading linearity
1 while (looping) {
2   if (i_have_a_left_neighbor)
3     MPI_Recv(inbuf, count, dtype, left, tag, comm, &status);
4   if (i_have_a_right_neighbor)
5     MPI_Send(outbuf, count, dtype, right, tag, comm);
6   do_other_work();
7 }

Assume at that least one process does not have a left neighbor, and consider how this code will run in parallel: every process will receive from its left and then send to its right. But notice that the above code uses standard mode sends. As a direct result, this algorithm is actually serialized - it will execute in a domino-like fashion, causing each process to block while waiting for its left neighbor.

Using non-blocking communication allows the MPI to progress both communications simultaneously:

Listing 3: Non-blocking can avoid cascading linearity
1 while (looping) {
2   count = 0;
3   if (i_have_a_left_neighbor)
4     MPI_Irecv(inbuf, count, dtype, left, tag, comm, &req[count++]);
5   if (i_have_a_right_neighbor)
6     MPI_Isend(outbuf, count, dtype, right, tag, comm, &req[count++]); 
7   MPI_Waitall(count, req, &statuses);
8   do_other_work();
9 }

The MPI_WAITALL on line 7 allows both communications to progress simultaneously. Specifically, the send can proceed before the receive completes. This code will therefore operate in a truly parallel fashion and will avoid the domino effect. Note, however, that this particular code example has a subtle implication: the WAITALL will block until both communications are complete. Indeed, the astute reader will recognize that a clever use of MPI_SENDRECV could be used for the same result. Specifically, blocking on line 7 means that there still may be some "dead" time while waiting for network communication to complete - time that could have been used for other work. This situation may be unavoidable in some applications, but others may have some work that can be performed while waiting for the communications to complete. For example:

Listing 4: Delayed MPI_WAITALL
1  while (looping) {
2    count = 0;
3    if (i_have_a_left_neighbor)
4      MPI_Irecv(inbuf, count, dtype, left, tag, comm, &req[count++]);
5    if (i_have_a_right_neighbor)
6      MPI_Isend(outbuf, count, dtype, right, tag, comm, &req[count++]); 
7    do_some_work();
8    MPI_Waitall(count, req, &statuses);
9    do_rest_of_work();
10 }

Note the addition of do_some_work() and do_rest_of_work() on lines 7 and 9, respectively. do_some_work() represents work that can be done before the communication completes. Hence, the application can even utilize the "dead" time while message passing is occurring in the background - an overlap of communication and computation. This method works best on networks and/or MPI implementations that allow for at least some degree of asynchronous progress, but can even benefit single-threaded, synchronous MPI implementations. Once the communication completes, do_rest_of_work() executes, and one assumes it is performing work that was dependent upon the received messages.

Note that since the same buffers and communication parameters are used every iteration, a further optimization could use the persistent mode. This improvement allows the MPI to setup the communications once, and simply say "go" every iteration:

Listing 4: Adding persistent requests into the mix
1  int count = 0;
2  if (i_have_a_left_neighbor)
3    MPI_Recv_init(inbuf, count, dtype, left, tag, comm, &req[count++]);
4  if (i_have_a_right_neighbor)
5    MPI_Send_init(outbuf, count, dtype, right, tag, comm, &req[count++]);
6  while (looping) {
7    MPI_Startall(count, req);
8    do_some_work();
9    MPI_Waitall(count, req, &statuses);
10   do_rest_of_work();
11 }

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.