Skip to main content

High Performance Computing

Introduction to th...

Collective Co... [mpi]

High Performance Computing

Introduction to th...

Derived Data ... [mpi]

Non-blocking Communication
"Introduction to the Message Passing Interface" course by the Southampton RSG

"Introduction to the Message Passing Interface" course by the Southampton RSG

Creative Commons License

Non-blocking Communication

In the previous episodes, we learnt how to send messages between two ranks or collectively to multiple ranks. In both cases, we used blocking communication functions which meant our program wouldn’t progress until the communication had completed. It takes time and computing power to transfer data into buffers, to send that data around (over the network) and to receive the data into another rank. But for the most part, the CPU isn’t actually doing much at all during communication, when it could still be number crunching.

Why bother with non-blocking communication?

Non-blocking communication is communication which happens in the background. So we don’t have to let any CPU cycles go to waste! If MPI is dealing with the data transfer in the background, we can continue to use the CPU in the foreground and keep doing tasks whilst the communication completes. By overlapping computation with communication, we hide the latency/overhead of communication. This is critical for lots of HPC applications, especially when using lots of CPUs, because, as the number of CPUs increases, the overhead of communicating with them all also increases. If you use blocking synchronous sends, the time spent communicating data may become longer than the time spent creating data to send! All non-blocking communications are asynchronous, even when using synchronous sends, because the communication happens in the background, even though the communication cannot complete until the data is received.

So, how do I use non-blocking communication?

Just as with buffered, synchronous, ready and standard sends, MPI has to be programmed to use either blocking or non-blocking communication. For almost every blocking function, there is a non-blocking equivalent. They have the same name as their blocking counterpart, but prefixed with "I". The "I" stands for "immediate", indicating that the function returns immediately and does not block the program. The table below shows some examples of blocking functions and their non-blocking counterparts.
BlockingNon-blocking
MPI_Bsend()MPI_Ibsend()
MPI_Barrier()MPI_Ibarrier()
MPI_Reduce()MPI_Ireduce()
But, this isn't the complete picture. As we'll see later, we need to do some additional bookkeeping to be able to use non-blocking communications.
By effectively utilising non-blocking communication, we can develop applications that scale significantly better during intensive communication. However, this comes with the trade-off of both increased conceptual and code complexity. Since non-blocking communication doesn't keep control until the communication finishes, we don't actually know if a communication has finished unless we check; this is usually referred to as synchronisation, as we have to keep ranks in sync to ensure they have the correct data. So whilst our program continues to do other work, it also has to keep pinging to see if the communication has finished, to ensure ranks are synchronised. If we check too often, or don't have enough tasks to "fill in the gaps", then there is no advantage to using non-blocking communication, and we may replace communication overheads with time spent keeping ranks in sync! It is not always clear-cut or predictable if non-blocking communication will improve performance. For example, if one ranks depends on the data of another, and there are no tasks for it to do whilst it waits, that rank will wait around until the data is ready, as illustrated in the diagram below. This essentially makes that non-blocking communication a blocking communication. Therefore, unless our code is structured to take advantage of being able to overlap communication with computation, non-blocking communication adds complexity to our code for no gain.
Non-blocking communication with data dependency

Advantages and Disadvantages

What are the main advantages of using non-blocking communication, compared to blocking? What about any disadvantages?

Point-to-point communication

For each blocking communication function we've seen, a non-blocking variant exists. For example, if we take MPI_Send(), the non-blocking variant is MPI_Isend() which has the arguments:
int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request );
*buf:The data to be sent
count:The number of elements of data
datatype:The data types of the data
dest:The rank to send data to
tag:The communication tag
comm:The communicator
*request:The request handle, used to track the communication
The arguments are identical to MPI_Send(), other than the addition of the *request argument. This argument is known as a handle (because it "handles" a communication request) which is used to track the progress of a (non-blocking) communication.
When we use non-blocking communication, we have to follow it up with MPI_Wait() to synchronise the program and make sure *buf is ready to be re-used. This is incredibly important to do. Suppose we are sending an array of integers,
MPI_Isend(some_ints, 5, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); some_ints[1] = 5; /* !!! don't do this !!! */
Modifying some_ints before the send has completed is undefined behaviour, and can result in breaking results! For example, if MPI_Isend() decides to use its buffered mode then modifying some_ints before it's finished being copied to the send buffer means the wrong data is sent. Every non-blocking communication has to have a corresponding MPI_Wait(), to wait and synchronise the program to ensure that the data being sent is ready to be modified again. MPI_Wait() is a blocking function which will only return when a communication has finished.
int MPI_Wait( MPI_Request *request, MPI_Status *status );
*request:The request handle for the communication
*status:The status handle for the communication
Once we have used MPI_Wait() and the communication has finished, we can safely modify some_ints again. To receive the data send using a non-blocking send, we can use either the blocking MPI_Recv() or it's non-blocking variant.
int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request, );
*buf:A buffer to receive data into
count:The number of elements of data to receive
datatype:The data type of the data
source:The rank to receive data from
tag:The communication tag
comm:The communicator
*request:The request handle for the receive

True or False?

Is the following statement true or false? Non-blocking communication guarantees immediate completion of data transfer.
In the example below, an array of integers (some_ints) is sent from rank 0 to rank 1 using non-blocking communication.
MPI_Status status; MPI_Request request; int recv_ints[5]; int some_ints[5] = { 1, 2, 3, 4, 5 }; if (my_rank == 0) { MPI_Isend(some_ints, 5, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); MPI_Wait(&request, &status); some_ints[1] = 42; // After MPI_Wait(), some_ints has been sent and can be modified again } else { MPI_Irecv(recv_ints, 5, MPI_INT, 0, 0, MPI_COMM_WORLD, &request); MPI_Wait(&request, &status); int data_i_wanted = recv_ints[2]; // recv_ints isn't guaranteed to have the correct data until after MPI_Wait() }
The code above is functionally identical to blocking communication, because of MPI_Wait() is blocking. The program will not continue until MPI_Wait() returns. Since there is no additional work between the communication call and blocking wait, this is a poor example of how non-blocking communication should be used. It doesn't take advantage of the asynchronous nature of non-blocking communication at all. To really make use of non-blocking communication, we need to interleave computation (or any busy work we need to do) with communication, such as in the next example.
MPI_Status status; MPI_Request request; if (my_rank == 0) { /* This send important_data without being blocked and move into the next work */ MPI_Isend(important_data, 16, MPI_INT, 1, 0, MPI_COMM_WORLD, &request); } else { /* Start listening for the message from the other rank, but isn't blocked */ MPI_Irecv(important_data, 16, MPI_INT, 0, 0, MPI_COMM_WORLD, &request); } /* Whilst the message is still sending or received, we should do some other work to keep using the CPU (which isn't required for most of the communication. IMPORTANT: the work here cannot modify or rely on important_data */ clear_model_parameters(); initialise_model(); /* Once we've done the work which doesn't require important_data, we need to wait until the data is sent/received if it hasn't already */ MPI_Wait(&request, &status); /* Now the model is ready and important_data has been sent/received, the simulation carries on */ simulate_something(important_data);

What About Deadlocks?

Deadlocks are easily created when using blocking communication. The code snippet below shows an example of deadlock from one of the previous episodes.
if (my_rank == 0) { MPI_Send(&numbers, 8, MPI_INT, 1, 0, MPI_COMM_WORLD); MPI_Recv(&numbers, 8, MPI_INT, 1, 0, MPI_COMM_WORLD, &status); } else { MPI_Send(&numbers, 8, MPI_INT, 0, 0, MPI_COMM_WORLD); MPI_Recv(&numbers, 8, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); }
If we changed to non-blocking communication, do you think there would still be a deadlock? Try writing your own non-blocking version.

To wait, or not to wait

In some sense, by using MPI_Wait() we aren't fully non-blocking because we still block execution whilst we wait for communications to complete. To be "truly" asynchronous we can use another function called MPI_Test() which, at face value, is the non-blocking counterpart of MPI_Wait(). When we use MPI_Test(), it checks if a communication is finished and sets the value of a flag to true if it is and returns. If a communication hasn't finished, MPI_Test() still returns but the value of the flag is false instead. MPI_Test() has the following arguments:
int MPI_Test( MPI_Request *request, int *flag, MPI_Status *status, );
*request:The request handle for the communication
*flag:A flag to indicate if the communication has completed
*status:The status handle for the communication
*request and *status are the same you'd use for MPI_Wait(). *flag is the variable which is modified to indicate if the communication has finished or not. Since it's an integer, if the communication hasn't finished then flag == 0.
We use MPI_Test() is much the same way as we'd use MPI_Wait(). We start a non-blocking communication, and keep doing other, independent, tasks whilst the communication finishes. The key difference is that since MPI_Test() returns immediately, we may need to make multiple calls before the communication is finished. In the code example below, MPI_Test() is used within a while loop which keeps going until either the communication has finished or until there is no other work left to do.
MPI_Status status; MPI_Request request; MPI_Irecv(recv_buffer, 16, MPI_INT, 0, 0, MPI_COMM_WORLD, &request); // We need to define a flag, to track when the communication has completed int comm_completed = 0; // One example use case is keep checking if the communication has finished, and continuing // to do CPU work until it has while (!comm_completed && work_still_to_do()) { do_some_other_work(); // MPI_Test will return flag == true when the communication has finished MPI_Test(&request, &comm_completed, &status); } // If there is no more work and the communication hasn't finished yet, then we should wait // for it to finish if (!comm_completed) { MPI_Wait(&request, &status); }

Dynamic task scheduling

Dynamic task schedulers are a class of algorithms designed to optimise the work load across ranks. The most efficient, and, really, only practical, implementations use non-blocking communication to periodically check the work balance and asynchronously assign and send additional work to a rank, in the background, as it continues to work on its current queue of work.

An interesting aside: communication timeouts

Non-blocking communication gives us a lot of flexibility, letting us write complex communication algorithms to experiment and find the right solution. One example of that flexibility is using MPI_Test() to create a communication timeout algorithm.
#define COMM_TIMEOUT 60 // seconds clock_t start_time = clock(); double elapsed_time = 0.0; int comm_completed = 0 while (!comm_completed && elapsed_time < COMM_TIMEOUT) { // Check if communication completed MPI_Test(&request, &comm_completed, &status); // Update elapsed time elapsed_time = (double)(clock() - start_time) / CLOCKS_PER_SEC; } if (elapsed_time >= COMM_TIMEOUT) { MPI_Cancel(&request); // Cancel the request to stop the, e.g. receive operation handle_communication_errors(); // Put the program into a predictable state }
Something like this would effectively remove deadlocks in our program, and allows us to take appropriate actions to recover the program back into a predictable state. In reality, however, it would be hard to find a useful and appropriate use case for this in scientific computing. In any case, though, it demonstrates the power and flexibility offered by non-blocking communication.

Try it yourself

In the MPI program below, a chain of ranks has been set up so one rank will receive a message from the rank to its left and send a message to the one on its right, as shown in the diagram below:
A chain of ranks
For following skeleton below, use non-blocking communication to send send_message to the right and receive a message from the left rank. Create two programs, one using MPI_Wait() and the other using MPI_Test().
#include <mpi.h> #include <stdio.h> #define MESSAGE_SIZE 32 int main(int argc, char **argv) { int my_rank; int num_ranks; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); if (num_ranks < 2) { printf("This example requires at least two ranks\n"); return MPI_Finalize(); } char send_message[MESSAGE_SIZE]; char recv_message[MESSAGE_SIZE]; int right_rank = (my_rank + 1) % num_ranks; int left_rank = my_rank < 1 ? num_ranks - 1 : my_rank - 1; sprintf(send_message, "Hello from rank %d!", my_rank); // Your code goes here return MPI_Finalize(); }
The output from your program should look something like this:
Rank 0: message received -- Hello from rank 3! Rank 1: message received -- Hello from rank 0! Rank 2: message received -- Hello from rank 1! Rank 3: message received -- Hello from rank 2!

Collective communication

Since the release of MPI 3.0, the collective operations have non-blocking versions. Using these non-blocking collective operations is as easy as we've seen for point-to-point communication in the last section. If we want to do a non-blocking reduction, we'd use MPI_Ireduce():
int MPI_Ireduce( const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm, MPI_Request *request, );
*sendbuf:The data to be reduced by the root rank
*recvbuf:A buffer to contain the reduction output
count:The number of elements of data to be reduced
datatype:The data type of the data
op:The reduction operation to perform
root:The root rank, which will perform the reduction
comm:The communicator
*request:The request handle for the communicator
As with MPI_Send() vs. MPI_Isend() the only change in using the non-blocking variant of MPI_Reduce() is the addition of the *request argument, which returns a request handle. This is the request handle we'll use with either MPI_Wait() or MPI_Test() to ensure that the communication has finished, and been successful. The below code examples shows a non-blocking reduction:
MPI_Status status; MPI_Request request; int recv_data; int send_data = my_rank + 1; /* MPI_Iallreduce is the non-blocking version of MPI_Allreduce. This reduction operation will sum send_data for all ranks and distribute the result back to all ranks into recv_data */ MPI_Iallreduce(&send_data, &recv_data, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &request); /* Remember, using MPI_Wait() straight after the communication is equivalent to using a blocking communication */ MPI_Wait(&request, &status);

What's `MPI_Ibarrier()` all about?

In the previous episode, we learnt that MPI_Barrier() is a collective operation we can use to bring all the ranks back into synchronisation with one another. How do you think the non-blocking variant, MPI_Ibarrier(), is used and how might you use this in your program? You may want to read the relevant documentation first.

Reduce and Broadcast

Using non-blocking collective communication, calculate the sum of my_num = my_rank + 1 from each MPI rank and broadcast the value of the sum to every rank. To calculate the sum, use either MPI_Igather() or MPI_Ireduce(). You should use the code below as your starting point.
#include <mpi.h> #include <stdio.h> int main(int argc, char **argv) { int my_rank; int num_ranks; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); if (num_ranks < 2) { printf("This example needs at least 2 ranks\n"); MPI_Finalize(); return 1; } int sum = 0; int my_num = my_rank + 1; printf("Start : Rank %d: my_num = %d sum = %d\n", my_rank, my_num, sum); // Your code goes here printf("End : Rank %d: my_num = %d sum = %d\n", my_rank, my_num, sum); return MPI_Finalize(); }
For two ranks, the output should be:
Start : Rank 0: my_num = 1 sum = 0 Start : Rank 1: my_num = 2 sum = 0 End : Rank 0: my_num = 1 sum = 3 End : Rank 1: my_num = 2 sum = 3