The most fundamental concept in parallel computing is that processor A has to send a message that contains some data (e.g. an array of values) to processor B, and processor B has to receive that message. That's all!
This can be done by using what is called peer-to-peer or p2p communication. I'm pretty sure this sounds familiar! Another way of sending data is called collective communication. In this case, one processor will send data to all remaining processors.
As you will notice, almost 85% of the parallel code that you will write will be comprised of p2p and collective communication calls. So, it is logical to discuss these two modes of communication first.
The remaining 15% (and these numbers are speculations based on the parallel codes I have been involved in designing) go to datatype creation, topology, and miscellaneous functions. These are slightly advanced concepts that will not be discussed in this article. However, MPI datatype creation will be part of the parallel heat equation. So don't worry, you'll get a chance to see how that works!
ROOT & NODE
In a distributed memory parallel computer, a ROOT processor refers to the processor from which the job (or code) is executed. It acts as the job manager for all other processors. This is usually the most powerful machine in your system. All pre and post processing is usually done ont he ROOT.
Conversly, a NODE is any other processor that is not a ROOT.
In MPI, each processor has a "rank" or simply a number assigned to it. By definition, the ROOT processor has rank 0 while the other processors are numbered sequentially.
Peer to Peer (P2P) Communication
In P2P communication, processor A initiates a Send operation to processor B, while processor B initiates a Receive operation from processor A. These have to be matched to avoid what is called a deadlock. We'll discuss that shortly and learn how to overcome it.
There are two ways of initiating send and receive operations:
- Immediate or non-blocking
In a blocking send (receive), the processor "waits" for the message to be sent (received), i.e. the processor stops at that line of code and waits for the message to be sent (received). Imagine the processor is executing the code line by line. When it arrives at a blocking send (receive) it just waits there for the message to sent (received). The nuance here is that the message cannot be sent unless it is ready to be received. So the sending processor will wait until the receiving processor reaches the receive function call and starts receiving the message.
If not properly implemented, this can cause a deadlock, a situation in which a processor cannot send a message because it cannot be received by another processor. Here's a famous scenario.
Consider a parallel code running on four processors, numbered 1, 2, 3 and 4. Also, let each processor send a message to the next one, i.e. 1 -> 2, 2 -> 3, 3 -> 4, and 4 -> 1. For the sake of illustration, let us start by analyzing what processor 1 is doing. When the code reaches the send function, processor 1 starts to prepare the data to be sent to processor 2, but, processor 2 cannot receive this message because it sending a message to processor 3. However, processor 3 cannot receive this message because it is sending a message to processor 4, and so on. So all processors will wait (forever) on that send call because none of them has initiated any receive yet. They are all busy waiting for the receiving processor to start receiving the message.
But, it can cause premature ending of the code because of data mismatch or loss over the network. This can be simply overcome by issuing a "wait" function that tells all processors to wait until all data has been sent and received. This will be shown in the parallel implementation of the heat equation.
In collective communication, all processors are involved in the communication process. There are several instances where you need collective communication. For example, if a processor wants to tell all other processors that it finished computing something, this can be done by using collective communication (instead of writing a loop of send and receive operations).
The most important functionalities of collective communication as defined by MPI are:
A broadcast operation consists of sending data from one processor to the others.
A gather operation consists of gathering values from a group processors and doing something with them. For example, the ROOT processor might want to gather the solution from each processor to put them in one final array.
A reduce operation consists of collecting values from all processors and applying an algebraic (sum, product etc...) or Boolean (maximum, minimum) operation on those values. The resulting reduced data is then stored in one processor.