Please Make A Note: Parallel Computing

Showing posts with label Parallel Computing. Show all posts

Friday, July 16, 2010

Compiling Hypre on Mac OSX

Hypre is a library for solving large, sparse linear systems of equations on massively parallel computers. You can download it from here. Here's my configuration on Mac OSX

./configure --prefix=/Users/USERNAME/pkg/hypre-2.6.0b-install\
--with-MPI-include=/Users/USERNAME/pkg/openmpi-1.4.2-install/include\
--with-MPI-lib-dirs=/Users/USERNAME/pkg/openmpi-1.4.2-install/lib\
CC='gcc -arch x86_64' CXX='g++ -arch x86_64' F77='gfortran -arch x86_64'

then

make
make install

Make sure you download and install this version of gfortran. See also Compiling OpenMPI on MacOSX.

Cite as:
Saad, T. "Compiling Hypre on Mac OSX". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2010/07/compiling-hypre-on-mac-osx.html

Thursday, July 15, 2010

Parallel Heat Equation

I apologize for this delay. I've attached the parallel heat equation solver to this post. You can download the code from here

http://www.tonysaad.net/wp/wp-content/uploads/2016/02/parallel-heat-equation-2d.zip

or, separately

http://www.tonysaad.net/wp/wp-content/uploads/2016/02/heat.c
http://www.tonysaad.net/wp/wp-content/uploads/2016/02/heat.h

Please note, that due to time constraints, I have not thoroughly debugged this code. So please make sure you understand the code very well before adapting it to your problem. I've put as much comments as possible so the code should be self explanatory.

I will try to put in some explanation of what's going on in there, but that may take a while.

Cheers!

[Table of Contents: Parallel Computing with MPI]

Cite as:
Saad, T. "Parallel Heat Equation". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2010/07/parallel-heat-equation.html

Wednesday, July 14, 2010

Compiling OpenMPI on MacOSX

First, you will need a proper gfortran installed on your system. This version (gfortran 4.2.3)worked for me. After installing that download and extract OpenMPI. Here's the config line that I used

./configure --prefix=/Users/USERNAME/pkg/openmpi-1.4.2-install\
F77=gfortran CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64

then

make
make install

Voila!

If you wish to install to the default location (/usr/local/ etc...) remove the prefix option.

Cite as:
Saad, T. "Compiling OpenMPI on MacOSX". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2010/07/compiling-openmpi-on-macosx.html

Thursday, July 17, 2008

Parallel Computing with MPI - Roundup

Here's a roundup of the topics discussed on parallel computing with MPI. This was illustrated using the 2D heat equation, MPICH2, and Microsoft Visual Studio.

Introducing the Heat Equation

Discretization of the Heat Equation

Stability Analysis and the Courant-Friedrichs-Lewy Criterion

Programming the Heat Equation in C using Visual Studio

MPICH2 and Visual Studio

Parallel Computing in a Nutshell

Measuring Parallel Performance

Essential Ingredients of Message Passing

MPI Warmup

Parallelizing the Heat Equation

Cite as:
Saad, T. "Parallel Computing with MPI - Roundup". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2008/06/parallel-computing-with-mpi-roundup.html

Parallel Computing with MPI - Part IX: MPI Warmup

[PLEASE NOTE, THAT DUE TO TIME RESTRICTIONS, I WILL NOT BE ABLE TO PUBLISH THE FULL MPI WARMUP TUTORIAL IMMEDIATELY, THEREFORE, I WILL PUT THE PARTS ONE AT A TIME. SO KEEP CHECKING THIS POST. THANK YOU FOR READING THIS BLOG.]

In this article, I will go over several examples of how to use some of the MPI functions discussed in the previous post. I will start by discussing the essential parts of every MPI program and then go over more specific examples on how to deal with synchronous and asynchronous communcation.

I will illustrate this using Visual Studio and MPICH2. Please bear in mind that the codes are universal, and if you are a UNIX user, then you will have no problem compiling the codes presented hereafter. I chose Visual Studio because I am currently satisfied with the approach that Microsoft is using regarding programming, open source codes, and the .Net technology. That's my biased opinion, but whether you like it or not, Visual Studio is the most versatile and easy to use IDE and compiler on the planet! and the best of all, you can get it for free!

First, lets get visual studio ready. If you have not done so, please refer to "MPICH2 and Visual Studio" article.

Now start visual studio and create a new C++ EMPTY project (Win32 console application). Call it whatever you want, say MPIWarmup. Then, in the folder browser, right click on "Source files" and create a new C++ source file. Call it "main.cpp". Now we are ready to get started with MPI programming.

First Principles: Your First Parallel Program!

In science, a strong theory is always derived from first principles. These are essentially the conservation laws of mass, momentum, energy etc... In MPI, there are also some first principles to make a code run in parallel.
First of all, before you do anything, you have to include the mpi header file, i.e.

#include <mpi.h>

Every MPI program has to start and end with two very specific functions, respectively. These are:

int MPI_Init(int *argc, char ***argv)
int MPI_Finalize()

all the programming will take place between these two functions. So, here's how it would look like in an actual program:

#include <mpi.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
  // initialize the MPI world
  MPI_Init(&argc,&argv);
  // each processor will print "Bonjour Y'all!"
  printf("Bonjour Y'all! \n");
  // shutdown the MPI world before exiting
  MPI_Finalize();
  return 0;
}

To run this program, you need to start a command prompt, and browse to the directory where your code was compiled. Usually, this would be the debug folder of your project directory. Here's a video that summarizes the above.

Here, I ran two instances of the code, that's why you see the message "Bonjour Y'all!" printed twice. This is the simplest MPI program ever, and it does not have any kind of messages sent back and forth of course.

Now let us improve our program a little bit. How about adding the processor "rank" or ID to the printed message. To do this, each processor has to call the MPI rank function. Here's how it is usually done:

#include <mpi.h>
#include <stdio.h>
int processorID;
int numberOfProcessors;

int main (int argc, char* argv[]) {
  MPI_Init(&argc, &argv);
  // get the size of the cluster and store in numberOfProcessors
  MPI_Comm_size(MPI_COMM_WORLD,&numberOfProcessors);
  // get the rank or the number of the current process and store it in processorID
  MPI_Comm_rank(MPI_COMM_WORLD,&processorID);
  printf("I am processor %d - Bonjour Y'all! \n", processorID);
  MPI_Finalize();
  return 0;
}

This program will output the processor number as well as the Bonjour Y'all message. When debugging parallel codes, including the processor number in your print statements is invaluable because it help you pinpoint where the code is breaking. You will find it very useful to place the above piece of code (without the printf statement of course) in all your parallel programs.

Notice that we have used something called MPI_COMM_WORLD in two of the function calls. This is simply the default communicator that is create when you initialize the MPI world. Communicators are like a web that connects different computers together. Sometimes, you might want to define a new communicator that links only a subset of your parallel computer. I have never used that though.

Also, notice the use of the ampersand (&) in front of the numberOfProcessors and processorID variables. By definition, variables in MPI are passed by address because the data is being copied from one memory address to another. I will not get into the details of this, but remember, you can never send a pointer in MPI.

Let us assume now that you want the MASTER node to print something different than the compute nodes. This can be easily done now that we have the processorID at hand. Of course, the root node has processorID "0" by definition. So, in your code, you could include an if statement that checks if the current processorID is zero or not. However, there's a more elegant way of doing it by using some "define" directives as allowed in the C language. Since we're at it, let us also redefine a couple of global variables:

#include <mpi.h>
#include <stdio.h>
#define MASTER (processorID == 0)
#define NODE (processorID != 0)
#define WORLD MPI_COMM_WORLD

int processorID;
int numberOfProcessors;

int main (int argc, char* argv[]) {
  MPI_Init(&argc, &argv);
  MPI_Comm_size(WORLD,&numberOfProcessors);
  MPI_Comm_rank(WORLD,&processorID);
  if (MASTER)
    printf("Master node says: Bonjour Y'all! \n");
  else
    printf("Compute node says: Chillax! \n");
  MPI_Finalize();
  return 0;
}

I think the above code is clear enough as to what it does.

Peer to Peer Communication

It is time now to start experimenting with send and receive operations. As was mentioned in the previous post, there are two kinds of peer-to-peer communcation, i.e. blocking and immediate. We'll start with the blocking send and receive calls. The anatomy of the send function is

int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm )

where:
- buf: memory address of the data to be sent
- count: number of elements in send buffer
- datatype: datatype of each send buffer element
- dest: rank of the destination processor
- tag: message tag - useful to identify what is coming from where
- comm: communicator - usually MPI_COMM_WORLD

The receive function call has a similar anatomy as follows:

int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source,
int tag, MPI_Comm comm, MPI_Status *status )

where:
- buf: memeory address of receive buffer
- status: status object - detects the status of the receive operation
- count: maximum number of elements in receive buffer
- datatype: datatype of each receive buffer element
- source: rank of source processor
- tag: message tag - this should match the tag of the send function
- comm: communicator - usually MPI_COMM_WORLD

Here's an example: assume that the master node has a variable that it wants to send to processor 2 (i.e. rank 1). This could be done by using two if statements as follows:

#include <mpi.h>
#include <stdio.h>
#define MASTER (processorID == 0)
#define NODE (processorID != 0)
#define WORLD MPI_COMM_WORLD

int processorID;
int numberOfProcessors;

int main (int argc, char* argv[]) {
  int someVariable = 0;
  MPI_Init(&argc, &argv);
  MPI_Comm_size(WORLD,&numberOfProcessors);
  MPI_Comm_rank(WORLD,&processorID);
  // set the value of someVarialbe on the Master node only
  if (MASTER) someVariable = 10;
  // print the value of someVariable on all processors - this should be zero for all
  //nodes except the master
  printf("Before Send - Processor %i - someVariable = %i  \n",processorID,someVari  able);
  if (MASTER) {
    // send someVariable to processor 1, with the tag being
    //the rank of the source processor, i.e. 0
    MPI_Send(&someVariable,1,MPI_INT,1,0,WORLD);
  }
  if (processorID == 1) {
    // allocate a status variable
    MPI_Status theStatus;
    // initiate the receive operation from proccor 0.
    MPI_Recv(&someVariable,1,MPI_INT,0,0,WORLD,&theStatus);
  }
  printf("After Send - Processor %i - someVariable = %i \n",processorID,someVariab  le);
  MPI_Finalize();
  return 0;
}

Voila! To send an array, all you need to do is use the array variable name instead of someVariable. Of course, you have to remove the ampersand if your array was declared as a pointer. If your array is declared as a fixed array, then you can keep the ampersand. I never used fixed array unless it is really necessary for internal code management. A code that does not interact with the user, i.e. a code that is not dynamic by nature is useless. That defeats the whole purpose of programming! So here's an example with an array declared using pointers.

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define MASTER (processorID == 0)
#define NODE (processorID != 0)
#define WORLD MPI_COMM_WORLD

int processorID;
int numberOfProcessors;

int main (int argc, char* argv[]) {
  int arraySize = 0; /* the arraySize variable will be entered at the command line */
  MPI_Init(&argc, &argv);
  MPI_Comm_size(WORLD,&numberOfProcessors);
  MPI_Comm_rank(WORLD,&processorID);
  /* only one node (usually the master node) should read from the command line */
  if (MASTER) {
    printf("Please enter the array size: \n");
    fflush(stdout);
    scanf("%i",&arraySize);
  }
  /* after the arraySize is read on the master node, the other nodes need to
  have that value locally. at this point, the arraySize = 0 on all nodes
  except the master node where it is equal to the value entered at the cmd.
  a simple broadcast call will send the arraySize from the root to all other

nodes  */
  MPI_Bcast(&arraySize,1,MPI_INT,0,WORLD);
  /* declare a new array using malloc */
  float* someArray = (float*) malloc(arraySize*sizeof(float));
  /* initialize the newly declared array*/
  for (int i = 0; i < arraySize; i++) someArray[i] = 0.0;

  /* print the newly declared array */
  if(processorID == 1) {
    for (int i = 0; i < arraySize; i++)
    printf("someArray[%i] = %f \n", i, someArray[i]);
    printf("------------------------------------------\n");
  }

  /* change the values on the master node */
  if (MASTER)
    for (int i = 0; i < arraySize; i++) someArray[i] = 1.0*(i+1);

  if (MASTER) {
    /* send someArray to processor 1, with the tag being the
    rank of the source processor, i.e. 0 */
    MPI_Send(someArray,arraySize,MPI_FLOAT,1,0,WORLD);
  }
  if (processorID == 1) {
    // allocate a status variable
    MPI_Status theStatus;
    // initiate the receive operation from proccor 0.
    MPI_Recv(someArray,arraySize,MPI_FLOAT,0,0,WORLD,&theStatus);
  }

  /* print the values to make sure the array was correctly received */
  if(processorID == 1) for (int i = 0; i < arraySize; i++)
  printf("someArray[%i] = %f \n", i, someArray[i]);
  MPI_Finalize();
  return 0;
}

Note that I have used a broadcast call after the root node reads the array size from the command line. I will explain collective communication very shortly.

So far, we've sent data from the root node to the first node. Of course, this can be done between any two processors. One processor may also send different arrays to different processors.

One subtle scenario occurs when all processors are sending and receiving data from each. This may lead to a deadlock if not properly handled using blocking communication. The safest bet in this case is to use asynchronous or immediate communication. I use block based communication when there is a single processor sending different data sets to all the other processors. For example, in CFD, a typical scenario is when the problem is read into the master node and partitioning is subsequently applied, on the master node. Then, the master has to send the partition information and data structures to its processor. This should be done using blocking send and receive. Now, when the processors start doing the computation, data has to be communicated at the partition interfaces. When the partition topology is simple (unidirectional for example) then blocking communication may be used. However, when the topology is complicated (see Fig. 1), then the easiest and safest way to pass on the data at the interface is by using immediate sends and receives.

(Fig. 1)

Cite as:
Saad, T. "Parallel Computing with MPI - Part IX: MPI Warmup". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2008/07/parallel-computing-with-mpi-part-ix-mpi.html

Friday, July 4, 2008

Parallel Computing with MPI - Part VIII: Essential Ingredients of Message Passing

In this article, I will show you, in the most simple of terms, how easy it is to write parallel code using the message passing interface. There will be no coding at this stage, only the basic concepts.

The most fundamental concept in parallel computing is that processor A has to send a message that contains some data (e.g. an array of values) to processor B, and processor B has to receive that message. That's all!

This can be done by using what is called peer-to-peer or p2p communication. I'm pretty sure this sounds familiar! Another way of sending data is called collective communication. In this case, one processor will send data to all remaining processors.

As you will notice, almost 85% of the parallel code that you will write will be comprised of p2p and collective communication calls. So, it is logical to discuss these two modes of communication first.

The remaining 15% (and these numbers are speculations based on the parallel codes I have been involved in designing) go to datatype creation, topology, and miscellaneous functions. These are slightly advanced concepts that will not be discussed in this article. However, MPI datatype creation will be part of the parallel heat equation. So don't worry, you'll get a chance to see how that works!

ROOT & NODE
In a distributed memory parallel computer, a ROOT processor refers to the processor from which the job (or code) is executed. It acts as the job manager for all other processors. This is usually the most powerful machine in your system. All pre and post processing is usually done ont he ROOT.

Conversly, a NODE is any other processor that is not a ROOT.

In MPI, each processor has a "rank" or simply a number assigned to it. By definition, the ROOT processor has rank 0 while the other processors are numbered sequentially.

Peer to Peer (P2P) Communication
In P2P communication, processor A initiates a Send operation to processor B, while processor B initiates a Receive operation from processor A. These have to be matched to avoid what is called a deadlock. We'll discuss that shortly and learn how to overcome it.

There are two ways of initiating send and receive operations:

Blocking
Immediate or non-blocking

Blocking Communication

In a blocking send (receive), the processor "waits" for the message to be sent (received), i.e. the processor stops at that line of code and waits for the message to be sent (received). Imagine the processor is executing the code line by line. When it arrives at a blocking send (receive) it just waits there for the message to sent (received). The nuance here is that the message cannot be sent unless it is ready to be received. So the sending processor will wait until the receiving processor reaches the receive function call and starts receiving the message.

(Fig. 1)

If not properly implemented, this can cause a deadlock, a situation in which a processor cannot send a message because it cannot be received by another processor. Here's a famous scenario.

Consider a parallel code running on four processors, numbered 1, 2, 3 and 4. Also, let each processor send a message to the next one, i.e. 1 -> 2, 2 -> 3, 3 -> 4, and 4 -> 1. For the sake of illustration, let us start by analyzing what processor 1 is doing. When the code reaches the send function, processor 1 starts to prepare the data to be sent to processor 2, but, processor 2 cannot receive this message because it sending a message to processor 3. However, processor 3 cannot receive this message because it is sending a message to processor 4, and so on. So all processors will wait (forever) on that send call because none of them has initiated any receive yet. They are all busy waiting for the receiving processor to start receiving the message.

Immediate Communication

Immediate communication is the opposite of blocking communication. In this case, the processor does not wait for the message to be received. It sends it over the network and continues to the next line of code. This mode of communication immediately avoids any deadlock problems.

(Fig. 2)

But, it can cause premature ending of the code because of data mismatch or loss over the network. This can be simply overcome by issuing a "wait" function that tells all processors to wait until all data has been sent and received. This will be shown in the parallel implementation of the heat equation.

Collective Communication
In collective communication, all processors are involved in the communication process. There are several instances where you need collective communication. For example, if a processor wants to tell all other processors that it finished computing something, this can be done by using collective communication (instead of writing a loop of send and receive operations).

The most important functionalities of collective communication as defined by MPI are:

Broadcast
Gather
Reduce

Broadcast

A broadcast operation consists of sending data from one processor to the others.

(Fig. 3)

Gather

A gather operation consists of gathering values from a group processors and doing something with them. For example, the ROOT processor might want to gather the solution from each processor to put them in one final array.

Reduce

A reduce operation consists of collecting values from all processors and applying an algebraic (sum, product etc...) or Boolean (maximum, minimum) operation on those values. The resulting reduced data is then stored in one processor.

(Fig. 4)

In the next post, we'll start a (long) MPI warmup to get you familiar with the functions presented in this article as well as some programming intricacies of MPI.

Cite as:
Saad, T. "Parallel Computing with MPI - Part VIII: Essential Ingredients of Message Passing". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2008/07/parallel-computing-with-mpi-part-viii.html

Wednesday, July 2, 2008

Parallel Computing with MPI - Part VII: Measuring Parallel Performance

The most obvious benefit of using parallel computing is the reduction in execution time of your code. For example, if your code takes two hours to run on a single processor, it would (theoretically) take one hour to run on two processors, 30 minutes to run on four processors, and 2/N hours to run on N processors. However, there are two delicate issues to address here.

First, you must distinguish between CPU time and wall clock time.

CPU vs Clock Time

CPU time is the time required by the central processing unit to process instructions. It does not involve input/output waiting times (for example, when the code is waiting for the user to input some data and the user is in taking a break).

On the other hand, wall clock time is the time taken to run a code as measured by a regular clock. This will include all kinds of input/output operations and any processor idle time. Also, if you are running 10 applications alongside your code, the wall clock time for your code will be obviously larger than if it was run only by itself.

CPU time is usually used to benchmark your code, but in reality, it is the wall clock time that really counts because if you start your code in the morning, and finishes by early evening, then that's when it really finished executing, not earlier as would most likely be reported by the CPU time.

I personally use both methods to time my parallel codes. Ideally, if your code is running on a dedicated parallel machine then the wall clock time will be somehow close to the CPU time.

Speedup

The other issue I would like to address is the actual reduction in running time, usually measured by the speedup. The speedup is defined as

(Eq. 1)

Theoretically, your speedup should be equal to N, i.e. if your codes runs in T seconds, then it takes it T/N seconds to run on N processors. Then, an ideal speedup would be a straight line at a 45 degrees angle

(Fig. 1)

The speedup curve may also give information on the scalability of your code. A code is scalable, in the parallel sense, when the speedup does not drift away from the ideal curve.

In practice, the speedup is not always ideal or even linear. The deterioration in scalability has several reasons that are beyond the scope of this article. A rule of thumb is to try to keep the load on each processor balanced. The more processors you want to use, the larger your problem data structures should be. You will learn this by practice.

Sometimes, you will see super-linear speedup, i.e. S(N) > N. This usually happens because the parallel code is sometimes more efficient that its sequential counterpart.

Efficiency

The efficiency of parallel code is a measure of how much time the processors are actually being used to execute the program. This may be written as

(Eq. 2)

When inserting Eq. 1 into Eq. 2, one arrives at

(Eq. 3)

For instance, if the efficiency is 80%, then the processors are being used 80% of the time to do the actual computation and 20% of the time being idle.

In the next post, I will discuss the essential ingredients needed in the message passing paradigm. You will see that only a few functions are required to write a parallel code using message passing.

Cite as:
Saad, T. "Parallel Computing with MPI - Part VII: Measuring Parallel Performance". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2008/07/parallel-computing-with-mpi-part-vii.html

Saturday, June 28, 2008

Parallel Computing with MPI - Part VI: Parallel Computing in a Nutshell

Finally, we reach the most interesting part of this series. In this post, I will present, in a nutshell, the parts that you need to know in order to get started with parallel computing. I will base this discussion on an article that I wrote in 2005 on the CFD-Wiki. You can find it here. Of course, a lot has changed since then, but the basics and semantics are exactely the same. Any improvement that will happen in parallel computing will be at the implementation level, hardware, and additional functionality. The base has been laid out nicely and so it will remain for a while.

This article will be a very long one, albeit I hate to write long articles on a blog. So if you would like me to split it, just drop me a comment or an email.

What is Parallel Computing?

Parallel computing may be defined as the simultaneous use of several processors to execute a computer program. In the process, inter-processor communication is required at some point in the code; therefore, the code has to be instructed to run in parallel

For example, a code that computes the sum of the first N integers is very easy to program in parallel. Each processor will be responsible for computing a fraction of the global sum. However, at the end of the execution, all processors have to communicate their local sums to a single processor for example that will add them all up to obtain the global sum.

Types of Parallel Computers

There two types of parallel architectures

Shared memory multiprocessor
Distributed memory multicomputer

In a shared memory multiprocessor, all processors share the same "memory addresses". Imagine that these processors are all on the same motherboard. Multi-core CPUs (dual cores, quad cores) and multi CPU socket motherboards, such as server motherboards (or the computers we have in our offices at UTSI) are examples of a shared memory multiprocessor machine. In this setup, all processors have direct access to any of the variables declared in a code. This configuration is very attractive from a programming point of view: there is no need to send data from one computer to another. However, the main disadvantage lies in possible conflicts in accessing/altering data that resides in memory (imagine processor one is computing the temperature at a point, while processor two tries to access that same value for possible computation - in this case, there will be a conflict and memory cannot be accessed). This is usually remedied by inserting function calls to make that memory addresses can be accessed without conflict. Of course, this outweighs the benefits of sharing memory addresses and therefore this configuration is no longer popular since conflict management becomes a tedious and ambiguous process.

In a distributed memory configuration, each processor has its own memory addresses and data has to be sent from one processor to another. Computers are usually connected via network switch and data flows over the network from one processor to another. Note, however, that multi-core CPUs as well as multi-socket motherboards CAN use this configuration. In this case, although the memory modules are the same for all processors, each one of them reserves a portion of this memory just as if it existed independently of the others. Also, when multi-core systems are used, data is sent much faster since it does not have to go through a network. The main advantage of this configuration is robustness. For example, new machines can always be added to the parallel computer. Office computers can be used as part of the parallel computer when they are idle, and if the GRAs do not work overnight!

So How Does it Work?

The history of parallel computing extends over a long period of time and may be traced to the early 60's (check this link for a concise historical perspective). However, the most important milestone took place in 1992 when a bunch of scientist (most likely computer guys) were assembled to create a standard for message passing; their job was to formulate a universal framework for exchanging data in a parallel computer. So they retreated for a while and came up with what is called today the Message Passing Interface or MPI. It defines a set of functions and rules on how data is to flow from one processor to another within a parallel computer. Of course, they do not have to do the actual implementation as this is left for third party vendors or the open source community.

So how does parallel computing work? The notion is very simple. A computer code runs on different processors. At one point or another, data has to be exchanged between processors. Therefore, a functional call has to be included at that point to instruct the code to send or receive data. Once the executable reaches that line of code, it allocates some memory on the CPU buffer, copies the data that is to be sent, and sends this data to the receiving computer. On the receiving end, the code is waiting for some data to arrive. Of course, it knows what it will be receiving, therefore it allocates enough space in the CPU buffer and waits for this data. Once the data arrives, it copies it to the designated address in memory. and that's all!

Of course, there are many other ways that this happens, and different implementations do things differently, but the scenario presented above is one viable method. I still have to include some references though.

Cite as:
Saad, T. "Parallel Computing with MPI - Part VI: Parallel Computing in a Nutshell". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2008/06/parallel-computing-with-mpi-part-vi.html

Wednesday, June 25, 2008

Parallel Computing with MPI - Part V: MPICH2 and Visual Studio

Now that we have covered the sequential code for the finite difference solution of the 2D heat equation, It is time to start thinking in parallel. But first, we have to set the grounds for this adventure. This will be the topic of this post.

MPI, aka the Message Passing Interface, is a standard that defines a set of functions that allow a programmer to instruct their code to execute tasks in parallel. Of course, for this set of functions to be portable, they have to be compiled as a library. Then, the functions will be visible inside the code and used as necessary. I will defer my discussion on parallel computing to the next post as the sole purpose of this post is to have MPI ready so that we can experiment with it in our next post.

One thing that you have to understand is that MPI is a only a standard, but the implementations vary. This means that you will find several "MPI implementations" out there, some of them are free while others require the purchase of license, but nonetheless, they all offer the same functionality, function names, and datatypes as dictated by the MPI standard.

Fortunately, God created some people who are believe that science is not personal and are willing to take the time and put the effort to implement the MPI standard for free. The most popular implementations are the MPICH and the LamMPI. I use MPICH for windows and LamMPI for MacOSX.

For users of Unix based systems, I am going to assume that they have already enough knowledge on how to compile/install/link a library to their codes. Just do the same for MPI. You can download the MPICH2 implementation here and the LamMPI implementation here.

Now for windows users, I will outline the installation process of MPICH2 so that it is streamlined with Microsoft Visual Studio 2008. Both the express and professional editions can be used. Remember that the express edition can be downloaded for free!

I assume that you already have visual studio setup. There are two versions of MPICH2, namely, the 32 bit and 64 bit. For the time being, let us focus on the 32 bit version. Let us first setup the environment for MPI

Download and install MPICH2 32 bit edition
Add the following to your path
C:\Program Files\MPICH2\bin
(in Vista: right click My Computer/Properties/Advanced System Settings/Advanced/Environment Variables and edit the path variable for either the system or user variables - Do the same in XP, but you dont have to go through advanced system setting because it doesnt exist)
Add the following exceptions to your firewall:
C:\Program Files\MPICH2\bin\mpiexec.exe
C:\Program Files\MPICH2\bin\smpd.exe
Usually, you will receive a message to unblock these programs. But in case you don't, you have to add them manually
Register a user account with MPI. Go to
C:\Program Files\MPICH2\bin\wmpiregister
and type in the user account and password that will be used to start MPI jobs. You should have the same user account and password on all your cluster machines
Open port 8676. Go to
C:\Program Files\MPICH2\bin\wmpiconfig
and select "port" at the top of the list

Note that if you install the 32 bit MPICH2 on a 64 bit windows, the MPICH installation directory will be under "Program Files (x86)".

Now, to setup the environment in Visual Studio we have to tell the compiler where to find the MPI header files and libraries, do the following

Start Visual Studio
Go to: Tools/Options/Projects and Solutions/Visual C++ Directories
In the platform combo box, select win32
In the Directories combo box, select "Include files" and add the following entry
C:\Program Files\MPICH2\include (or Program Files (x86) for 64 bit windows)
Select "Library files" and add the following entry
C:\Program Files\MPICH2\lib\ (or Program Files (x86) for 64 bit windows)

Voila!

Now your visual studio should be working like a charm! We'll test these settings in the next post.

Cite as:
Saad, T. "Parallel Computing with MPI - Part V: MPICH2 and Visual Studio". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2008/06/parallel-computing-with-mpi-part-v.html

Sunday, June 22, 2008

Parallel Computing with MPI - Part IV: Programming the 2D Heat Equation in C using Visual Studio

Download c code | Download Visual Studio solution

In this post, you'll find the code that I have written for the numerical solution of the 2D heat equation. This is the sequential code, of course. By the way, sequential is the opposite of parallel, just in case you didn't know about it. There should be no confusion between sequential programming as all programming languages are sequential, i.e. the code is executed one line at a time.

To make things easy for newcomers, I have not used any extra header files to place the structure definitions as well as the function prototypes. Good programming practice, however, requries that you use header files to include function definitions, header files, and other stuff.

You will find two files over here. The first one, contains only the c code. Although the file extension is cpp, the code is in c. Using a C++ compiler will be perfect in this case.

The second is a rar archive and includes the visual studio solution files for those who wish to use visual studio C++ express. You can download it here. Simply extract the rar archive and open the ".sln" file.

The minimum that is expected from you is to read the whole code and understand before pressing the build or run buttons. I will most likely upload another version that has detailed comments on how things are done.

Download c code

Introducing the Heat Equation

Discretization of the Heat Equation

Stability Analysis and the Courant-Friedrichs-Lewy Criterion

Programming the Heat Equation in C using Visual Studio

MPICH2 and Visual Studio

Parallel Computing in a Nutshell

Measuring Parallel Performance

Essential Ingredients of Message Passing

MPI Warmup

Parallelizing the Heat Equation

Cite as:
Saad, T. "Parallel Computing with MPI - Part IV: Programming the 2D Heat Equation in C using Visual Studio". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2008/06/parallel-computing-with-mpi-part-iv.html

Wednesday, June 18, 2008

Parallel Computing with MPI - Part III: Stability Analysis of the Discretized Heat Equation

In today's part of this series, we'll go over a basic stability analysis for the discretized heat equation that was derived in the previous post.

Here, I will follow John Anderson's neat exposition of the stability analysis for difference equations. We start by defining a couple of terms first. Let,

D = exact solution of the difference equation (infinite accuracy computer)

N = numerical solution of the difference equation (finite accuracy computer)

ε = Round off error

then, by definition,

ε = N - D

which is the error between the exact solution and the finite accuracy numerical solution of the difference equation. Our difference equation can therefore be represented by N since we are most likely to solve it on a finite accuracy computer, right? To make headway, we replace T in our difference equation by N. at the outset, we get

(Eq. 1)

but, the numerical solution can be expressed in terms of the round off error and the exact difference equation solution as

(Eq. 2)

If we substitute this into the difference equation, we get

(Eq. 3)

however, the exact solution D to the difference equation already satisfies the discretized equation. Then, D simply cancels out from the above equation, and therefore, the round off error also satisfies the difference equation, i.e.

(Eq. 4)

Since we are solving using a real computer, then round off errors are inevitable. Therefore, we want to study how these errors evolve from one timestep to another. If ε gets smaller or at worst stays the same, then the solution is stable. Otherwise, it gets larger, then the solution is unstable. Mathematically, for a numerical solution to be stable, this can be expressed as

(Eq. 5)

Now, we wish to seek a general form of the round off error. Thankfully, once upon a time, a mathematician called Joseph Fourier showed that any function can be expressed as an infinite series of sines and cosines, or more conveniently, exponentials. Therefore, it is reasonable to assume the following general form for the round off error

(Eq. 6)

where i is the imaginary unit, kn is a wave number, and An(t) reflects the time dependence of the round off error. One may also assume a general form for An(t). Since the error will either grow or decay with time, then one can set

(Eq. 7)

so that the general form of the round off error is

(Eq. 8)

Now, we can substitute this form into the discretized equation accordingly. For example,

(Eq. 9)

Upon substitution and simplification, i.e. dividing by ε(x,y,t), we get

(Eq. 10)

nice! Notice that the infinite sum is no longer needed. Also, the exponential terms inside the parentheses can be collapsed into cosines

(Eq. 11)

This form, albeit quite simple, does not help in the subsequent steps. We need to invoke one extra trigonometric identity

(Eq. 12)

to obtain

(Eq. 13)

Now we take the absolute value of this equation and obtain the stability requirement

(Eq. 14)

we get two cases

Case I:

(Eq. 15)

(Eq. 16)

which is always true. So this case does not pose any limitations.

Case II:

(Eq. 17)

(Eq. 18)

but, since the square of the sine is always less than or equal to 1, then

(Eq. 19)

going back to the original inequality, for stability, we must satisfy

(Eq. 20)

(Eq. 21)

This is known as the Courant-Friedrichs-Lewy or the CFL condition. Although not in the context in which it was originally derived, a CFL condition may be obtained for almost any numerical analysis of differential equation.

In general, if the grid spacing was not uniform, we would have

(Eq. 22)

Note that in Horak and Gruber's paper (Parallel Numerical Solution of 2-D Heat Equation), there is a typo in the stability condition equation on page 49 (first equation on that page). The "2" in the denominator should actually be a "4" as was shown in this post. The CFL condition that the authors refer to actually corresponds to the one dimensional heat equation.

In my next post, I'll post a sequential program for solving the heat equation in C using visual studio.

REFERENCES

1. ANDERSON, J. D. 1995 Computational fluid dynamics : the basics with applications. McGraw-Hill.

2. TANNEHILL, J. C., ANDERSON, D. A. & PLETCHER, R. H. 1997 Computational fluid mechanics and heat transfer. 2nd edn. Taylor & Francis.

Introducing the Heat Equation

Discretization of the Heat Equation

Stability Analysis and the Courant-Friedrichs-Lewy Criterion

Programming the Heat Equation in C using Visual Studio

MPICH2 and Visual Studio

Parallel Computing in a Nutshell

Measuring Parallel Performance

Essential Ingredients of Message Passing

MPI Warmup

Parallelizing the Heat Equation

Cite as:
Saad, T. "Parallel Computing with MPI - Part III: Stability Analysis of the Discretized Heat Equation". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2008/06/parallel-computing-with-mpi-part-iii.html

Monday, June 16, 2008

Parallel Computing with MPI - Part II: Discretization of the Heat Equation

In this article, we'll discuss how to discretize the heat equation using an Euler explicit differencing schem. The discretization is first order forward in time, and second order central in space. I am assuming that you are somehow familiar with finite difference methods. If not, then discuss it with your therapist.

Again, we are considering the heat equation given by

The underlying idea in discretization is to first divide the domain of interest (i.e. the region over which we are solving the heat equation) into a grid of discrete points (in space and time). When this is done, the derivatives can be approximated using the values of the dependent variable T at the discrete points. For convenience, the grid points are indexed using i in the horizontal x direction, and j in the vertical y direction.

Similarly, the time dimension is divided into time levels n, n+1, n+2... separated by a timestep Δt.

Next, we have to transform Eq. (1) into an algebraic relation at (any) point (i, j) and time level n. To do this, we have to convert each derivative into an equivalent algebraic expression. For the unsteady term, we use a forward Euler scheme as follows

where n denotes the time level and Δt denotes the timestep size. The form given in Eq. (2) is called forward in time because it uses the value of T at the next time level (n+1), which is unknown. In essence, we are solving for T at (n+1)

For the spatial derivatives, we will use explicit central differencing. This means that the discretized form of the spatial derivatives will be evaluated at the current time level n where the values are known. This is written as

and

If we further allow Δx = Δy = δ Upon substitution of Eqs. (2)-(4) into Eq. (1), we obtain the full discretized form of the heat equation

finally, the simplicity of the explicit scheme is clearly visible in Eq. (5) as one can easily solve for the new time level (n+1)

This means that there is no need to solve a linear system of equations. If we had used an implicit method, the RHS of Eq. (6) would be at time level (n + 1) and therefore, one would have to solve a linear system of equations at each time step.

In the next post, we'll discuss the stability of the discretized equation and derive the CFL criterion to ensure stability of the numerical solution given by Eq. (6).

Introducing the Heat Equation

Discretization of the Heat Equation

Stability Analysis and the Courant-Friedrichs-Lewy Criterion

Programming the Heat Equation in C using Visual Studio

MPICH2 and Visual Studio

Parallel Computing in a Nutshell

Measuring Parallel Performance

Essential Ingredients of Message Passing

MPI Warmup

Parallelizing the Heat Equation

Cite as:
Saad, T. "Parallel Computing with MPI - Part II: Discretization of the Heat Equation". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2008/06/parallel-computing-with-mpi-part-ii.html

Sunday, June 15, 2008

Parallel Computing with MPI - Part I: The Heat Equation

In this series, I will introduce parallel programming concepts with MPI. This series is prompted by the "Numerical Methods for PDEs" class given by Professor Christian Parigger of UTSI.

Of course, since things are done best with concrete examples; our objective will be to parallelize the 2D heat equation (i.e. unsteady diffusion) with constant properties. The parallelization of the 2D heat equation is actually one of the problems on the final exam of Prof. Parigger's class and is based on the paper by Horak and Gruber, Parallel Numerical Solution of 2-D Heat Equation.

Here's an outline of what I am planning to talk about:

Introducing the Heat Equation
Discretization of the Heat Equation
Stability Analysis and the Courant-Friedrichs-Lewy Criterion
Programming the Heat Equation in C using Visual Studio
MPICH2 and Visual Studio
The Essential Ingredients of an MPI Program
MPI Warmup
Parallelizing the Heat Equation (2 or 3 posts - most likely a video tutorial)

The heat equation is a great example to introduce parallel computing because we will use an explicit scheme for the discretization. The reason for the simplicity of the explicit schemes and its suitability for parallel computing will become clear as I progress in this series. (For those who absolutely want to know why it is simpler to parallelize an explicit finite difference scheme, I'll briefly point out that it is because you are not solving a linear system of equations. When using explicit methods, the values at the new time step are explicitly expressed in terms of the values at the previous one and therefore the exchange of values at the interface is straightforward).

So let us first start by introducing the heat equation.

The heat equation (with no heat sources) expresses a balance between the accumulation and the flow of heat within a certain region of space, usually referred to as a control volume. Obviously, the heat equation is not only restricted to heat. Many other physical phenomena share the same form of the heat equation such as mass transfer, momentum transfer, and Shroedinger's equation. In its simplest forms, the heat equation describing heat diffusion in solids (aka conduction heat transfer) is given by

ρ is the density of the material, c is the heat capacity (if we had a fluid, c would be the heat capacity at constant pressure or cp). It describes how much heat a solid (or a fluid element) can store and subsequently raise the temperature as opposed to letting it pass through. k is the thermal conductivity of the solid and describes the resistance to the flow of heat. A higher k means more heat can flow through the material. For example, insulators have a very low thermal conductivity while heat sinks (such as CPU heat sinks) have very large k.

The unsteady term at the LHS of Eq. (1) describes how much heat is accumulated in a control volume as time goes by. On the other hand, the RHS describes how much heat is flowing in the spatial directions.

Fundamentally, the heat equation describes a process called diffusion, which denotes the transfer of "information" (e.g. heat) at the molecular level. This is useful when convection enters the picture because it occurs at the macroscopic level, in the presence of a velocity field associated with fluid flow (I'll explain this more in a future post).

Mathematically, diffusion is almost always described by second derivatives in space (while convection is associated with first derivatives in space). This is clearly visible in Eq. (1) above. A more convenient way of writing the heat equation is by dividing both sides of Eq. (1) by ρc. The outcome is

The new parameter, α, shown in Eq. (2) is called the thermal diffusivity. It is a measure of how much a solid can pass heat or store it. For example, large values of alpha mean that heat can flow through material very fast, while small values indicate a tendency for the material to store heat.

In the next post, we will discretize Eq. (2) using explicit forward differencing in time, and central differencing in space.

Introducing the Heat Equation

Discretization of the Heat Equation

Stability Analysis and the Courant-Friedrichs-Lewy Criterion

Programming the Heat Equation in C using Visual Studio

MPICH2 and Visual Studio

Parallel Computing in a Nutshell

Measuring Parallel Performance

Essential Ingredients of Message Passing

MPI Warmup

Parallelizing the Heat Equation

Cite as:
Saad, T. "Parallel Computing with MPI - Part I: The Heat Equation". Weblog entry from Please Make A Note. http://pleasemakeanote.blogspot.com/2008/06/parallel-computing-with-mpi-part-i-heat.html

Pages

Friday, July 16, 2010

Thursday, July 15, 2010

Wednesday, July 14, 2010

Thursday, July 17, 2008

Friday, July 4, 2008

Wednesday, July 2, 2008

Saturday, June 28, 2008

Wednesday, June 25, 2008

Sunday, June 22, 2008

Table of Contents:

Wednesday, June 18, 2008

Table of Contents:

Monday, June 16, 2008

Table of Contents:

Sunday, June 15, 2008

Table of Contents: