Algorithmes Parallèles et Distribués -- Partie MPI

(last update: Wed Dec 14 16:55:03 CET 2016)

1   This course

Instructors:

2   Installing MPI, rpcgen, and OpenMP

In Debian or Ubuntu, we will use the OpenMPI implementation of the MPI standard. OpenMPI is installed on all computers at the room G207.

If you wish to install it at home or in your laptop, you will need to following packages:

libopenmpi-dev
openmpi-bin
openmpi-doc
openmpi-checkpoint

We will also need the rpcgen(1) and rpcbind(8) tools, as well as OpenMP support for gcc, available in the following packages:

libc-dev-bin
rpcbind
libgomp1

3   Getting Help

4   Hello World

The following program says hello from each process (files found here: examples/hello-world):

#include <stdio.h>
#include <mpi.h>

int main (int argc, char ** argv)
{
   int rank;

   MPI_Init (&argc, & argv);

   MPI_Comm_rank (MPI_COMM_WORLD, &rank);
   printf ("p%d: hello\n", rank);

   MPI_Finalize ();
   return 0;
}

To compile it, execute the following command:

mpicc -Wall hello.c -o hello

To run it, using 4 processes, all of them running on the local host:

mpirun -n 4 ./hello

It will output something similar to this:

p0: hello
p1: hello
p3: hello
p2: hello

The following Makefile could be useful for your projects (obseve that the target all not only compiles the program hello.c but it also executes it):

CFLAGS=-Wall
CC=mpicc

all : hello
       mpirun -n 4 ./hello

clean :
       rm -f hello *.o

5   Running on Multiple Hosts

Running 4 processes, all of them on the local host:

mpirun -n 4 ./hello

Equivalenly:

mpirun -n 4 -H localhost ./hello

Running 3 processes, two of them in machine G207-1 and one in G207-10 (the assignment of processes (ranks) to hosts is done round-robin):

mpirun -n 3 -H G207-1,G207-10 ./hello

Running 5 processes, 3 on G207-1 and 2 on G207-2 (note that there is no -n option):

$ cat hosts
G207-1  slots=3
G207-2  slots=2
$ mpirun -hostfile hosts ./hello

Running 3 processes, all of them in G207-1:

mpirun -n 3 -hostfile hosts ./hello

Running 9 processes, (3 + 3) in G207-1 and (2 + 1) in G207-2 (two rounds of assignment, as first one "completed" the slots):

mpirun -n 9 -hostfile hosts ./hello

6   Conway's Game of Life

We will code and experiment with a sequential and a distributed version of Conway's game of life. Click on the link and read the description if you are unfamiliar with the game.

The game board is given by a matrix storing m rows and n columns. Each element of the matrix is a bit, but we will represent it with a value of type short. The matrix is stored in a 1-dimension C array as a consecutive sequence of rows (1st row, 2nd row, ..., last row). The C type of the array is:

short tab[n * m];

You are provided with the following helper functions (code available here):

// prints out the game board
void show (short* tab, int m, int n);

// allocates and randomly initializes a game board; density (from 0 to 1)
// indicates the density of cells initially alive
short *init (int m, int n, float density);

// plays one iteration of the game, updating the board 'tab'
void game (short* tab, int m, int n);

// translates a timeval structure to milliseconds
uint64_t timeval_to_ms (struct timeval *tm);
  1. Use the functions above to write a program seq that will run the game for a given number of iterations. The program will have the following command-line syntax:

    ./seq ROWS COLS IT
    

    where ROWS and COLS are the dimensions of the board, and IT the number of iterations. The program should display, before it exits, the amount of time used to compute the game. Use gettimeofday(2).

  2. Using a number of columns equal to 10, experimentally find out the number of rows and iterations that make your program run in around 1 second. (Comment out any (slow) piece of code that displays the board or artificially introduces delays to improve readability of the output.)

  3. Using the number of iterations and rows computed in the previous step, plot (using your favourite spreadsheet program, e.g., Libreoffice Calc) the runtime as a function of the number columns, providing data for at least the following numbers of columns:

    2, 4, 6, 8, 10, 15, 20, 25, 30, 40, 50, 60
    

    Should the runtime follow a linear correlation to the number of columns? Can you confirm or disprove it? Do you observe any unexpected behaviour for small numbers of columns? Can you explain it?

  4. We will now design and code a distributed version of the program. The resulting program (call it dist) will receive the same three command-line parameters. It will behave as follows.

    1. The process with rank 0 will parse the commandline options and find out the rows, columns, and number of iterations. It will then broadcast these to the remaining processes. Use only one collective call to distribute all the three parameters.
    2. The decomposition and distribution of the board will be done by rows (horizontal bands). Initially, each process will allocate and randomly initialize the rows it handles. No process will ever store the full game board during the simulation, each process will only store a fragment of its rows.
    3. Before every iteration of the game, each process will have to first obtain the ghost region from the neighbor processes. Think well (and confirm with your instructors if necessary) what is the ghost region for each process.
    4. After the entire simulation finishes, assemble the complete board in process 0 and print it.
    5. Your program should also print the time spent in steps (a) to (c).
  5. Using a board of dimensions 1000x1000 and around 50 iterations, make an educated guess about the speedup of your program when using 4 processes. Only after it, experimentally find out the actual speedup.

  6. Using a number of processes equal to the number of CPUs in your machine, experimentally find how the speedup of your program depends on the following factors:

    • number of rows
    • number of columns
    • number of iterations

    For each one of these three factors, plot the speedup as a function of the variable. What are the values for the numbers of rows, columns, and iterations that give rise to problem instances for which your algorithm achieves the best speedup?

  7. Repeat the previous question using 5 to 10 processes distributed across multiple machines (one per machine). Justify your results.