(last update: Wed Dec 14 16:55:03 CET 2016)
In Debian or Ubuntu, we will use the OpenMPI implementation of the MPI standard. OpenMPI is installed on all computers at the room G207.
If you wish to install it at home or in your laptop, you will need to following packages:
libopenmpi-dev openmpi-bin openmpi-doc openmpi-checkpoint
We will also need the rpcgen(1) and rpcbind(8) tools, as well as OpenMP support for gcc, available in the following packages:
libc-dev-bin rpcbind libgomp1
The following program says hello from each process (files found here: examples/hello-world):
#include <stdio.h> #include <mpi.h> int main (int argc, char ** argv) { int rank; MPI_Init (&argc, & argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); printf ("p%d: hello\n", rank); MPI_Finalize (); return 0; }
To compile it, execute the following command:
mpicc -Wall hello.c -o hello
To run it, using 4 processes, all of them running on the local host:
mpirun -n 4 ./hello
It will output something similar to this:
p0: hello p1: hello p3: hello p2: hello
The following Makefile could be useful for your projects (obseve that the target all not only compiles the program hello.c but it also executes it):
CFLAGS=-Wall CC=mpicc all : hello mpirun -n 4 ./hello clean : rm -f hello *.o
The tool mpirun can be instructed to launch processes on multiple machines.
It will use ssh to connect to the other machines. It will be convenient to add your public key to the ~/.ssh/authorized_keys file. In the lab, run:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
(If you don't have a key pair, you will need to create it, run the command ssh-keygen.)
mpirun does not distribute the executable (this is done using the NFS)
Important options:
-n N: run N processes
-H HOST1,HOST2,...: run one process in HOST1, one in HOST2, and so on (round-robin assignment)
-hostfile FILE: read from FILE the list of hosts to run the processes. Example:
$ cat FILE G207-1 slots=5 G207-6 slots=2
Up to 5 processes will execute in G207-1, and up to 2 processes in G207-6. If more than 7 processes are requested via -n, the remaining ones will be assigned round-robin.
For additional details, refer to the mpirun(1) man page.
Running 4 processes, all of them on the local host:
mpirun -n 4 ./hello
Equivalenly:
mpirun -n 4 -H localhost ./hello
Running 3 processes, two of them in machine G207-1 and one in G207-10 (the assignment of processes (ranks) to hosts is done round-robin):
mpirun -n 3 -H G207-1,G207-10 ./hello
Running 5 processes, 3 on G207-1 and 2 on G207-2 (note that there is no -n option):
$ cat hosts G207-1 slots=3 G207-2 slots=2 $ mpirun -hostfile hosts ./hello
Running 3 processes, all of them in G207-1:
mpirun -n 3 -hostfile hosts ./hello
Running 9 processes, (3 + 3) in G207-1 and (2 + 1) in G207-2 (two rounds of assignment, as first one "completed" the slots):
mpirun -n 9 -hostfile hosts ./hello
We will code and experiment with a sequential and a distributed version of Conway's game of life. Click on the link and read the description if you are unfamiliar with the game.
The game board is given by a matrix storing m rows and n columns. Each element of the matrix is a bit, but we will represent it with a value of type short. The matrix is stored in a 1-dimension C array as a consecutive sequence of rows (1st row, 2nd row, ..., last row). The C type of the array is:
short tab[n * m];
You are provided with the following helper functions (code available here):
// prints out the game board void show (short* tab, int m, int n); // allocates and randomly initializes a game board; density (from 0 to 1) // indicates the density of cells initially alive short *init (int m, int n, float density); // plays one iteration of the game, updating the board 'tab' void game (short* tab, int m, int n); // translates a timeval structure to milliseconds uint64_t timeval_to_ms (struct timeval *tm);
Use the functions above to write a program seq that will run the game for a given number of iterations. The program will have the following command-line syntax:
./seq ROWS COLS IT
where ROWS and COLS are the dimensions of the board, and IT the number of iterations. The program should display, before it exits, the amount of time used to compute the game. Use gettimeofday(2).
Using a number of columns equal to 10, experimentally find out the number of rows and iterations that make your program run in around 1 second. (Comment out any (slow) piece of code that displays the board or artificially introduces delays to improve readability of the output.)
Using the number of iterations and rows computed in the previous step, plot (using your favourite spreadsheet program, e.g., Libreoffice Calc) the runtime as a function of the number columns, providing data for at least the following numbers of columns:
2, 4, 6, 8, 10, 15, 20, 25, 30, 40, 50, 60
Should the runtime follow a linear correlation to the number of columns? Can you confirm or disprove it? Do you observe any unexpected behaviour for small numbers of columns? Can you explain it?
We will now design and code a distributed version of the program. The resulting program (call it dist) will receive the same three command-line parameters. It will behave as follows.
Using a board of dimensions 1000x1000 and around 50 iterations, make an educated guess about the speedup of your program when using 4 processes. Only after it, experimentally find out the actual speedup.
Using a number of processes equal to the number of CPUs in your machine, experimentally find how the speedup of your program depends on the following factors:
For each one of these three factors, plot the speedup as a function of the variable. What are the values for the numbers of rows, columns, and iterations that give rise to problem instances for which your algorithm achieves the best speedup?
Repeat the previous question using 5 to 10 processes distributed across multiple machines (one per machine). Justify your results.