Introduction and goals
The goal of this tutorial is to explain how to write a basic
topology-aware MPI application. This application is a
multi-level hierarchical topology-aware token circulation. It
is topology-aware in a sense that communications are
organized with respect to the underlying physical topology. It
is hierarchical in a sense that processes are organized
into a hierarchy of communicators.
This application is intended to be run using the grid-enabled
MPI library QCG-OMPI. It is using specific run-time flags to
obtain non-standard information, namely, the topology the
application is being executed on.
Usual initializations
The process initialization is done like in any MPI application:
MPI_Init( &argc, &argv );
err = MPI_Comm_rank( MPI_COMM_WORLD, &rank );
err |= MPI_Comm_size( MPI_COMM_WORLD, &size );
if( err != MPI_SUCCESS ){
fprintf( stderr, "Initialization error\n" );
MPI_Abort( MPI_COMM_WORLD, 1 );
return EXIT_FAILURE;
}
Topology discovery
Topology discovery is using the MPI_Attr_get routine with two
QCG-OMPI-specific flags: QCG_TOPOLOGY_DEPTH and
QCG_TOPOLOGY_COLORS.
The topology is represented using a table of colors. The number
of levels accessible by processes is given in an array of
depths. The depths are also used to allocate the memory to
store the table of colors.
depths = (int *)malloc( size * sizeof( int ) );
if( MPI_Attr_get( MPI_COMM_WORLD,
QCG_TOPOLOGY_DEPTHS,
&depths,
&flag ) != MPI_SUCCESS ) {
fprintf( stderr, "MPI_Attr_get(depths) failed, aborting\n" );
MPI_Abort( MPI_COMM_WORLD, 1 );
return EXIT_FAILURE;
}
colors = (int **)malloc( size * sizeof( int* ) );
for( i = 0 ; i < size ; i++ ) {
colors[i] = (int *)malloc( depths[i] * sizeof( int ) );
}
if ( MPI_Attr_get( MPI_COMM_WORLD,
QCG_TOPOLOGY_COLORS,
&colors,
&flag ) != MPI_SUCCESS ) {
fprintf( stderr, "MPI_Attr_get(colors) failed, aborting\n" );
MPI_Abort( MPI_COMM_WORLD, 1 );
return EXIT_FAILURE;
}
Display the colors
If you want to display the table of colors, you can add a few
printf's to have this done by the ROOT process.
if( ROOT == rank ) {
for( i = 0 ; i < size ; i++ ){
fprintf( stdout, "%d\t", i );
}
fprintf( stdout, "\n" );
for( i = 0 ; i < size ; i++ ){
fprintf( stdout, "%d\t", colors[i][0] );
}
fprintf( stdout, "\n" );
for( i = 0 ; i < size ; i++ ){
fprintf( stdout, "%d\t", colors[i][1] );
}
fprintf( stdout, "\n" );
for( i = 0 ; i < size ; i++ ){
fprintf( stdout, "%d\t", colors[i][2] );
}
fprintf( stdout, "\n\n" );
}
Construction of the cluster-level communicators
Split the upper-level communicator (MPI_COMM_WORLD) into
communicators for each cluster. For this, you will have to use
the table of colors obtained by MPI_Attr_get. The first row is,
by convention, representing the global communicator
(MPI_COMM_WORLD), so you will use the second row.
MPI_Comm_split( MPI_COMM_WORLD, colors[rank][1], 0, &cluster_comm );
Here we introduce the notion of
boss. A boss is just a
"leader" for each communicator. Two prossibilities:
- We take the processes with the lowest global rank. To
know whether a process is a boss before the
corresponding communicator is actually built, check whether
its color is the same as the previous process's color.
if( ROOT == rank )
boss = TRUE;
else {
if ( colors[rank][1] == colors[rank-1][1] )
boss = MPI_UNDEFINED;
else
boss = TRUE;
}
- Build the communicator and get your rank using
MPI_Comm_rank(). If your rank is 0, you are the boss.
MPI_Comm_rank( cluster_comm, &cluster_rank );
if( ROOT == cluster_rank )
boss = TRUE;
else {
boss = MPI_UNDEFINED;
}
If we are a boss, set boss to TRUE. Otherwise,
set it to MPI_UNDEFINED. This information will be used
to build a new communicator. MPI_Comm_split() will build a
valid communicator only for processes for which boss is
set to TRUE.
When you want two clusters to communicate with each other, you
will want to use a specific communicator containing all the
same-level bosses. So you will build a communicator with all
the cluster bosses.
MPI_Comm_split( MPI_COMM_WORLD, boss, 0, &comm_boss );
MPI_Comm_size( cluster_comm, &cluster_size );
MPI_Comm_rank( cluster_comm, &cluster_rank );
(ROOT == cluster_rank ) ? ( cluster_boss = TRUE ) : ( cluster_boss = FALSE );
if( TRUE == cluster_boss ) {
MPI_Comm_size( comm_boss, &nb_boss );
MPI_Comm_rank( comm_boss, &top_boss_rank );
}
Construction of the machine-level communicators
This is pretty much the same as for the cluster-level, except
that we are now splitting the cluster communicators into
machine-level communicators.
MPI_Comm_split( cluster_comm, colors[rank][2], 0, &machine_comm );
MPI_Comm_size( machine_comm, &nb_cores );
MPI_Comm_rank( machine_comm, &local_rank );
You can also build a communicator between machine-level
bosses of a given cluster.
(ROOT == local_rank ) ? ( machine_boss = TRUE ) : ( machine_boss = FALSE );
MPI_Comm_split( cluster_comm, machine_boss, 0, &local_boss_comm );
if( TRUE == machine_boss ){
MPI_Comm_size( local_boss_comm, &nb_machines );
MPI_Comm_rank( local_boss_comm, &boss_rank );
}
Local storage of the communicator handles
For convenience purpose, you can store the communicators you
just built into a single table and make further accesses
easier.
MPI_Comm* comms = (MPI_Comm*) malloc( 4 * sizeof( MPI_Comm) );
comms[0] = comm_boss; // upper-level bosses
comms[1] = cluster_comm; // cluster-level
comms[2] = local_boss_comm; // cluster-level bosses
comms[3] = machine_comm; // machine-level
Token circulation: between cores
This is where the serious business is starting:
communications. First, we will take care of the lowest-level
communications: between cores of a given machine.
If you are just a random core, not a machine head, just receive
the token for one neighbor and forward it to your other neighbor.
if( FALSE == machine_boss ){ // just a random core
left = ( local_rank + nb_cores - 1 )
% nb_cores;
right = ( local_rank + 1 )
%
nb_cores;
MPI_Recv( &token, 1, MPI_INT, left,
TAG, comms[3], &status );
++token;
fprintf( stdout, "Hello! I am proc %d
on core %d/%d. My hostname is %s. Token: %d\n",
rank,
local_rank, nb_cores, hostname, token );
MPI_Send( &token, 1, MPI_INT,
right, TAG, comms[3] );
}
Token circulation: between machines
If you are a machine boss, you have to receive the token from
another machine boss.
if( FALSE == cluster_boss ) { // machine boss
left = (boss_rank + nb_machines -1 ) % nb_machines;
right = (boss_rank + 1 ) % nb_machines;
lleft = ( local_rank + nb_cores - 1 ) % nb_cores;
MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[2], &status );
++token;
fprintf( stdout, "Hello! I am proc %d on core %d/%d on machine %d/%d. My hostname is %s. Token: %d\n",
rank, local_rank, nb_cores, boss_rank, nb_machines, hostname, token );
Then you circulate it throughout your machine, which is done by
sending it to one neighbor on the machine-level
communicator. You wait for it to complete its circulation, when
you receive it from your other neighbor.
if( 1 < nb_cores ){
MPI_Send(
&token, 1, MPI_INT, 1, TAG,
comms[3] );
MPI_Recv( &token, 1, MPI_INT, lleft, TAG, comms[3], &status );
}
Once this is all done, you forward it to your another machine boss.
MPI_Send( &token, 1, MPI_INT, right, TAG, comms[2] );
}
Token circulation: between clusters
If you are a cluster boss, you have to receive the token from
another cluster, on the communicator that "links" all the
cluster bosses. Then you will circulate the token in your own
machine, and on your own cluster. Once this is all done, you
forward it to another cluster head.
if( ROOT != rank ) {// cluster boss
left = (top_boss_rank + nb_boss -1 ) % nb_boss;
right = (top_boss_rank + 1 ) % nb_boss;
lleft = ( local_rank + nb_cores - 1 ) % nb_cores;
cleft = ( cluster_rank + nb_machines - 1 ) % nb_machines;
MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[0], &status );
++token;
fprintf( stdout, "Hello! I am proc %d on core %d/%d on machine %d/%d in cluster %d/%d. My hostname is %s. Token: %d\n",
rank, local_rank, nb_cores, cluster_rank, nb_machines, top_boss_rank, nb_boss, hostname, token );
if( 1 < nb_cores ){
MPI_Send( &token, 1, MPI_INT, 1, TAG, comms[3] );
MPI_Recv( &token, 1, MPI_INT, lleft, TAG, comms[3], &status );
}
if( 1 != nb_machines ) {
MPI_Send( &token, 1, MPI_INT, 1, TAG, comms[2] );
MPI_Recv( &token, 1, MPI_INT, cleft, TAG, comms[2], &status );
}
MPI_Send( &token, 1, MPI_INT, right, TAG, comms[0] );
}
If you are the main boss, rank 0 on the MPI_COMM_WORLD
communicator, you are doing almost the same thing. The
difference is that you are initiating the token circulation on
the cluster_boss communicator, so you first send the token, and
end the circulation by receiving it.
lleft = ( local_rank + nb_cores - 1 ) % nb_cores;
left = ( boss_rank + nb_machines - 1 ) % nb_machines;
bleft = ( top_boss_rank + nb_boss - 1 ) % nb_boss;
fprintf( stdout, "Hello! I am proc %d on core %d/%d on machine %d/%d in cluster %d/%d. My hostname is %s. Token: %d\n",
rank, local_rank, nb_cores, cluster_rank, nb_machines, top_boss_rank, nb_boss, hostname, token );
MPI_Send( &token, 1, MPI_INT, 1, TAG, comms[3] );
MPI_Recv( &token, 1, MPI_INT, lleft, TAG, comms[3], &status );
MPI_Send( &token, 1, MPI_INT, 1, TAG, comms[2] );
MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[2], &status );
MPI_Send( &token, 1, MPI_INT, 1, TAG, comms[0] );
MPI_Recv( &token, 1, MPI_INT, bleft, TAG, comms[0], &status );
Full code and execution
Here you can find the full
code.