Introduction and goals

The goal of this tutorial is to explain how to write a basic topology-aware MPI application. This application is a multi-level hierarchical topology-aware token circulation. It is topology-aware in a sense that communications are organized with respect to the underlying physical topology. It is hierarchical in a sense that processes are organized into a hierarchy of communicators.

This application is intended to be run using the grid-enabled MPI library QCG-OMPI. It is using specific run-time flags to obtain non-standard information, namely, the topology the application is being executed on.

Usual initializations

The process initialization is done like in any MPI application:

MPI_Init( &argc, &argv );
err = MPI_Comm_rank( MPI_COMM_WORLD, &rank );
err |= MPI_Comm_size( MPI_COMM_WORLD, &size );
if( err != MPI_SUCCESS ){
    fprintf( stderr, "Initialization error\n" );
    MPI_Abort( MPI_COMM_WORLD, 1 );
    return EXIT_FAILURE;
}
Topology discovery

Topology discovery is using the MPI_Attr_get routine with two QCG-OMPI-specific flags: QCG_TOPOLOGY_DEPTH and QCG_TOPOLOGY_COLORS.

The topology is represented using a table of colors. The number of levels accessible by processes is given in an array of depths. The depths are also used to allocate the memory to store the table of colors.

depths = (int *)malloc( size * sizeof( int ) );

if( MPI_Attr_get( MPI_COMM_WORLD,
                                QCG_TOPOLOGY_DEPTHS,
                                &depths,
                                &flag ) != MPI_SUCCESS ) {
    fprintf( stderr, "MPI_Attr_get(depths) failed, aborting\n" );
    MPI_Abort( MPI_COMM_WORLD, 1 );
    return EXIT_FAILURE;
}

colors = (int **)malloc( size * sizeof( int* ) );

for( i = 0 ; i < size ; i++ ) {
colors[i] = (int *)malloc( depths[i] * sizeof( int ) );
}

if ( MPI_Attr_get( MPI_COMM_WORLD,
                                 QCG_TOPOLOGY_COLORS,
                                &colors,
                                &flag ) != MPI_SUCCESS ) {
    fprintf( stderr, "MPI_Attr_get(colors) failed, aborting\n" );
    MPI_Abort( MPI_COMM_WORLD, 1 );
    return EXIT_FAILURE;
}
Display the colors

If you want to display the table of colors, you can add a few printf's to have this done by the ROOT process.

if( ROOT == rank ) {
    for( i = 0 ; i < size ; i++ ){
        fprintf( stdout, "%d\t", i );
    }
    fprintf( stdout, "\n" );
    for( i = 0 ; i < size ; i++ ){
        fprintf( stdout, "%d\t", colors[i][0] );
    }
    fprintf( stdout, "\n" );
    for( i = 0 ; i < size ; i++ ){
        fprintf( stdout, "%d\t", colors[i][1] );
    }
    fprintf( stdout, "\n" );
    for( i = 0 ; i < size ; i++ ){
        fprintf( stdout, "%d\t", colors[i][2] );
    }
    fprintf( stdout, "\n\n" );
}
Construction of the cluster-level communicators

Split the upper-level communicator (MPI_COMM_WORLD) into communicators for each cluster. For this, you will have to use the table of colors obtained by MPI_Attr_get. The first row is, by convention, representing the global communicator (MPI_COMM_WORLD), so you will use the second row.

MPI_Comm_split( MPI_COMM_WORLD, colors[rank][1], 0, &cluster_comm );
Here we introduce the notion of boss. A boss is just a "leader" for each communicator. Two prossibilities:

If we are a boss, set boss to TRUE. Otherwise, set it to MPI_UNDEFINED. This information will be used to build a new communicator. MPI_Comm_split() will build a valid communicator only for processes for which boss is set to TRUE.

When you want two clusters to communicate with each other, you will want to use a specific communicator containing all the same-level bosses. So you will build a communicator with all the cluster bosses.

MPI_Comm_split( MPI_COMM_WORLD, boss, 0, &comm_boss );

MPI_Comm_size( cluster_comm, &cluster_size );
MPI_Comm_rank( cluster_comm, &cluster_rank );

(ROOT == cluster_rank ) ? ( cluster_boss = TRUE ) : ( cluster_boss = FALSE );
if( TRUE == cluster_boss ) {
    MPI_Comm_size( comm_boss, &nb_boss );
    MPI_Comm_rank( comm_boss, &top_boss_rank );
}
Construction of the machine-level communicators

This is pretty much the same as for the cluster-level, except that we are now splitting the cluster communicators into machine-level communicators.

MPI_Comm_split( cluster_comm, colors[rank][2], 0, &machine_comm );

MPI_Comm_size( machine_comm, &nb_cores );
MPI_Comm_rank( machine_comm, &local_rank );

You can also build a communicator between machine-level bosses of a given cluster.

(ROOT == local_rank ) ? ( machine_boss = TRUE ) : ( machine_boss = FALSE );
MPI_Comm_split( cluster_comm, machine_boss, 0, &local_boss_comm );
if( TRUE == machine_boss ){
    MPI_Comm_size( local_boss_comm, &nb_machines );
    MPI_Comm_rank( local_boss_comm, &boss_rank );
}
Local storage of the communicator handles

For convenience purpose, you can store the communicators you just built into a single table and make further accesses easier.

MPI_Comm* comms = (MPI_Comm*) malloc( 4 * sizeof( MPI_Comm) );
comms[0] = comm_boss;            // upper-level bosses
comms[1] = cluster_comm;         // cluster-level
comms[2] = local_boss_comm;    // cluster-level bosses
comms[3] = machine_comm;      // machine-level
Token circulation: between cores

This is where the serious business is starting: communications. First, we will take care of the lowest-level communications: between cores of a given machine.

If you are just a random core, not a machine head, just receive the token for one neighbor and forward it to your other neighbor.

if( FALSE == machine_boss ){ // just a random core
    left = ( local_rank + nb_cores - 1 ) % nb_cores;
    right = ( local_rank + 1 )              % nb_cores;

    MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[3], &status );
    ++token;
    fprintf( stdout, "Hello! I am proc %d on core %d/%d. My hostname is %s. Token: %d\n",
                           rank, local_rank, nb_cores, hostname, token );
    MPI_Send( &token, 1, MPI_INT, right, TAG, comms[3] );
}
Token circulation: between machines

If you are a machine boss, you have to receive the token from another machine boss.

if( FALSE == cluster_boss ) { // machine boss
    left  = (boss_rank + nb_machines -1 ) % nb_machines;
    right = (boss_rank + 1 )                      % nb_machines;
    lleft = ( local_rank + nb_cores - 1 )    % nb_cores;

    MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[2], &status );
    ++token;
    fprintf( stdout, "Hello! I am proc %d on core %d/%d on machine %d/%d. My hostname is %s. Token: %d\n",
                            rank, local_rank, nb_cores, boss_rank, nb_machines, hostname, token );

Then you circulate it throughout your machine, which is done by sending it to one neighbor on the machine-level communicator. You wait for it to complete its circulation, when you receive it from your other neighbor.

    if( 1 < nb_cores ){
         MPI_Send( &token, 1, MPI_INT, 1,      TAG, comms[3] );
         MPI_Recv( &token, 1, MPI_INT, lleft, TAG, comms[3], &status );
    }

Once this is all done, you forward it to your another machine boss.

    MPI_Send( &token, 1, MPI_INT, right, TAG, comms[2] );
}
Token circulation: between clusters

If you are a cluster boss, you have to receive the token from another cluster, on the communicator that "links" all the cluster bosses. Then you will circulate the token in your own machine, and on your own cluster. Once this is all done, you forward it to another cluster head.

if( ROOT != rank ) {// cluster boss
    left  = (top_boss_rank + nb_boss -1 )         % nb_boss;
    right = (top_boss_rank + 1 )                      % nb_boss;
    lleft = ( local_rank + nb_cores - 1 )           % nb_cores;
    cleft = ( cluster_rank + nb_machines - 1 ) % nb_machines;

    MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[0], &status );
    ++token;
    fprintf( stdout, "Hello! I am proc %d on core %d/%d on machine %d/%d in cluster %d/%d. My hostname is %s. Token: %d\n",
                                rank, local_rank, nb_cores, cluster_rank, nb_machines, top_boss_rank, nb_boss, hostname, token );
    if( 1 < nb_cores ){
        MPI_Send( &token, 1, MPI_INT, 1,       TAG, comms[3] );
         MPI_Recv( &token, 1, MPI_INT, lleft, TAG, comms[3], &status );
    }

    if( 1 != nb_machines ) {
        MPI_Send( &token, 1, MPI_INT, 1,      TAG, comms[2] );
        MPI_Recv( &token, 1, MPI_INT, cleft, TAG, comms[2], &status );
    }

    MPI_Send( &token, 1, MPI_INT, right, TAG, comms[0] );
}

If you are the main boss, rank 0 on the MPI_COMM_WORLD communicator, you are doing almost the same thing. The difference is that you are initiating the token circulation on the cluster_boss communicator, so you first send the token, and end the circulation by receiving it.

lleft = ( local_rank + nb_cores - 1 )       % nb_cores;
left  = ( boss_rank + nb_machines - 1 ) % nb_machines;
bleft = ( top_boss_rank + nb_boss - 1 ) % nb_boss;

fprintf( stdout, "Hello! I am proc %d on core %d/%d on machine %d/%d in cluster %d/%d. My hostname is %s. Token: %d\n",
                            rank, local_rank, nb_cores, cluster_rank, nb_machines, top_boss_rank, nb_boss, hostname, token );

MPI_Send( &token, 1, MPI_INT, 1,     TAG, comms[3] );
MPI_Recv( &token, 1, MPI_INT, lleft, TAG, comms[3], &status );

MPI_Send( &token, 1, MPI_INT, 1,     TAG, comms[2] );
MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[2], &status );

MPI_Send( &token, 1, MPI_INT, 1,      TAG, comms[0] );
MPI_Recv( &token, 1, MPI_INT, bleft, TAG, comms[0], &status );
Full code and execution

Here you can find the full code.

Valid XHTML 1.0 Strict