Topology-Aware Grid-Enabled HelloWorld

The goal of this tutorial is to explain how to write a basic topology-aware MPI application. This application is a multi-level hierarchical topology-aware token circulation. It is topology-aware in a sense that communications are organized with respect to the underlying physical topology. It is hierarchical in a sense that processes are organized into a hierarchy of communicators.

This application is intended to be run using the grid-enabled MPI library QCG-OMPI. It is using specific run-time flags to obtain non-standard information, namely, the topology the application is being executed on.

The process initialization is done like in any MPI application:

MPI_Init( &argc, &argv );
err = MPI_Comm_rank( MPI_COMM_WORLD, &rank );
err |= MPI_Comm_size( MPI_COMM_WORLD, &size );
if( err != MPI_SUCCESS ){
    fprintf( stderr, "Initialization error\n" );
    MPI_Abort( MPI_COMM_WORLD, 1 );
    return EXIT_FAILURE;
}

Topology discovery is using the MPI_Attr_get routine with two QCG-OMPI-specific flags: QCG_TOPOLOGY_DEPTH and QCG_TOPOLOGY_COLORS.

The topology is represented using a table of colors. The number of levels accessible by processes is given in an array of depths. The depths are also used to allocate the memory to store the table of colors.

depths = (int *)malloc( size * sizeof( int ) );

if( MPI_Attr_get( MPI_COMM_WORLD,
                                QCG_TOPOLOGY_DEPTHS,
                                &depths,
                                &flag ) != MPI_SUCCESS ) {
    fprintf( stderr, "MPI_Attr_get(depths) failed, aborting\n" );
    MPI_Abort( MPI_COMM_WORLD, 1 );
    return EXIT_FAILURE;
}

colors = (int **)malloc( size * sizeof( int* ) );

for( i = 0 ; i < size ; i++ ) {
colors[i] = (int *)malloc( depths[i] * sizeof( int ) );
}

if ( MPI_Attr_get( MPI_COMM_WORLD,
                                 QCG_TOPOLOGY_COLORS,
                                &colors,
                                &flag ) != MPI_SUCCESS ) {
    fprintf( stderr, "MPI_Attr_get(colors) failed, aborting\n" );
    MPI_Abort( MPI_COMM_WORLD, 1 );
    return EXIT_FAILURE;
}

If you want to display the table of colors, you can add a few printf's to have this done by the ROOT process.

if( ROOT == rank ) {
    for( i = 0 ; i < size ; i++ ){
        fprintf( stdout, "%d\t", i );
    }
    fprintf( stdout, "\n" );
    for( i = 0 ; i < size ; i++ ){
        fprintf( stdout, "%d\t", colors[i][0] );
    }
    fprintf( stdout, "\n" );
    for( i = 0 ; i < size ; i++ ){
        fprintf( stdout, "%d\t", colors[i][1] );
    }
    fprintf( stdout, "\n" );
    for( i = 0 ; i < size ; i++ ){
        fprintf( stdout, "%d\t", colors[i][2] );
    }
    fprintf( stdout, "\n\n" );
}

Split the upper-level communicator (MPI_COMM_WORLD) into communicators for each cluster. For this, you will have to use the table of colors obtained by MPI_Attr_get. The first row is, by convention, representing the global communicator (MPI_COMM_WORLD), so you will use the second row.

MPI_Comm_split( MPI_COMM_WORLD, colors[rank][1], 0, &cluster_comm );

Here we introduce the notion of boss. A boss is just a "leader" for each communicator. Two prossibilities:

We take the processes with the lowest global rank. To know whether a process is a boss before the corresponding communicator is actually built, check whether its color is the same as the previous process's color.
if( ROOT == rank )
    boss = TRUE;
else {
    if ( colors[rank][1] == colors[rank-1][1] )
        boss = MPI_UNDEFINED;
    else
        boss = TRUE;
}
Build the communicator and get your rank using MPI_Comm_rank(). If your rank is 0, you are the boss.
MPI_Comm_rank( cluster_comm, &cluster_rank );
if( ROOT == cluster_rank )
boss = TRUE;
else {
boss = MPI_UNDEFINED;
}

If we are a boss, set boss to TRUE. Otherwise, set it to MPI_UNDEFINED. This information will be used to build a new communicator. MPI_Comm_split() will build a valid communicator only for processes for which boss is set to TRUE.

When you want two clusters to communicate with each other, you will want to use a specific communicator containing all the same-level bosses. So you will build a communicator with all the cluster bosses.

MPI_Comm_split( MPI_COMM_WORLD, boss, 0, &comm_boss );

MPI_Comm_size( cluster_comm, &cluster_size );
MPI_Comm_rank( cluster_comm, &cluster_rank );

(ROOT == cluster_rank ) ? ( cluster_boss = TRUE ) : ( cluster_boss = FALSE );
if( TRUE == cluster_boss ) {
MPI_Comm_size( comm_boss, &nb_boss );
MPI_Comm_rank( comm_boss, &top_boss_rank );
}

This is pretty much the same as for the cluster-level, except that we are now splitting the cluster communicators into machine-level communicators.

MPI_Comm_split( cluster_comm, colors[rank][2], 0, &machine_comm );

MPI_Comm_size( machine_comm, &nb_cores );
MPI_Comm_rank( machine_comm, &local_rank );

You can also build a communicator between machine-level bosses of a given cluster.

(ROOT == local_rank ) ? ( machine_boss = TRUE ) : ( machine_boss = FALSE );
MPI_Comm_split( cluster_comm, machine_boss, 0, &local_boss_comm );
if( TRUE == machine_boss ){
MPI_Comm_size( local_boss_comm, &nb_machines );
MPI_Comm_rank( local_boss_comm, &boss_rank );
}

For convenience purpose, you can store the communicators you just built into a single table and make further accesses easier.

MPI_Comm* comms = (MPI_Comm*) malloc( 4 * sizeof( MPI_Comm) );
comms[0] = comm_boss;            // upper-level bosses
comms[1] = cluster_comm;         // cluster-level
comms[2] = local_boss_comm;    // cluster-level bosses
comms[3] = machine_comm;      // machine-level

This is where the serious business is starting: communications. First, we will take care of the lowest-level communications: between cores of a given machine.

If you are just a random core, not a machine head, just receive the token for one neighbor and forward it to your other neighbor.

if( FALSE == machine_boss ){ // just a random core
    left = ( local_rank + nb_cores - 1 ) % nb_cores;
    right = ( local_rank + 1 )              % nb_cores;

    MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[3], &status );
    ++token;
    fprintf( stdout, "Hello! I am proc %d on core %d/%d. My hostname is %s. Token: %d\n",
                           rank, local_rank, nb_cores, hostname, token );
    MPI_Send( &token, 1, MPI_INT, right, TAG, comms[3] );
}

If you are a machine boss, you have to receive the token from another machine boss.

if( FALSE == cluster_boss ) { // machine boss
    left = (boss_rank + nb_machines -1 ) % nb_machines;
    right = (boss_rank + 1 )                      % nb_machines;
    lleft = ( local_rank + nb_cores - 1 )    % nb_cores;

    MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[2], &status );
    ++token;
    fprintf( stdout, "Hello! I am proc %d on core %d/%d on machine %d/%d. My hostname is %s. Token: %d\n",
                            rank, local_rank, nb_cores, boss_rank, nb_machines, hostname, token );

Then you circulate it throughout your machine, which is done by sending it to one neighbor on the machine-level communicator. You wait for it to complete its circulation, when you receive it from your other neighbor.

    if( 1 < nb_cores ){
         MPI_Send( &token, 1, MPI_INT, 1,      TAG, comms[3] );
         MPI_Recv( &token, 1, MPI_INT, lleft, TAG, comms[3], &status );
    }

Once this is all done, you forward it to your another machine boss.

MPI_Send( &token, 1, MPI_INT, right, TAG, comms[2] );
}

If you are a cluster boss, you have to receive the token from another cluster, on the communicator that "links" all the cluster bosses. Then you will circulate the token in your own machine, and on your own cluster. Once this is all done, you forward it to another cluster head.

if( ROOT != rank ) {// cluster boss
    left = (top_boss_rank + nb_boss -1 )         % nb_boss;
    right = (top_boss_rank + 1 )                      % nb_boss;
    lleft = ( local_rank + nb_cores - 1 )           % nb_cores;
    cleft = ( cluster_rank + nb_machines - 1 ) % nb_machines;

    MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[0], &status );
    ++token;
    fprintf( stdout, "Hello! I am proc %d on core %d/%d on machine %d/%d in cluster %d/%d. My hostname is %s. Token: %d\n",
                                rank, local_rank, nb_cores, cluster_rank, nb_machines, top_boss_rank, nb_boss, hostname, token );
    if( 1 < nb_cores ){
        MPI_Send( &token, 1, MPI_INT, 1,       TAG, comms[3] );
         MPI_Recv( &token, 1, MPI_INT, lleft, TAG, comms[3], &status );
    }

    if( 1 != nb_machines ) {
        MPI_Send( &token, 1, MPI_INT, 1,      TAG, comms[2] );
        MPI_Recv( &token, 1, MPI_INT, cleft, TAG, comms[2], &status );
    }

    MPI_Send( &token, 1, MPI_INT, right, TAG, comms[0] );
}

If you are the main boss, rank 0 on the MPI_COMM_WORLD communicator, you are doing almost the same thing. The difference is that you are initiating the token circulation on the cluster_boss communicator, so you first send the token, and end the circulation by receiving it.

lleft = ( local_rank + nb_cores - 1 )       % nb_cores;
left = ( boss_rank + nb_machines - 1 ) % nb_machines;
bleft = ( top_boss_rank + nb_boss - 1 ) % nb_boss;

fprintf( stdout, "Hello! I am proc %d on core %d/%d on machine %d/%d in cluster %d/%d. My hostname is %s. Token: %d\n",
                            rank, local_rank, nb_cores, cluster_rank, nb_machines, top_boss_rank, nb_boss, hostname, token );

MPI_Send( &token, 1, MPI_INT, 1,     TAG, comms[3] );
MPI_Recv( &token, 1, MPI_INT, lleft, TAG, comms[3], &status );

MPI_Send( &token, 1, MPI_INT, 1,     TAG, comms[2] );
MPI_Recv( &token, 1, MPI_INT, left, TAG, comms[2], &status );

MPI_Send( &token, 1, MPI_INT, 1,      TAG, comms[0] );
MPI_Recv( &token, 1, MPI_INT, bleft, TAG, comms[0], &status );

Here you can find the full code.