Exploiting Redundancy in Communication Avoiding Algorithms for Algorithm-based Fault Tolerance

De wikiRcln
Révision de 27 février 2015 à 12:59 par Coti (discussion | contributions)

(diff) ← Version précédente | Voir la version courante (diff) | Version suivante → (diff)
Aller à : navigation, rechercher

Introduction

Abstract

Communication-avoiding algorithms allow redundant computations to minimize the number of inter-process communications. In this paper, we propose to exploit this redundancy for fault-tolerance purpose. We illustrate this idea with QR factorization of tall and skinny matrices, and we evaluate the number of failures our algorithm can tolerate under different semantics.

General info

  • Le site de la conf : [1]
  • Deadline le 7 mars
  • Format : 10 pages IEEE double colonnes

Team

Semantics of the FT

  • Eventually, get the resulting R on any of the processes
  • Eventually, get the resulting R on process 0
  • Eventually, get the resulting R on all the (surviving) processes
  • Replace the failed process by a new one and continue

Code

How to execute fault tolerant MPI code

  • OpenMPI with user-level failure mitigation (ie, fault-tolerant Open MPI) : Open MPI 1.7ft [2]
  • Compile Open MPI with option --enable-mpi-ext=ftmpi --with-ft=mpi
  • Run it with option -am ft-enable-mpi

Example: mpiexec -am ft-enable-mpi -n 5 ./ring-ft

References

Communication-avoiding algorithms :

  • J. Demmel, L. Grigori, M. F. Hoemmen, and J. Langou, Communication-optimal parallel and sequential QR and LU factorizations : Theory and Practice, UCB-EECS-2008-89 and LAWN 204 [3]

ABFT :

  • George Bosilca, Rémi Delmas, Jack Dongarra, Julien Langou: Algorithm-based fault tolerance applied to high performance computing [4]
  • Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca and Jack Dongarra : Fault tolerant high performance computing by a coding approach [5]

Tools we are using here:

  • Fault tolerant version of OpenMPI [6]
  • Joshua Hursey, Richard L. Graham, Building a Fault Tolerant MPI Application: A Ring Communication Example [7]