equipes:rcln:ancien_wiki:projets:exploiting_redundancy_algo_fault_tolerance

Exploiting Redundancy in Communication Avoiding Algorithms for Algorithm-based Fault Tolerance

Communication-avoiding algorithms allow redundant computations to minimize the number of inter-process communications. In this paper, we propose to exploit this redundancy for fault-tolerance purpose. We illustrate this idea with QR factorization of tall and skinny matrices, and we evaluate the number of failures our algorithm can tolerate under different semantics.

  • Eventually, get the resulting R on any of the processes
  • Eventually, get the resulting R on process 0
  • Eventually, get the resulting R on all the (surviving) processes
  • Replace the failed process by a new one and continue
  • OpenMPI with user-level failure mitigation (ie, fault-tolerant Open MPI) : Open MPI 1.7ft http://fault-tolerance.org
  • Compile Open MPI with option –enable-mpi-ext=ftmpi –with-ft=mpi
  • Run it with option -am ft-enable-mpi

Example: mpiexec -am ft-enable-mpi -n 5 ./ring-ft

Communication-avoiding algorithms :

ABFT :

Tools we are using here:

  • Dernière modification: il y a 4 ans