Exploiting Redundancy in Communication Avoiding Algorithms for Algorithm-based Fault Tolerance
Introduction
Abstract
Communication-avoiding algorithms allow redundant computations to minimize the number of inter-process communications. In this paper, we propose to exploit this redundancy for fault-tolerance purpose. We illustrate this idea with QR factorization of tall and skinny matrices, and we evaluate the number of failures our algorithm can tolerate under different semantics.
General info
- Le site de la conf : http://www.mcs.anl.gov/ieeecluster2015
- Deadline le 7 mars
- Format : 10 pages IEEE double colonnes
Team
Semantics of the FT
- Eventually, get the resulting R on any of the processes
- Eventually, get the resulting R on process 0
- Eventually, get the resulting R on all the (surviving) processes
- Replace the failed process by a new one and continue
Code
How to execute fault tolerant MPI code
- OpenMPI with user-level failure mitigation (ie, fault-tolerant Open MPI) : Open MPI 1.7ft http://fault-tolerance.org
- Compile Open MPI with option
–enable-mpi-ext=ftmpi –with-ft=mpi
- Run it with option
-am ft-enable-mpi
Example: mpiexec -am ft-enable-mpi -n 5 ./ring-ft
References
Communication-avoiding algorithms :
- J. Demmel, L. Grigori, M. F. Hoemmen, and J. Langou, Communication-optimal parallel and sequential QR and LU factorizations : Theory and Practice, UCB-EECS-2008-89 and LAWN 204 http://www.netlib.org/lapack/lawnspdf/lawn204.pdf
ABFT :
- George Bosilca, Rémi Delmas, Jack Dongarra, Julien Langou: Algorithm-based fault tolerance applied to high performance computing http://www.sciencedirect.com/science/article/pii/S0743731508002141
- Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca and Jack Dongarra : Fault tolerant high performance computing by a coding approach http://icl.cs.utk.edu/news_pub/submissions/diskless_ftmpi_ppopp05.pdf
Tools we are using here:
- Fault tolerant version of OpenMPI http://fault-tolerance.org
- Joshua Hursey, Richard L. Graham, Building a Fault Tolerant MPI Application: A Ring Communication Example https://www.open-mpi.org/papers/ipdps-dpdns-2011/ipdps-dpdns-2011.pdf