Communication-avoiding algorithms allow redundant computations to minimize the number of inter-process communications. In this paper, we propose to exploit this redundancy for fault-tolerance purpose. We illustrate this idea with QR factorization of tall and skinny matrices, and we evaluate the number of failures our algorithm can tolerate under different semantics.
–enable-mpi-ext=ftmpi –with-ft=mpi
-am ft-enable-mpi
Example: mpiexec -am ft-enable-mpi -n 5 ./ring-ft
Communication-avoiding algorithms :
ABFT :
Tools we are using here: