Table des matières

Exploiting Redundancy in Communication Avoiding Algorithms for Algorithm-based Fault Tolerance

Introduction

Abstract

Communication-avoiding algorithms allow redundant computations to minimize the number of inter-process communications. In this paper, we propose to exploit this redundancy for fault-tolerance purpose. We illustrate this idea with QR factorization of tall and skinny matrices, and we evaluate the number of failures our algorithm can tolerate under different semantics.

General info

Team

Semantics of the FT

Code

How to execute fault tolerant MPI code

Example: mpiexec -am ft-enable-mpi -n 5 ./ring-ft

References

Communication-avoiding algorithms :

ABFT :

Tools we are using here: