Adaptive control in roll-forward recovery for extreme scale multigrid

Autorinnen und Autoren: Huber M, Rüde U, Wohlmuth BI
Zeitschrift: International Journal of High Performance Computing Applications
Verlag: SAGE Publications Sage UK: London, England
Jahr der Veröffentlichung: 2018
Seitenbereich: 1-31
ISSN: 1094-3420
Sprache: Englisch


With the
increasing number of compute components, failures in future
exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend
a recently proposed
algorithm-based recovery method
for multigrid iterations
by introducing an adaptive control.
After a fault, the healthy part of the system continues the iterative solution
process, while the solution in the faulty domain is re-constructed
by an asynchronous on-line recovery. The computations in both the
faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both
under and over-solving must be avoided.
Both of these
waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal
re-coupling, we introduce a stopping
criterion based on a mathematical error estimator. It involves
hierarchical weighted sums of residuals within
the context of uniformly refined
meshes and is well-suited
in the context of parallel high-performance computing. The re-coupling process
is steered by local contributions of
the error estimator. We propose and
compare two criteria which differ in
their weights. Failure scenarios
when solving up to 6.9 · 1011
unknowns on more than
245 766 parallel processes will be reported
on a state-of-the-art peta-scale supercomputer
demonstrating the robustness of the method.

Keywords: error estimator, high-performance computing, algorithm-based fault
tolerance, multigrid

Huber, Markus
Lehrstuhl für Informatik 10 (Systemsimulation)
Rüde, Ulrich Prof. Dr.
Lehrstuhl für Informatik 10 (Systemsimulation)

Technische Universität München (TUM)


Huber, M., Rüde, U., & Wohlmuth, B.I. (2018). Adaptive control in roll-forward recovery for extreme scale multigrid. International Journal of High Performance Computing Applications, 1-31.

