Adaptive control in roll-forward recovery for extreme scale multigrid

Beitrag in einer Fachzeitschrift

Details zur Publikation

Autorinnen und Autoren: Huber M, Rüde U, Wohlmuth BI
Zeitschrift: International Journal of High Performance Computing Applications
Verlag: SAGE Publications Sage UK: London, England
Jahr der Veröffentlichung: 2018
Seitenbereich: 1-31
ISSN: 1094-3420
Sprache: Englisch


With the
increasing number of compute components, failures in future
exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend
a recently proposed
algorithm-based recovery method
for multigrid iterations
by introducing an adaptive control.
After a fault, the healthy part of the system continues the iterative solution
process, while the solution in the faulty domain is re-constructed
by an asynchronous on-line recovery. The computations in both the
faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both
under and over-solving must be avoided.
Both of these
waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal
re-coupling, we introduce a stopping
criterion based on a mathematical error estimator. It involves
hierarchical weighted sums of residuals within
the context of uniformly refined
meshes and is well-suited
in the context of parallel high-performance computing. The re-coupling process
is steered by local contributions of
the error estimator. We propose and
compare two criteria which differ in
their weights. Failure scenarios
when solving up to 6.9 · 1011
unknowns on more than
245 766 parallel processes will be reported
on a state-of-the-art peta-scale supercomputer
demonstrating the robustness of the method.

Keywords: error estimator, high-performance computing, algorithm-based fault
tolerance, multigrid

FAU-Autorinnen und Autoren / FAU-Herausgeberinnen und Herausgeber

Huber, Markus
Lehrstuhl für Informatik 10 (Systemsimulation)
Rüde, Ulrich Prof. Dr.
Lehrstuhl für Informatik 10 (Systemsimulation)

Einrichtungen weiterer Autorinnen und Autoren

Technische Universität München (TUM)


Huber, M., Rüde, U., & Wohlmuth, B.I. (2018). Adaptive control in roll-forward recovery for extreme scale multigrid. International Journal of High Performance Computing Applications, 1-31.

Huber, Markus, Ulrich Rüde, and B. I. Wohlmuth. "Adaptive control in roll-forward recovery for extreme scale multigrid." International Journal of High Performance Computing Applications (2018): 1-31.


Zuletzt aktualisiert 2019-01-04 um 17:31