Adaptive control in roll-forward recovery for extreme scale multigrid

Huber M, Rüde U, Wohlmuth BI (2018)


Publication Language: English

Publication Type: Journal article

Publication year: 2018

Journal

Publisher: SAGE Publications Sage UK: London, England

Pages Range: 1-31

URI: https://arxiv.org/pdf/1804.06373.pdf

DOI: 10.1177/1094342018817088

Abstract

With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is re-constructed by an asynchronous on-line recovery. The computations in both the faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both under and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal re-coupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchical weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The re-coupling process is steered by local contributions of the error estimator. We propose and compare two criteria which differ in their weights. Failure scenarios when solving up to 6.9 · 1011 unknowns on more than 245 766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.

Keywords: error estimator, high-performance computing, algorithm-based fault tolerance, multigrid


Authors with CRIS profile

Involved external institutions

How to cite

APA:

Huber, M., Rüde, U., & Wohlmuth, B.I. (2018). Adaptive control in roll-forward recovery for extreme scale multigrid. International Journal of High Performance Computing Applications, 1-31. https://dx.doi.org/10.1177/1094342018817088

MLA:

Huber, Markus, Ulrich Rüde, and B. I. Wohlmuth. "Adaptive control in roll-forward recovery for extreme scale multigrid." International Journal of High Performance Computing Applications (2018): 1-31.

BibTeX: Download