A survey of checkpoint/restart techniques on distributed memory systems

Shahzad F, Wittmann M, Kreutzer M, Zeiser T, Hager G, Wellein G (2013)

Publication Language: English

Publication Type: Journal article

Publication year: 2013

Journal

Parallel Processing Letters World Scientific Publishing Co

Publisher: World Scientific Publishing Co

Book Volume: 23

Pages Range: 1340011-1340030

Journal Issue: 04

URI: http://www.worldscientific.com/doi/abs/10.1142/S0129626413400112

DOI: 10.1142/S0129626413400112

Abstract

The road to exascale computing poses many challenges for the High Performance Computing (HPC) community. Each step on the exascale path is mainly the result of a higher level of parallelism of the basic building blocks (i.e., CPUs, memory units, networking components, etc.). The reliability of each of these basic components does not increase at the same rate as the rate of hardware parallelism. This results in a reduction of the mean time to failure (MTTF) of the whole system. A fault tolerance environment is thus indispensable to run large applications on such clusters. Checkpoint/Restart (C/R) is the classic and most popular method to minimize failure damage. Its ease of implementation makes it useful, but typically it introduces significant overhead to the application. Several efforts have been made to reduce the C/R overhead. In this paper we compare various C/R techniques for their overheads by implementing them on two different categories of applications. These approaches are based on parallel-file-system (PFS)-level checkpoints (synchronous/ asynchronous) and node-level checkpoints. We utilize the Scalable Checkpoint/Restart (SCR) library for the comparison of node-level checkpoints. For asynchronous PFS-level checkpoints, we use the Damaris library, the SCR asynchronous feature, and application-based checkpointing via dedicated threads. Our baseline for overhead comparison is the naïve application-based synchronous PFS-level checkpointing method. A 3D lattice-Boltzmann (LBM) flow solver and a Lanczos eigenvalue solver are used as prototypical applications in which all the techniques considered here may be applied. © 2013 World Scientific Publishing Company.

Authors with CRIS profile

Faisal Shahzad Regionales Rechenzentrum Erlangen (RRZE) Moritz Kreutzer Regionales Rechenzentrum Erlangen (RRZE) Thomas Zeiser Regionales Rechenzentrum Erlangen (RRZE) Georg Hager Regionales Rechenzentrum Erlangen (RRZE) Gerhard Wellein Professur für Höchstleistungsrechnen

How to cite

APA:

Shahzad, F., Wittmann, M., Kreutzer, M., Zeiser, T., Hager, G., & Wellein, G. (2013). A survey of checkpoint/restart techniques on distributed memory systems. Parallel Processing Letters, 23(04), 1340011-1340030. https://doi.org/10.1142/S0129626413400112

MLA:

Shahzad, Faisal, et al. "A survey of checkpoint/restart techniques on distributed memory systems." Parallel Processing Letters 23.04 (2013): 1340011-1340030.

BibTeX: Download