A survey of checkpoint/restart techniques on distributed memory systems

Beitrag in einer Fachzeitschrift

Details zur Publikation

Autor(en): Shahzad F, Wittmann M, Kreutzer M, Zeiser T, Hager G, Wellein G
Zeitschrift: Parallel Processing Letters
Verlag: World Scientific Publishing Co
Jahr der Veröffentlichung: 2013
Band: 23
Heftnummer: 04
Seitenbereich: 1340011-1340030
ISSN: 0129-6264
Sprache: Englisch


The road to exascale computing poses many challenges for the High Performance Computing (HPC) community. Each step on the exascale path is mainly the result of a higher level of parallelism of the basic building blocks (i.e., CPUs, memory units, networking components, etc.). The reliability of each of these basic components does not increase at the same rate as the rate of hardware parallelism. This results in a reduction of the mean time to failure (MTTF) of the whole system. A fault tolerance environment is thus indispensable to run large applications on such clusters. Checkpoint/Restart (C/R) is the classic and most popular method to minimize failure damage. Its ease of implementation makes it useful, but typically it introduces significant overhead to the application. Several efforts have been made to reduce the C/R overhead. In this paper we compare various C/R techniques for their overheads by implementing them on two different categories of applications. These approaches are based on parallel-file-system (PFS)-level checkpoints (synchronous/ asynchronous) and node-level checkpoints. We utilize the Scalable Checkpoint/Restart (SCR) library for the comparison of node-level checkpoints. For asynchronous PFS-level checkpoints, we use the Damaris library, the SCR asynchronous feature, and application-based checkpointing via dedicated threads. Our baseline for overhead comparison is the naïve application-based synchronous PFS-level checkpointing method. A 3D lattice-Boltzmann (LBM) flow solver and a Lanczos eigenvalue solver are used as prototypical applications in which all the techniques considered here may be applied. © 2013 World Scientific Publishing Company.

FAU-Autoren / FAU-Herausgeber

Hager, Georg Dr.
Regionales Rechenzentrum Erlangen (RRZE)
Kreutzer, Moritz
Regionales Rechenzentrum Erlangen (RRZE)
Shahzad, Faisal
Regionales Rechenzentrum Erlangen (RRZE)
Wellein, Gerhard Prof. Dr.
Professur für Höchstleistungsrechnen
Wittmann, Markus
Regionales Rechenzentrum Erlangen (RRZE)
Zeiser, Thomas Dr.
Regionales Rechenzentrum Erlangen (RRZE)


Shahzad, F., Wittmann, M., Kreutzer, M., Zeiser, T., Hager, G., & Wellein, G. (2013). A survey of checkpoint/restart techniques on distributed memory systems. Parallel Processing Letters, 23(04), 1340011-1340030. https://dx.doi.org/10.1142/S0129626413400112

Shahzad, Faisal, et al. "A survey of checkpoint/restart techniques on distributed memory systems." Parallel Processing Letters 23.04 (2013): 1340011-1340030.


Zuletzt aktualisiert 2018-09-08 um 16:39