An Evaluation of Different I/O Techniques for Checkpoint/Restart

Beitrag bei einer Tagung


Details zur Publikation

Autorinnen und Autoren: Shahzad F, Wittmann M, Zeiser T, Hager G, Wellein G
Titel Sammelwerk: Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, IPDPSW 2013
Verlag: IEEE Digital Library
Verlagsort: n.a.
Jahr der Veröffentlichung: 2013
Tagungsband: Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013 IEEE 27th International
Seitenbereich: 1708-1716
Sprache: Englisch


Abstract


Today's High Performance Computing (HPC) clusters consist of hundreds of thousands of CPUs, memory units, complex networks, and other components. Such an extreme level of hardware parallelism reduces the mean time to failure (MTTF) of the overall cluster. The future of HPC urgently demands to develop environments that facilitate programs to run successfully even in the presence of failures. Checkpoint/Restart (C/R) is one of the most common techniques to provide fault tolerance. C/R is relatively easy to implement, but typically it introduces significant overhead in the runtime of the application. In this paper, a check pointing technique is presented that significantly reduces the checkpoint overhead and is highly scalable. This is achieved by overlapping the I/O for writing the checkpoint with the computation of the application. For this asynchronous check pointing technique, a theoretical model is developed to estimate the checkpoint overhead. An implementation of this technique is then benchmarked and compared with other check pointing strategies. We show our approach to have marginal overhead as opposite to standard synchronous check pointing for typical application scenarios. A comparison with the node-level check pointing technique by using Scalable Checkpoint/Restart (SCR) library is also presented. © 2013 IEEE.



FAU-Autorinnen und Autoren / FAU-Herausgeberinnen und Herausgeber

Hager, Georg Dr.
Regionales Rechenzentrum Erlangen (RRZE)
Shahzad, Faisal
Regionales Rechenzentrum Erlangen (RRZE)
Wellein, Gerhard Prof. Dr.
Professur für Höchstleistungsrechnen
Wittmann, Markus
Regionales Rechenzentrum Erlangen (RRZE)
Zeiser, Thomas Dr.
Regionales Rechenzentrum Erlangen (RRZE)


Zitierweisen

APA:
Shahzad, F., Wittmann, M., Zeiser, T., Hager, G., & Wellein, G. (2013). An Evaluation of Different I/O Techniques for Checkpoint/Restart. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013 IEEE 27th International (pp. 1708-1716). Boston, MA, USA: n.a.: IEEE Digital Library.

MLA:
Shahzad, Faisal, et al. "An Evaluation of Different I/O Techniques for Checkpoint/Restart." Proceedings of the 2013 IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, Boston, MA, USA n.a.: IEEE Digital Library, 2013. 1708-1716.

BibTeX: 

Zuletzt aktualisiert 2018-09-08 um 23:11