Shahzad F, Wittmann M, Zeiser T, Wellein G (2012)
Publication Language: English
Publication Type: Book chapter / Article in edited volumes
Publication year: 2012
Publisher: Springer-verlag
Edited Volumes: Recent Advances in the Message Passing Interface
Series: Lecture Notes in Computer Science
City/Town: -
Book Volume: 7490
Pages Range: 289-290
ISBN: 978-3-642-33517-4
URI: http://link.springer.com/chapter/10.1007/978-3-642-33518-1_36
DOI: 10.1007/978-3-642-33518-1_36
Checkpoint/restart (C/R) is a classical approach to introduce fault tolerance in large HPC applications. Although it is relatively easy as compared to other fault tolerance approaches, its overhead hinders its wide usage. We present an application-level checkpointing technique that significantly reduces the checkpoint overhead. The checkpoint I/O is overlapped with the computation of the application by following a two-stage checkpointing mechanism with dedicated threads for doing I/O. © 2012 Springer-Verlag.
APA:
Shahzad, F., Wittmann, M., Zeiser, T., & Wellein, G. (2012). Asynchronous Checkpointing by Dedicated Checkpoint Threads. In Recent Advances in the Message Passing Interface. (pp. 289-290). -: Springer-verlag.
MLA:
Shahzad, Faisal, et al. "Asynchronous Checkpointing by Dedicated Checkpoint Threads." Recent Advances in the Message Passing Interface. -: Springer-verlag, 2012. 289-290.
BibTeX: Download