Asynchronous Checkpointing by Dedicated Checkpoint Threads

Shahzad F, Wittmann M, Zeiser T, Wellein G (2012)


Publication Language: English

Publication Type: Book chapter / Article in edited volumes

Publication year: 2012

Journal

Publisher: Springer-verlag

Edited Volumes: Recent Advances in the Message Passing Interface

Series: Lecture Notes in Computer Science

City/Town: -

Book Volume: 7490

Pages Range: 289-290

ISBN: 978-3-642-33517-4

URI: http://link.springer.com/chapter/10.1007/978-3-642-33518-1_36

DOI: 10.1007/978-3-642-33518-1_36

Abstract

Checkpoint/restart (C/R) is a classical approach to introduce fault tolerance in large HPC applications. Although it is relatively easy as compared to other fault tolerance approaches, its overhead hinders its wide usage. We present an application-level checkpointing technique that significantly reduces the checkpoint overhead. The checkpoint I/O is overlapped with the computation of the application by following a two-stage checkpointing mechanism with dedicated threads for doing I/O. © 2012 Springer-Verlag.

Authors with CRIS profile

How to cite

APA:

Shahzad, F., Wittmann, M., Zeiser, T., & Wellein, G. (2012). Asynchronous Checkpointing by Dedicated Checkpoint Threads. In Recent Advances in the Message Passing Interface. (pp. 289-290). -: Springer-verlag.

MLA:

Shahzad, Faisal, et al. "Asynchronous Checkpointing by Dedicated Checkpoint Threads." Recent Advances in the Message Passing Interface. -: Springer-verlag, 2012. 289-290.

BibTeX: Download