An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems

Aliee H, Zarandi HR (2011)


Publication Status: Published

Publication Type: Conference contribution, Conference Contribution

Publication year: 2011

Pages Range: 53-58

Conference Proceedings Title: roceedings of the 13th European Workshop on Dependable Computing (EWDC '11)

Event location: Pisa IT

ISBN: 9781450302845

DOI: 10.1145/1978582.1978594

Abstract

This paper presents an adaptive, CPU-aware, fault detection and recovery approach which dynamically modifies the number of replicas in the system. This technique utilizes available unused resources as redundancy. It is transparent for users and does not require modification to the application. This technique benefits from the fact that although all the future product designs are dedicated to multi-cores, these products suffer from poor parallelism in applications. Therefore, there are underutilized CPU resources in the system which can be employed for fault tolerance aim. This is achieved by monitoring the system status periodically, on runtime which creates a set of redundant processes per application. To prevent performance degradation, redundant processes are dynamically scheduled. This technique is more beneficial when the number of cores increases or the application is IO-based with much underutilized CPU resources. Experimental results on a real quad-core system prove that on average, the applications from standard benchmarks like SPLASH-2, PARSEC, and some other suites, utilize the CPU less than 20% which provides high fault detection and recovery with almost 10% performance overhead. Copyright © 2011 ACM.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Aliee, H., & Zarandi, H.R. (2011). An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems. In roceedings of the 13th European Workshop on Dependable Computing (EWDC '11) (pp. 53-58). Pisa, IT.

MLA:

Aliee, Hananeh, and Hamid R. Zarandi. "An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems." Proceedings of the 13th European Workshop on Dependable Computing, EWDC 2011, Pisa 2011. 53-58.

BibTeX: Download