Avritzer A, Grottke M, Menasché DS (2020)
Publication Language: English
Publication Type: Book chapter / Article in edited volumes
Publication year: 2020
Edited Volumes: Handbook of Software Aging and Rejuvenation
Pages Range: 197-228
ISBN: 978-981-121-456-1
DOI: 10.1142/9789811214578_0008
Background: In this chapter, we present the application of software aging monitoring and software rejuvenation for the assessment of high-availability systems. In high-availability systems, the metric of interest is the transient performability during system recovery, also referred to as “survivability”. A survivability assessment requires the definition of the failure model. In addition, extensive testing using loads and configurations that are able to model the conditions customers encounter in production is required.
Aim: We describe the application of an agile devops methodology leveraging a failure model that incorporates aging-related failures. This agile devops methodology has been developed to integrate the failure reporting from production (i.e., ticket history), the Markov chain design, the performance test case design, the performance test case execution, the performance check and the decision to release the software.
Applicability domain: The domain of applicability of this study is mission-critical systems that employ high-availability strategies, such as software component hosting supporting open-source development, media streaming hardware and software supporting high-volume media processing, and online banking. Continuous integration, testing and operations is a key part of building software in the new devops paradigm.
Method: Our method involves the following steps, which embrace development and operations. Each step is based on its predecessor output: 1) an analysis of ticket history generated by operations; 2) a Markov chain design derived from ticket history; 3) a performance test case design based on Markov chain analysis; 4) a performance test case execution for each software version; 5) a Markov chain parameterization based on test case results; 6) an evaluation of performance metrics of interest using the parameterized Markov chain; 7) performance checks and 8) a new software release delivered to operations. The development team receives feedback on performance issues and bugs, and provides new releases for performance checking and testing.
Results: We present extensive measurements from a large industrial system, as well as a list of test cases identified from the high-availability Markov chain. These test cases were executed using the industrial system under study in this research, and the obtained test results are presented. These results were used in the Markov model parameterization.
Lessons learned: Several high-availability strategies, such as automated load balancing and software rejuvenation, require that failures be detected in an efficient manner. We have found that high-availability strategies implemented for fast failure recovery also need to focus on implementing strategies for fast failure detection, as in our experiments failure recovery rates were dominated by the failure detection time. We believe that this finding can be used to help high-availability system architects select which software aging and rejuvenation features to add to their high-availability designs.
APA:
Avritzer, A., Grottke, M., & Menasché, D.S. (2020). Using Software Aging Monitoring and Rejuvenation for the Assessment of High-Availability Systems. In Handbook of Software Aging and Rejuvenation. (pp. 197-228).
MLA:
Avritzer, Alberto, Michael Grottke, and Daniel S. Menasché. "Using Software Aging Monitoring and Rejuvenation for the Assessment of High-Availability Systems." Handbook of Software Aging and Rejuvenation. 2020. 197-228.
BibTeX: Download