ClusterCockpit-A web application for job-specific performance monitoring

Eitzinger J, Gruber T, Afzal A, Zeiser T, Wellein G (2019)


Publication Type: Conference contribution, Conference Contribution

Publication year: 2019

Publisher: Institute of Electrical and Electronics Engineers Inc.

Book Volume: 2019-September

Conference Proceedings Title: Proceedings - IEEE International Conference on Cluster Computing, ICCC

Event location: Albuquerque, NM US

ISBN: 9781728147345

DOI: 10.1109/CLUSTER.2019.8891017

Abstract

Monitoring is a common component of HPC system software. Up to now, monitoring focused mainly on health checking and system level performance as well as on job scheduler information and was targeted towards system administrators. Recently job-specific performance monitoring based on hardware performance counter metrics has gained attention at academic HPC computing centers. HPC is becoming a mainstream tool that is also used by non-HPC experts, and HPC centers see a demand to check for pathological jobs and jobs with large optimization potential. The possibility to measure hardware performance counter data with negligible overhead allows assessment of efficient resource utilization and detection of pathological jobs. Pathological jobs are, e.g. jobs with errors in the batch script, jobs which do not terminate, jobs with severe load imbalance, or jobs that do not use any resources. This paper introduces ClusterCockpit, a web front-end tailor-made tool for job-specific performance monitoring. While many recent job-specific performance monitoring efforts concentrate on the measurement and data collection layers, ClusterCockpit provides a modern user interface targeted towards performance analysts as well as application users.

Authors with CRIS profile

How to cite

APA:

Eitzinger, J., Gruber, T., Afzal, A., Zeiser, T., & Wellein, G. (2019). ClusterCockpit-A web application for job-specific performance monitoring. In Proceedings - IEEE International Conference on Cluster Computing, ICCC. Albuquerque, NM, US: Institute of Electrical and Electronics Engineers Inc..

MLA:

Eitzinger, Jan, et al. "ClusterCockpit-A web application for job-specific performance monitoring." Proceedings of the 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019, Albuquerque, NM Institute of Electrical and Electronics Engineers Inc., 2019.

BibTeX: Download