Optimizing R with SparkR on a commodity cluster for biomedical research

Journal article

Publication Details

Author(s): Sedlmayr M, Würfl T, Maier C, Haeberle L, Fasching PA, Prokosch HU, Christoph J
Journal: Computer Methods and Programs in Biomedicine
Publication year: 2016
Volume: 137
Pages range: 321-328
ISSN: 0169-2607


Medical researchers are challenged today by the enormous amount of data collected in healthcare. Analysis methods such as genome-wide association studies (GWAS) are often computationally intensive and thus require enormous resources to be performed in a reasonable amount of time. While dedicated clusters and public clouds may deliver the desired performance, their use requires upfront financial efforts or anonymous data, which is often not possible for preliminary or occasional tasks. We explored the possibilities to build a private, flexible cluster for processing scripts in R based on commodity, non-dedicated hardware of our department.For this, a GWAS-calculation in R on a single desktop computer, a Message Passing Interface (MPI)-cluster, and a SparkR-cluster were compared with regards to the performance, scalability, quality, and simplicity.The original script had a projected runtime of three years on a single desktop computer. Optimizing the script in R already yielded a significant reduction in computing time (2 weeks). By using R-MPI and SparkR, we were able to parallelize the computation and reduce the time to less than three hours (2.6 h) on already available, standard office computers. While MPI is a proven approach in high-performance clusters, it requires rather static, dedicated nodes. SparkR and its Hadoop siblings allow for a dynamic, elastic environment with automated failure handling. SparkR also scales better with the number of nodes in the cluster than MPI due to optimized data communication.R is a popular environment for clinical data analysis. The new SparkR solution offers elastic resources and allows supporting big data analysis using R even on non-dedicated resources with minimal change to the original code. To unleash the full potential, additional efforts should be invested to customize and improve the algorithms, especially with regards to data distribution.

FAU Authors / FAU Editors

Christoph, Jan
Lehrstuhl für Medizinische Informatik
Fasching, Peter Andreas PD Dr.
Professur für Translationale Frauenheilkunde und Geburtshilfe
Maier, Christian
Lehrstuhl für Medizinische Informatik
Prokosch, Hans-Ulrich Prof. Dr.
Lehrstuhl für Medizinische Informatik
Sedlmayr, Martin Dr.
Lehrstuhl für Medizinische Informatik
Würfl, Tobias
Lehrstuhl für Informatik 5 (Mustererkennung)

How to cite

Sedlmayr, M., Würfl, T., Maier, C., Haeberle, L., Fasching, P.A., Prokosch, H.-U., & Christoph, J. (2016). Optimizing R with SparkR on a commodity cluster for biomedical research. Computer Methods and Programs in Biomedicine, 137, 321-328. https://dx.doi.org/10.1016/j.cmpb.2016.10.006

Sedlmayr, Martin, et al. "Optimizing R with SparkR on a commodity cluster for biomedical research." Computer Methods and Programs in Biomedicine 137 (2016): 321-328.


Last updated on 2018-19-04 at 04:14