Journal article

Optimizing R with SparkR on a commodity cluster for biomedical research

Publication Details
Author(s): Sedlmayr M, Würfl T, Maier C, Haeberle L, Fasching P, Prokosch HU, Christoph J
Publication year: 2016
Volume: 137
Pages range: 321-328
ISSN: 0169-2607


Medical researchers are challenged today by the enormous amount of data collected in healthcare. Analysis methods such as genome-wide association studies (GWAS) are often computationally intensive and thus require enormous resources to be performed in a reasonable amount of time. While dedicated clusters and public clouds may deliver the desired performance, their use requires upfront financial efforts or anonymous data, which is often not possible for preliminary or occasional tasks. We explored the possibilities to build a private, flexible cluster for processing scripts in R based on commodity, non-dedicated hardware of our department.For this, a GWAS-calculation in R on a single desktop computer, a Message Passing Interface (MPI)-cluster, and a SparkR-cluster were compared with regards to the performance, scalability, quality, and simplicity.The original script had a projected runtime of three years on a single desktop computer. Optimizing the script in R already yielded a significant reduction in computing time (2 weeks). By using R-MPI and SparkR, we were able to parallelize the computation and reduce the time to less than three hours (2.6 h) on already available, standard office computers. While MPI is a proven approach in high-performance clusters, it requires rather static, dedicated nodes. SparkR and its Hadoop siblings allow for a dynamic, elastic environment with automated failure handling. SparkR also scales better with the number of nodes in the cluster than MPI due to optimized data communication.R is a popular environment for clinical data analysis. The new SparkR solution offers elastic resources and allows supporting big data analysis using R even on non-dedicated resources with minimal change to the original code. To unleash the full potential, additional efforts should be invested to customize and improve the algorithms, especially with regards to data distribution.

How to cite
APA: Sedlmayr, M., Würfl, T., Maier, C., Haeberle, L., Fasching, P., Prokosch, H.-U., & Christoph, J. (2016). Optimizing R with SparkR on a commodity cluster for biomedical research. Computer Methods and Programs in Biomedicine, 137, 321-328.

MLA: Sedlmayr, Martin, et al. "Optimizing R with SparkR on a commodity cluster for biomedical research." Computer Methods and Programs in Biomedicine 137 (2016): 321-328.

BibTeX: Download
Share link
Last updated on 2017-08-21 at 02:45
PDF downloaded successfully