Optimizing R with SparkR on a commodity cluster for biomedical research

Sedlmayr M, Würfl T, Maier C, Haeberle L, Fasching P, Prokosch HU, Christoph J (2016)


Publication Type: Journal article

Publication year: 2016

Journal

Book Volume: 137

Pages Range: 321-328

DOI: 10.1016/j.cmpb.2016.10.006

Abstract

Medical researchers are challenged today by the enormous amount of data collected in healthcare. Analysis methods such as genome-wide association studies (GWAS) are often computationally intensive and thus require enormous resources to be performed in a reasonable amount of time. While dedicated clusters and public clouds may deliver the desired performance, their use requires upfront financial efforts or anonymous data, which is often not possible for preliminary or occasional tasks. We explored the possibilities to build a private, flexible cluster for processing scripts in R based on commodity, non-dedicated hardware of our department.For this, a GWAS-calculation in R on a single desktop computer, a Message Passing Interface (MPI)-cluster, and a SparkR-cluster were compared with regards to the performance, scalability, quality, and simplicity.The original script had a projected runtime of three years on a single desktop computer. Optimizing the script in R already yielded a significant reduction in computing time (2 weeks). By using R-MPI and SparkR, we were able to parallelize the computation and reduce the time to less than three hours (2.6 h) on already available, standard office computers. While MPI is a proven approach in high-performance clusters, it requires rather static, dedicated nodes. SparkR and its Hadoop siblings allow for a dynamic, elastic environment with automated failure handling. SparkR also scales better with the number of nodes in the cluster than MPI due to optimized data communication.R is a popular environment for clinical data analysis. The new SparkR solution offers elastic resources and allows supporting big data analysis using R even on non-dedicated resources with minimal change to the original code. To unleash the full potential, additional efforts should be invested to customize and improve the algorithms, especially with regards to data distribution.

Authors with CRIS profile

How to cite

APA:

Sedlmayr, M., Würfl, T., Maier, C., Haeberle, L., Fasching, P., Prokosch, H.-U., & Christoph, J. (2016). Optimizing R with SparkR on a commodity cluster for biomedical research. Computer Methods and Programs in Biomedicine, 137, 321-328. https://doi.org/10.1016/j.cmpb.2016.10.006

MLA:

Sedlmayr, Martin, et al. "Optimizing R with SparkR on a commodity cluster for biomedical research." Computer Methods and Programs in Biomedicine 137 (2016): 321-328.

BibTeX: Download