Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

Kieu Trinh Do , Wahl S, Raffler J, Molnos S, Laimighofer M, Adamski J, Suhre K, Strauch K, Peters A, Gieger C, Langenberg C, Stewart ID, Theis FJ, Grallert H, Kastenmueller G, Krumsiek J (2018)

Publication Type: Journal article

Publication year: 2018

Journal

Metabolomics Springer Verlag (Germany)

Book Volume: 14

Article Number: 128

Journal Issue: 10

DOI: 10.1007/s11306-018-1420-2

Abstract

Background: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. Methods: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. Results: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. Conclusion: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.

Involved external institutions

Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (HMGU) / Helmholtz Munich

Germany (DE) Weill Cornell Medicine-Qatar (WCM-Q)

Qatar (QA) University of Cambridge

United Kingdom (GB) Deutsches Zentrum für Diabetesforschung e.V. (DZD)

Germany (DE)

How to cite

APA:

Kieu Trinh Do, ., Wahl, S., Raffler, J., Molnos, S., Laimighofer, M., Adamski, J.,... Krumsiek, J. (2018). Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics, 14(10). https://doi.org/10.1007/s11306-018-1420-2

MLA:

Kieu Trinh Do, , et al. "Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies." Metabolomics 14.10 (2018).

BibTeX: Download