Guiding questions to avoid data leakage in biological machine learning applications

Bernett J, Blumenthal DB, Grimm DG, Haselbeck F, Joeres R, Kalinina OV, List M (2024)


Publication Type: Journal article

Publication year: 2024

Journal

Book Volume: 21

Pages Range: 1444-1453

Journal Issue: 8

DOI: 10.1038/s41592-024-02362-y

Abstract

Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Bernett, J., Blumenthal, D.B., Grimm, D.G., Haselbeck, F., Joeres, R., Kalinina, O.V., & List, M. (2024). Guiding questions to avoid data leakage in biological machine learning applications. Nature Methods, 21(8), 1444-1453. https://doi.org/10.1038/s41592-024-02362-y

MLA:

Bernett, Judith, et al. "Guiding questions to avoid data leakage in biological machine learning applications." Nature Methods 21.8 (2024): 1444-1453.

BibTeX: Download