ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines

Abdelaal M, Yayak AB, Klede K, Schöning H (2024)


Publication Language: English

Publication Type: Conference contribution, Original article

Publication year: 2024

Event location: Utrecht NL

URI: https://www.wis.ewi.tudelft.nl/assets/files/dbml2024/DBML24_paper_11.pdf

Open Access Link: https://www.wis.ewi.tudelft.nl/assets/files/dbml2024/DBML24_paper_11.pdf

Abstract

Addressing data quality issues is a challenging task
due to the labor-intensive nature of manual data cleaning pro-
cesses and the inadequacy of automated tools that lack effective
repair strategies. In this paper, we introduce ReClean, a novel
automated data-cleaning method, dedicated to ML pipelines, that
employs reinforcement learning (RL) to optimize data-cleaning
tasks. ReClean treats data cleaning as a sequential decision
process, where RL agents learn to choose optimal data repair
operations that improve ML model convergence and predictive
performance. Our extensive experimental evaluation shows that
ReClean surpasses existing baseline methods, successfully deter-
mining and applying data repair tools to enhance downstream
predictive tasks automatically and without supervision.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Abdelaal, M., Yayak, A.B., Klede, K., & Schöning, H. (2024). ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines. In Proceedings of the 3rd International Workshop on Databases and Machine Learning. Utrecht, NL.

MLA:

Abdelaal, Mohamed, et al. "ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines." Proceedings of the 3rd International Workshop on Databases and Machine Learning, Utrecht 2024.

BibTeX: Download