Collaborative Data Cleaning Framework: a Pilot Case Study for Machine Learning Development

Authors

  • Nikolaus Parulian University of Illinois Urbana-Champaign https://orcid.org/0000-0002-6971-0882
  • Bertram Ludäscher University of Illinois Urbana Champaign

DOI:

https://doi.org/10.2218/ijdc.v18i1.924

Abstract

This study experiments with collaborative data cleaning, a pivotal phase in data preparation for both analysis and machine learning. We used a provenance Data Cleaning Model (DCM) for multi-user scenarios to track changes on a dataset and conduct comprehensive experiments that simulate multiple data curators working collaboratively on a dataset. Furthermore, we analyzed how different data-cleaning scenarios to improve quality metrics of completeness and correctness of a dataset can affect the downstream machine learning modeling performance.

 

Downloads

Published

2024-12-09

Issue

Section

Conference Papers