Advanced search
Start date
Betweenand

Improving Scalability in Holistic Data Cleaning

Grant number: 18/20360-5
Support Opportunities:Scholarships abroad - Research Internship - Doctorate
Start date: January 01, 2019
End date: December 31, 2019
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques
Principal Investigator:Caetano Traina Junior
Grantee:Paulo Henrique de Oliveira
Supervisor: Ihab Ilyas
Host Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil
Institution abroad: University of Waterloo, Canada  
Associated to the scholarship:15/15392-7 - Indexing Attribute Domains in Relational DBMS, BP.DR

Abstract

When dealing with real-world data, erroneous data are the norm rather than the exception. To increase the value of data in analytics and decision-making, data scientists focus intensively on data cleaning tasks. Over the years, individual problems have been addressed separately, such as missing value imputation, outlier detection, and deduplication. Recently, a novel approach has been adopted by the scientific community, whose goal is to leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions: the idea is to take into account the "holistic" nature of the data cleaning process. Driven by the scalability challenges introduced by such approach, this project aims at developing techniques to improve scalability in the data cleaning process. The internship will be hosted at the University of Waterloo and supervised by Prof. Ihab Francis Ilyas.

News published in Agência FAPESP Newsletter about the scholarship:
More itemsLess items
Articles published in other media outlets ( ):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)

Scientific publications
(References retrieved automatically from Web of Science and SciELO through information on FAPESP grants and their corresponding numbers as mentioned in the publications by the authors)
SCABORA, LUCAS C.; SPADON, GABRIEL; OLIVEIRA, PAULO H.; RODRIGUES-JR, JOSE F.; TRAINA-JR, CAETANO; ACM. Enhancing recursive graph querying on RDBMS with data clustering approaches. PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING (SAC'20), v. N/A, p. 8-pg., . (16/17078-0, 16/17330-1, 18/17620-5, 18/20360-5, 17/08376-0, 19/04461-9)