Advanced search
Start date
Betweenand


Enriching data analytics with incremental data cleaning and attribute domain management

Full text
Author(s):
Paulo Henrique de Oliveira
Total Authors: 1
Document type: Doctoral Thesis
Press: São Carlos.
Institution: Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB)
Defense date:
Examining board members:
Caetano Traina Junior; Ricardo Marcondes Marcacini; Vanessa Braganholo Murta; Marcela Xavier Ribeiro
Advisor: Caetano Traina Junior
Abstract

In the present Big Data era, many businesses have become more data-driven, seeking to improve their decision-making processes based on solid Data Analytics practices. Several steps constitute the Data Analytics pipeline and all of them involve specific approaches and technologies, which are constantly evolving. In order to accommodate new needs and trends, there is always room for improvements in the steps of Data Analytics. In this context, this PhD research has focused on improving two of those steps: (i) data cleaning and (ii) data analysis. Regarding the first step, we addressed the problem of performing data cleaning incrementally, considering dynamic scenarios with incoming data batches, as well as holistically, that is, jointly taking into account multiple error detection criteria. As a result, we have developed an incremental data cleaning framework which significantly outperforms competitors, enabling higher efficiency while compromising little on repair quality, as well as addresses the problem in an innovative way, hence filling a gap in the literature. Regarding the second improved step, we addressed the problem of handling queries over an Attribute Domain, which consists of the set of stored values within a domain of attributes, usually across multiple relations. As a result, we have proposed three contributions: (a) the Domain Index, an access method for efficiently performing queries over Attribute Domains, which we refer to as Domain Queries; (b) a comprehensive case study of Domain Indexes applied to the medical domain, focusing on content-based Domain Queries for supporting physicians in decision-making; and (c) an approach for including support to Attribute Domains as first-class citizens in a Relational Database Management System (RDBMS). Together, those contributions target a distinct category of queries which, until the execution of this PhD research, had not been addressed in the literature elsewhere. Experimental results highlight the superior performance enabled by the Domain Index compared to existing techniques of modern RDBMSs, which not only are inefficient in several scenarios, but also are not always applicable. Ultimately, those contributions enrich data analyses down the road. Hence, this PhD research advances the state of the art in the field of Data Analytics, as well as opens several directions for future work. (AU)

FAPESP's process: 15/15392-7 - Indexing Attribute Domains in Relational DBMS
Grantee:Paulo Henrique de Oliveira
Support Opportunities: Scholarships in Brazil - Doctorate