Advanced search
Start date

The influence of pre-processing data techniques on classification algorithms

Grant number: 15/01382-0
Support type:Scholarships in Brazil - Doctorate
Effective date (Start): October 01, 2016
Effective date (End): September 30, 2020
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques
Principal Investigator:André Carlos Ponce de Leon Ferreira de Carvalho
Grantee:Victor Hugo Barella
Home Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil
Associated research grant:13/07375-0 - CeMEAI - Center for Mathematical Sciences Applied to Industry, AP.CEPID
Associated scholarship(s):19/13015-2 - Meta-learning applied to imbalanced datasets using data complexity measures, BE.EP.DR


The pre-processing of data is one of the most important steps in the data mining process, and one of the most neglected. Data collection may suffer from manual errors and equipment problems creating inconsistent, noisy or missing data. There are some other aspects, as imbalance and overlapping classes, which may difficult the analysis. Ignoring these aspects in the learning process can impair the induction of a suitable model, as traditional machine learning algorithms have difficulties to induce a good model in these contexts. Furthermore, most of these problems are commonly processed independently and are interrelated. The aim of this PhD project is to analyze and address the noise problem, unbalanced data, overlapping classes and high dimensionality in an integrated manner, observing the relations between them. Data with these characteristics are often found in Molecular Biology. Thus, it is considered to use molecular biology data during analysis.