Advanced search
Start date

The influence of pre-processing data techniques on classification algorithms

Grant number: 15/01382-0
Support Opportunities:Scholarships in Brazil - Doctorate
Effective date (Start): October 01, 2016
Effective date (End): November 30, 2020
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques
Principal Investigator:André Carlos Ponce de Leon Ferreira de Carvalho
Grantee:Victor Hugo Barella
Host Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil
Associated research grant:13/07375-0 - CeMEAI - Center for Mathematical Sciences Applied to Industry, AP.CEPID
Associated scholarship(s):19/13015-2 - Meta-Learning Applied to Imbalanced Datasets Using Data Complexity Measures, BE.EP.DR


The pre-processing of data is one of the most important steps in the data mining process, and one of the most neglected. Data collection may suffer from manual errors and equipment problems creating inconsistent, noisy or missing data. There are some other aspects, as imbalance and overlapping classes, which may difficult the analysis. Ignoring these aspects in the learning process can impair the induction of a suitable model, as traditional machine learning algorithms have difficulties to induce a good model in these contexts. Furthermore, most of these problems are commonly processed independently and are interrelated. The aim of this PhD project is to analyze and address the noise problem, unbalanced data, overlapping classes and high dimensionality in an integrated manner, observing the relations between them. Data with these characteristics are often found in Molecular Biology. Thus, it is considered to use molecular biology data during analysis.

News published in Agência FAPESP Newsletter about the scholarship:
Articles published in other media outlets (0 total):
More itemsLess items

Scientific publications
(References retrieved automatically from Web of Science and SciELO through information on FAPESP grants and their corresponding numbers as mentioned in the publications by the authors)
BARELLA, VICTOR H.; GARCIA, LUIS P. F.; DE SOUTO, MARCILIO C. P.; LORENA, ANA C.; DE CARVALHO, ANDRE C. P. L. F.. Assessing the data complexity of imbalanced datasets. INFORMATION SCIENCES, v. 553, p. 83-109, . (13/07375-0, 15/01382-0, 12/22608-8)
Academic Publications
(References retrieved automatically from State of São Paulo Research Institutions)
BARELLA, Victor Hugo. Imbalanced classification tasks: measuring data complexity and recommending techniques. 2021. Doctoral Thesis - Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB) São Carlos.

Please report errors in scientific publications list by writing to: