Advanced search
Start date
Betweenand


Imbalanced classification tasks: measuring data complexity and recommending techniques

Full text
Author(s):
Victor Hugo Barella
Total Authors: 1
Document type: Doctoral Thesis
Press: São Carlos.
Institution: Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB)
Defense date:
Examining board members:
André Carlos Ponce de Leon Ferreira de Carvalho; Gustavo Enrique de Almeida Prado Alves Batista; Ronaldo Cristiano Prati; Carlos Manuel Milheiro de Oliveira Pinto Soares
Advisor: André Carlos Ponce de Leon Ferreira de Carvalho
Abstract

Machine learning classification algorithms tend to perform poorly in datasets with class imbalance. Class imbalance is not a problem per se, but it poses adverse effects when combined with other data characteristics, such as class overlap and noise. This study aims to measure data characteristics in imbalanced datasets and recommend techniques to deal with class imbalance in a meta-learning system. Popular data complexity measures were decomposed per class to better assess the imbalanced datasets characteristics. They were applied to controlled artificial datasets and to real datasets. These measures were correlated with several classification models predictive performance. The measures were also evaluated before and after applying popular pre-processing techniques for imbalanced datasets. Moreover, a meta-learning system was implemented using popular meta-features along with the data complexity measures developed in this research. The results showed that decomposing the data complexity measures per class improved their ability to measure complexity in imbalanced datasets. Furthermore, according to experimental results, they were the most important meta-features in the meta-learning system. Based on the results, data science practitioners should consider measuring the data complexity of imbalanced datasets, whether it is to interpret the data characteristics, select techniques, or develop new techniques. (AU)

FAPESP's process: 15/01382-0 - The influence of pre-processing data techniques on classification algorithms
Grantee:Victor Hugo Barella
Support Opportunities: Scholarships in Brazil - Doctorate