Assessing the data complexity of imbalanced datasets

Barella, Victor H.; Garcia, Luis P. F.; de Souto, Marcilio C. P.; Lorena, Ana C.; de Carvalho, Andre C. P. L. F.

Full text
Author(s):	Barella, Victor H. ^[1] ; Garcia, Luis P. F. ^[2] ; de Souto, Marcilio C. P. ^[3] ; Lorena, Ana C. ^[4] ; de Carvalho, Andre C. P. L. F. ^[1] Total Authors: 5
Affiliation:	^[1] Univ Sao Paulo, Inst Math & Comp Sci, Trabalhador Sao Carlense Av 400, BR-13560970 Sao Paulo, SP - Brazil ^[2] Univ Brasilia, Comp Sci Dept, BR-70910900 Brasilia, DF - Brazil ^[3] Univ Orleans Leonard de Vinci, Fundamental Comp Sci Lab, BP 6759, F-45067 Orleans 2 - France ^[4] Aeronaut Inst Technol, Praca Marechal Eduardo Gomes 50, BR-12228900 Sao Jose Dos Campos, SP - Brazil Total Affiliations: 4
Document type:	Journal article
Source:	INFORMATION SCIENCES; v. 553, p. 83-109, APR 2021.
Web of Science Citations:	0
Abstract
Imbalanced datasets are an important challenge in supervised Machine Learning (ML). According to the literature, class imbalance does not necessarily impose difficulties for ML algorithms. Difficulties mainly arise from other characteristics, such as overlapping between classes and complex decision boundaries. For binary classification tasks, calculating imbalance is straightforward, e.g., the ratio between class sizes. However, measuring more relevant characteristics, such as class overlapping, is not trivial. In the past years, complexity measures able to assess more relevant dataset characteristics have been proposed. In this paper, we investigate their effectiveness on real imbalanced datasets and how they are affected by applying different data imbalance treatments (DIT). For such, we perform two data-driven experiments: (1) We adapt the complexity measures to the context of imbalanced datasets. The experimental results show that our proposed measures assess the difficulty of imbalanced problems better than the original ones. We also compare the results with the state-of-art on data complexity measures for imbalanced datasets. (2) We analyze the behavior of complexity measures before and after applying DITs. According to the results, the difference in data complexity, in general, correlates to the predictive performance improvement obtained by applying DITs to the original datasets. (C) 2020 Elsevier Inc. All rights reserved. (AU)

FAPESP's process:	13/07375-0 - CeMEAI - Center for Mathematical Sciences Applied to Industry
Grantee:	Francisco Louzada Neto
Support Opportunities:	Research Grants - Research, Innovation and Dissemination Centers - RIDC


FAPESP's process:	15/01382-0 - The influence of pre-processing data techniques on classification algorithms
Grantee:	Victor Hugo Barella
Support Opportunities:	Scholarships in Brazil - Doctorate


FAPESP's process:	12/22608-8 - Use of data complexity measures in the support of supervised machine learning
Grantee:	Ana Carolina Lorena
Support Opportunities:	Research Grants - Young Investigators Grants

Short URL