Busca avançada
Ano de início
Entree
(Referência obtida automaticamente do Web of Science, por meio da informação sobre o financiamento pela FAPESP e o número do processo correspondente, incluída na publicação pelos autores.)

Assessing the data complexity of imbalanced datasets

Texto completo
Autor(es):
Barella, Victor H. [1] ; Garcia, Luis P. F. [2] ; de Souto, Marcilio C. P. [3] ; Lorena, Ana C. [4] ; de Carvalho, Andre C. P. L. F. [1]
Número total de Autores: 5
Afiliação do(s) autor(es):
[1] Univ Sao Paulo, Inst Math & Comp Sci, Trabalhador Sao Carlense Av 400, BR-13560970 Sao Paulo, SP - Brazil
[2] Univ Brasilia, Comp Sci Dept, BR-70910900 Brasilia, DF - Brazil
[3] Univ Orleans Leonard de Vinci, Fundamental Comp Sci Lab, BP 6759, F-45067 Orleans 2 - France
[4] Aeronaut Inst Technol, Praca Marechal Eduardo Gomes 50, BR-12228900 Sao Jose Dos Campos, SP - Brazil
Número total de Afiliações: 4
Tipo de documento: Artigo Científico
Fonte: INFORMATION SCIENCES; v. 553, p. 83-109, APR 2021.
Citações Web of Science: 0
Resumo

Imbalanced datasets are an important challenge in supervised Machine Learning (ML). According to the literature, class imbalance does not necessarily impose difficulties for ML algorithms. Difficulties mainly arise from other characteristics, such as overlapping between classes and complex decision boundaries. For binary classification tasks, calculating imbalance is straightforward, e.g., the ratio between class sizes. However, measuring more relevant characteristics, such as class overlapping, is not trivial. In the past years, complexity measures able to assess more relevant dataset characteristics have been proposed. In this paper, we investigate their effectiveness on real imbalanced datasets and how they are affected by applying different data imbalance treatments (DIT). For such, we perform two data-driven experiments: (1) We adapt the complexity measures to the context of imbalanced datasets. The experimental results show that our proposed measures assess the difficulty of imbalanced problems better than the original ones. We also compare the results with the state-of-art on data complexity measures for imbalanced datasets. (2) We analyze the behavior of complexity measures before and after applying DITs. According to the results, the difference in data complexity, in general, correlates to the predictive performance improvement obtained by applying DITs to the original datasets. (C) 2020 Elsevier Inc. All rights reserved. (AU)

Processo FAPESP: 13/07375-0 - CeMEAI - Centro de Ciências Matemáticas Aplicadas à Indústria
Beneficiário:Francisco Louzada Neto
Modalidade de apoio: Auxílio à Pesquisa - Centros de Pesquisa, Inovação e Difusão - CEPIDs
Processo FAPESP: 15/01382-0 - Influência do tratamento de dados em algoritmos de classificação
Beneficiário:Victor Hugo Barella
Modalidade de apoio: Bolsas no Brasil - Doutorado
Processo FAPESP: 12/22608-8 - Uso de medidas de complexidade de dados no suporte ao aprendizado de máquina supervisionado
Beneficiário:Ana Carolina Lorena
Modalidade de apoio: Auxílio à Pesquisa - Jovens Pesquisadores