New label noise injection methods for the evaluation of noise filters

Garcia, Luis P. F.; Lehmann, Jens; de Carvalho, Andre C. P. L. F.; Lorena, Ana C.

Texto completo
Autor(es):	Garcia, Luis P. F. ^{[1, 2]} ; Lehmann, Jens ^{[2, 3]} ; de Carvalho, Andre C. P. L. F. ^[1] ; Lorena, Ana C. ^{[4, 5]} Número total de Autores: 4
Afiliação do(s) autor(es):	^[1] Univ Sao Paulo, Inst Ciencias Matemat & Comp, Trabalhador Sao Carlense Av 400, BR-13560970 Sao Carlos, SP - Brazil ^[2] Univ Leipzig, Inst Appl Informat, Hainstr 11, Leipzig, Saxony - Germany ^[3] Univ Bonn, Comp Sci Inst, Romerstr 164, Bonn, North Rhine Wes - Germany ^[4] Univ Fed Sao Paulo, Inst Ciencia Tecnol, Talim St 330, BR-12231280 Sao Jose Dos Campos, SP - Brazil ^[5] Inst Tecnol Aeronaut, Div Ciencia Comp, Praca Marechal Eduardo Gomes 50, BR-12228900 Sao Jose Dos Campos, SP - Brazil Número total de Afiliações: 5
Tipo de documento:	Artigo Científico
Fonte:	KNOWLEDGE-BASED SYSTEMS; v. 163, p. 693-704, JAN 1 2019.
Citações Web of Science:	0
Resumo
Noise is often present in real datasets used for training Machine Learning classifiers. Their disruptive effects in the learning process may include: increasing the complexity of the induced models, a higher processing time and a reduced predictive power in the classification of new examples. Therefore, treating noisy data in a preprocessing step is crucial for improving data quality and to reduce their harmful effects in the learning process. There are various filters using different concepts for identifying noisy examples in a dataset. Their ability in noise preprocessing is usually assessed in the identification of artificial noise injected into one or more datasets. This is performed to overcome the limitation that only a domain expert can guarantee whether a real example is indeed noisy. The most frequently used label noise injection method is the noise at random method, in which a percentage of the training examples have their labels randomly exchanged. This is carried out regardless of the characteristics and example space positions of the selected examples. This paper proposes two novel methods to inject label noise in classification datasets. These methods, based on complexity measures, can produce more challenging and realistic noisy datasets by the disturbance of the labels of critical examples situated close to the decision borders and can improve the noise filtering evaluation. An extensive experimental evaluation of different noise filters is performed using public datasets with imputed label noise and the influence of the noise injection methods are compared in both data preprocessing and classification steps. (C) 2018 Elsevier B.V. All rights reserved. (AU)

Processo FAPESP:	16/18615-0 - Aprendizado de máquina avançado
Beneficiário:	André Carlos Ponce de Leon Ferreira de Carvalho
Modalidade de apoio:	Auxílio à Pesquisa - Parceria para Inovação Tecnológica - PITE


Processo FAPESP:	13/07375-0 - CeMEAI - Centro de Ciências Matemáticas Aplicadas à Indústria
Beneficiário:	Francisco Louzada Neto
Modalidade de apoio:	Auxílio à Pesquisa - Centros de Pesquisa, Inovação e Difusão - CEPIDs


Processo FAPESP:	12/22608-8 - Uso de medidas de complexidade de dados no suporte ao aprendizado de máquina supervisionado
Beneficiário:	Ana Carolina Lorena
Modalidade de apoio:	Auxílio à Pesquisa - Jovens Pesquisadores

URL curto