New label noise injection methods for the evaluation of noise filters

Garcia, Luis P. F.; Lehmann, Jens; de Carvalho, Andre C. P. L. F.; Lorena, Ana C.

Full text
Author(s):	Garcia, Luis P. F. ^{[1, 2]} ; Lehmann, Jens ^{[2, 3]} ; de Carvalho, Andre C. P. L. F. ^[1] ; Lorena, Ana C. ^{[4, 5]} Total Authors: 4
Affiliation:	^[1] Univ Sao Paulo, Inst Ciencias Matemat & Comp, Trabalhador Sao Carlense Av 400, BR-13560970 Sao Carlos, SP - Brazil ^[2] Univ Leipzig, Inst Appl Informat, Hainstr 11, Leipzig, Saxony - Germany ^[3] Univ Bonn, Comp Sci Inst, Romerstr 164, Bonn, North Rhine Wes - Germany ^[4] Univ Fed Sao Paulo, Inst Ciencia Tecnol, Talim St 330, BR-12231280 Sao Jose Dos Campos, SP - Brazil ^[5] Inst Tecnol Aeronaut, Div Ciencia Comp, Praca Marechal Eduardo Gomes 50, BR-12228900 Sao Jose Dos Campos, SP - Brazil Total Affiliations: 5
Document type:	Journal article
Source:	KNOWLEDGE-BASED SYSTEMS; v. 163, p. 693-704, JAN 1 2019.
Web of Science Citations:	0
Abstract
Noise is often present in real datasets used for training Machine Learning classifiers. Their disruptive effects in the learning process may include: increasing the complexity of the induced models, a higher processing time and a reduced predictive power in the classification of new examples. Therefore, treating noisy data in a preprocessing step is crucial for improving data quality and to reduce their harmful effects in the learning process. There are various filters using different concepts for identifying noisy examples in a dataset. Their ability in noise preprocessing is usually assessed in the identification of artificial noise injected into one or more datasets. This is performed to overcome the limitation that only a domain expert can guarantee whether a real example is indeed noisy. The most frequently used label noise injection method is the noise at random method, in which a percentage of the training examples have their labels randomly exchanged. This is carried out regardless of the characteristics and example space positions of the selected examples. This paper proposes two novel methods to inject label noise in classification datasets. These methods, based on complexity measures, can produce more challenging and realistic noisy datasets by the disturbance of the labels of critical examples situated close to the decision borders and can improve the noise filtering evaluation. An extensive experimental evaluation of different noise filters is performed using public datasets with imputed label noise and the influence of the noise injection methods are compared in both data preprocessing and classification steps. (C) 2018 Elsevier B.V. All rights reserved. (AU)

FAPESP's process:	16/18615-0 - Advanced machine learning
Grantee:	André Carlos Ponce de Leon Ferreira de Carvalho
Support Opportunities:	Research Grants - Research Partnership for Technological Innovation - PITE


FAPESP's process:	13/07375-0 - CeMEAI - Center for Mathematical Sciences Applied to Industry
Grantee:	Francisco Louzada Neto
Support Opportunities:	Research Grants - Research, Innovation and Dissemination Centers - RIDC


FAPESP's process:	12/22608-8 - Use of data complexity measures in the support of supervised machine learning
Grantee:	Ana Carolina Lorena
Support Opportunities:	Research Grants - Young Investigators Grants

Short URL