Unsupervised active learning techniques for labeling training sets: An experimental evaluation on sequential data

Souza, Vinicius M. A.; Rossi, Rafael G.; Batista, Gustavo E. A. P. A.; Rezende, Solange O.

Texto completo
Autor(es):	Souza, Vinicius M. A. ^[1] ; Rossi, Rafael G. ^{[1, 2]} ; Batista, Gustavo E. A. P. A. ^[1] ; Rezende, Solange O. ^[1] Número total de Autores: 4
Afiliação do(s) autor(es):	^[1] Univ Sao Paulo, ICMC, Av Trabalhador Sao Carlense 400, BR-13560970 Sao Carlos, SP - Brazil ^[2] Fed Univ Mato Grosso do Sul UFMS, Campo Grande, MS - Brazil Número total de Afiliações: 2
Tipo de documento:	Artigo Científico
Fonte:	Intelligent Data Analysis; v. 21, n. 5, p. 1061+, 2017.
Citações Web of Science:	0
Resumo
Many real-world applications, such as those related to sensors, allow collecting large amounts of inexpensive unlabeled sequential data. However, the use of supervised machine learning methods is frequently hindered by the high costs involved in gathering labels for such data. These methods assume the availability of a considerable amount of labeled data to build an accurate classification model. To overcome this bottleneck, active learning methods are designed to selectively label the most informative examples instead of requesting all true labels. Although active learning has been widely used in many problems, most of the methods consider the presence of labeled data or some prior knowledge about the problem, as the number of classes. Differently, in this paper, we are interested in the realistic scenario where the active learning is performed from scratch on a fully unlabeled dataset and with the absence of any classifier or prior knowledge about the data. In general, the methods that consider fully unlabeled data use random sampling to select examples to label. The goal of this work is to show a broad experimental evaluation with different unsupervised active learning methods to select examples from fully unlabeled sequential data. We evaluated methods based on clustering algorithms and centrality measures from graphs for instance selection and the performance of supervised and semi-supervised learning algorithms in the classification task. Given our evaluation on a benchmark of sequential data and in a case study of insect species classification, we indicated the sampling based on hierarchical clustering or k-Means. These methods present a statistically significantly better performance to the popular random sampling. In addition, they are simple algorithms and readily available in many software packages. (AU)

Processo FAPESP:	14/08996-0 - Aprendizado de máquina para WebSensors: algoritmos e aplicações
Beneficiário:	Solange Oliveira Rezende
Modalidade de apoio:	Auxílio à Pesquisa - Regular


Processo FAPESP:	11/12823-6 - Extraindo padrões de coleções de documentos textuais utilizando redes heterogêneas
Beneficiário:	Rafael Geraldeli Rossi
Modalidade de apoio:	Bolsas no Brasil - Doutorado


Processo FAPESP:	11/17698-5 - Classificação de fluxo de dados não estacionários com aplicação em sensores identificadores de insetos
Beneficiário:	Vinícius Mourão Alves de Souza
Modalidade de apoio:	Bolsas no Brasil - Doutorado

URL curto