Unsupervised active learning techniques for labeling training sets: An experimental evaluation on sequential data

Souza, Vinicius M. A.; Rossi, Rafael G.; Batista, Gustavo E. A. P. A.; Rezende, Solange O.

Full text
Author(s):	Souza, Vinicius M. A. ^[1] ; Rossi, Rafael G. ^{[1, 2]} ; Batista, Gustavo E. A. P. A. ^[1] ; Rezende, Solange O. ^[1] Total Authors: 4
Affiliation:	^[1] Univ Sao Paulo, ICMC, Av Trabalhador Sao Carlense 400, BR-13560970 Sao Carlos, SP - Brazil ^[2] Fed Univ Mato Grosso do Sul UFMS, Campo Grande, MS - Brazil Total Affiliations: 2
Document type:	Journal article
Source:	Intelligent Data Analysis; v. 21, n. 5, p. 1061+, 2017.
Web of Science Citations:	0
Abstract
Many real-world applications, such as those related to sensors, allow collecting large amounts of inexpensive unlabeled sequential data. However, the use of supervised machine learning methods is frequently hindered by the high costs involved in gathering labels for such data. These methods assume the availability of a considerable amount of labeled data to build an accurate classification model. To overcome this bottleneck, active learning methods are designed to selectively label the most informative examples instead of requesting all true labels. Although active learning has been widely used in many problems, most of the methods consider the presence of labeled data or some prior knowledge about the problem, as the number of classes. Differently, in this paper, we are interested in the realistic scenario where the active learning is performed from scratch on a fully unlabeled dataset and with the absence of any classifier or prior knowledge about the data. In general, the methods that consider fully unlabeled data use random sampling to select examples to label. The goal of this work is to show a broad experimental evaluation with different unsupervised active learning methods to select examples from fully unlabeled sequential data. We evaluated methods based on clustering algorithms and centrality measures from graphs for instance selection and the performance of supervised and semi-supervised learning algorithms in the classification task. Given our evaluation on a benchmark of sequential data and in a case study of insect species classification, we indicated the sampling based on hierarchical clustering or k-Means. These methods present a statistically significantly better performance to the popular random sampling. In addition, they are simple algorithms and readily available in many software packages. (AU)

FAPESP's process:	14/08996-0 - Machine learning for WebSensors: algorithms and applications
Grantee:	Solange Oliveira Rezende
Support Opportunities:	Regular Research Grants


FAPESP's process:	11/12823-6 - Pattern extraction from textual document collections using heterogeneous networks
Grantee:	Rafael Geraldeli Rossi
Support Opportunities:	Scholarships in Brazil - Doctorate


FAPESP's process:	11/17698-5 - Classification of non-stationary data stream with application in sensors for insect identification
Grantee:	Vinícius Mourão Alves de Souza
Support Opportunities:	Scholarships in Brazil - Doctorate

Short URL