Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts

Rossi, Rafael Geraldeli; Lopes, Alneu de Andrade; Rezende, Solange Oliveira

Full text
Author(s):	Rossi, Rafael Geraldeli ^[1] ; Lopes, Alneu de Andrade ^[1] ; Rezende, Solange Oliveira ^[1] Total Authors: 3
Affiliation:	^[1] Univ Sao Paulo, Inst Math & Comp Sci, BR-05508 Sao Paulo - Brazil Total Affiliations: 1
Document type:	Journal article
Source:	INFORMATION PROCESSING & MANAGEMENT; v. 52, n. 2, p. 217-257, MAR 2016.
Web of Science Citations:	16
Abstract
Transductive classification is a useful way to classify texts when labeled training examples are insufficient. Several algorithms to perform transductive classification considering text collections represented in a vector space model have been proposed. However, the use of these algorithms is unfeasible in practical applications due to the independence assumption among instances or terms and the drawbacks of these algorithms. Network-based algorithms come up to avoid the drawbacks of the algorithms based on vector space model and to improve transductive classification. Networks are mostly used for label propagation, in which some labeled objects propagate their labels to other objects through the network connections. Bipartite networks are useful to represent text collections as networks and perform label propagation. The generation of this type of network avoids requirements such as collections with hyperlinks or citations, computation of similarities among all texts in the collection, as well as the setup of a number of parameters. In a bipartite heterogeneous network, objects correspond to documents and terms, and the connections are given by the occurrences of terms in documents. The label propagation is performed from documents to terms and then from terms to documents iteratively. Nevertheless, instead of using terms just as means of label propagation, in this article we propose the use of the bipartite network structure to define the relevance scores of terms for classes through an optimization process and then propagate these relevance scores to define labels for unlabeled documents. The new document labels are used to redefine the relevance scores of terms which consequently redefine the labels of unlabeled documents in an iterative process. We demonstrated that the proposed approach surpasses the algorithms for transductive classification based on vector space model or networks. Moreover, we demonstrated that the proposed algorithm effectively makes use of unlabeled documents to improve classification and it is faster than other transductive algorithms. (C) 2015 Elsevier Ltd. All rights reserved. (AU)

FAPESP's process:	11/12823-6 - Pattern extraction from textual document collections using heterogeneous networks
Grantee:	Rafael Geraldeli Rossi
Support Opportunities:	Scholarships in Brazil - Doctorate


FAPESP's process:	11/22749-8 - Challenges in exploratory visualization of multidimensional data: paradigms, scalability and applications
Grantee:	Luis Gustavo Nonato
Support Opportunities:	Research Projects - Thematic Grants


FAPESP's process:	14/08996-0 - Machine learning for WebSensors: algorithms and applications
Grantee:	Solange Oliveira Rezende
Support Opportunities:	Regular Research Grants

Short URL