Propagação em grafos bipartidos para extração de tópicos em fluxo de documentos textuais

Thiago de Paulo Faleiros

Full text
Author(s):	Thiago de Paulo Faleiros Total Authors: 1
Document type:	Doctoral Thesis
Press:	São Carlos.
Institution:	Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB)
Defense date:	2016-06-08
Examining board members:	Alneu de Andrade Lopes; Maria Cristina Ferreira de Oliveira; Gisele Lobo Pappa; Marcos Gonçalves Quiles; Ivan Nunes da Silva
Advisor:	Alneu de Andrade Lopes
Abstract
Handling large amounts of data is a requirement for modern text mining algorithms. For some applications, documents are published constantly, which demand a high cost for long-term storage. So it is necessary easily adaptable methods for an approach that considers documents flow, and be capable of analyzing the data in one step without requiring the high cost of storage. Another requirement is that this approach can exploit heuristics in order to improve the quality of results. Several models for automatic extraction of latent information in a collection of documents have been proposed in the literature, among them probabilistic topic models are prominent. Probabilistic topic models achieve good practical results, and have been extended to several models with different types of information included. However, properly describe these models, derive them, and then get appropriate inference algorithms are difficult tasks, requiring a rigorous mathematical treatment for descriptions of operations performed in the latent dimensions discovery process. Thus, for the development of a simple and efficient method to tackle the problem of latent dimensions discovery, a proper representation of the data is required. The hypothesis of this thesis is that by using bipartite graph for representation of textual data one can address the task of latent patterns discovery, present in the relationships between documents and words, in a simple and intuitive way. For validation of this hypothesis, we have developed a framework based on label propagation algorithm using the bipartite graph representation. The framework, called PBG (Propagation in Bipartite Graph) was initially applied to the unsupervised context for a static collection of documents. Then a semi-supervised version was proposed which need only a small amount of labeled documents to the transductive classification task. Finally, it was applied in the dynamic context in which flow of textual data was considered. Comparative analyzes were performed, and the results indicated that the PBG is a viable and competitive alternative for tasks in the unsupervised and semi-supervised contexts. (AU)

FAPESP's process:	11/23689-9 - Propagation in bipartite graphs for Topic Extraction in Data Streams
Grantee:	Thiago de Paulo Faleiros
Support Opportunities:	Scholarships in Brazil - Doctorate

Short URL