Advanced search
Start date
Betweenand


Using complex networks to classify texts

Full text
Author(s):
Diego Raphael Amancio
Total Authors: 1
Document type: Doctoral Thesis
Press: São Carlos.
Institution: Universidade de São Paulo (USP). Instituto de Física de São Carlos (IFSC/BT)
Defense date:
Examining board members:
Luciano da Fontoura Costa; Aparecido Nilceu Marana; Maria das Graças Volpe Nunes; Jose Hiroki Saito; Antonio Carlos Roque da Silva Filho
Advisor: Luciano da Fontoura Costa; Osvaldo Novais de Oliveira Junior
Abstract

The automatic classification of texts in pre-established categories is drawing increasing interest owing to the need to organize the ever growing number of electronic documents. The prevailing approach for classification is based on analysis of textual contents. In this thesis, we investigate the applicability of attributes based on textual style using the complex network (CN) representation, where nodes represent words and edges are adjacency relations. We studied the suitability of CN measurements for natural language processing tasks, with classification being assisted by supervised and unsupervised machine learning methods. A detailed study of topological measurements in texts revealed that several measurements are informative in the sense that they are able to distinguish meaningful from shuffled texts. Moreover, most measurements depend on syntactic factors, while intermittency measurements are more sensitive to semantic factors. As for the use of the CN model in practical scenarios, there is significant correlation between authors style and network topology. We achieved an accuracy rate of 65% in discriminating eight authors of novels with the use of network and intermittency measurements. During the stylistic analysis, we also found that books belonging to the same literary movement could be identified from their similar topological features. The network model also proved useful for disambiguating word senses. Upon employing only topological information to characterize nodes representing polysemous words, we found a strong relationship between syntax and semantics. For several words, the CN approach performed surprisingly better than the method based on recurrence patterns of neighboring words. The studies carried out in this thesis confirm that stylistic and semantic aspects play a crucial role in the structural organization of word adjacency networks. The word adjacency model investigated here might be useful not only to provide insight into the underlying mechanisms of the language, but also to enhance the performance of real applications implementing both CN and traditional approaches. (AU)

FAPESP's process: 10/00927-9 - Using complex networks to classify texts
Grantee:Diego Raphael Amancio
Support Opportunities: Scholarships in Brazil - Doctorate (Direct)