Advanced search
Start date
Betweenand

Using complex networks to classify texts

Grant number: 10/00927-9
Support type:Scholarships in Brazil - Doctorate (Direct)
Effective date (Start): June 01, 2010
Effective date (End): July 31, 2013
Field of knowledge:Interdisciplinary Subjects
Principal Investigator:Luciano da Fontoura Costa
Grantee:Diego Raphael Amancio
Home Institution: Instituto de Física de São Carlos (IFSC). Universidade de São Paulo (USP). São Carlos , SP, Brazil
Associated research grant:05/00587-5 - Mesh (graph) modeling and techniques of pattern recognition: structure, dynamics and applications, AP.TEM

Abstract

The automatic classification of texts in pre-established categories is drawing increasing interest owing to the need to organize the ever growing number of electronic documents. The prevailing approach for classification is based on analysis of textual contents. In this thesis, we investigate the applicability of attributes based on textual style using the Complex Network (CN) representation, where nodes represent words and edges are adjacency relations. We studied the suitability of CN measurements for natural language processing tasks, with classification being assisted by supervised and unsupervised machine learning methods. A detailed study of topological measurements in texts revealed that several measurements are informative in the sense that they are able to distinguish meaningful from shuffled texts. Moreover, most measurements depend on syntactic factors, while intermittency measurements are more sensitive to semantic factors. As for the use of the CN model in practical scenarios, there is significant correlation between authors' style and network topology. We achieved an accuracy rate of 65~\% in discriminating eight authors of novels with the use of network and intermittency measurements. During the stylistic analysis, we also found that books belonging to the same literary movement could be identified from their similar topological features. The network model also proved useful for disambiguating word senses. Upon employing only topological information to characterize nodes representing polysemous words, we found a strong relationship between syntax and semantics. For several words, the CN approach performed surprisingly better than the method based on recurrence patterns of neighboring words. The studies carried out in this thesis confirm that stylistic and semantic aspects play a crucial role in the structural organization of word adjacency networks. The word adjacency model investigated here might be useful not only to provide insight into the underlying mechanisms of the language, but also to enhance the performance of real applications implementing both CN and traditional approaches. (AU)

Scientific publications (15)
(References retrieved automatically from Web of Science and SciELO through information on FAPESP grants and their corresponding numbers as mentioned in the publications by the authors)
AMANCIO, DIEGO R.; OLIVEIRA, JR., OSVALDO N.; COSTA, LUCIANO DA F. Topological-collaborative approach for disambiguating authors' names in collaborative networks. SCIENTOMETRICS, v. 102, n. 1, p. 465-485, JAN 2015. Web of Science Citations: 6.
AMANCIO, DIEGO RAPHAEL; COMIN, CESAR HENRIQUE; CASANOVA, DALCIMAR; TRAVIESO, GONZALO; BRUNO, ODEMIR MARTINEZ; RODRIGUES, FRANCISCO APARECIDO; COSTA, LUCIANO DA FONTOURA. A Systematic Comparison of Supervised Classifiers. PLoS One, v. 9, n. 4 APR 24 2014. Web of Science Citations: 68.
AMANCIO, DIEGO R.; ALTMANN, EDUARDO G.; RYBSKI, DIEGO; OLIVEIRA, JR., OSVALDO N.; COSTA, LUCIANO DA F. Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript. PLoS One, v. 8, n. 7 JUL 2 2013. Web of Science Citations: 27.
SILVA, THIAGO C.; AMANCIO, DIEGO R. Discriminating word senses with tourist walks in complex networks. European Physical Journal B, v. 86, n. 7 JUL 2013. Web of Science Citations: 2.
SILVA, THIAGO CHRISTIANO; AMANCIO, DIEGO RAPHAEL. Network-based stochastic competitive learning approach to disambiguation in collaborative networks. Chaos, v. 23, n. 1 MAR 2013. Web of Science Citations: 2.
AMANCIO, DIEGO R.; ALUISIO, SANDRA M.; OLIVEIRA, JR., OSVALDO N.; COSTA, LUCIANO DA F. Complex networks analysis of language complexity. EPL, v. 100, n. 5 DEC 2012. Web of Science Citations: 20.
AMANCIO, DIEGO R.; OLIVEIRA, JR., OSVALDO N.; COSTA, LUCIANO DA F. A decaying factor accounts for contained activity in neuronal networks with no need of hierarchical or modular organization. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, NOV 2012. Web of Science Citations: 1.
AMANCIO, DIEGO R.; OLIVEIRA, JR., OSVALDO N.; COSTA, LUCIANO DA F. On the use of topological features and hierarchical characterization for disambiguating names in collaborative networks. EPL, v. 99, n. 4 AUG 2012. Web of Science Citations: 20.
AMANCIO, D. R.; NUNES, M. G. V.; OLIVEIRA, JR., O. N.; COSTA, L. DA F. Using complex networks concepts to assess approaches for citations in scientific papers. SCIENTOMETRICS, v. 91, n. 3, p. 827-842, JUN 2012. Web of Science Citations: 21.
SILVA, THIAGO C.; AMANCIO, DIEGO R. Word sense disambiguation via high order of learning in complex networks. EPL, v. 98, n. 5 JUN 2012. Web of Science Citations: 21.
AMANCIO, DIEGO RAPHAEL; OLIVEIRA, JR., OSVALDO N.; COSTA, LUCIANO DA FONTOURA. Identification of literary movements using complex networks to represent texts. NEW JOURNAL OF PHYSICS, v. 14, APR 23 2012. Web of Science Citations: 22.
AMANCIO, DIEGO R.; OLIVEIRA, JR., OSVALDO N.; COSTA, LUCIANO DA F. Unveiling the relationship between complex networks metrics and word senses. EPL, v. 98, n. 1 APR 2012. Web of Science Citations: 23.
AMANCIO‚ DR; NUNES‚ M.G.V.; OLIVEIRA‚ ON; DA F. COSTA‚ L. Using complex networks concepts to assess approaches for citations in scientific papers. SCIENTOMETRICS, p. 1-16, 2012.
AMANCIO, D. R.; OLIVEIRA, JR., O. N.; COSTA, L. DA F. On the concepts of complex networks to quantify the difficulty in finding the way out of labyrinths. PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, v. 390, n. 23-24, p. 4673-4683, NOV 1 2011. Web of Science Citations: 6.
AMANCIO, D. R.; NUNES, M. G. V.; OLIVEIRA, JR., O. N.; PARDO, T. A. S.; ANTIQUEIRA, L.; COSTA, L. DA F. Using metrics from complex networks to evaluate machine translation. PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, v. 390, n. 1, p. 131-142, JAN 1 2011. Web of Science Citations: 26.
Academic Publications
(References retrieved automatically from State of São Paulo Research Institutions)
AMANCIO, Diego Raphael. Using complex networks to classify texts. 2013. Doctoral Thesis - Universidade de São Paulo (USP). Instituto de Física de São Carlos São Carlos.

Please report errors in scientific publications list by writing to: cdi@fapesp.br.