Advanced search
Start date
Betweenand
(Reference retrieved automatically from Web of Science through information on FAPESP grant and its corresponding number as mentioned in the publication by the authors.)

Paragraph-based representation of texts: A complex networks approach

Full text
Author(s):
de Arruda, Henrique F. [1] ; Marinho, Vanessa Q. [1] ; Costa, Luciano da F. [2] ; Amancio, Diego R. [1, 3]
Total Authors: 4
Affiliation:
[1] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP - Brazil
[2] Univ Sao Paulo, Sao Carlos Inst Phys, Sao Carlos, SP - Brazil
[3] Indiana Univ, Sch Informat Comp & Engn, Bloomington, IN 47408 - USA
Total Affiliations: 3
Document type: Journal article
Source: INFORMATION PROCESSING & MANAGEMENT; v. 56, n. 3, p. 479-494, MAY 2019.
Web of Science Citations: 0
Abstract

An interesting model to represent texts as a graph (also called network) is the word adjacency (co-occurrence) representation, which is known to capture mainly syntactical features of texts. In this study, we propose a novel network model, which is based on the similarity between the content of the paragraphs of the text. By considering this representation, we characterized the networks with respect to measurements developed in the network science area. We characterized these measurements according to their properties regarding their ability to discriminate between real and shuffled texts, and to capture information regarding the content similarity of chunks of text. In order to compare the results with a more sophisticated approach, we employed a methodology based on word2vec. When comparing real and shuffled texts, the results revealed that real texts tend to have a more well-defined community structure. This characteristic can be related to the organization of subjects in real texts. The network-based measurements that were found to be able to discriminate real from shuffled texts were used as features in a classifier. As a result, the obtained accuracy was 98.72%. In order to compare with a different methodology, we used doc2vec-based features in the classifier, yielding an accuracy rate of 70.8%. The proposed network-based features were employed to analyze the Voynich manuscript, which was found to be compatible with real texts according to the considered characteristics. (AU)

FAPESP's process: 15/05676-8 - Development of new models for authorship recognition using complex networks
Grantee:Vanessa Queiroz Marinho
Support type: Scholarships in Brazil - Master
FAPESP's process: 15/22308-2 - Intermediate representations in Computational Science for knowledge discovery
Grantee:Roberto Marcondes Cesar Junior
Support type: Research Projects - Thematic Grants
FAPESP's process: 17/13464-6 - Modelling citation and information graphs: a complex network approach
Grantee:Diego Raphael Amancio
Support type: Scholarships abroad - Research
FAPESP's process: 16/19069-9 - Using semantical information to classify texts modelled as complex networks
Grantee:Diego Raphael Amancio
Support type: Regular Research Grants
FAPESP's process: 11/50761-2 - Models and methods of e-Science for life and agricultural sciences
Grantee:Roberto Marcondes Cesar Junior
Support type: Research Projects - Thematic Grants