Paragraph-based representation of texts: A complex networks approach

de Arruda, Henrique F.; Marinho, Vanessa Q.; Costa, Luciano da F.; Amancio, Diego R.

Full text
Author(s):	de Arruda, Henrique F. ^[1] ; Marinho, Vanessa Q. ^[1] ; Costa, Luciano da F. ^[2] ; Amancio, Diego R. ^{[1, 3]} Total Authors: 4
Affiliation:	^[1] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP - Brazil ^[2] Univ Sao Paulo, Sao Carlos Inst Phys, Sao Carlos, SP - Brazil ^[3] Indiana Univ, Sch Informat Comp & Engn, Bloomington, IN 47408 - USA Total Affiliations: 3
Document type:	Journal article
Source:	INFORMATION PROCESSING & MANAGEMENT; v. 56, n. 3, p. 479-494, MAY 2019.
Web of Science Citations:	1
Abstract
An interesting model to represent texts as a graph (also called network) is the word adjacency (co-occurrence) representation, which is known to capture mainly syntactical features of texts. In this study, we propose a novel network model, which is based on the similarity between the content of the paragraphs of the text. By considering this representation, we characterized the networks with respect to measurements developed in the network science area. We characterized these measurements according to their properties regarding their ability to discriminate between real and shuffled texts, and to capture information regarding the content similarity of chunks of text. In order to compare the results with a more sophisticated approach, we employed a methodology based on word2vec. When comparing real and shuffled texts, the results revealed that real texts tend to have a more well-defined community structure. This characteristic can be related to the organization of subjects in real texts. The network-based measurements that were found to be able to discriminate real from shuffled texts were used as features in a classifier. As a result, the obtained accuracy was 98.72%. In order to compare with a different methodology, we used doc2vec-based features in the classifier, yielding an accuracy rate of 70.8%. The proposed network-based features were employed to analyze the Voynich manuscript, which was found to be compatible with real texts according to the considered characteristics. (AU)

FAPESP's process:	17/13464-6 - Modelling citation and information graphs: a complex network approach
Grantee:	Diego Raphael Amancio
Support Opportunities:	Scholarships abroad - Research


FAPESP's process:	15/22308-2 - Intermediate representations in Computational Science for knowledge discovery
Grantee:	Roberto Marcondes Cesar Junior
Support Opportunities:	Research Projects - Thematic Grants


FAPESP's process:	16/19069-9 - Using semantical information to classify texts modelled as complex networks
Grantee:	Diego Raphael Amancio
Support Opportunities:	Regular Research Grants


FAPESP's process:	11/50761-2 - Models and methods of e-Science for life and agricultural sciences
Grantee:	Roberto Marcondes Cesar Junior
Support Opportunities:	Research Projects - Thematic Grants


FAPESP's process:	15/05676-8 - Development of new models for authorship recognition using complex networks
Grantee:	Vanessa Queiroz Marinho
Support Opportunities:	Scholarships in Brazil - Master

Short URL