Busca avançada
Ano de início
Entree
(Referência obtida automaticamente do Web of Science, por meio da informação sobre o financiamento pela FAPESP e o número do processo correspondente, incluída na publicação pelos autores.)

Paragraph-based representation of texts: A complex networks approach

Texto completo
Autor(es):
de Arruda, Henrique F. [1] ; Marinho, Vanessa Q. [1] ; Costa, Luciano da F. [2] ; Amancio, Diego R. [1, 3]
Número total de Autores: 4
Afiliação do(s) autor(es):
[1] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP - Brazil
[2] Univ Sao Paulo, Sao Carlos Inst Phys, Sao Carlos, SP - Brazil
[3] Indiana Univ, Sch Informat Comp & Engn, Bloomington, IN 47408 - USA
Número total de Afiliações: 3
Tipo de documento: Artigo Científico
Fonte: INFORMATION PROCESSING & MANAGEMENT; v. 56, n. 3, p. 479-494, MAY 2019.
Citações Web of Science: 0
Resumo

An interesting model to represent texts as a graph (also called network) is the word adjacency (co-occurrence) representation, which is known to capture mainly syntactical features of texts. In this study, we propose a novel network model, which is based on the similarity between the content of the paragraphs of the text. By considering this representation, we characterized the networks with respect to measurements developed in the network science area. We characterized these measurements according to their properties regarding their ability to discriminate between real and shuffled texts, and to capture information regarding the content similarity of chunks of text. In order to compare the results with a more sophisticated approach, we employed a methodology based on word2vec. When comparing real and shuffled texts, the results revealed that real texts tend to have a more well-defined community structure. This characteristic can be related to the organization of subjects in real texts. The network-based measurements that were found to be able to discriminate real from shuffled texts were used as features in a classifier. As a result, the obtained accuracy was 98.72%. In order to compare with a different methodology, we used doc2vec-based features in the classifier, yielding an accuracy rate of 70.8%. The proposed network-based features were employed to analyze the Voynich manuscript, which was found to be compatible with real texts according to the considered characteristics. (AU)

Processo FAPESP: 15/05676-8 - Desenvolvimento de novos modelos para reconhecimento de autoria com a utilização de redes complexas
Beneficiário:Vanessa Queiroz Marinho
Linha de fomento: Bolsas no Brasil - Mestrado
Processo FAPESP: 15/22308-2 - Representações intermediárias em Ciência Computacional para descoberta de conhecimento
Beneficiário:Roberto Marcondes Cesar Junior
Linha de fomento: Auxílio à Pesquisa - Temático
Processo FAPESP: 17/13464-6 - Modelando grafos de citação e informação: uma abordagem baseada em redes complexas
Beneficiário:Diego Raphael Amancio
Linha de fomento: Bolsas no Exterior - Pesquisa
Processo FAPESP: 16/19069-9 - Classificação de documentos usando informações semânticas em redes complexas
Beneficiário:Diego Raphael Amancio
Linha de fomento: Auxílio à Pesquisa - Regular
Processo FAPESP: 11/50761-2 - Modelos e métodos de e-Science para ciências da vida e agrárias
Beneficiário:Roberto Marcondes Cesar Junior
Linha de fomento: Auxílio à Pesquisa - Temático