Busca avançada
Ano de início
Entree
(Referência obtida automaticamente do Web of Science, por meio da informação sobre o financiamento pela FAPESP e o número do processo correspondente, incluída na publicação pelos autores.)

On the role of words in the network structure of texts: Application to authorship attribution

Texto completo
Autor(es):
Akimushkin, Camilo [1] ; Amancio, Diego R. [2] ; Oliveira, Jr., Osvaldo N. [1]
Número total de Autores: 3
Afiliação do(s) autor(es):
[1] Univ Sao Paulo, Sao Carlos Inst Phys, Ave Trabalhador Sao Carlense 400, Sao Carlos, SP - Brazil
[2] Univ Sao Paulo, Inst Math & Comp Sci, Ave Trabalhador Sao Carlense 400, Sao Carlos, SP - Brazil
Número total de Afiliações: 2
Tipo de documento: Artigo Científico
Fonte: PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS; v. 495, p. 49-58, APR 1 2018.
Citações Web of Science: 1
Resumo

Well-established automatic analyses of texts mainly consider frequencies of linguistic units, e.g. letters, words, and bigrams. In a recent, alternative approach, medium and large-scale text structures were used in opposition to the belief that text structure is dominated by the language features. In this paper, we introduce a generalized similarity measure to compare texts which accounts for both the network structure of texts and the role of individual words in the networks. The similarity measure is used for authorship attribution of three collections of books, each composed of 8 authors and 10 books per author. High accuracy rates were obtained with typical values between 90% and 98.75%, much higher than with the traditional term frequency-inverse document frequency (tf-idf) approach for the same collections. These accuracies are also higher than those obtained solely with the topology of networks. We conclude that the different properties of specific words on the macroscopic scale structure of a whole text are as relevant as their frequency of appearance; conversely, considering the identity of nodes brings further knowledge about a piece of text represented as a network. (C) 2017 Elsevier B.V. All rights reserved. (AU)

Processo FAPESP: 14/20830-0 - Modelagem e reconhecimento de padrões em textos com redes complexas
Beneficiário:Diego Raphael Amancio
Linha de fomento: Auxílio à Pesquisa - Regular
Processo FAPESP: 13/14262-7 - Filmes nanoestruturados de materiais de interesse biológico
Beneficiário:Osvaldo Novais de Oliveira Junior
Linha de fomento: Auxílio à Pesquisa - Temático
Processo FAPESP: 16/19069-9 - Classificação de documentos usando informações semânticas em redes complexas
Beneficiário:Diego Raphael Amancio
Linha de fomento: Auxílio à Pesquisa - Regular