Practical Text Phylogeny for Real-World Settings

Shen, Bingyu; Forstall, Christopher W.; Rocha, Anderson de Rezende; Scheirer, Walter J.

Texto completo
Autor(es):	Shen, Bingyu ^[1] ; Forstall, Christopher W. ^[2] ; Rocha, Anderson de Rezende ^[3] ; Scheirer, Walter J. ^[1] Número total de Autores: 4
Afiliação do(s) autor(es):	^[1] Univ Notre Dame, Dept Comp Sci & Engn, Notre Dame, IN 46556 - USA ^[2] Mt Allison Univ, Dept Class, Sackville, NB E4L 1G9 - Canada ^[3] Univ Estadual Campinas, Inst Comp, BR-13026063 Campinas, SP - Brazil Número total de Afiliações: 3
Tipo de documento:	Artigo Científico
Fonte:	IEEE ACCESS; v. 6, p. 41002-41012, 2018.
Citações Web of Science:	1
Resumo
The ease with which one can edit and redistribute digital documents on the Internet is one of modernity's great achievements, but it also leads to some vexing problems. With growing academic interest in the study of the evolution of digital writing on the one hand and the rise of disinformation on the other, the problem of identifying the relationship between texts with similar content is becoming more important. Traditional vector space representations of texts have made progress in solving this problem when it is cast as a reconstruction task that organizes related texts into a tree expressing relationships-this is dubbed text phylogeny in the information forensics literature. However, as new text representation methods have been successfully applied to many other text analysis problems, it is worth investigating if they too are used in text phylogeny tree reconstruction. In this paper, we explore the use of word embeddings as a text representation method, with the aim of trying to improve the accuracy of reconstructed phylogeny trees for real-world data and compare it with other widely used text representation methods. We evaluate the performance on established benchmarks for this task: a synthetic data set and data collected from Wikipedia. We also apply our framework to a new data set of fan fiction based on some famous fairy tales. Experimental results show that word embeddings are competitive with other feature sets for the published benchmarks, and are highly effective for creative writing. (AU)

Processo FAPESP:	17/12646-3 - Déjà vu: coerência temporal, espacial e de caracterização de dados heterogêneos para análise e interpretação de integridade
Beneficiário:	Anderson de Rezende Rocha
Modalidade de apoio:	Auxílio à Pesquisa - Temático

URL curto