Content redistribution throughout the web, by lawful or unlawful means, has attracted attention in recent years in fields like forensics, copyright enforcement, security and social network analysis. Very often the digital objects involved in this process go through an evolutionary chain, in which different versions of an original document emerge. In this case, the relationship among the documents can be represented by a directed acyclic graph, known in the field as a phylogenetic tree, due to the direct analogy with the ones used in evolution studies in Biology. From the analysis of these trees, it is possible to discover clues pointing to criminals, or gain insights about how information spreads through the web. Thus, the automatic reconstruction of phylogeny trees associated to multimedia presents itself as an important challenge, with great potential for generating value and benefits to the society. The sub-field which studies this problem is known as Multimedia Phylogeny, and it has achieved significant results in some types of media, namely images and video. In preliminary tests, done by the candidate, promising results were achieved in another, but less explored, particular type of media: text documents. In this project, we propose to expand text phylogeny research, using synthetic and real data, aiming to improve the performance of the existent reconstruction process, addressing the problems that were found the most challenging in our preliminary studies.
News published in Agência FAPESP Newsletter about the scholarship: