Advanced search
Start date

Graph-based total recall information retrieval on text document corpora


In the real world, textual format is the common way of storing information. Thus automated techniques which help 10 group, extract topic, and classify textual documents, minimizing the need of human intervention, remain a worthwhile research topic. In this context, the Brazilian and Canadian groups have developed a number of lechniques related to network-based text mining, to complement the traditional vector space model for representing textual corpora. More specifically, representing textual collections as networks of terms and documents. Algorithms that use a graph representation have several advantages since a graph representation: (1) avoids sparsity and ensures low memory consumption; (2) enables an optimal description of the topological structure of a dataset and associated operations; (3) provides local and global statistics of the dataset's structure; and (4) allows extracting patterns which are not extracted by algorithms based on vector-space model (Breve et al., 2012). By using such representations, a number of techniques has been developed for supervised, unsupervised, and semi-supervised learning by both groups. The Brazilian group's methods are based on information propagation in bipartite networks and can be applied to difterent domains. In the textual domains, in which a collection of documents may be represented by document-term bipartite networks, the proposals range from text classification to soft clustering, including semi-supervised classification and topic extraction. The counterpart Canadian team is involved in a major ongoing project on total recall information retrieval (IR) in large noisy text datasets funded by NSERC and Boeing Canada. A difterent project that received funding from the Digging into Data program untillate 2015 and continues under NSERC Discovery grant funding addresses total recall (lR) on a large corpus of biodiversity heritage text. As a notivatinq practical problem, this project also aims to expand the functionality and the utility of the Biodiversity-Heritaqe Library (BHL) [BHL], a digital library of over 170 thousand volumes, and 49 million pages of biodiversity literature, dating since the 16th century, openly available to the global biodiversity community. The collaboration between the two teams will aim for novel approaches so that each team can improve their knowledge and usage of strateqies, techniques and tools employed by the other, in the context of total recall IR for the BHL corpus. These opportunities will extend to the students working in these topics, who will experience international collaboration and internships at the partner institutions as part of the masters or doctoral projects. (AU)

Articles published in Agência FAPESP Newsletter about the research grant:
Articles published in other media outlets (0 total):
More itemsLess items