Research Grants 17/50153-9 - Mineração de texto, Algoritmos

Abstract

In the real world, textual format is the common way of storing information. Thus automated techniques which help 10 group, extract topic, and classify textual documents, minimizing the need of human intervention, remain a worthwhile research topic. In this context, the Brazilian and Canadian groups have developed a number of lechniques related to network-based text mining, to complement the traditional vector space model for representing textual corpora. More specifically, representing textual collections as networks of terms and documents. Algorithms that use a graph representation have several advantages since a graph representation: (1) avoids sparsity and ensures low memory consumption; (2) enables an optimal description of the topological structure of a dataset and associated operations; (3) provides local and global statistics of the dataset's structure; and (4) allows extracting patterns which are not extracted by algorithms based on vector-space model (Breve et al., 2012). By using such representations, a number of techniques has been developed for supervised, unsupervised, and semi-supervised learning by both groups. The Brazilian group's methods are based on information propagation in bipartite networks and can be applied to difterent domains. In the textual domains, in which a collection of documents may be represented by document-term bipartite networks, the proposals range from text classification to soft clustering, including semi-supervised classification and topic extraction. The counterpart Canadian team is involved in a major ongoing project on total recall information retrieval (IR) in large noisy text datasets funded by NSERC and Boeing Canada. A difterent project that received funding from the Digging into Data program untillate 2015 and continues under NSERC Discovery grant funding addresses total recall (lR) on a large corpus of biodiversity heritage text. As a notivatinq practical problem, this project also aims to expand the functionality and the utility of the Biodiversity-Heritaqe Library (BHL) [BHL], a digital library of over 170 thousand volumes, and 49 million pages of biodiversity literature, dating since the 16th century, openly available to the global biodiversity community. The collaboration between the two teams will aim for novel approaches so that each team can improve their knowledge and usage of strateqies, techniques and tools employed by the other, in the context of total recall IR for the BHL corpus. These opportunities will extend to the students working in these topics, who will experience international collaboration and internships at the partner institutions as part of the masters or doctoral projects. (AU)

Articles published in Agência FAPESP Newsletter about the research grant:

More items Less items

TITULO

Articles published in other media outlets ( ):

More items Less items

VEICULO: TITULO (DATA)

Grant number:	17/50153-9
Support Opportunities:	Regular Research Grants
Start date:	June 01, 2018
End date:	May 31, 2020
Field of knowledge:	Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques
Agreement:	Consortium of Alberta, Laval, Dalhousie and Ottawa (CALDO)
Mobility Program:	SPRINT - Projetos de pesquisa - Mobilidade

Principal Investigator:	Alneu de Andrade Lopes
Grantee:	Alneu de Andrade Lopes
Principal researcher abroad:	Evangelos Milios
Institution abroad:	Dalhousie University, Canada

Host Institution:	Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil

Associated research grant:	15/14228-9 - Social Network Analysis and Mining, AP.R

Short URL