Scholarship 19/18378-6 - Biologia computacional, Ciência de dados - BV FAPESP
Advanced search
Start date
Betweenand

Auxiliary module for discovering chemical structures in literature: data mining

Grant number: 19/18378-6
Support Opportunities:Scholarships in Brazil - Scientific Initiation
Start date: September 01, 2019
End date: June 30, 2020
Field of knowledge:Interdisciplinary Subjects
Principal Investigator:Ricardo Roberto da Silva
Grantee:Ana Carolina Lunardello Coelho
Host Institution: Faculdade de Ciências Farmacêuticas de Ribeirão Preto (FCFRP). Universidade de São Paulo (USP). Ribeirão Preto , SP, Brazil
Associated research grant:17/18922-2 - Development of a computing platform extensible and modular for metabolomics and metagenomics analysis: innovation with the discovery of new enzymatic activities and natural products of pharmaceutical interest derived, AP.BTA.JP

Abstract

The rapid increase of information in scientific literature makes updating data bases one of the biggest challenges in contemporary science. The main objective of the methods of data mining is to capture information rapidly and make it available to the user in an easily interpretable way. The strategy to provide information in this manner allows the possibility to recuperate information semi-automatically, which surpasses the updating limitations of traditional data bases and the difficulty to keep updated with scientific literature. The possibility that users can visualize, edit and archive the text they find in an easy access web platform is essential for an effective use of information. For many years the scientific community has been trying to formalize chemical nomenclature but institutions such as IUPAC (International Union of Pure and Applied Chemistry), IUBMB (International Union of Biochemistry and Molecular Biology) and CAS (Chemical Abstracts Service) have proposed solutions that are not exhaustive and conflict between each other. One of the challenges of processing natural language, differently than the protocol of exchanging information between computers, is to be characterized as being ambiguous and having abrupt variability. The utilization of data mining is based on linguistics, the scientific study of languages. Two of the fundamental principals of linguistics are structuring in different levels and ambiguity in each level. Topic modeling is a generative probabilistic model being increasingly used in data mining and information recuperation due to its ability in processing big text collections. One of the probabilistic models that is applied the most is the Latent Dirichlet Allocation (LDA). The objective of this project is the implementation of a data mining method using the Latent Dirichlet Allocation model which will consequently give support to the associated masters project.

News published in Agência FAPESP Newsletter about the scholarship:
More itemsLess items
Articles published in other media outlets ( ):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)

Please report errors in scientific publications list using this form.