Research and Innovation: Regulatory Document Extraction Engine (MEDoRe)

Grant number:	22/10596-7
Support Opportunities:	Research Grants - Innovative Research in Small Business - PIPE
Start date:	March 01, 2023
End date:	November 30, 2023
Field of knowledge:	Physical Sciences and Mathematics - Computer Science

Principal Investigator:	Danilo Amaral de Oliveira
Grantee:	Danilo Amaral de Oliveira


Company:	Openlex Soluções Tecnológicas Ltda
CNAE:	Tratamento de dados, provedores de serviços de aplicação e serviços de hospedagem na internet Outras atividades de prestação de serviços de informação não especificadas anteriormente

Principal investigators	Bruno Squizato Faiçal ; Ivan Ervolino
Associated researchers:	Frederico Amaral de Oliveira

Associated research grant(s):	23/16491-5 - Sigalei Analytics: Turning Regulatory Documents into Strategic Decisions, AP.PIPE
Associated scholarship(s):	23/06198-9 - Collection and storage of official journals, BP.TT 23/01658-1 - Expanding the functionalities of MEDoRe, BP.TT 23/02098-0 - Regulatory Document Extraction Engine (MEDoRe), BP.PIPE

Abstract

Brazil is recognized as a country with a complex environment for doing business. One of the main reasons is the high complexity generated in the regulatory environment. In addition to this complexity, the government lacks the maturity to standardize and structure government data to make it easier to understand. Most of the large volume of data generated daily is unstructured data} and due to the sheer volume of this data, manually monitoring and analyzing government decisions effectively is an impractical task. Among the various types of data generated are the official journals, which have great informative value for society and business about governmental acts. The official journals allow the acts, decisions, and proposals made by public bodies to be made publicly available, so that government decisions can be complied with and, in a democratic way, society and companies can monitor or participate in them. The Official Gazettes are published daily at the municipal, state, and federal levels. These documents are widely available in PDF format, noting that only the Diário Oficial da União (DOU) is also available following open data guidelines as an alternative to PDF. PDF is a document format for final presentation, which preserves the original layout of the document, but often does not maintain the logical structure of the document. This is one of the reasons why the Technical Primer for Publishing Open Data in Brazil states that regulatory documents should be made available in formats with open, non-proprietary specifications, and structured so that unrestricted and automated use is possible. The Secretariat of Logistics and Information Technology (SLTI) is aware that the use of the PDF format is inadequate and that this is a recurrent mistake made by several public agencies and emphasizes that the use of the PDF format makes the reuse of data unfeasible, or difficult, because it does not allow its automated reuse.Making regulatory documents available in unstructured formats is an obstacle to automated reuse of data for broad dissemination, processing, curation, and information extraction. It is important that companies and professionals are up-to-date on official decisions and information for strategic decision-making in their respective fields. For this, a mechanism is needed that can process the large volume of government data made available to extract the regulatory documents in unstructured formats and make them available in a structured format, such as json. This is an open format that can be interpreted by software, which allows the regulatory documents to be presented by expert applications and processed automatically.To overcome this obstacle, we propose a processing mechanism based on computer vision and extraction of text from images. In this regard, we expect to visually identify regions that refer to regulatory documents embedded in the Journals and extract them as plain text. Subsequently, the regulatory documents are classified according to the class they belong to (e.g., law, decree, or resolution). Finally they are structured in the json standard and stored in the infrastructure made available by the headquarter company. The access for storage and consumption of the structured data will be done through an API already developed by the company. (AU)

Articles published in Agência FAPESP Newsletter about the research grant:

More items Less items

TITULO

Articles published in other media outlets ( ):

More items Less items

VEICULO: TITULO (DATA)

Short URL