Advanced search
Start date

Veritas: a Brazilian legal documents dataset for natural language processing

Grant number: 21/06783-3
Support Opportunities:Scholarships in Brazil - Scientific Initiation
Effective date (Start): October 01, 2021
Effective date (End): November 30, 2021
Field of knowledge:Physical Sciences and Mathematics - Computer Science
Principal Investigator:Esther Luna Colombini
Grantee:Guilherme Pereira Corrêa
Host Institution: Instituto de Computação (IC). Universidade Estadual de Campinas (UNICAMP). Campinas , SP, Brazil
Host Company:Universidade de São Paulo (USP). Centro de Inovação da USP (INOVA)
Associated research grant:19/07665-4 - Center for Artificial Intelligence, AP.eScience.CPE


In the last decade, the volume of available data has been increasing drastically in quantity as in diversely, reaching its climax during the Covid-19 pandemic. With the evolution of computational techniques for analyzing this information, this growth made areas that had usually concentrated on qualitative approaches start to take into another trend: the quantitative approach. Particularly, the Brazilian Legal domain is not an exception since many documents are available online for the public, especially after the publication of the Federal Law 11419/2006, which popularized the use of digital documents in legal cases; thus, this area has the potential to impact the lives of many people directly. Because of that, a partnership between the Institute of Computation of the State University of Campinas (Unicamp) and the Law School of Ribeirão Preto of the University of São Paulo (USP) was established, where the goal is to fulfill the demands of computation legal science for social good. Inspired by this partnership, the present project aims to contribute to Machine Learning in the Law domain, focused on public interest applications. By the end, we will publish a dataset of Brazilian judicial documents, where its content is going to be collected by a web crawler according to the desired parameters. Moreover, the usage by non-supervised applications is the main target. For validation, a non-supervised model for the classification of judicial decisions will be developed, where this classification will be made based on the content of the collected documents. Thus, this project is expected to contribute to the community of Portuguese speakers researchers of computational legal science since there is a shortage of this kind of database. Furthermore, the available datasets are either in English or restricted by only one class of documents or even annotated, losing their generality and making their usage limited.(AU)

News published in Agência FAPESP Newsletter about the scholarship:
Articles published in other media outlets (0 total):
More itemsLess items

Please report errors in scientific publications list using this form.