JurisBERT: A New Approach that Converts a Classification Corpus into an STS One

Viegas, Charles F. O.; Costa, Bruno C.; Ishii, Renato P.

Full text
Author(s):	Viegas, Charles F. O. ; Costa, Bruno C. ; Ishii, Renato P. Total Authors: 3
Document type:	Journal article
Source:	COMPUTATIONAL SCIENCE AND ITS APPLICATIONS, ICCSA 2023, PT I; v. 13956, p. 17-pg., 2023-01-01.
Abstract
We propose in this work a new approach that aims to transform a classification corpus into an STS (Semantic Textual Similarity) one. In that sense, we use BERT (Bidirectional Encoder Representations from Transformers) to validate our hypothesis, i.e., a multi-level classification dataset can be converted into an STS dataset which improves the fine-tuning step and evidences the proposed corpus. Also, in our approach, we trained from scratching a BERT model considering the legal texts, called JurisBert which reveals a considered improvement in fastness and precision, and it requires less computational resources than other approaches. JurisBERT uses the concept of sub-language, i.e., a model pre-trained in a language (Brazilian Portuguese) passes through refining (fine-tuning) to better attend to a specific domain, in our case, the legal field. JurisBERT uses 24k pairs of ementas with degrees of similarity varying from 0 to 3. We got this data from search mechanisms available on the court websites to validate the model with realworld data. Our experiments showed JurisBERT is better than other models such as multilingual BERT and BERTimbau with 3.30% better precision (F-1), 5 times reduced training time, and using accessible hardware, i.e., low-cost GPGPU architecture. (AU)

FAPESP's process:	15/24485-9 - Future internet for smart cities
Grantee:	Fabio Kon
Support Opportunities:	Research Projects - Thematic Grants


FAPESP's process:	14/50937-1 - INCT 2014: on the Internet of the Future
Grantee:	Fabio Kon
Support Opportunities:	Research Projects - Thematic Grants

Short URL