Research Grants 22/09490-0 - Tradução literária, Letras clássicas - BV FAPESP
Advanced search
Start date
Betweenand

Digital classics: linking ancient languages to Portuguese and enhancing an automatic model for translation alignment

Abstract

This proposal aims at carrying out a collaborative project between the area of Classics (Ancient Greek) and the area of Computing/Digital Humanities, i.e., in partnership with the computer developer Tariq Yousef (U. Leipzig) and the digital humanist Chiara Palladino (Furman U.). The role of the host university in this international partnership is to provide data training for Ancient Greek-Portuguese Translation Alignment tasks (GRC-POR TAs), which generate lexical unit pairs in both languages to enhance the multilingual model in context, Classical BERT in semi-automatically creation of TAs including the Brazilian Portuguese language. For the TAs, the local team of Greek Language graduates or undergraduate students will use corpora from individual or school projects, performing the manual and correcting automatic TAs through the UGARIT platform. The scanned texts in both languages are authorized for use; and aligned at the lexical or phrasal level, creating manually and/or automatically evaluated ATs. The workflow involving the procedure is: A. Ancient Greek to Portuguese translations can be i) new in preparation with manual alignment or treebanking; ii) coming from dirty TXT files or depending on OCR; iii) in revision; iv) ready in parallel sentences; v) manually aligned in ancient Greek. B. Automatic TA of the already prepared translated corpus performed on the UGARIT platform that does the alignment extraction. C. The data undergoes recursive retraining until it reaches its stable correctness threshold. The automatic alignment workflow in recursive training involves: tokenization; embeddings extraction; similarity matrix (cosine); alignment extraction; refinement, and evaluation D. The evaluation of the automatic alignment model according to the Gold standards and the alignment guidelines reach their optimal point. E. Translation choices are analyzed The TA performed manually by trained and in-training classicists is guided by the current guidelines and has their level of agreement evaluated using the kappa and overlap coefficient. About 10,000 sentence pairs will be available for alignment, including texts by prose authors: fables, philosophy, history, and epic and dramatic poetry. The outcomes to be obtained are a) manual and semi-automatic GRC-POR TAs: developed with an automatic aligner and corrected based on the gold standard for GRC-POR; b) an automatically aligned corpus at the lexical level, published online, exported, and unrestricted available for research and teaching of Ancient Greek; c) documentation of TA analyzes and practices in POR and d) the enhancement of the Classical BERT model, implemented in the UGARIT automatic aligner, evaluated and intended for the POR target audience, with the trained data available on data servers such as GitHub and Zenodo. (AU)

Articles published in Agência FAPESP Newsletter about the research grant:
More itemsLess items
Articles published in other media outlets ( ):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)