Advanced search
Start date
Betweenand

The Brazilian corpus: a 1-billion-word online collection of Brazilian portuguese

Grant number: 08/00944-0
Support type:Regular Research Grants
Duration: May 01, 2008 - April 30, 2010
Field of knowledge:Linguistics, Literature and Arts - Linguistics
Principal Investigator:Antonio Paulo Berber Sardinha
Grantee:Antonio Paulo Berber Sardinha
Home Institution: Pró-Reitoria de Pós-Graduação (PRPG). Pontifícia Universidade Católica de São Paulo (PUC-SP). São Paulo , SP, Brazil

Abstract

This project is aimed at building and making available online the Corpus Brasileiro (Brazilian Corpus), to be made up of one billion words of contemporary Portuguese, from a wide range of registers. Currently there is a gap among online Portuguese collections for corpora with the dimensions and variety such as the Corpus Brasileiro. The largest online corpora of Portuguese are the Corpus do Português, with 45 million words (http://www.corpusdoportugues.org/), hosted by Brigham Young University (USA), among which only 12,009,402 words come from contemporary Brazilian Portuguese, and Lácio-Web, with 10 million words of Brazilian Portuguese, created by researchers at NILC (Núcleo Interinstitucional de Linguística Computacional; http://www.nilc.icmc.usp.br/lacioweb). The methodology for building the corpus includes: (1) obtaining texts and transcripts both online and offline; (2) structuring the material in SQL databases; (3) making the corpus available for searching online, through PHP forms. The process of building the corpus will follow the architecture proposed by Davies (2005 inter alia), which consists of importing textual data into structures databases and then querying the databases via PHP online forms. He reports excellent results both in terms of speed and accuracy. His architecture forms the basis of several online corpora, including the British National Corpus (100 million words; http://corpus.byu.edu/bnc/) and more recently the BYU Corpus of American English (326 million words; http://www.americancorpus.org/). The corpus will offer access to both frequency information and concordance lines for user generated searches. The user will not have access to the whole texts, since this would impinge on copyright laws (Besek, 2003). The need for corpora as large as one billion words derives from the fact that a corpus is a sample of a large population (language), and in the case of general language, the size of the population is unknown; hence, the larger the sample, the closer it will be to the population, thus being a more representative sample of the range of variation in language (Berber Sardinha, 2004). The corpus can have a significant social impact, as it will make it possible for everyone to search the vast quantities of text and talk and find out for themselves how Brazilian Portuguese is typically used in diverse situations. Potential users of the corpus include linguists, language researchers, Portuguese teachers (both of Portuguese as a mother tongue or as foreign language), journalists, writers, grammarians, dictionary makers, students, as well as a host of other users. (AU)

Distribution map of accesses to this page
Click here to view the access summary to this page.