#PraCegoVer: A Large Dataset for Image Captioning in Portuguese

dos Santos, Gabriel Oliveira; Colombini, Esther Luna; Avila, Sandra

Texto completo
Autor(es):	dos Santos, Gabriel Oliveira ; Colombini, Esther Luna ; Avila, Sandra Número total de Autores: 3
Tipo de documento:	Artigo Científico
Fonte:	DATA; v. 7, n. 2, p. 27-pg., 2022-02-01.
Resumo
Automatically describing images using natural sentences is essential to visually impaired people's inclusion on the Internet. This problem is known as Image Captioning. There are many datasets in the literature, but most contain only English captions, whereas datasets with captions described in other languages are scarce. We introduce the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese. In contrast to popular datasets, #PraCegoVer has only one reference per image, and both mean and variance of reference sentence length are significantly high, which makes our dataset challenging due to its linguistic aspect. We carry a detailed analysis to find the main classes and topics in our data. We compare #PraCegoVer to MS COCO dataset in terms of sentence length and word frequency. We hope that #PraCegoVer dataset encourages more works addressing the automatic generation of descriptions in Portuguese. Dataset: https://doi.org/10.5281/zenodo.5710562 Dataset License: CC BY-NC-SA 4.0. (AU)

Processo FAPESP:	19/24041-4 - #PraCegoVer: audiodescrição automática de imagens
Beneficiário:	Gabriel Oliveira dos Santos
Modalidade de apoio:	Bolsas no Brasil - Iniciação Científica


Processo FAPESP:	13/08293-7 - CECC - Centro de Engenharia e Ciências Computacionais
Beneficiário:	Munir Salomao Skaf
Modalidade de apoio:	Auxílio à Pesquisa - Centros de Pesquisa, Inovação e Difusão - CEPIDs

URL curto