Evaluating the impact of corpora used to train distributed text representation models for noisy and short texts

Lochter, Johannes, V; Pires, Pedro R.; Bossolani, Carlos; Yamakami, Akebo; Almeida, Tiago A.; IEEE

Texto completo
Autor(es):	Lochter, Johannes, V ; Pires, Pedro R. ; Bossolani, Carlos ; Yamakami, Akebo ; Almeida, Tiago A. ; IEEE Número total de Autores: 6
Tipo de documento:	Artigo Científico
Fonte:	2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN); v. N/A, p. 8-pg., 2018-01-01.
Resumo
The traditional bag of words has well-known shortcomings. Distributed representations have emerged as an alternative and became the state-of-the-art in textual representation. However, most of the existing researches evaluated these representations using formal and well-written text corpora to generate the language model which is later applied to traditional text categorization problems. There is no evidence the same models are suitable to deal with short and noisy texts (e.g., messages from social networks and instant messengers). In addition, there is no consensus on which of the existing techniques for training distributed representation of words is the best available choice to be used as a baseline for further comparisons. In this way, this study evaluates if language models created from a corpus extracted from the domain is more appropriate than the ones created from a corpus of formal text. Moreover, we provide a comparison between the most traditional techniques used to create models of distributed representation of texts by applying them in polarity detection of short and noisy messages. Our experiments were diligently designed and allowed us to confirm our hypothesis that training distributed representation models with a corpus composed by text with same characteristics of the application domain is more recommended than using a formal and well-written one. (AU)

Processo FAPESP:	17/06495-2 - Representação vetorial distribuída de texto aplicada na classificação de mensagens curtas e ruidosas
Beneficiário:	Pedro Reis Pires
Modalidade de apoio:	Bolsas no Brasil - Iniciação Científica


Processo FAPESP:	17/09387-6 - Modelo de representação distribuída de textos com capacidade de evoluir continuamente
Beneficiário:	Tiago Agostinho de Almeida
Modalidade de apoio:	Auxílio à Pesquisa - Regular

URL curto