Evaluating the impact of corpora used to train distributed text representation models for noisy and short texts

Lochter, Johannes, V; Pires, Pedro R.; Bossolani, Carlos; Yamakami, Akebo; Almeida, Tiago A.; IEEE

Full text
Author(s):	Lochter, Johannes, V ; Pires, Pedro R. ; Bossolani, Carlos ; Yamakami, Akebo ; Almeida, Tiago A. ; IEEE Total Authors: 6
Document type:	Journal article
Source:	2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN); v. N/A, p. 8-pg., 2018-01-01.
Abstract
The traditional bag of words has well-known shortcomings. Distributed representations have emerged as an alternative and became the state-of-the-art in textual representation. However, most of the existing researches evaluated these representations using formal and well-written text corpora to generate the language model which is later applied to traditional text categorization problems. There is no evidence the same models are suitable to deal with short and noisy texts (e.g., messages from social networks and instant messengers). In addition, there is no consensus on which of the existing techniques for training distributed representation of words is the best available choice to be used as a baseline for further comparisons. In this way, this study evaluates if language models created from a corpus extracted from the domain is more appropriate than the ones created from a corpus of formal text. Moreover, we provide a comparison between the most traditional techniques used to create models of distributed representation of texts by applying them in polarity detection of short and noisy messages. Our experiments were diligently designed and allowed us to confirm our hypothesis that training distributed representation models with a corpus composed by text with same characteristics of the application domain is more recommended than using a formal and well-written one. (AU)

FAPESP's process:	17/06495-2 - Distributed vector representation of documents applied to categorize short and noisy text messages
Grantee:	Pedro Reis Pires
Support Opportunities:	Scholarships in Brazil - Scientific Initiation


FAPESP's process:	17/09387-6 - A continuously evolving distributed text representation model
Grantee:	Tiago Agostinho de Almeida
Support Opportunities:	Regular Research Grants

Short URL

Compartilhe esta página