Advanced search
Start date
Betweenand

Distributed vector representation of documents applied to categorize short and noisy text messages

Grant number: 17/06495-2
Support Opportunities:Scholarships in Brazil - Scientific Initiation
Effective date (Start): June 01, 2017
Effective date (End): May 31, 2018
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computer Systems
Principal Investigator:Tiago Agostinho de Almeida
Grantee:Pedro Reis Pires
Host Institution: Centro de Ciências em Gestão e Tecnologia (CCGT). Universidade Federal de São Carlos (UFSCAR). Campus de Sorocaba. Sorocaba , SP, Brazil

Abstract

Classifying text messages is becoming more and more difficult with increasing use of mobile devices to access the Internet, which makes the messages rife with slangs, abbreviations, and typos. The traditional text representation known as bag-of-words has a series of shortcomings that aggravate when messages are short and noisy. One of the most adopted solutions to overcome these problems makes use of techniques such as lexical normalization and semantic indexing. These solutions, however, have the disadvantage of being language dependent and requiring constant maintenance. This paper investigates the use of distributed vector representation of text as an alternative to bag-of-words, in problems of classification of short and noisy messages. In these representations, semantically more similar words are represented by closer vectors in an n-dimensional space. The hypothesis of this project is that, by preserving the semantic similarity between words, the use of these representations circumvents many of the deficiencies related to the use of bag-of-words, and can provide superior performance. Because they are generated by unsupervised methods, these representations have the advantage of not needing dictionaries. Given that there are different algorithms to generate the distributed vector representation, this work will investigate which provides the best performance in the categorization task and if this representation can offer superior performance to the traditional bag-of-words. (AU)

News published in Agência FAPESP Newsletter about the scholarship:
More itemsLess items
Articles published in other media outlets ( ):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)

Scientific publications
(References retrieved automatically from Web of Science and SciELO through information on FAPESP grants and their corresponding numbers as mentioned in the publications by the authors)
LOCHTER, JOHANNES, V; PIRES, PEDRO R.; BOSSOLANI, CARLOS; YAMAKAMI, AKEBO; ALMEIDA, TIAGO A.; IEEE. Evaluating the impact of corpora used to train distributed text representation models for noisy and short texts. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), v. N/A, p. 8-pg., . (17/06495-2, 17/09387-6)

Please report errors in scientific publications list using this form.