Scholarship 17/06495-2 - Aprendizado computacional, Processamento de linguagem natural

Grant number:	17/06495-2
Support Opportunities:	Scholarships in Brazil - Scientific Initiation
Start date:	June 01, 2017
End date:	May 31, 2018
Field of knowledge:	Physical Sciences and Mathematics - Computer Science - Computer Systems

Principal Investigator:	Tiago Agostinho de Almeida
Grantee:	Pedro Reis Pires

Host Institution:	Centro de Ciências em Gestão e Tecnologia (CCGT). Universidade Federal de São Carlos (UFSCAR). Campus de Sorocaba. Sorocaba , SP, Brazil

Abstract Classifying text messages is becoming more and more difficult with increasing use of mobile devices to access the Internet, which makes the messages rife with slangs, abbreviations, and typos. The traditional text representation known as bag-of-words has a series of shortcomings that aggravate when messages are short and noisy. One of the most adopted solutions to overcome these problems makes use of techniques such as lexical normalization and semantic indexing. These solutions, however, have the disadvantage of being language dependent and requiring constant maintenance. This paper investigates the use of distributed vector representation of text as an alternative to bag-of-words, in problems of classification of short and noisy messages. In these representations, semantically more similar words are represented by closer vectors in an n-dimensional space. The hypothesis of this project is that, by preserving the semantic similarity between words, the use of these representations circumvents many of the deficiencies related to the use of bag-of-words, and can provide superior performance. Because they are generated by unsupervised methods, these representations have the advantage of not needing dictionaries. Given that there are different algorithms to generate the distributed vector representation, this work will investigate which provides the best performance in the categorization task and if this representation can offer superior performance to the traditional bag-of-words. (AU)

News published in Agência FAPESP Newsletter about the scholarship:
More items Less items
TITULO

Articles published in other media outlets ( ):
More items Less items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)

Scientific publications

(References retrieved automatically from Web of Science and SciELO through information on FAPESP grants and their corresponding numbers as mentioned in the publications by the authors)

LOCHTER, JOHANNES, V; PIRES, PEDRO R.; BOSSOLANI, CARLOS; YAMAKAMI, AKEBO; ALMEIDA, TIAGO A.; IEEE. Evaluating the impact of corpora used to train distributed text representation models for noisy and short texts. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), v. N/A, p. 8-pg., 2018-01-01. (17/06495-2, 17/09387-6)

Short URL