A continuously evolving distributed text representation model
Urban insights: deep learning applied to governance in cities
Veritas: a Brazilian legal documents dataset for natural language processing
Grant number: | 17/06495-2 |
Support Opportunities: | Scholarships in Brazil - Scientific Initiation |
Effective date (Start): | June 01, 2017 |
Effective date (End): | May 31, 2018 |
Field of knowledge: | Physical Sciences and Mathematics - Computer Science - Computer Systems |
Principal Investigator: | Tiago Agostinho de Almeida |
Grantee: | Pedro Reis Pires |
Host Institution: | Centro de Ciências em Gestão e Tecnologia (CCGT). Universidade Federal de São Carlos (UFSCAR). Campus de Sorocaba. Sorocaba , SP, Brazil |
Abstract Classifying text messages is becoming more and more difficult with increasing use of mobile devices to access the Internet, which makes the messages rife with slangs, abbreviations, and typos. The traditional text representation known as bag-of-words has a series of shortcomings that aggravate when messages are short and noisy. One of the most adopted solutions to overcome these problems makes use of techniques such as lexical normalization and semantic indexing. These solutions, however, have the disadvantage of being language dependent and requiring constant maintenance. This paper investigates the use of distributed vector representation of text as an alternative to bag-of-words, in problems of classification of short and noisy messages. In these representations, semantically more similar words are represented by closer vectors in an n-dimensional space. The hypothesis of this project is that, by preserving the semantic similarity between words, the use of these representations circumvents many of the deficiencies related to the use of bag-of-words, and can provide superior performance. Because they are generated by unsupervised methods, these representations have the advantage of not needing dictionaries. Given that there are different algorithms to generate the distributed vector representation, this work will investigate which provides the best performance in the categorization task and if this representation can offer superior performance to the traditional bag-of-words. (AU) | |
News published in Agência FAPESP Newsletter about the scholarship: | |
More itemsLess items | |
TITULO | |
Articles published in other media outlets ( ): | |
More itemsLess items | |
VEICULO: TITULO (DATA) | |
VEICULO: TITULO (DATA) | |