Advanced search
Start date
Betweenand

A social media monitoring platform: large scale natural language processing using Hadoop

Grant number: 14/22802-4
Support type:Research Grants - Innovative Research in Small Business - PIPE
Duration: October 01, 2015 - June 30, 2016
Field of knowledge:Physical Sciences and Mathematics - Computer Science
Principal Investigator:Thiago Barros Rodrigues Costa
Grantee:Thiago Barros Rodrigues Costa
Company:Nervera Serviços de Informática Ltda. - ME
City: São Paulo
Co-Principal Investigators:Andrei Cristian Roman
Associated grant(s):17/12888-7 - A big data platform for media aggregation and intelligence, AP.PIPE
Associated scholarship(s):16/00261-7 - A social media monitoring platform: large scale natural language processing using Hadoop, BP.PIPE

Abstract

A large volume of data from social media, such as messages, comments, news and blog posts, are found in the form of natural language. This increases the demand for machine learning and natural language processing such as classifiers and sequence models. Our project aims to develop this type of techniques in a distributed and scalable way, with focus on Portuguese. We will apply these technologies to build a social media monitoring system. Hadoop is an open-source platform that was initially inspired by technologies developed by Google, such as the Map-reduce framework and the Google File System. Yahoo was its precursor and it continues intensively using the platform in its business. The increasing adoption of Hadoop by large companies as well as by independent developers has helped to create a rich set of tools, functionalities and code repository. This is the reason why Hadoop was chosen as our platform for developing out distributed systems algorithms. The basis of Hadoop is the Map-reduce framework, which processes distributed data in a way that each node in the system stores and processes a segment of data. The first phase (Map) reads and locally process a segment of data. The second phase (Reduce) aggregates the output of the Map from different segments and computes the final result. The concrete objective of this project is to implement advanced machine learning and natural language processing techniques in the Map-Reduce framework. As a result, we expect to extract high valued information from massive non-structured data, specifically from text written in Portuguese. The commercial application will be a social media-monitoring platform that will help our customers get relevant information for strategic decisions. (AU)