Advanced search
Start date

A social media monitoring platform: large scale natural language processing using Hadoop


A large volume of data from social media, such as messages, comments, news and blog posts, are found in the form of natural language. This increases the demand for machine learning and natural language processing such as classifiers and sequence models. Our project aims to develop this type of techniques in a distributed and scalable way, with focus on Portuguese. We will apply these technologies to build a social media monitoring system. Hadoop is an open-source platform that was initially inspired by technologies developed by Google, such as the Map-reduce framework and the Google File System. Yahoo was its precursor and it continues intensively using the platform in its business. The increasing adoption of Hadoop by large companies as well as by independent developers has helped to create a rich set of tools, functionalities and code repository. This is the reason why Hadoop was chosen as our platform for developing out distributed systems algorithms. The basis of Hadoop is the Map-reduce framework, which processes distributed data in a way that each node in the system stores and processes a segment of data. The first phase (Map) reads and locally process a segment of data. The second phase (Reduce) aggregates the output of the Map from different segments and computes the final result. The concrete objective of this project is to implement advanced machine learning and natural language processing techniques in the Map-Reduce framework. As a result, we expect to extract high valued information from massive non-structured data, specifically from text written in Portuguese. The commercial application will be a social media-monitoring platform that will help our customers get relevant information for strategic decisions. (AU)

Articles published in Agência FAPESP Newsletter about the research grant:
Articles published in other media outlets (0 total):
More itemsLess items

Please report errors in scientific publications list by writing to: