Advanced search
Start date

Query translation using enhanced morphology and semantic techniques in a statistical machine translation system (MorSeM)

Grant number: 12/02131-2
Support Opportunities:Research Grants - Visiting Researcher Grant - International
Duration: May 06, 2012 - November 30, 2012
Field of knowledge:Physical Sciences and Mathematics - Computer Science
Principal Investigator:Renata Wassermann
Grantee:Renata Wassermann
Visiting researcher: Marta Ruiz Costa-Jussa
Visiting researcher institution: Barcelona Media (BM), Spain
Host Institution: Instituto de Matemática e Estatística (IME). Universidade de São Paulo (USP). São Paulo , SP, Brazil


The relevance of the multilingual information retrieval lies in the presence of more and more languages in the different platforms. It becomes more common for non-native speakers to explore multilingual text collections. This research area is called Cross-lingual Information Retrieval (CLIR) which is the circumstance in which a user tries to retrieve information in a set of documents written in one language for a query in another language. In the CLIR context, this project intends to investigate new linguistic methods for statistical machine translation (SMT) which maximize the quality of query translation. This project would be developed in the context of the OnAIR Project (FAPESP2010/19111-9), which focuses on facilitating the task of searching for information in long videos either in English or Brazilian Portuguese. On the one hand, the nature of queries is significantly different from the text paradigm which is the goal of machine translation systems. The motivation for query translation lies in translating domain-specific and isolated terms. When translating queries, we have restricted or no additional context information. Therefore, standard MT methods can produce poor performance. On the other hand, we must take into account that MT is a highly interdisciplinary and multidisciplinary field since it is approached from the point of view of human translators, engineers, computer scientists, mathematicians and linguists. Therefore, taking advantage of that, the main objective of the project is to explore different linguistic and statistical techniques (focusing on morphology and semantics) to be introduced in a state-of-the-art statistical MT system in order to correctly translate queries. One of the main problems in machine translation is to be able to choose the correct meaning, which involves a classification or disambiguation problem. At the same time, one of the most important aspects in query translation is to outperform the semantic aspect of the translation. In addition, morphology can be a barrier for semantic disambiguation. Therefore, in this project we would introduce morphology tools to try to deal with these challenges. In order to improve the accuracy, it is possible to apply a method to disambiguate different meanings of a single word. We will study the best way to introduce a bilingual dictionary to solve disambiguation problems. Then we will experiment with introducing the query source context information in order to solve disambiguation problems by means of statistical chunking such as dice score, vector-space modeling or latent semantic analysis. We will evaluate and contrast our new methodogies both in MT and CLIR quality terms. Once the enhanced MT methodology is studied, analyzed and compared to plain statistical-based MT systems, we will choose the combination of enhanced techniques that presents the best results for integration with the OnAIR project. The enhanced MT system will be integrated and adapted in the OnAIR platform and easily translate queries (and documents if necessary). The automatic translation of queries will allow looking at videos in one language (English or Brazilian Portuguese) and making questions in the other language (Brazilian Portuguese or English). (AU)

Articles published in Agência FAPESP Newsletter about the research grant:
Articles published in other media outlets (0 total):
More itemsLess items

Please report errors in scientific publications list using this form.