Abstract
Visual Question Answering (VQA) is a task that aims to answer a user's question grounded to a given image. Normally, this task requires a combination of concepts from Computer Vision and Natural Language Processing. The majority of existing VQA systems merge the extracted image and question features in order to predict an answer. Nonetheless, this multi-modal fusion shows a significant ga…