Enhancement of visual information in image-based question answering tasks with scene graph data using self-supervised learning

Bruno César de Oliveira Souza

Full text
Author(s):	Bruno César de Oliveira Souza Total Authors: 1
Document type:	Master's Dissertation
Press:	Campinas, SP.
Institution:	Universidade Estadual de Campinas (UNICAMP). Instituto de Computação
Defense date:	2023-08-23
Examining board members:	Adín Ramírez Rivera; Thiago Alexandre Salgueiro Pardo
Advisor:	Adín Ramírez Rivera; Hélio Pedrini
Abstract
The intersection of vision and language has garnered significant interest as researchers aim for seamless integration between visual recognition and reasoning capabilities. Scene graphs have emerged as a valuable tool in multimodal image-language tasks, exhibiting high performance in tasks such as Visual Question Answering (VQA). However, current methods that rely on idealized annotated scene graphs often struggle to generalize when utilizing predicted scene graphs extracted directly from images. In this study, we address this challenge by introducing the SelfGraphVQA framework. Our approach involves extracting a scene graph from an input image using a pre-trained scene graph generator and subsequently enhancing the visual information through self-supervised techniques. By leveraging self-supervision, our method enhances the utilization of graph representations in VQA tasks, eliminating the need for expensive and potentially biased annotation data. Additionally, we employ image augmentations to create alternative views of the extracted scene graphs, enabling the learning of joint embeddings through a contrastive approach that optimizes the informational content within their representations. In our experimentation, we explore three distinct contrastive strategies: node-wise, graph-wise, and permutation equivariance regularization, all tailored to scene graph processing. Through empirical evaluations, we demonstrate the effectiveness of the extracted scene graphs in VQA tasks, surpassing the limitations of relying solely on annotated scene graphs. Furthermore, we illustrate that our self-supervised approach significantly enhances the overall performance of VQA models by emphasizing the significance of visual information. As a result, our framework provides a more practical and efficient solution for VQA tasks that rely on scene graphs to address complex reasoning questions. Overall, our study showcases the efficacy of leveraging self-supervised techniques to enhance scene graph utilization in VQA tasks. By circumventing the limitations of idealized annotated scene graphs, we promote a robust approach to incorporating visual information for multimodal understanding. The SelfGraphVQA framework contributes to the advancement of seamless integration between vision and language, unlocking new possibilities for improved recognition and reasoning in the field of image-language tasks (AU)

FAPESP's process:	20/14452-4 - Visual question answering task with graph convolution networks
Grantee:	Bruno César de Oliveira Souza
Support Opportunities:	Scholarships in Brazil - Master

Short URL