Hierarchical variational vistual attention

Darley Freire Barreto

Full text
Author(s):	Darley Freire Barreto Total Authors: 1
Document type:	Master's Dissertation
Press:	Campinas, SP.
Institution:	Universidade Estadual de Campinas (UNICAMP). Instituto de Computação
Defense date:	2021-05-10
Examining board members:	Adín Ramírez Rivera; Esther Luna Colombini; Roberto de Alencar Lotufo
Advisor:	Adín Ramírez Rivera
Abstract
The attention in artificial intelligence was inspired by human visual attention and designed to increase the flexibility of neural models, providing a sense of relevance to the model. Full visual inputs have excessive information that can affect models, possibly undermining their performance. When attending regions of interest in an image, a model can control the flow of information, focusing on relevant parts that help to perform a task, possibly reducing the training complexity in the model that will use these attended regions. This work proposes to model attention as samples from a variational distribution, computing the probability of all pixel locations \wrt the predicted distribution, creating a mask in the input image. Three similar models are presented and evaluated, the core idea is to use a neural network to predict parameters of a Normal distribution, whose samples represent the center of an attention mask in the pixel space, with size given by the predicted standard deviation. Initially, a model is proposed to predict four parameters and create a hierarchical distribution, where these parameters are used to create a Normal and a Gamma distribution, then samples from both are used to create a second Normal, which is then used to generate the attention. However, experiments have shown that this approach is not sufficient to predict robust attentional masks. Therefore, a second model with only one level is introduced, hence only two parameters need to be predicted to create a Normal distribution and sample the masks. Similar to the first model, the predicted attentional masks were far less accurate than expected, diverging considerably from the training, validation, and test labels. Finally, a third model is proposed to simplify the second, by removing the need to predict the standard deviation, focusing only on the mean of the Normal distribution. Experiments performed on all three methods with both synthetic sets and real data show that the modeling and the optimization function considered in this work are not sufficient to conduct the model in a generic data set. In the simplest configuration, \ie, predicting only the mean of the attentional distribution, experiments show that the model can not learn when the data have a small sample variability. However, when the number of instances and classes is increased, the model achieves acceptable results if compared to the alternatives. Yet, when increasing the number of instances, the model is once again unable to learn, revealing that there is a threshold between the data complexity and modeling capacity (AU)

FAPESP's process:	18/10027-7 - An attentional model for videos classification
Grantee:	Darley Freire Barreto
Support Opportunities:	Scholarships in Brazil - Master

Short URL