Visual rhythm-based convolutional neural networks and adaptive fusion for a multi-stream architecture applied to human action recognition

Helena de Almeida Maia

Full text
Author(s):	Helena de Almeida Maia Total Authors: 1
Document type:	Doctoral Thesis
Press:	Campinas, SP.
Institution:	Universidade Estadual de Campinas (UNICAMP). Instituto de Computação
Defense date:	2020-10-27
Examining board members:	Hélio Pedrini; Rodrigo Luis de Souza da Silva; Tiago José de Carvalho; Esther Luna Colombini; Tiago Fernandes Tavares
Advisor:	Marcelo Bernardes Vieira; Hélio Pedrini
Abstract
The large amount of video data produced and released every day makes visual inspection by a human operator impracticable. However, the content of these videos can be useful for various important tasks, such as surveillance and health monitoring. Therefore, automatic methods are needed to detect and understand relevant events in videos. The problem addressed in this work is the recognition of human actions in videos that aims to classify the action that is being performed by one or more actors. The complexity of the problem and the volume of video data suggest the use of deep learning-based techniques, however, unlike image-related problems, there is neither a great variety of specific well-established architectures nor annotated datasets as large as image-based ones. To circumvent these limitations, we propose and analyze a multi-stream architecture containing image-based networks pre-trained on the large ImageNet. Different image representations are extracted from the videos to feed the streams, in order to provide complementary information for the system. Here, we propose new streams based on visual rhythm that encode longer-term information when compared to still frames and optical flow. As important as the definition of representative and complementary aspects is the choice of proper combination methods that explore the strengths of each modality. Thus, here we also analyze different fusion approaches to combine the modalities. In order to define the best parameters of our fusion methods using the training set, we have to reduce overfitting in individual modalities, otherwise, the 100$\%$-accurate outputs would not offer a realistic and relevant representation for the fusion method. Thus, we investigate an early stopping technique to train individual networks. In addition to reducing overfitting, this method also reduces the training cost, since it usually requires fewer epochs to complete the classification process, and adapts to new streams and datasets thanks to its trainable parameters. Experiments are conducted on UCF101 and HMDB51 datasets, which are two challenging benchmarks in the context of action recognition (AU)

FAPESP's process:	17/09160-1 - Human Action Recognition in Videos
Grantee:	Helena de Almeida Maia
Support Opportunities:	Scholarships in Brazil - Doctorate

Short URL