Advanced search
Start date
Betweenand


Representation learning of spatio-temporal features from video

Full text
Author(s):
Gabriel de Barros Paranhos da Costa
Total Authors: 1
Document type: Doctoral Thesis
Press: São Carlos.
Institution: Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB)
Defense date:
Examining board members:
Moacir Antonelli Ponti; Tiago José de Carvalho; Hélio Pedrini; Marcela Xavier Ribeiro
Advisor: Moacir Antonelli Ponti; Rodrigo Fernandes de Mello
Abstract

One of the main challenges in computer vision is to encode the information present in images and videos into a feature vector that can later used, for example, to train a machine learning model. Videos include an extra challenge since both spatial and temporal information need to be considered. To address the challenges of creating new feature extraction methods, representation learning focuses on creating data-driven representations directly from raw data; these methods achieved state-of-the-art performance on many image-focused computer vision tasks. For these reasons, spatio-temporal representation learning from videos is considered a natural next step. Even though multiple architectures have been proposed for video processing, the results obtained by these methods when applied to videos are still akin to the ones obtained by hand-crafted feature extraction methods and reasonably bellow the advantages obtained by representation learning on images. We believe that to advance the area of spatio-temporal representation learning, a better understanding of how the information is encoded by these methods is required, allowing for more knowledgeable decisions regarding when each architecture should be used. For this purpose, we propose a novel evaluation protocol that looks at a synthetic problem in three different settings where the relevant information for the task appears only on spatial dimensions, temporal dimension or both. We also investigate the advantages of using a representation learning method over hand-crafted feature extraction, especially regarding their use on different (previously unknown) tasks. Lastly, we propose a data-driven regularisation method based on generative networks and knowledge transfer to improve the feature space learnt by representation learning methods. Our results show that when learning spatio-temporal representations it is important to include temporal information in every stage. We also notice that while architectures that used convolutions on the temporal dimension achieved the best results among the tested architectures, they had difficulties adapting to changes in the temporal information. When comparing the performance of hand-crafted and learnt representations on multiples tasks, hand-crafted features obtained better results on the task they were designed for, but considerably worst performance on a second unrelated task. Finally, we show that generative networks have a promising application on knowledge transfer, even though further investigation is required in a spatio-temporal setting. (AU)

FAPESP's process: 15/05310-3 - Representation Learning of spatio-temporal features from video
Grantee:Gabriel de Barros Paranhos da Costa
Support Opportunities: Scholarships in Brazil - Doctorate