On the spatial dilemma linking deep motion retargeting and disentangled representations from video

Juan Felipe Hernández Albarracín

Full text
Author(s):	Juan Felipe Hernández Albarracín Total Authors: 1
Document type:	Doctoral Thesis
Press:	Campinas, SP.
Institution:	Universidade Estadual de Campinas (UNICAMP). Instituto de Computação
Defense date:	2023-01-17
Examining board members:	Adín Ramírez Rivera; Erickson Rangel do Nascimento; Claudio Rosito Jung; Hélio Pedrini; Sandra Eliza Fontes de Avila
Advisor:	Adín Ramírez Rivera
Abstract
The task of Video Motion Retargeting consists in animating the object of interest in a source image or video, according to the motion present in a driving video. Modern retargeting techniques necessarily work with some notion of independence between the object of interest (also known as content) and its motion, so no content traits leak when animating another object. Although the way to concretely apply this notion of independence is quite diverse in the literature, state-of-the-art approaches have something in common: they work with high-dimensional representations that are redundant when it comes to spatial information, so the representation space is normally bigger than the data space itself. Naturally, retargeting models that operate in low-dimensional representation spaces are less successful, due to the amount of spatial information that is lost. Nonetheless, they yield compact features that own interesting properties, making them suitable for numerous tasks, besides merely retargeting. In this thesis, we study the capacity of Variational-Autoencoder-based Deep Generative Models to attain good-quality retargeting by operating exclusively in low-dimensional latent spaces. We implemented three models in which the notion of independence between motion and content is applied in learning disentangled representations that explicitly encode these two factors of variation. Each model applies different inductive biases relative to weakly- and self-supervised techniques, as well as more concrete supervisory signals that make them aware of spatial information, without the need to explicitly represent it in high-dimensional spaces. Our contribution is two-fold: first, we devise models that learn disentangled, compact and meaningful representations that separately encode content and motion information, and second, we explore diverse techniques to cope with the dilemma of dispensing with spatial information (and consequently with retargeting quality) to leverage meaningful disentangled features. Our results show that our models are successful at not only reducing the gap of performance between low-dimensional-latent-space models and state-of-the-art retargeting ones but also at attaining disentangled representations useful for downstream tasks (AU)

FAPESP's process:	17/16144-2 - Video-to-video dynamics transfer with deep generative models
Grantee:	Juan Felipe Hernández Albarracín
Support Opportunities:	Scholarships in Brazil - Doctorate

Short URL