A latent space analysis in encoder-decoder models to improve the representation learning for semantic segmentation task on images

Darwin Danilo Saire Pilco

Full text
Author(s):	Darwin Danilo Saire Pilco Total Authors: 1
Document type:	Doctoral Thesis
Press:	Campinas, SP.
Institution:	Universidade Estadual de Campinas (UNICAMP). Instituto de Computação
Defense date:	2022-12-12
Examining board members:	Adín Ramírez Rivera; Alexandre Xavier Falcão; Hélio Pedrini; Roberto Hirata Junior; Moacir Antonelli Ponti
Advisor:	Adín Ramírez Rivera
Abstract
In recent years, the use of Deep Neural Networks (DNNs) as a powerful feature extractor has led to several improvements in many areas of knowledge with outstanding results, especially in several computer vision tasks. One of those tasks is Semantic Segmentation (SS), which is a pixel-wise level labeling classification on images, i.e., each pixel is labeled as belonging to a given semantic class. Semantic segmentation also has several applications in a wide range of fields, like robotics, mapping, or scene understanding, in which pixel-level labels are paramount. The use of DNN showed significant improvements in the SS process, though this brought the problem of spatial precision loss, often produced at the segmented objects' boundaries. On the other hand, the multi-task approach uses related tasks to improve the performance of the main task. Therefore, we use a multi-task approach to improve segmentation. However, how to choose these related tasks is not a trivial problem. In this thesis, we propose to study the latent space (feature maps) in hourglass (encoder-decoder) models using a multi-task approach by complementing the SS task with tasks such as edge detection, semantic contour, and distance-transform (tasks based on objects boundary). We observe that the complementary tasks can produce more robust representations that enhance semantic labels by sharing a common latent space. Furthermore, we explore the influence of contour-based tasks on the latent space, as well as their impact on the performance of SS process. By analyzing the latent space influenced by multi-task, we managed to create (design) a model that addresses the problem of spatial precision loss by providing an internal structure for the feature representations while extracting a global representation that supports the former. To fit the internal structure, we predict a Gaussian Mixture Model from the data at training time, which, merged with the skip connections at the decoding stage, helps to avoid wrong inductive biases. Our results demonstrate the effectiveness of learning in a multi-task setting for hourglass models by improving the state of the art without any post-processing refinement. We also show the improvement of the SS task by providing and combining both global and local learning representations with a clustering behavior. However, to obtain a better-fitting representation space, we need a dataset with many fine annotations. Finally, we present quantitative and qualitative results on the CamVid, Freiburg Forest, Cityscapes, and Synthia datasets benchmark (AU)

FAPESP's process:	17/16597-7 - Semantic Segmentation on Videos
Grantee:	Darwin Danilo Saire Pilco
Support Opportunities:	Scholarships in Brazil - Doctorate

Short URL