Semantic segmentation task aims to create a dense classification by labeling pixel-wise each object present in images or videos. Convolutional neural network (CNN) approaches have been proved useful by exhibiting the best results in this task. Some challenges remain, however, such as the low-resolution of feature maps and the loss of spatial precision, both produced in the last convolution layer of the CNNs. How to solve these problems and obtain consistent results is still an open problem on images and even more on videos; thus, making semantic segmentation on video a rather difficult problem. In this Ph.D. project, to solve these problems, we propose to create an hourglass-shaped CNN architecture to address the semantic segmentation task on video. Our proposed architecture is end-to-end trainable and extracts spatiotemporal information to discriminate between several object classes present in video. In this way, the final result of our proposed architecture is the generation of densely labeled videos. To achieve this goal we need to learn meaningful spatiotemporal features that differentiate the objects of the video (by learning convolution kernels) while remaining consistent within frame's variations, learn multidimensional up-sampling and fusion kernels that use the predictions of lower resolution levels and the existing spatiotemporal features to maintain the relations between voxels through the learned nonlinearities, and create an end-to-end learning framework (data augmentation and loss functions) that uses the existing tags (both densely annotated and bounding boxes) on video datasets to train the network.
News published in Agência FAPESP Newsletter about the scholarship: