An End-to-End Deep Learning Approach for Video Captioning Through Mobile Devices

Pezzuto Damaceno, Rafael J.; Cesar, Roberto M., Jr.

Full text
Author(s):	Pezzuto Damaceno, Rafael J. ; Cesar, Roberto M., Jr. Total Authors: 2
Document type:	Journal article
Source:	PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2023, PT I; v. 14469, p. 15-pg., 2024-01-01.
Abstract
Video captioning is a computer vision task that aims at generating a description for video content. This can be achieved using deep learning approaches that leverage image and audio data. In this work, we have developed two strategies to tackle this task in the context of resource-constrained devices: (i) generating one caption per frame combined with audio classification, and (ii) generating one caption for a set of frames combined with audio classification. In these strategies, we have utilized one architecture for the image data and another for the audio data. We have developed an application tailored for resource-constrained devices, where the image sensor captures images at a specific frame rate. The audio data is captured from a microphone for a predefined duration at time. Our application combines the results from both modalities to create a comprehensive description. The main contribution of this work is the introduction of a new end-to-end application that can utilize the developed strategies and be beneficial for environment monitoring. Our method has been implemented on a low-resource computer, which poses a significant challenge. (AU)

FAPESP's process:	15/22308-2 - Intermediate representations in Computational Science for knowledge discovery
Grantee:	Roberto Marcondes Cesar Junior
Support Opportunities:	Research Projects - Thematic Grants


FAPESP's process:	22/15304-4 - Learning context rich representations for computer vision
Grantee:	Nina Sumiko Tomita Hirata
Support Opportunities:	Research Projects - Thematic Grants


FAPESP's process:	22/12204-9 - Development of methods for image captioning: a framework based on computer vision and natural language processing
Grantee:	Rafael Jeferson Pezzuto Damaceno
Support Opportunities:	Scholarships in Brazil - Post-Doctoral

Short URL