An End-to-End Deep Learning Approach for Video Captioning Through Mobile Devices

Pezzuto Damaceno, Rafael J.; Cesar, Roberto M., Jr.

Texto completo
Autor(es):	Pezzuto Damaceno, Rafael J. ; Cesar, Roberto M., Jr. Número total de Autores: 2
Tipo de documento:	Artigo Científico
Fonte:	PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2023, PT I; v. 14469, p. 15-pg., 2024-01-01.
Resumo
Video captioning is a computer vision task that aims at generating a description for video content. This can be achieved using deep learning approaches that leverage image and audio data. In this work, we have developed two strategies to tackle this task in the context of resource-constrained devices: (i) generating one caption per frame combined with audio classification, and (ii) generating one caption for a set of frames combined with audio classification. In these strategies, we have utilized one architecture for the image data and another for the audio data. We have developed an application tailored for resource-constrained devices, where the image sensor captures images at a specific frame rate. The audio data is captured from a microphone for a predefined duration at time. Our application combines the results from both modalities to create a comprehensive description. The main contribution of this work is the introduction of a new end-to-end application that can utilize the developed strategies and be beneficial for environment monitoring. Our method has been implemented on a low-resource computer, which poses a significant challenge. (AU)

Processo FAPESP:	15/22308-2 - Representações intermediárias em Ciência Computacional para descoberta de conhecimento
Beneficiário:	Roberto Marcondes Cesar Junior
Modalidade de apoio:	Auxílio à Pesquisa - Temático


Processo FAPESP:	22/15304-4 - Aprendizado de representações ricas em contexto para visão computacional
Beneficiário:	Nina Sumiko Tomita Hirata
Modalidade de apoio:	Auxílio à Pesquisa - Temático


Processo FAPESP:	22/12204-9 - Desenvolvimento de métodos para descrição de imagens: um arcabouço baseado em visão computacional e processamento de linguagem natural
Beneficiário:	Rafael Jeferson Pezzuto Damaceno
Modalidade de apoio:	Bolsas no Brasil - Pós-Doutorado

URL curto