Busca avançada
Ano de início
Entree


Benchmarking Speech-Driven Gesture Generation Models for Generalization to Unseen Voices and Noisy Environments

Texto completo
Autor(es):
Gomez Sanchez, Johsac Isbac ; Inofuente Colque, Kevin Adier ; de Menezes Martins Marques, Leonardo Boulitreau ; Paro Costa, Paula Dornhofer ; Tonoli, Rodolfo Luis
Número total de Autores: 5
Tipo de documento: Artigo Científico
Fonte: COMPANION PUBLICATION OF THE 26TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2024 COMPANION; v. N/A, p. 5-pg., 2024-01-01.
Resumo

Speech-driven gesture generation models enhance robot gestures and control avatars in virtual environments by synchronizing gestures with speech prosody. However, state-of-the-art models are trained on a limited number of speakers, with audios typically recorded in controlled conditions, potentially resulting in poor generalization to new voices and noisy environments. This paper presents a robust evaluation method for speech-driven gesture generation models against unseen voices and varying noise levels. We utilize a voice conversion model to produce synthetic speech that maintains prosodic features, ensuring a thorough test of the model's generalization capabilities. Additionally, we introduce a controlled synthetic noisy dataset to evaluate model performance under different noise conditions. This methodology establishes a comprehensive framework for robustness evaluation in speech-to-gesture synthesis benchmarks. Applying this approach to the state-of-the-art DiffuseStyleGesture+ model reveals a slight performance degradation with diverse voices and increased background noise. Our findings emphasize the need for models that can generalize better to real-world conditions, ensuring reliable performance in varied acoustic scenarios. (AU)

Processo FAPESP: 20/09838-0 - BI0S - Brazilian Institute of Data Science
Beneficiário:João Marcos Travassos Romano
Modalidade de apoio: Auxílio à Pesquisa - Programa Centros de Pesquisa em Engenharia