Audio-Visual Emotion Recognition Using a Hybrid Deep Convolutional Neural Network based on Census Transform

Cornejo, Jadisha Yarif Ramirez; Pedrini, Helio; IEEE

Texto completo
Autor(es):	Cornejo, Jadisha Yarif Ramirez ; Pedrini, Helio ; IEEE Número total de Autores: 3
Tipo de documento:	Artigo Científico
Fonte:	2019 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC); v. N/A, p. 7-pg., 2019-01-01.
Resumo
Over the last years, recognition of emotions based on multimodal channels has received increasing attention from the scientific community. Many application fields can benefit from multimodal emotion recognition, such as human-computer interactions, educational software, behavior prediction, interpersonal relations. Speech and facial expressions are two natural and effective ways to express emotions in human-human interaction. In this work, we introduce a hybrid deep convolutional neural network to extract audio and visual features from videos. Initially, for extracting audio data, we transform the audio signal into an image representation as input to a 2D-Convolutional Neural Network (CNN). For extracting visual data, we introduce a Census-Transform (CT) based on CNN. Then, we fuse both audio and visual features, reducing them through Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). Finally, K-Nearest Neighbor (K-NN), Support Vector Machine (SVM), Logistic Regression (LR) and Gaussian Naive Bayes (GNB) classifiers are employed for emotion recognition. Experimental results on RML, eNTERFACE05 and BAUM-1s datasets demonstrated that our model reached competitive recognition rates compared to other state-of-the-art approaches. (AU)

Processo FAPESP:	17/12646-3 - Déjà vu: coerência temporal, espacial e de caracterização de dados heterogêneos para análise e interpretação de integridade
Beneficiário:	Anderson de Rezende Rocha
Modalidade de apoio:	Auxílio à Pesquisa - Temático

URL curto