Audio-Visual Emotion Recognition Using a Hybrid Deep Convolutional Neural Network based on Census Transform

Cornejo, Jadisha Yarif Ramirez; Pedrini, Helio; IEEE

Full text
Author(s):	Cornejo, Jadisha Yarif Ramirez ; Pedrini, Helio ; IEEE Total Authors: 3
Document type:	Journal article
Source:	2019 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC); v. N/A, p. 7-pg., 2019-01-01.
Abstract
Over the last years, recognition of emotions based on multimodal channels has received increasing attention from the scientific community. Many application fields can benefit from multimodal emotion recognition, such as human-computer interactions, educational software, behavior prediction, interpersonal relations. Speech and facial expressions are two natural and effective ways to express emotions in human-human interaction. In this work, we introduce a hybrid deep convolutional neural network to extract audio and visual features from videos. Initially, for extracting audio data, we transform the audio signal into an image representation as input to a 2D-Convolutional Neural Network (CNN). For extracting visual data, we introduce a Census-Transform (CT) based on CNN. Then, we fuse both audio and visual features, reducing them through Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). Finally, K-Nearest Neighbor (K-NN), Support Vector Machine (SVM), Logistic Regression (LR) and Gaussian Naive Bayes (GNB) classifiers are employed for emotion recognition. Experimental results on RML, eNTERFACE05 and BAUM-1s datasets demonstrated that our model reached competitive recognition rates compared to other state-of-the-art approaches. (AU)

FAPESP's process:	17/12646-3 - Déjà vu: feature-space-time coherence from heterogeneous data for media integrity analytics and interpretation of events
Grantee:	Anderson de Rezende Rocha
Support Opportunities:	Research Projects - Thematic Grants

Short URL