Sound events detection and localization improvements by using Gammatone filters and temporal convolutional neural networks

Karen Gissell Rosero Jacome

Full text
Author(s):	Karen Gissell Rosero Jacome Total Authors: 1
Document type:	Master's Dissertation
Press:	Campinas, SP.
Institution:	Universidade Estadual de Campinas (UNICAMP). Faculdade de Engenharia Elétrica e de Computação
Defense date:	2022-03-28
Examining board members:	Bruno Sanches Masiero; Luiz Wagner Pereira Biscainho; Tiago Fernandes Tavares
Advisor:	Felipe Leonel Grijalva Arévalo; Bruno Sanches Masiero
Abstract
The human auditory system has the ability to extract meaning from sound, helping us to identify and localize sounds in an acoustical environment. The development of computational methods inspired in human capacities and behaviors has established opportunities for improving machine hearing. Recent studies based on deep learning show that the use of convolutional neural networks (CNN) and recurrent neural networks (RNN) is a promising approach for the sound event detection and localization (SELD) task. Even though, depending on the sound environment, the performance of these systems is still far from reaching perfect metrics, in some aspects they have already surpassed the human performance. Therefore, this project intends to boost the performance of the studied SELD systems by improving different stages of the process. We propose the use of Gammatone auditory filters for the acoustic feature extraction stage, and the implementation of a temporal convolutional network (TCN) along with CNN and RNN layers is contemplated as an improvement for the traditional SELD architecture. The system will support the detection and localization of up to three sound events that could be class-coincident or not. Furthermore, due to the limited quantity of audio samples contained in the datasets, we also explore the use of suitable data augmentation techniques. The system is evaluated on databases that represent environments with different levels of difficulty. The results of this work show that the Gammatone filters are a great alternative to modify the linear frequency resolution of the spectrogram, since they model the tonotopic distribution produced in the cochlea. Regarding the network architecture, the TCN block captures long-term dependencies on data, generating a deeper feature extraction that produces a greater number of trainable parameters, without adding much complexity to the system architecture. Lastly, the data augmentation techniques that showed the best results were frequency masking, random magnitude, and swapping of Ambisonics channels. The evaluation of the proposed system surpassed all the state of the art metrics obtained for four different datasets, maintaining an acceptable performance in reverberant environments and audio scenes with multiple sound sources, and an almost perfect performance in an anechoic environment (AU)

FAPESP's process:	19/22945-3 - 3D audio: spatial audio acquisition, coding, and reproduction
Grantee:	Karen Gissell Rosero Jácome
Support Opportunities:	Scholarships in Brazil - Master

Short URL