Abstract
Multidimensional representation spaces by contrastive training involving images and texts are proposed to approach related concepts between modal signals. Some works expand this concept to audio, speech, or ambient sounds by approaching their description. However, so far, no work in the literature relates concepts of audio, image, and text or creates environments with more than two types …