Abstract
Humans constantly deal with multimodal information, that is, data sets of different modalities, such as text and image. For machines to process information similarly to humans, they must be able to process multimodal data and understand the joint relationship between these modalities, not just text or image in isolation, for example. This multimodal aspect of learning can be very useful i…