Abstract
Visual recognition in image classification, object detection, and semantic segmentation remains a significant challenge in computer vision, particularly when it comes to learning 3D data representations. The advent of advanced deep learning techniques in vision, inspired by breakthroughs in natural language processing, has led to the emergence of a new multimodal paradigm: Vision-Language…