The agent acquires a vocabulary of neuro-symbolic concepts for objects, relations, and actions, represented through a ...
Abstract: This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has ...
Abstract: Visual grounding aims to use a natural language expression to find specific objects in an image, whether in a bounding box or a segmentation mask. The vision research community has ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results