Abstract: Pre-trained vision-language models (VLMs) and language models (LMs) have recently garnered significant attention due to their remarkable ability to represent textual concepts, opening up new ...
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, ...