Abstract: In many robotic applications, especially those involving humans and the environment, linguistic and visual information must be processed jointly and bound together. Existing works either ...