JAV-CG is positioned around the observation that real-world multimedia is naturally audio-visual, yet much of current multimodal research still treats audio and video separately or asymmetrically.
Large language models and multimodal foundation models have significantly advanced reasoning over text, images, and video. However, the audio modality remains under-integrated in many systems despite being essential to how humans perceive events, actions, intent, emotion, and context. Sounding videos are not simply vision plus a side channel. They are tightly coupled, temporally grounded, and often semantically inseparable.
JAV-CG aims to build a focused forum for research that treats audio and video as a joint intelligence problem. We are interested in models that can understand sounding scenes, reason over cross-modal correspondences, generate synchronized multimedia outputs, and transfer knowledge between understanding and creation.
The workshop centers on three interconnected directions: audio-visual comprehension and reasoning, audio-video content generation, and unified frameworks that connect both within a single model family. By bringing together researchers from multimedia, audio and speech, computer vision, and MLLM communities, JAV-CG seeks to establish a sharper research agenda for next-generation audio-visual AI systems.