ACM Multimedia 2026 Workshop

JAV-CG The 1st International Workshop on Joint Audio-Video Comprehension and Generation

A dedicated forum for the next wave of audio-video intelligence, spanning robust multimodal understanding, synchronized generation, and unified models that bridge perception and creation in one end-to-end framework.

Why a dedicated workshop on joint audio-video intelligence?

JAV-CG is positioned around the observation that real-world multimedia is naturally audio-visual, yet much of current multimodal research still treats audio and video separately or asymmetrically.

Large language models and multimodal foundation models have significantly advanced reasoning over text, images, and video. However, the audio modality remains under-integrated in many systems despite being essential to how humans perceive events, actions, intent, emotion, and context. Sounding videos are not simply vision plus a side channel. They are tightly coupled, temporally grounded, and often semantically inseparable.

JAV-CG aims to build a focused forum for research that treats audio and video as a joint intelligence problem. We are interested in models that can understand sounding scenes, reason over cross-modal correspondences, generate synchronized multimedia outputs, and transfer knowledge between understanding and creation.

The workshop centers on three interconnected directions: audio-visual comprehension and reasoning, audio-video content generation, and unified frameworks that connect both within a single model family. By bringing together researchers from multimedia, audio and speech, computer vision, and MLLM communities, JAV-CG seeks to establish a sharper research agenda for next-generation audio-visual AI systems.

Submission timeline and event milestones

Official

Workshop Paper Submission

11 June 2026

Deadline for workshop contributions under the ACM MM 2026 workshop schedule.

Official

Author Notification

30 July 2026

Notification for accepted workshop papers.

Official

Camera-Ready

06 August 2026

Final camera-ready material due according to ACM MM 2026 workshop schedule.

Official

Author Registration

13 August 2026

Registration deadline associated with accepted workshop contributions.

Official

ACM Multimedia 2026

10-14 November 2026

Conference venue: Rio de Janeiro, Brazil.

Placeholder

Workshop Day and Final Agenda

Coming Soon

Final keynote sequence, contributed paper slots, and exact workshop day will be updated here.

Welcomes submissions across understanding, generation, and unified AV modeling

Submission Types

What to submit

Technical, position, or perspective papers may present new methods, research visions, open challenges, or forward-looking arguments related to joint audio-video comprehension and generation.

Featured papers may summarize original publications or accepted papers from major venues that substantially advance audio-visual understanding, generation, or unified frameworks.

Submissions are expected to be written in English, double-blind, and formatted using the current ACM two-column conference template. The proposal specifies a limit of up to 8 pages plus references.

8-page limit + references ACM format Double blind English only
Submission portal link can be inserted here once the workshop submission venue is finalized.
Topics and Themes

Representative areas

The workshop proposal organizes JAV-CG around three topic clusters. The list below is representative rather than exhaustive.

Audio-Visual Comprehension and Reasoning

  • Sound source localization and source separation
  • Audio-visual event detection and localization
  • Audio-visual question answering and scene understanding
  • Spatial, temporal, and cross-modal reasoning
  • Segmentation, grounding, retrieval, and AV trustworthiness

Audio-Video Content Generation

  • Video-to-audio and text-to-audio-video synthesis
  • Audio-driven video generation and talking head animation
  • Foley generation, spatial audio, and AV editing
  • Music-conditioned generation and multimodal storytelling
  • Controllable and semantically coherent AV synthesis

Unified Audio-Video Frameworks

  • Any-to-any multimodal generation involving audio and video
  • Joint tokenization, alignment, and representation learning
  • Unified encoder-decoder or MLLM architectures
  • Pre-training, multi-task learning, and instruction tuning
  • Benchmarks, datasets, and evaluation metrics for AV quality

Invited leaders from JAV and multimodal AI

AZ

Andrew Zisserman

University of Oxford

Title

Audio-Visual Understanding and Reasoning

Abstract

Placeholder abstract. Replace this with the final abstract once available. Suggested focus: robust correspondence learning, open-world audio-visual reasoning, long-form scene understanding, and trustworthy AV intelligence.

Speaker Bio

Placeholder speaker bio. Replace with official bio and optionally add a real portrait into the figure area above.

LW

Limin Wang

Nanjing University

Title

Audio-Visual Generation and Editing

Abstract

Placeholder abstract. Replace this with the final abstract once available. Suggested focus: synchronized generation pipelines, multimodal editing, and alignment between motion dynamics and sound.

Speaker Bio

Placeholder speaker bio. Replace with official bio and optionally add a real portrait into the figure area above.

AN

Arsha Nagrani

Google DeepMind

Title

Unified Audio-Visual Understanding and Generation

Abstract

Placeholder abstract. Replace this with the final abstract once available. Suggested focus: joint representation learning for AV intelligence and the path toward unified foundation models.

Speaker Bio

Placeholder speaker bio. Replace with official bio and optionally add a real portrait into the figure area above.

KG

Kristen Grauman

University of Texas at Austin

Title

Embodied Audio-Visual Perceiving and Interaction

Abstract

Placeholder abstract. Replace this with the final abstract once available. Suggested focus: embodied agents that listen and look jointly, with grounded interaction in real environments.

Speaker Bio

Placeholder speaker bio. Replace with official bio and optionally add a real portrait into the figure area above.

AT

Antonio Torralba

Massachusetts Institute of Technology

Title

Audio-Visual Learning and Representation

Abstract

Placeholder abstract. Replace this with the final abstract once available. Suggested focus: representation learning foundations for scalable multimodal perception and transferable AV features.

Speaker Bio

Placeholder speaker bio. Replace with official bio and optionally add a real portrait into the figure area above.

KH

Kai Han

University of Hong Kong

Title

Joint Audio-Video Generation

Abstract

Placeholder abstract. Replace this with the final abstract once available. Suggested focus: controllable AV synthesis, coherence across modalities, and benchmarking generation quality.

Speaker Bio

Placeholder speaker bio. Replace with official bio and optionally add a real portrait into the figure area above.

Morning

Invited talks and panel discussion

Session Duration Speaker Affiliation
Opening Address: Audio-Visual AI in the Age of LLMs 5 min Tat-Seng Chua National University of Singapore
Audio-Visual Understanding and Reasoning 30 min Andrew Zisserman University of Oxford
Audio-Visual Generation and Editing 30 min Limin Wang Nanjing University
Coffee Break 10 min - -
Panel Discussion: Universal Audio-Visual Intelligence 30 min All speakers Multi-institution panel
Unified Audio-Visual Understanding and Generation 30 min Arsha Nagrani Google DeepMind
Embodied Audio-Visual Perceiving and Interaction 30 min Kristen Grauman University of Texas at Austin
Afternoon

Focused invited talks and contributed paper sessions

Session Duration Speaker Affiliation
Audio-Visual Learning and Representation 15 min Antonio Torralba Massachusetts Institute of Technology
Contributed Paper Session I: Audio-Visual Comprehension 15 min each TBD Accepted papers
Joint Audio-Video Generation 15 min Kai Han University of Hong Kong
Contributed Paper Session II: Audio-Video Generation 15 min each TBD Accepted papers
3D Audio-Visual Modeling and Embodiment 15 min Baining Guo Microsoft Research
Contributed Paper Session III: Unified Audio-Visual Models 15 min each TBD Accepted papers

The proposal notes that JAV-CG is intended as a full-day workshop when possible, while retaining flexibility for a half-day adaptation if required by conference logistics.

YQ

You Qin

National University of Singapore

KL

Kai Liu

Zhejiang University

HF

Hao Fei

University of Oxford

LZ

Liang Zheng

Australian National University

RZ

Roger Zimmermann

National University of Singapore

JL

Jiebo Luo

University of Rochester

FW

Furu Wei

Microsoft Research

TC

Tat-Seng Chua

National University of Singapore