ACM Multimedia 2026 Workshop

JAV-CG The 1st International Workshop on Joint Audio-Video Comprehension and Generation

A dedicated forum for the next wave of audio-video intelligence, spanning robust multimodal understanding, synchronized generation, and unified models that bridge perception and creation.

JavisVerse logo
AUDIO
VIDEO
JOINT AI
VenueRio de Janeiro, Brazil
Dates10-14 November 2026
Submission16 July 2026
ScopeComprehension, Generation, Unified AV
About

Joint audio-video intelligence as one research problem.

Real-world multimedia is naturally audio-visual, yet many multimodal systems still treat sound as a side channel. JAV-CG focuses on models that listen, see, reason, and generate synchronized multimedia in one coherent framework.

The workshop brings together multimedia, audio and speech, computer vision, and multimodal foundation-model communities to sharpen the research agenda for audio-video comprehension and generation.

Important Dates

Submission timeline and event milestones.

01

Workshop Paper Submission

16 July 2026

Official ACM MM 2026 workshop contribution deadline.

02

Author Notification

06 August 2026

Acceptance notification for workshop submissions.

03

Camera-Ready / Final Metadata

20 August 2026

Final accepted material and metadata due under the ACM MM 2026 workshop schedule.

04

Author Registration

20 August 2026

Registration deadline for accepted workshop contributions.

05

ACM Multimedia 2026

10-14 November 2026

Conference venue: Rio de Janeiro, Brazil.

06

Workshop Day

Coming Soon

Final agenda and exact workshop day will be updated after logistics are confirmed.

Call for Papers

Welcomes submissions across understanding, generation, and unified AV modeling.

JAV-CG welcomes archival workshop papers intended for the ACM MM 2026 workshop proceedings, as well as non-archival featured-paper submissions for workshop presentation. Technical, position, and perspective papers may be up to 8 pages plus references.

ACM format Double blind English only Archival + non-archival

Audio-Visual Comprehension

  • Sound source localization and source separation
  • Audio-visual event detection and localization
  • Question answering, grounding, and scene reasoning
  • Trustworthy and long-form audio-visual understanding

Audio-Video Generation

  • Video-to-audio and text-to-audio-video synthesis
  • Audio-driven video generation and talking heads
  • Foley, spatial audio, multimodal editing, and music
  • Controllable synchronized generation across modalities

Unified AV Frameworks

  • Any-to-any multimodal generation involving audio and video
  • Joint tokenization, alignment, and representation learning
  • Unified encoder-decoder or MLLM architectures
  • Benchmarks, datasets, metrics, safety, and evaluation

Submission portal is live on OpenReview.

Best Paper Award

We will present a Best Paper Award to recognize outstanding workshop submissions.

Keynote Speakers

Keynote speaker details are placeholders.

Speaker TBD 01

Affiliation TBD

Title Talk title TBD
Abstract

Abstract TBD. Final keynote abstract will be added after speaker confirmation.

Speaker Bio

Speaker bio TBD. Official biography and portrait will be added after confirmation.

Speaker TBD 02

Affiliation TBD

Title Talk title TBD
Abstract

Abstract TBD. Final keynote abstract will be added after speaker confirmation.

Speaker Bio

Speaker bio TBD. Official biography and portrait will be added after confirmation.

Speaker TBD 03

Affiliation TBD

Title Talk title TBD
Abstract

Abstract TBD. Final keynote abstract will be added after speaker confirmation.

Speaker Bio

Speaker bio TBD. Official biography and portrait will be added after confirmation.

Speaker TBD 04

Affiliation TBD

Title Talk title TBD
Abstract

Abstract TBD. Final keynote abstract will be added after speaker confirmation.

Speaker Bio

Speaker bio TBD. Official biography and portrait will be added after confirmation.

Speaker TBD 05

Affiliation TBD

Title Talk title TBD
Abstract

Abstract TBD. Final keynote abstract will be added after speaker confirmation.

Speaker Bio

Speaker bio TBD. Official biography and portrait will be added after confirmation.

Speaker TBD 06

Affiliation TBD

Title Talk title TBD
Abstract

Abstract TBD. Final keynote abstract will be added after speaker confirmation.

Speaker Bio

Speaker bio TBD. Official biography and portrait will be added after confirmation.

Tentative Schedule

Program schedule is TBD.

Program Block 01

Schedule details TBD

Session Duration Speaker Affiliation
Session TBD 01 Duration TBD Speaker TBD Affiliation TBD
Session TBD 02 Duration TBD Speaker TBD Affiliation TBD
Session TBD 03 Duration TBD Speaker TBD Affiliation TBD
Break TBD Duration TBD - -
Session TBD 04 Duration TBD Speaker TBD Affiliation TBD
Session TBD 05 Duration TBD Speaker TBD Affiliation TBD
Program Block 02

Schedule details TBD

Session Duration Speaker Affiliation
Session TBD 06 Duration TBD Speaker TBD Affiliation TBD
Session TBD 07 Duration TBD Speaker TBD Affiliation TBD
Session TBD 08 Duration TBD Speaker TBD Affiliation TBD
Break TBD Duration TBD - -
Session TBD 09 Duration TBD Speaker TBD Affiliation TBD
Session TBD 10 Duration TBD Speaker TBD Affiliation TBD
Organizers

An international team across multimedia, AV learning, and foundation models.

You Qin

You Qin

National University of Singapore

Homepage
Shengqiong Wu

Shengqiong Wu

University of Oxford

Homepage
Liang Zheng

Liang Zheng

Australian National University

Homepage
Roger Zimmermann

Roger Zimmermann

National University of Singapore

Homepage
Jiebo Luo

Jiebo Luo

University of Rochester

Homepage
Tat-Seng Chua

Tat-Seng Chua

National University of Singapore

Homepage
Contact

Workshop correspondence