JAV-CG The 1st International Workshop on Joint Audio-Video Comprehension and Generation
A dedicated forum for the next wave of audio-video intelligence, spanning robust multimodal understanding, synchronized generation, and unified models that bridge perception and creation.
Joint audio-video intelligence as one research problem.
Real-world multimedia is naturally audio-visual, yet many multimodal systems still treat sound as a side channel. JAV-CG focuses on models that listen, see, reason, and generate synchronized multimedia in one coherent framework.
The workshop brings together multimedia, audio and speech, computer vision, and multimodal foundation-model communities to sharpen the research agenda for audio-video comprehension and generation.
Submission timeline and event milestones.
Workshop Paper Submission
16 July 2026Official ACM MM 2026 workshop contribution deadline.
Author Notification
06 August 2026Acceptance notification for workshop submissions.
Camera-Ready / Final Metadata
20 August 2026Final accepted material and metadata due under the ACM MM 2026 workshop schedule.
Author Registration
20 August 2026Registration deadline for accepted workshop contributions.
ACM Multimedia 2026
10-14 November 2026Conference venue: Rio de Janeiro, Brazil.
Workshop Day
Coming SoonFinal agenda and exact workshop day will be updated after logistics are confirmed.
Welcomes submissions across understanding, generation, and unified AV modeling.
JAV-CG welcomes archival workshop papers intended for the ACM MM 2026 workshop proceedings, as well as non-archival featured-paper submissions for workshop presentation. Technical, position, and perspective papers may be up to 8 pages plus references.
Audio-Visual Comprehension
- Sound source localization and source separation
- Audio-visual event detection and localization
- Question answering, grounding, and scene reasoning
- Trustworthy and long-form audio-visual understanding
Audio-Video Generation
- Video-to-audio and text-to-audio-video synthesis
- Audio-driven video generation and talking heads
- Foley, spatial audio, multimodal editing, and music
- Controllable synchronized generation across modalities
Unified AV Frameworks
- Any-to-any multimodal generation involving audio and video
- Joint tokenization, alignment, and representation learning
- Unified encoder-decoder or MLLM architectures
- Benchmarks, datasets, metrics, safety, and evaluation
Submission portal is live on OpenReview.
We will present a Best Paper Award to recognize outstanding workshop submissions.
Keynote speaker details are placeholders.
Speaker TBD 01
Affiliation TBD
Abstract TBD. Final keynote abstract will be added after speaker confirmation.
Speaker bio TBD. Official biography and portrait will be added after confirmation.
Speaker TBD 02
Affiliation TBD
Abstract TBD. Final keynote abstract will be added after speaker confirmation.
Speaker bio TBD. Official biography and portrait will be added after confirmation.
Speaker TBD 03
Affiliation TBD
Abstract TBD. Final keynote abstract will be added after speaker confirmation.
Speaker bio TBD. Official biography and portrait will be added after confirmation.
Speaker TBD 04
Affiliation TBD
Abstract TBD. Final keynote abstract will be added after speaker confirmation.
Speaker bio TBD. Official biography and portrait will be added after confirmation.
Speaker TBD 05
Affiliation TBD
Abstract TBD. Final keynote abstract will be added after speaker confirmation.
Speaker bio TBD. Official biography and portrait will be added after confirmation.
Speaker TBD 06
Affiliation TBD
Abstract TBD. Final keynote abstract will be added after speaker confirmation.
Speaker bio TBD. Official biography and portrait will be added after confirmation.
Program schedule is TBD.
Schedule details TBD
| Session | Duration | Speaker | Affiliation |
|---|---|---|---|
| Session TBD 01 | Duration TBD | Speaker TBD | Affiliation TBD |
| Session TBD 02 | Duration TBD | Speaker TBD | Affiliation TBD |
| Session TBD 03 | Duration TBD | Speaker TBD | Affiliation TBD |
| Break TBD | Duration TBD | - | - |
| Session TBD 04 | Duration TBD | Speaker TBD | Affiliation TBD |
| Session TBD 05 | Duration TBD | Speaker TBD | Affiliation TBD |
Schedule details TBD
| Session | Duration | Speaker | Affiliation |
|---|---|---|---|
| Session TBD 06 | Duration TBD | Speaker TBD | Affiliation TBD |
| Session TBD 07 | Duration TBD | Speaker TBD | Affiliation TBD |
| Session TBD 08 | Duration TBD | Speaker TBD | Affiliation TBD |
| Break TBD | Duration TBD | - | - |
| Session TBD 09 | Duration TBD | Speaker TBD | Affiliation TBD |
| Session TBD 10 | Duration TBD | Speaker TBD | Affiliation TBD |
An international team across multimedia, AV learning, and foundation models.
You Qin
National University of Singapore
Homepage
Kai Liu
Zhejiang University
HomepageShengqiong Wu
University of Oxford
Homepage
Wei Ji
Nanjing University
Homepage
Hao Fei
University of Oxford
Homepage
Liang Zheng
Australian National University
Homepage
Roger Zimmermann
National University of Singapore
Homepage
Jiebo Luo
University of Rochester
Homepage
Tat-Seng Chua
National University of Singapore
HomepageWorkshop correspondence
- You Qin qinyou@u.nus.edu
- Kai Liu kail@zju.edu.cn
- Shengqiong Wu shengqiongwu@gmail.com
- Hao Fei haofei7419@gmail.com