JavisGPT:
TL;DR:
We introduce JavisGPT, a multimodal LLM that can understand audiovisual inputs and simultaneously generate synchronized sounding videos in a unified model.
All the code, model, and dataset are comming soon.
This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audiovideo fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
Figure 1: JavisGPT bridges the MLLM backbone and a downstream JAV-DiT decoder via learnable queries, so as to support understanding and generation for sounding videos in a unified framework.
Figure 2: Large-scale and diversified JavisInst-Omni dataset for instruction tuning on audiovisual understanding and generation.
To fill the gap in the field, we collected the first large-scale instruction tuning dataset for unified comprehension and generaiton for sounding videos, which consists of both single-model understanding (i.e., audio understanding and visual understanding) and multi-modal interaction tasks. Specifically, we leverage ChatGPT-4o to build a high-quality JavisInst-Und subset that covers various scenarios for joint audio-video understanding tasks, with a JavisInst-Gen subset that supports different contexts and styles for sounding video generation instructions. All the dataset will be released to facilitate future research in the field.
Our code is based on the LLaVA-NeXT repository, many thanks to their foundational work. We also require deep interaction with the downstream JavisDiT to support audiovisual generation, and you may refer to this repository for more details.
@inproceedings{liu2025javisgpt,
title={JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation},
author={Kai Liu and Jungang Li and Yuchong Sun and Shengqiong Wu and jianzhang gao and Daoan Zhang and Wei Zhang and Sheng Jin and Sicheng Yu and Geng Zhan and Jiayi Ji and Fan Zhou and Liang Zheng and Shuicheng YAN and Hao Fei and Tat-Seng Chua},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
}