JavisGPT

TL;DR: We introduce JavisGPT, a multimodal LLM that can understand audiovisual inputs and simultaneously generate synchronized sounding videos in a unified model.
All the code, model, and dataset are comming soon.

Abstract

This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audiovideo fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.

Technical Description

• JavisGPT Architecture

Figure 1: JavisGPT bridges the MLLM backbone and a downstream JAV-DiT decoder via learnable queries, so as to support understanding and generation for sounding videos in a unified framework.

Overall Architecture. We take the Qwen2.5-VL as the backbone model to inherit its visual perception ability, and introduce an audio encoder (BEATs) for audio understanding. We derive a SyncFusion module to explicitly capture the synchrony between input sounding videos, with learnable JavisQuery tokens to align the LLM's hidden states with the DiT's condition space to guide the generation process.
Audiovisual Comprehension. The proposed SyncFusion module injects temporally-aligned audio information into corresponding visual tokens via cross-attention, resulting in a new modality where each input token represents a sounding event occuring in a specific patch area at a specific time point.
Audiovisual Generation. We adopt and freeze the downstream JAV-DiT model as the decoder for our JavisGPT, where we use a fixed number of learnable tokens gather user context and project LLM's output embeddings into the condition space of the JAV-DiT model for targeted sounding video generation.

• JavisInst-Omni Dataset

Figure 2: Large-scale and diversified JavisInst-Omni dataset for instruction tuning on audiovisual understanding and generation.

To fill the gap in the field, we collected the first large-scale instruction tuning dataset for unified comprehension and generaiton for sounding videos, which consists of both single-model understanding (i.e., audio understanding and visual understanding) and multi-modal interaction tasks. Specifically, we leverage ChatGPT-4o to build a high-quality JavisInst-Und subset that covers various scenarios for joint audio-video understanding tasks, with a JavisInst-Gen subset that supports different contexts and styles for sounding video generation instructions. All the dataset will be released to facilitate future research in the field.

BibTeX

@inproceedings{liu2025javisgpt,
    title={JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation},
    author={Kai Liu and Jungang Li and Yuchong Sun and Shengqiong Wu and jianzhang gao and Daoan Zhang and Wei Zhang and Sheng Jin and Sicheng Yu and Geng Zhan and Jiayi Ji and Fan Zhou and Liang Zheng and Shuicheng YAN and Hao Fei and Tat-Seng Chua},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
}

JavisGPT:

A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Abstract

Technical Description

• JavisGPT Architecture

• JavisInst-Omni Dataset

Related Links

BibTeX

A Unified Multi-modal LLM for
Sounding-Video Comprehension and Generation