JavisGPT:

A Unified Multi-modal LLM for
Sounding-Video Comprehension and Generation

1ZJU, 2NUS, 3HKUST(GZ), 4RUC, 5UR, 6HZCU, 7NTU, 8SMU, 9USYD, 10ANU
Published as a spotlight paper at NeurIPS (*Equal Contribution, Correspondence)

TL;DR: We introduce JavisGPT, a multimodal LLM that can understand audiovisual inputs and simultaneously generate synchronized sounding videos in a unified model.
All the code, model, and dataset are comming soon.


Abstract

This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audiovideo fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.


Technical Description


• JavisGPT Architecture

Teaser

Figure 1: JavisGPT bridges the MLLM backbone and a downstream JAV-DiT decoder via learnable queries, so as to support understanding and generation for sounding videos in a unified framework.


  • Overall Architecture. We take the Qwen2.5-VL as the backbone model to inherit its visual perception ability, and introduce an audio encoder (BEATs) for audio understanding. We derive a SyncFusion module to explicitly capture the synchrony between input sounding videos, with learnable JavisQuery tokens to align the LLM's hidden states with the DiT's condition space to guide the generation process.
  • Audiovisual Comprehension. The proposed SyncFusion module injects temporally-aligned audio information into corresponding visual tokens via cross-attention, resulting in a new modality where each input token represents a sounding event occuring in a specific patch area at a specific time point.
  • Audiovisual Generation. We adopt and freeze the downstream JAV-DiT model as the decoder for our JavisGPT, where we use a fixed number of learnable tokens gather user context and project LLM's output embeddings into the condition space of the JAV-DiT model for targeted sounding video generation.

• JavisInst-Omni Dataset

Teaser

Figure 2: Large-scale and diversified JavisInst-Omni dataset for instruction tuning on audiovisual understanding and generation.


To fill the gap in the field, we collected the first large-scale instruction tuning dataset for unified comprehension and generaiton for sounding videos, which consists of both single-model understanding (i.e., audio understanding and visual understanding) and multi-modal interaction tasks. Specifically, we leverage ChatGPT-4o to build a high-quality JavisInst-Und subset that covers various scenarios for joint audio-video understanding tasks, with a JavisInst-Gen subset that supports different contexts and styles for sounding video generation instructions. All the dataset will be released to facilitate future research in the field.

Demonstrations



• Example-1: Audio Comprehension (content recognition)

avatar-user

Is the sound caused by pressure from/against wood?

avatar-bot

Yes. The creaking of a wooden door or gate as it is opened and closed.


• Example-2: Video Comprehension (temporal relationship)

avatar-user

What happened after the person took the box?
A. Ate the medicine. B. Tidied up the blanket.
C. Put down the cup/glass/bottle. D. Open the computer.

avatar-bot

A. Ate the medicine


• Example-3: Audio-Video Comprehension (spatiotemporal relationship)

avatar-user

How many instruments in the room did not sound from beginning to end?
Answer the question using a single word.

avatar-bot

One


• Example-4: Audio-Video Comprehension (multi-turn dialogues)

avatar-user

Answer the following questions based on this sounding video.
What happened in this video?

avatar-bot

A man is standing in front of his car with a camera in hand, taking pictures.

avatar-user

Do you have any idea what type of vehicle it is?

avatar-bot

I cannot tell from the angle.

avatar-user

How many people in the video?

avatar-bot

One man is visible, but there may be someone behind the camera.

avatar-user

Can you hear any noise?

avatar-bot

Yes, there is some background noise.


• Example-5: Audio-Video Generation

avatar-user

Make a sounding-video out of this: An aerial view of a beach with white sand and green umbrellas lining the shore. Turquoise waters with waves crashing onto the shore can be seen, and four birds are flying overhead. Several people are on the beach. The sound of wind blowing and water splashing can be heard throughout.

avatar-bot

Great idea! I will create a sounding-video with the beautiful aerial view of the beach with all the described elements.


• Example-6: Audio-Video Generation-then-Understanding

avatar-user

Make a quick clip with a soundtrack. The machine can be heard running steadily in the background as a close-up shot of a car's engine with a light shining on it, showing detailed components and mechanical parts.

avatar-bot

Sure! I’ll put together a short clip that captures the steady hum of the machine, paired with a close-up of the car’s engine to showcase its intricate components

avatar-user

Does the machine producing the steady sound appear in the video?

avatar-bot

Yes. The engine that produces the steady sound appears in the video.

avatar-user

When is the machine heard running steadily?
A. Before the close-up shot of the car's engine
B. During the close-up shot of the car's engine
C. After the close-up shot of the car's engine
D. Only when the light is turned off

avatar-bot

B. During the close-up shot of the car's engine


• Example-7: Proactive Audio-Video Generation

avatar-user

Hi JavisGPT. I like dark settings.

avatar-bot

Wow, dark settings really have such a strong vibe!

avatar-user

Combine narration and images to form a complete video rendition of this text. A large crowd of concert-goers gathered in a setting, raising their hands in the air. The audience appears engaged and excited, with various individuals visible within the crowd. The scene suggests a lively and energetic atmosphere. In the audio, a woman can be heard singing and whooping along with music playing in the background

avatar-bot

Sure! I’ll blend the narration with visuals to bring the concert scene to life and capture its vibrant energy in a dark setting


Related Links

Our code is based on the LLaVA-NeXT repository, many thanks to their foundational work. We also require deep interaction with the downstream JavisDiT to support audiovisual generation, and you may refer to this repository for more details.

BibTeX

@inproceedings{liu2025javisgpt,
    title={JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation},
    author={Kai Liu and Jungang Li and Yuchong Sun and Shengqiong Wu and jianzhang gao and Daoan Zhang and Wei Zhang and Sheng Jin and Sicheng Yu and Geng Zhan and Jiayi Ji and Fan Zhou and Liang Zheng and Shuicheng YAN and Hao Fei and Tat-Seng Chua},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
}