JavisDiT++:

Unified Modeling and Optimization for
Joint Audio-Video Generation

1Zhejiang University, 2National University of Singapore, 3University of Toronto,
4HiThink Research, 5University of Rochester, 6Nanyang Technological University

TL;DR: We introduce JavisDiT++, a concise yet powerful DiT model to generate semantically and temporally aligned sounding videos with textual conditions.


Abstract

Recent AIGC advances have rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for efficient and effective JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules.

Method


We propose a unified and efficient DiT architecture that jointly models audio and video tokens via full self-attention to enable dense cross-modal interaction. Modality-specific FFNs are then introduced to refine intra-modal features. Unlike dual-stream frameworks or dynamic routing rules, our MS-MoE deterministically routes tokens by modality, isolating interference while retaining the benefits of expert sparsity. As a result, the model capacity is expanded (1.3B → 2.1B) without increasing inference cost.

Teaser

Figure 1: Model architecture. We introduce a modality-specific MoE (MS-MoE) design for efficient and effective JAVG.


We then propose a temporally aligned rotary position encoding (TA-RoPE) scheme that temporally aligns audio and video tokens by assigning shared time indices, and offsets spatial dimensions to prevent overlap. This alignment is achieved purely through position ID manipulation without physically reordering tokens, avoiding the inefficiency of interleaving audio-visual tokens in autoregressive models.

Teaser

Figure 2: Illustration of temporal-aligned rotary position encoding (TA-RoPE) for video and audio token synchronization.


We further propose the AV-DPO technique to improve audio-video quality and synchronization by aligning generation with human preferences. It leverages diverse reward models to rank outputs and construct preference pairs based on modality-aware criteria, ensuring consistent selection across modalities. To our knowledge, this is the first application of preference alignment in the JAVG domain.

Teaser

Figure 3: Illustration of preference data collection and training pipeline of the proposed AV-DPO technique.


Comparison with SOTAs

Text Prompts: A turtle swims in turquoise water among small fish, with birds chirping in the background.

Ours

UniVerse-1

JavisDiT

Veo3

Text Prompts: A girl in a white headscarf, black top, and red skirt plays the flute beside another on piano.

Ours

UniVerse-1

JavisDiT

Veo3


More JAVG Examples

Several pigeons are gathered on a rocky shore near the water. One pigeon splashes energetically, sending up ripples, while the others stand watching. The sound of splashing water, gurgling, and flapping wings fills the air.

A brown bear is walking towards the camera, growling in a natural setting with greenery in the background.

A young girl plays the piano.

A sports car races around a track bordered by grass and fences. The engine roars through the air.

A small wooden cabin stands in a rainy forest, with misty trees around it and the sound of heavy rain and distant thunder.

A black-and-white image shows suited musicians on stage playing saxophones, trumpets, and a tuba, music filling the air before a curtain backdrop.

A man with long curly hair and a beard plays an electric guitar in a studio, wearing a black t-shirt and gray pants. Behind him are a monitor, speakers, and a poster.

A group of sharks swim gracefully underwater, their fins and tails sending ripples through the clear water. The sound of gurgling and splashing follows their movement.

A large cartoonish alien head with big eyes and a small mouth appears on screen, looking concerned. Behind it, a dusk cityscape with tall buildings and a river sets the scene, as a deep boom echoes, followed by a softer one.

A woman in a red dress walks barefoot across a sandy desert at sunset, her hair flowing as wind blows and faint footsteps sound in the sand.

At night, a narrow alley is lined with traffic cones and a rope barrier, its wet ground reflecting street lamps. Worn walls show faded posters, and a bright light glows at the far end. The sound of rain falls over the empty scene.

A large waterfall cascades down a rocky cliff into a body of water below, creating a dramatic scene. The water is a deep blue color, and there are greenish areas on the rocks near the waterfall. The sound of rushing water fills the air, steady and loud.


BibTeX

@inproceedings{liu2026javisdit++,
  author      = {Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Tat-Seng Chua, and Hao Fei},
  title       = {JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation},
  conference  = {The Fourteenth International Conference on Learning Representations},
  year        = {2026},
}