JavisVerse: A Universe of Joint Audio-Video Intelligence Symphony

A unified family of audio-video models for multimodal generation and understanding, including:
text-conditional joint audio-video synthesis (JavisDiT) and unified audiovisual comprehension and generation (JavisGPT).

Flagship Research

JavisDiT Text → Audio-Video Generation

Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Sync.

A foundation DiT model that produces synchronized video + sound from text. It learns hierarchical spatio-temporal priors to align motion, scene dynamics, and acoustic events in a single diffusion transformer.

End-to-end text-to-audio-visual synthesis

Explicit spatio-temporal + audio-event synchrony

The first open-sourced model for JAVG

📄Paper 💻Code 🎬Demos

JavisDiT++ Quality / Alignment Upgrade

Unified Modeling and Optimization for Joint Audio-Video Generation

An enhanced version of JavisDiT focusing on perceptual quality, tighter audio-video synchrony, and human preference alignment. It pushes fidelity, efficiency, and spatio-temporal synchronization.

A unified model architecture for JAVG

Improved generation quality and efficiency

Better action-sound timing synchronization

📄Paper 💻Code 🎬Demos

JavisGPT Unified Audiovisual MLLM

A Unified LLM for Sounding-Video Comprehension and Generation

A multimodal large language model that both understands sounding video (audio + visual context) and generates new audio-visual experiences. Moves from "describe this clip" to "create the next scene with sound".

Audio-visual reasoning and dialog

Conditioned generation (sound + video) as actions

Towards unified perception ↔ creation systems

📄Paper 💻Code 🗣️Homepage

Workshop

JAV-CG

JAV-CG ACMMM 2026 Workshop

The 1st International Workshop on Joint Audio-Video Comprehension and Generation

A focused workshop on unified audio-video intelligence, spanning multimodal comprehension, synchronized generation, and integrated frameworks that bridge both within next-generation MLLMs and generative AI systems.

Audio-visual comprehension: localization, QA, and cross-modal reasoning

Audio-video generation with synchronized and semantically coherent synthesis

Unified AV frameworks connecting understanding and creation in one model

🗓️Workshop Site

Survey

AVI

AVI Survey Survey Paper

Audio-Visual Intelligence in Large Foundation Models: A Comprehensive Survey

The first comprehensive review of audio-visual intelligence in the era of large foundation models, consolidating fragmented literature into a unified framework across perception, generation, interaction, and beyond

Unified taxonomy for AVI tasks across perception, generation, and interaction

Methods, datasets, benchmarks, and evaluation practices in one structured view

Highlights open challenges in synchronization, controllability, safety, and reasoning

📄Paper

Awesome AVI Survey Companion

A curated GitHub repository for papers, datasets, benchmarks, and code in AVI

A living companion resource to the survey that organizes the fast-growing AVI landscape into a practical reading map for researchers working on multimodal understanding, generation, and unified AV systems.

Curated papers, datasets, benchmarks, and open-source implementations

Structured by task families across audio-visual perception and generation

Designed as an evolving companion to the AVI survey and the broader field

💻GitHub