JavisVerse logo

JavisVerse: A Universe of Joint Audio-Video Intelligence Symphony

A unified family of audio-video models for multimodal generation and understanding, including:
text-conditional joint audio-video synthesis (JavisDiT) and unified audiovisual comprehension and generation (JavisGPT).

JavisDiT logo

JavisDiT Text โ†’ Audio-Video Generation

Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Sync.

A foundation DiT model that produces synchronized video + sound from text. It learns hierarchical spatio-temporal priors to align motion, scene dynamics, and acoustic events in a single diffusion transformer.

End-to-end text-to-audio-visual synthesis
Explicit spatio-temporal + audio-event synchrony
The first open-sourced model for JAVG
JavisGPT logo

JavisGPT Unified Audiovisual MLLM

A Unified LLM for Sounding-Video Comprehension and Generation

A multimodal large language model that both understands sounding video (audio + visual context) and generates new audio-visual experiences. Moves from "describe this clip" to "create the next scene with sound".

Audio-visual reasoning and dialog
Conditioned generation (sound + video) as actions
Towards unified perception โ†” creation systems