JavisVerse logo

JavisVerse: A Universe of Joint Audio-Video Intelligence Symphony

A unified family of audio-video models for multimodal generation and understanding, including:
text-conditional joint audio-video synthesis (JavisDiT) and unified audiovisual comprehension and generation (JavisGPT).

JavisDiT logo

JavisDiT Text → Audio-Video Generation

Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

A foundation multimodal generator that produces synchronized video + sound from text. It learns hierarchical spatio-temporal priors to align motion, scene dynamics, and acoustic events in a single diffusion transformer.

End-to-end text-to-audio-visual synthesis
Explicit spatio-temporal + audio-event synchrony
The first open-sourced model for JAVG