JavisDiT Text โ Audio-Video Generation
Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Sync.
A foundation DiT model that produces synchronized video + sound from text. It learns hierarchical spatio-temporal priors to align motion, scene dynamics, and acoustic events in a single diffusion transformer.