JavisDiT Text → Audio-Video Generation
Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
A foundation multimodal generator that produces synchronized video + sound from text. It learns hierarchical spatio-temporal priors to align motion, scene dynamics, and acoustic events in a single diffusion transformer.