JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization¶
Conference: ICLR 2026 arXiv: 2503.23377 Code: https://javisverse.github.io/JavisDiT-page/ Area: Diffusion Models / Video Generation Keywords: Joint Audio-Video Generation, DiT, Spatio-Temporal Synchronization, Contrastive Learning, Benchmark Dataset
TL;DR¶
This paper proposes JavisDiT, a joint audio-video generation model built on the DiT architecture. It achieves fine-grained spatio-temporal audio-video alignment via a Hierarchical Spatio-Temporal Synchronization Prior Estimator (HiST-Sypo). The work also introduces a new benchmark, JavisBench (10K complex-scene samples), and a new evaluation metric, JavisScore.
Background & Motivation¶
Background: Audio and video are naturally coupled in real-world scenarios, making joint audio-video generation (JAVG) valuable for film production and short-video creation.
Limitations of Prior Work: Cascaded asynchronous approaches—generating audio first and then synthesizing video, or vice versa—accumulate noise; end-to-end methods are more promising. Existing DiT backbones such as AV-DiT and MM-LDM rely on image-based DiTs and struggle to model fine-grained spatio-temporal relationships. Current synchronization strategies achieve only coarse temporal alignment (parameter sharing) or semantic alignment (embedding alignment), lacking fine-grained spatial synchronization. Existing benchmarks such as AIST++ and Landscape contain overly simple scenes that fail to capture complex, multi-event real-world scenarios. The AV-Align metric relies on optical flow and audio onset detection, which are unreliable in complex scenes.
Key Challenge: Prior JAVG methods lack a mechanism to jointly model the spatial and temporal dimensions of audio-video synchronization at a fine granularity.
Goal: To develop an end-to-end JAVG framework with hierarchical spatio-temporal prior estimation, alongside a more challenging benchmark and a more robust evaluation metric.
Method¶
Overall Architecture¶
JavisDiT comprises a video branch and an audio branch that share AV-DiT blocks. Each branch sequentially passes through: ST-SelfAttn → coarse-grained CrossAttn (T5 semantics) → fine-grained ST-CrossAttn (spatio-temporal prior) → bidirectional CrossAttn (cross-modal fusion).
Key Design 1: Hierarchical Spatio-Temporal Synchronization Prior Estimator (HiST-Sypo)¶
Coarse-grained prior: The semantic embeddings from the T5 encoder are directly reused to describe overall sound events.
Fine-grained prior estimation: - 77 hidden states from the ImageBind text encoder serve as input. - \(N_s = 32\) spatial tokens and \(N_t = 32\) temporal tokens serve as queries. - A 4-layer Transformer encoder-decoder \(\mathcal{P}\) extracts spatio-temporal priors. - The outputs parameterize a Gaussian distribution; stochastic spatio-temporal priors are sampled as \((p_s, p_t) \leftarrow \mathcal{P}_\phi(s; \epsilon)\). - The estimator is trained via contrastive learning using negative samples (asynchronous audio-video pairs) and a dedicated loss function.
Key Design 2: Multi-Modal Bidirectional Cross-Attention (MM-BiCrossAttn)¶
- An attention matrix \(A\) is computed from video queries \(q_v\) and audio keys \(k_a\).
- \(A \times v_a\) → audio-to-video attention.
- \(A^T \times v_v\) → video-to-audio attention.
- The bidirectional information flow enables deep cross-modal fusion.
Three-Stage Training Strategy¶
- Audio pre-training (0.8M audio-text pairs): The audio branch is initialized with weights from OpenSora's video branch.
- ST-Prior training (0.6M synchronized audio-video triplets): The HiST-Sypo estimator is trained.
- JAVG training (0.6M samples): The self-attention and ST-Prior modules are frozen; only ST-CrossAttn and Bi-CrossAttn are trained.
Loss & Training¶
- Diffusion denoising loss (Flow Matching or DDPM).
- ST-Prior estimator: contrastive learning loss (synchronized positive samples vs. asynchronous negative samples).
- Dynamic temporal masking supports multiple conditional generation tasks.
Key Experimental Results¶
Main Results on JavisBench¶
| Method | FVD ↓ | FAD ↓ | TV-IB ↑ | AV-IB ↑ | JavisScore ↑ |
|---|---|---|---|---|---|
| TempoToken (T2A→A2V) | 539.8 | - | 0.084 | - | - |
| MM-Diffusion (JAVG) | - | - | - | - | - |
| JavisDiT | Best | Best | Best | Best | Best |
JavisBench Dataset Characteristics¶
| Dimension | # Categories | Description |
|---|---|---|
| Event scene | Multiple | Nature, industrial, indoor, etc. |
| Spatial composition | 2 | Single / multiple sounding subjects |
| Temporal composition | 3 | Single event / sequential / concurrent |
| Total samples | 10,140 | 75% multi-event; 57% concurrent events |
Ablation Study¶
JavisDiT also significantly outperforms MM-Diffusion and cascaded methods on traditional benchmarks (AIST++ and Landscape) across FVD, KVD, and FAD metrics.
Highlights & Insights¶
- Fine-grained spatio-temporal alignment: The model aligns not only when a sound occurs but also where in the frame it occurs—a spatial dimension overlooked by prior work.
- Stochastic prior sampling: The same text prompt can correspond to different spatio-temporal prior distributions, explicitly modeling uncertainty in the location and timing of events.
- Challenging nature of JavisBench: With 75% of samples containing multiple events and 57% containing concurrent events, the benchmark far exceeds the complexity of existing datasets.
- Robustness of JavisScore: The metric computes ImageBind synchronization scores over sliding windows and selects the least-synchronized 40% of frames, making it more reliable than AV-Align.
- Modular design: Single-modality self-attention blocks are frozen, and only cross-modal modules are trained, yielding parameter efficiency.
Limitations & Future Work¶
- Video generation resolution is relatively low (240P/24fps), lagging behind state-of-the-art video generation models.
- The model relies on OpenSora pre-trained weights; feasibility of training from scratch has not been verified.
- ImageBind's joint audio-visual embedding space may lack sufficient granularity in extreme scenarios.
- The choice of \(N_s = 32, N_t = 32\) for the HiST-Sypo estimator has not been thoroughly ablated.
- Controllability of generated audio (e.g., specific instrument timbre) is not discussed.
- Although JavisBench contains 10K samples, extension to more diverse linguistic and cultural contexts is needed.
Related Work & Insights¶
- MM-Diffusion (Ruan et al.): The first end-to-end JAVG model, using simple parameter sharing for alignment.
- SyncFlow (Liu et al.): Employs STDiT3 blocks but lacks bidirectional information exchange.
- Seeing-Hearing (Xing et al.): Relies on simple embedding alignment without fine-grained spatial information.
- OpenSora (Zheng et al.): The source of pre-trained video branch weights; provides the dynamic temporal masking technique.
- Insight: Audio-video synchronization is fundamentally a conditional consistency problem; the paradigm of hierarchical prior estimation combined with contrastive learning is generalizable to other multi-modal alignment scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The fine-grained spatio-temporal prior estimation in HiST-Sypo is genuinely innovative.
- Experimental Thoroughness: ⭐⭐⭐⭐ — New benchmark + new metric + multi-method comparison, though some baselines are not open-sourced.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with clear figures, though some details are deferred to the appendix.
- Value: ⭐⭐⭐⭐ — JAVG is an important yet immature direction; this paper advances standardization in the field.