PrismAudio: Decomposed Chain-of-Thought and Multi-dimensional Rewards for Video-to-Audio Generation¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=cIfDKEbAky
Code: Project Page https://PrismAudio.github.io (Code TBD)
Area: Video-to-Audio Generation / Multimodal
Keywords: Video foley, Decomposed Chain-of-Thought, Multi-dimensional rewards, GRPO, Flow matching diffusion
TL;DR¶
PrismAudio decomposes video-to-audio (V2A) generation into four specialized Chains-of-Thought (CoT): semantic, temporal, aesthetic, and spatial. Each CoT is paired with a corresponding reward function, and the model is optimized via multi-dimensional reinforcement learning using efficient Fast-GRPO. It achieves SOTA across four perceptual dimensions on VGGSound and the self-built AudioCanvas with fewer parameters and faster inference.
Background & Motivation¶
Background: Video-to-audio generation (V2A, also known as video foley) synthesizes sound from silent video (with optional text). A "good" foley must simultaneously satisfy four human perceptual dimensions: semantic consistency (sound matches objects/events), temporal synchronization (sound onset/rhythm matches action), aesthetic quality (natural, high production value), and spatial accuracy (left/right panning matches visual position). Mainstream methods have evolved from early visual-only conditions (Diff-Foley, V2A-Mapper) to explicit text conditioning (MMAudio, MovieGen Audio). Recently, ThinkSound introduced CoT reasoning from Multimodal LLMs to perform structured "audio planning" before rendering, significantly improving interpretability.
Limitations of Prior Work: The authors identify three critical flaws in existing methods (especially ThinkSound). First is monolithic planning—all audio analysis is generated in a single reasoning path, conflating semantic, temporal, spatial, and aesthetic tasks, which often leads to multi-modal hallucinations or neglect of specific dimensions in complex scenes. Second is objective entanglement—competing perceptual goals are compressed into a unified reconstruction loss, preventing the model from learning context-dependent trade-offs and degrading into signal-level reconstruction. Third is lack of human preference alignment—training solely on text matching lacks a mechanism to learn what "sounds satisfying to a human ear," resulting in technically correct but perceptually mediocre outputs.
Key Challenge: The four perceptual objectives are interdependent yet mutually restrictive. For instance, focusing solely on semantic consistency might produce a matching but dull sound (poor aesthetics), or a correct sound type that is out of sync. A single loss or reward cannot find the optimal balance between these competing goals.
Goal: To enable the model to generate high-quality reasoning and optimize human preferences across all four dimensions simultaneously, rather than improving one at the expense of others.
Key Insight: Different perceptual dimensions require fundamentally different analytical frameworks (semantics rely on content identification, spatiality on localization logic, aesthetics on subjective quality). Therefore, they should not be mixed in a single reasoning path. By decomposing reasoning and aligning each part with a specialized reward signal, multi-dimensional reinforcement learning can optimize both "reasoning" and "preference" together.
Core Idea: Replace "monolithic reasoning + single reconstruction loss" with "decomposed CoT + dimension-aligned multi-rewards + efficient Fast-GRPO," allowing each dimension to be managed independently while jointly aligning with human preferences.
Method¶
Overall Architecture¶
PrismAudio is built upon a CoT-aware audio foundation model via a three-step pipeline: first, upgrade an audio foundation model to handle structured reasoning text; second, decompose monolithic reasoning into four specialized CoTs (semantic, temporal, aesthetic, spatial) by generating data with Gemini 2.5 Pro and fine-tuning VideoLLaMA2; finally, perform post-training using Fast-GRPO for multi-dimensional RL—mapping each CoT to a corresponding reward function, sampling candidate audios, calculating multi-dimensional weighted advantages via group normalization, and updating the policy using efficient hybrid ODE-SDE sampling.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Silent Video<br/>(+ Optional Text)"] --> B["CoT-aware Audio Base Model<br/>VideoPrism + T5-Gemma"]
A --> C["Four Decomposed CoTs<br/>Semantic/Temporal/Aesthetic/Spatial"]
C -->|Structured Text Condition| B
B --> D["Sample Candidate Audios"]
D --> E["Multi-dimensional Rewards<br/>One per CoT"]
E -->|Group Norm Advantage| F["Fast-GRPO<br/>Hybrid ODE-SDE + Random Window"]
F -->|Update Policy| B
B --> G["Output Stereo Audio"]
Key Designs¶
1. CoT-aware Audio Base Model: Replacing Encoders That Cannot Understand Video or Reasoning
Models like ThinkSound use backbones based on Multi-diffusion Transformers and flow matching, but two bottlenecks hinder multi-dimensional reasoning: insufficient video understanding and text encoders that fail to process structured CoTs. PrismAudio makes two targeted replacements. On the video side, it replaces per-frame CLIP encoders with VideoPrism—a unified ViT encoder pre-trained on massive video datasets, which captures the temporal semantics needed for multi-dimensional reasoning. On the text side, it upgrades T5 to T5-Gemma, which distills the reasoning capability of decoder-only LLMs into an encoder-decoder architecture, allowing it to condition the generation on CoT text containing logical structures and causal relationships.
2. Multi-dimensional CoT Decomposition: Splitting Monolithic Planning
This step addresses the "monolithic planning" issue. The authors use Gemini 2.5 Pro to construct high-quality CoT training data and fine-tune VideoLLaMA2 to generate four independent reasoning paths for a video: Semantic CoT identifies audio events and characteristics; Temporal CoT determines rhythm and timing; Aesthetic CoT focuses on sound quality (clarity, reverb, loudness); Spatial CoT analyzes stereo image positioning. These four CoTs are concatenated as a structured prompt. Ablations show this decomposition significantly outperforms monolithic CoT in semantic (CLAP 0.52 vs 0.46) and aesthetic (CE 4.26 vs 3.79) metrics.
3. Multi-dimensional Reward Functions: Specialized Evaluators for Each CoT
To solve "objective entanglement," the authors design a reward for each dimension using established specialized models: Semantic Reward uses MS-CLAP for audio-text alignment; Temporal Reward uses Synchformer for sync detection; Aesthetic Reward uses Meta Audiobox Aesthetics (no-reference MOS prediction); Spatial Reward uses StereoCRW for localization accuracy. These \(K=4\) reward heads \(\{R_k\}\) correspond directly to the four CoTs. This CoT-Reward Correspondence is key to joint improvement across all dimensions.
4. Fast-GRPO: Efficient Multi-dimensional RL via Random Windows
Applying GRPO to flow matching models is computationally expensive because the deterministic ODE must be converted to a stochastic SDE for RL optimization. Standard Flow-GRPO performs SDE sampling at every denoising step, incurring massive overhead. Fast-GRPO restricts stochasticity and optimization to a small, computationally cheap segment. It uses a Hybrid ODE-SDE Sampler: for each iteration, a starting point \(\ell\) and a small window \(w \ll T\) are randomly sampled. Outside the window \(W(\ell)\), the model uses deterministic ODE steps; inside the window, it uses noisy SDE steps. The SDE steps induce an analytical Gaussian policy \(\pi_\theta(x_{t+1}\mid x_t,c)\), allowing a closed-form solution for the GRPO policy ratio \(r_t(\theta)\). Random Window Scheduling ensures the entire trajectory is explored over multiple iterations while reducing the number of function evaluations (NFE) from \(T\) to \(w\).
For multi-dimensional optimization, a weighted total reward is calculated for each candidate: \(R_{\text{total}}^i=\sum_{k=1}^{K}\lambda_k R_k(x_T^i,c)\). The advantage \(A_i\) is then derived via group normalization:
The final objective optimizes the clipped GRPO loss restricted to the window \(W(\ell)\):
Key Experimental Results¶
Main Results¶
On the VGGSound test set, PrismAudio (518M) outperforms all baselines with fewer parameters and faster inference (0.63s):
| Method | Params | CLAP↑ (Sem) | DeSync↓ (Temp) | CRW↓ (Spat) | MOS-Q↑ | MOS-C↑ | Time(s)↓ |
|---|---|---|---|---|---|---|---|
| MMAudio | 1.03B | 0.40 | 0.46 | - | 3.95 | 4.03 | 1.30 |
| ThinkSound (Prev. SOTA) | 1.3B | 0.43 | 0.55 | 13.47 | 4.05 | 4.18 | 1.07 |
| Ours (PrismAudio) | 518M | 0.47 | 0.41 | 7.72 | 4.21 | 4.22 | 0.63 |
| Ours w/o CoT-RL | 518M | 0.42 | 0.51 | 10.29 | 4.02 | 4.11 | 0.63 |
On the out-of-distribution (OOD) AudioCanvas benchmark, PrismAudio maintains stability while others degrade:
| Method | CLAP↑ | DeSync↓ | CE↑ | CRW↓ | MOS-Q↑ | MOS-C↑ |
|---|---|---|---|---|---|---|
| ThinkSound | 0.48 | 0.80 | 4.10 | 22.82 | 3.79 | 3.80 |
| Ours (PrismAudio) | 0.52 | 0.36 | 4.26 | 12.87 | 4.12 | 4.01 |
Ablation Study¶
CoT Strategy (AudioCanvas):
| Configuration | CLAP↑ | DeSync↓ | CE↑ | CRW↓ |
|---|---|---|---|---|
| Baseline (No CoT) | 0.42 | 0.44 | 3.81 | 15.30 |
| Monolithic CoT | 0.46 | 0.38 | 3.79 | 13.02 |
| MultiCoT (Ours) | 0.52 | 0.36 | 4.26 | 12.87 |
Multi-dimensional vs. Single-dimensional Rewards (Target Entanglement):
| Reward Focus | CLAP↑ | DeSync↓ | PQ↑ | CRW↓ | FD↓ |
|---|---|---|---|---|---|
| Baseline (No RL) | 0.47 | 0.42 | 6.45 | 15.30 | 1.90 |
| Semantic Only | 0.54 | 0.58 | 6.62 | 11.89 | 1.84 |
| Aesthetic Only | 0.46 | 0.42 | 7.06 | 13.51 | 4.50 |
| Multi-dimensional | 0.52 | 0.36 | 6.68 | 12.87 | 1.53 |
Key Findings¶
- Decomposed Reasoning + Multi-rewards is the Main Driver: Adding CoT-RL improves all dimensions simultaneously. Without it, the base model is strong but fails to reach human preference peaks.
- Single Rewards Cause Trade-offs: Rewarding only aesthetics doubles the distribution error (FD), meaning the sound is "pretty" but decoupled from content. Only multi-dimensional rewards achieve balanced improvement.
- Fast-GRPO Efficiency: Fast-GRPO reaches a higher reward ceiling (~0.51) in 200 steps compared to Flow-GRPO which plateaus at 600+ steps (~0.47).
Highlights & Insights¶
- Reasoning-Reward Alignment: The 1:1 mapping between decomposed CoT and specialized reward heads is a clean solution for disentanglement.
- Efficient Diffusion RL: Fast-GRPO reduces NFE from \(T\) to \(w\) using the random window trick, which is applicable to any flow matching/diffusion task.
- Reusing Specialized Evaluators: Using existing models (MS-CLAP, Synchformer) as rewards bypasses the need for training custom reward models and ensures alignment between training objectives and evaluation metrics.
Limitations & Future Work¶
- Risk of overfitting to proxy reward models; "surpassing ground truth" in some metrics suggests the model relies on specific proxy biases.
- Reward weights \(\lambda_k\) are manually tuned; adaptive weight learning could be a future improvement.
- High engineering complexity and dependency on multiple external models (Gemini, VideoLLaMA2, four evaluators).
Related Work & Insights¶
- vs ThinkSound: PrismAudio solves the dimension interference and entanglement issues of ThinkSound's monolithic reasoning and reconstruction loss via decomposition and multi-dimensional RL.
- vs Flow-GRPO: Fast-GRPO is significantly more efficient by restricting stochastic exploration to a random window rather than the full trajectory.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐