ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing¶
Conference: NeurIPS 2025
arXiv: 2506.21448
Code: https://ThinkSound-Project.github.io
Area: LLM Reasoning
Keywords: Video-to-Audio, Chain-of-Thought, Audio Generation, Multimodal Reasoning, Flow Matching
TL;DR¶
ThinkSound is a three-stage interactive video-to-audio framework that leverages an MLLM to generate structured CoT reasoning as guidance for a unified audio foundation model. It achieves state-of-the-art performance on VGGSound and MovieGen Audio benchmarks while supporting object-level refinement and natural language instruction-based editing.
Background & Motivation¶
Background: Video-to-audio (V2A) generation has evolved from end-to-end diffusion models (Diff-Foley, FoleyCrafter) to multimodal conditional generation (MMAudio, MultiFoley), with substantial quality improvements. However, existing methods remain black-box one-step generators that lack deep reasoning over visual content.
Limitations of Prior Work: Generating realistic audio requires reasoning in the manner of a professional sound designer—determining whether an owl is calling or flapping its wings, recognizing the rustling of branches, and synchronizing multiple audio events. End-to-end approaches compress these reasoning steps, frequently producing generic sounds that are misaligned with subtle visual cues.
Key Challenge: SonicVisionLM employs an MLLM for captioning followed by text-to-audio generation, but loses critical visual detail in the process. DeepSound-V1 introduces CoT reasoning but fragments the pipeline into three independent models. Neither approach fully exploits the reasoning capacity of MLLMs to guide a unified audio generation system.
Goal: To deeply integrate MLLM-based CoT reasoning into the V2A pipeline, enabling stepwise, interactive audio generation and editing.
Key Insight: Emulating the professional sound designer's workflow—first synthesizing the overall soundscape, then refining sounds for specific objects, and finally applying instruction-driven edits—with each stage guided by CoT reasoning.
Core Idea: An MLLM generates audio-specific CoT reasoning chains that serve as structured conditioning signals for a unified flow matching audio foundation model across three stages of audio generation.
Method¶
Overall Architecture¶
ThinkSound consists of two primary modules: (1) an MLLM fine-tuned from VideoLLaMA2, responsible for analyzing video/text inputs and producing structured CoT reasoning; and (2) a unified audio foundation model based on MM-DiT, which accepts multimodal conditioning from CoT, video, text, and audio context, and generates high-fidelity audio via flow matching. The full pipeline proceeds through three stages: base Foley generation → object-level interactive refinement → instruction-driven editing.
Key Designs¶
-
AudioCoT Dataset:
- Function: Constructs a large-scale multimodal CoT annotation dataset that bridges visual content, textual descriptions, and audio synthesis.
- Mechanism: A three-stage automated pipeline—(a) VideoLLaMA2 and Qwen2-Audio extract visual and audio information, and GPT-4.1-nano synthesizes CoT chains; (b) Grounded SAM2 extracts ROI regions to generate object-level CoT; (c) editing CoT is generated based on four operations (extension/inpainting/addition/removal).
- Design Motivation: Without large-scale audio CoT data, an MLLM cannot be trained to produce meaningful reasoning chains; existing datasets lack structured reasoning annotations.
-
CoT Reasoning MLLM:
- Function: Fine-tunes VideoLLaMA2 to generate audio-specific structured reasoning.
- Mechanism: Standard cross-entropy fine-tuning on AudioCoT endows the model with three capabilities: (a) audio understanding (acoustic properties, sound propagation, temporal causality); (b) structured decomposition (breaking complex audio scenes into actionable steps); (c) multimodal instruction following.
- Design Motivation: General-purpose MLLMs lack the specialized reasoning capacity required for audio generation.
-
Unified Audio Foundation Model (CoT-Guided MM-DiT):
- Function: Generates high-fidelity audio from arbitrary combinations of input modalities.
- Mechanism: Trained with flow matching. Key design choices include—(a) dual-path text encoding: MetaCLIP encodes visual captions for scene-level context, while T5-v1-xl encodes CoT reasoning to capture fine-grained temporal causal relationships; (b) hybrid Transformer: multi-stream blocks (modality-specific parameters with shared attention) followed by single-stream blocks; (c) adaptive fusion module: video features are upsampled and fused with audio latents via a gating mechanism; (d) classifier-free guidance dropout (random per-modality drop with probability 0.2) to support arbitrary input combinations.
- Design Motivation: A unified architecture allows all three stages to share a single audio generation model, and CoT conditioning provides more precise generation guidance than raw captions.
-
Click-Based Interaction Interface (Stage 2):
- Function: Enables users to click on specific objects in the video to trigger object-level audio refinement.
- Mechanism: Grounded SAM2 generates an ROI from the click location and tracks it across frames; the MLLM produces a dedicated CoT for the ROI; the foundation model conditions on the existing audio as context and synthesizes object-specific sounds that are blended into the output.
- Design Motivation: Enables fine-grained audio control for non-technical users.
Loss & Training¶
- VAE: Trained for 500K steps on Stability AI VAE (24×A800); the encoder is then frozen and the decoder is trained for an additional 500K steps.
- Foundation model: 100K steps (8×A100), batch size 256, learning rate \(10^{-4}\).
- Task fine-tuning: 50K steps (8×A100) separately for each of the three stages.
Key Experimental Results¶
Main Results (VGGSound V2A Generation)¶
| Method | FD↓ | KL_PaSST↓ | DeSync↓ | CLAP_CoT↑ | MOS-Q↑ | MOS-A↑ |
|---|---|---|---|---|---|---|
| MMAudio | 43.26 | 1.65 | 0.44 | 0.40 | 3.84 | 3.97 |
| ThinkSound | 34.56 | 1.52 | 0.46 | 0.46 | 4.02 | 4.18 |
| w/o CoT | 39.84 | 1.59 | 0.48 | 0.41 | 3.91 | 4.04 |
OOD Evaluation (MovieGen Audio Bench)¶
| Method | CLAP_CoT↑ | DeSync↓ | MOS-Q↑ | MOS-A↑ |
|---|---|---|---|---|
| MMAudio | 0.45 | 0.77 | 3.95 | 3.62 |
| MovieGen | 0.47 | 1.00 | 3.98 | 3.70 |
| ThinkSound | 0.51 | 0.76 | 4.11 | 3.87 |
Key Findings¶
- Significant contribution of CoT reasoning: Removing CoT raises FD from 34.56 to 39.84 (+15%) and lowers CLAP_CoT from 0.46 to 0.41, confirming that CoT supplies critical information on audio events, temporal relationships, and acoustic properties.
- Strong OOD generalization: ThinkSound achieves state-of-the-art performance on the unseen MovieGen benchmark, suggesting that CoT reasoning improves generalization.
- Object-level and editing tasks: ThinkSound substantially outperforms baselines on both object-level generation (FD 43.27 vs. MMAudio 44.46) and audio editing (FD 34.78 vs. AudioLDM-2 61.28).
- Inference efficiency: Generation time is only 1.07s, faster than MMAudio (3.01s) and FoleyCrafter (3.84s).
Highlights & Insights¶
- Well-motivated three-stage interactive workflow: The design faithfully emulates the professional sound designer's process (global → local → revision), with MLLM CoT reasoning bridging user intent and audio synthesis at each stage.
- High value of the AudioCoT dataset: The automated CoT annotation pipeline is scalable to additional data sources, addressing the lack of audio CoT training data.
- Dual-path text encoding (MetaCLIP + T5): The complementary combination of scene-level global context and fine-grained CoT reasoning substantially outperforms single-encoder alternatives.
Limitations & Future Work¶
- Dependency on additional MLLM inference: Each generation requires a preceding MLLM pass to produce the CoT, adding system complexity (though the paper reports shorter overall generation time, likely attributable to faster audio synthesis).
- CoT quality contingent on GPT-4.1-nano: The dataset construction pipeline relies on a closed-source model, and CoT errors propagate downstream.
- Unclear scale of human evaluation: The number of evaluators and detailed setup for MOS scores are not sufficiently described in the main text.
- No conversational interaction: The three stages constitute a predefined linear pipeline with no support for iterative user-feedback-driven refinement.
Related Work & Insights¶
- vs. MMAudio: MMAudio also employs flow matching with multimodal conditioning but lacks CoT reasoning. ThinkSound decomposes complex scenes into manageable sound components via CoT, improving FD by 20%.
- vs. SonicVisionLM: SonicVisionLM follows a two-stage video→text→audio pipeline in which visual detail is lost at the intermediate step; ThinkSound retains the video as a direct conditioning input throughout generation.
- vs. DeepSound-V1: DeepSound-V1 also employs CoT but fragments it across three independent models; ThinkSound covers all stages with a single unified foundation model.
Rating¶
- Novelty: ⭐⭐⭐⭐ The three-stage interactive framework integrating CoT reasoning into V2A generation is a novel design, and the AudioCoT dataset constitutes an independent contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluation spans multiple benchmarks (VGGSound + MovieGen), three sub-tasks, ablation studies, and both objective and subjective metrics.
- Writing Quality: ⭐⭐⭐⭐ The paper is clearly structured with rich figures and detailed method descriptions.
- Value: ⭐⭐⭐⭐ Demonstrates the utility of CoT reasoning in generative tasks (beyond pure understanding or reasoning), opening a new application direction.