ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing¶

Conference: NeurIPS 2025
arXiv: 2506.21448
Code: https://ThinkSound-Project.github.io
Area: LLM Reasoning
Keywords: Video-to-Audio, Chain-of-Thought, Audio Generation, Multimodal Reasoning, Flow Matching

TL;DR¶

ThinkSound is a three-stage interactive video-to-audio framework that leverages an MLLM to generate structured CoT reasoning as guidance for a unified audio foundation model. It achieves state-of-the-art performance on VGGSound and MovieGen Audio benchmarks while supporting object-level refinement and natural language instruction-based editing.

Background & Motivation¶

Background: Video-to-audio (V2A) generation has evolved from end-to-end diffusion models (Diff-Foley, FoleyCrafter) to multimodal conditional generation (MMAudio, MultiFoley), with substantial quality improvements. However, existing methods remain black-box one-step generators that lack deep reasoning over visual content.

Limitations of Prior Work: Generating realistic audio requires reasoning in the manner of a professional sound designer—determining whether an owl is calling or flapping its wings, recognizing the rustling of branches, and synchronizing multiple audio events. End-to-end approaches compress these reasoning steps, frequently producing generic sounds that are misaligned with subtle visual cues.

Key Challenge: SonicVisionLM employs an MLLM for captioning followed by text-to-audio generation, but loses critical visual detail in the process. DeepSound-V1 introduces CoT reasoning but fragments the pipeline into three independent models. Neither approach fully exploits the reasoning capacity of MLLMs to guide a unified audio generation system.

Goal: To deeply integrate MLLM-based CoT reasoning into the V2A pipeline, enabling stepwise, interactive audio generation and editing.

Key Insight: Emulating the professional sound designer's workflow—first synthesizing the overall soundscape, then refining sounds for specific objects, and finally applying instruction-driven edits—with each stage guided by CoT reasoning.

Core Idea: An MLLM generates audio-specific CoT reasoning chains that serve as structured conditioning signals for a unified flow matching audio foundation model across three stages of audio generation.

Method¶

Overall Architecture¶

ThinkSound consists of two primary modules: (1) an MLLM fine-tuned from VideoLLaMA2, responsible for analyzing video/text inputs and producing structured CoT reasoning; and (2) a unified audio foundation model based on MM-DiT, which accepts multimodal conditioning from CoT, video, text, and audio context, and generates high-fidelity audio via flow matching. The full pipeline proceeds through three stages: base Foley generation → object-level interactive refinement → instruction-driven editing.

Key Designs¶

AudioCoT Dataset:
- Function: Constructs a large-scale multimodal CoT annotation dataset that bridges visual content, textual descriptions, and audio synthesis.
- Mechanism: A three-stage automated pipeline—(a) VideoLLaMA2 and Qwen2-Audio extract visual and audio information, and GPT-4.1-nano synthesizes CoT chains; (b) Grounded SAM2 extracts ROI regions to generate object-level CoT; (c) editing CoT is generated based on four operations (extension/inpainting/addition/removal).
- Design Motivation: Without large-scale audio CoT data, an MLLM cannot be trained to produce meaningful reasoning chains; existing datasets lack structured reasoning annotations.
CoT Reasoning MLLM:
- Function: Fine-tunes VideoLLaMA2 to generate audio-specific structured reasoning.
- Mechanism: Standard cross-entropy fine-tuning on AudioCoT endows the model with three capabilities: (a) audio understanding (acoustic properties, sound propagation, temporal causality); (b) structured decomposition (breaking complex audio scenes into actionable steps); (c) multimodal instruction following.
- Design Motivation: General-purpose MLLMs lack the specialized reasoning capacity required for audio generation.
Unified Audio Foundation Model (CoT-Guided MM-DiT):
- Function: Generates high-fidelity audio from arbitrary combinations of input modalities.
- Mechanism: Trained with flow matching. Key design choices include—(a) dual-path text encoding: MetaCLIP encodes visual captions for scene-level context, while T5-v1-xl encodes CoT reasoning to capture fine-grained temporal causal relationships; (b) hybrid Transformer: multi-stream blocks (modality-specific parameters with shared attention) followed by single-stream blocks; (c) adaptive fusion module: video features are upsampled and fused with audio latents via a gating mechanism; (d) classifier-free guidance dropout (random per-modality drop with probability 0.2) to support arbitrary input combinations.
- Design Motivation: A unified architecture allows all three stages to share a single audio generation model, and CoT conditioning provides more precise generation guidance than raw captions.
Click-Based Interaction Interface (Stage 2):
- Function: Enables users to click on specific objects in the video to trigger object-level audio refinement.
- Mechanism: Grounded SAM2 generates an ROI from the click location and tracks it across frames; the MLLM produces a dedicated CoT for the ROI; the foundation model conditions on the existing audio as context and synthesizes object-specific sounds that are blended into the output.
- Design Motivation: Enables fine-grained audio control for non-technical users.

Loss & Training¶

VAE: Trained for 500K steps on Stability AI VAE (24×A800); the encoder is then frozen and the decoder is trained for an additional 500K steps.
Foundation model: 100K steps (8×A100), batch size 256, learning rate \(10^{-4}\).
Task fine-tuning: 50K steps (8×A100) separately for each of the three stages.

Key Experimental Results¶

Main Results (VGGSound V2A Generation)¶

Method	FD↓	KL_PaSST↓	DeSync↓	CLAP_CoT↑	MOS-Q↑	MOS-A↑
MMAudio	43.26	1.65	0.44	0.40	3.84	3.97
ThinkSound	34.56	1.52	0.46	0.46	4.02	4.18
w/o CoT	39.84	1.59	0.48	0.41	3.91	4.04

OOD Evaluation (MovieGen Audio Bench)¶

Method	CLAP_CoT↑	DeSync↓	MOS-Q↑	MOS-A↑
MMAudio	0.45	0.77	3.95	3.62
MovieGen	0.47	1.00	3.98	3.70
ThinkSound	0.51	0.76	4.11	3.87

Key Findings¶

Significant contribution of CoT reasoning: Removing CoT raises FD from 34.56 to 39.84 (+15%) and lowers CLAP_CoT from 0.46 to 0.41, confirming that CoT supplies critical information on audio events, temporal relationships, and acoustic properties.
Strong OOD generalization: ThinkSound achieves state-of-the-art performance on the unseen MovieGen benchmark, suggesting that CoT reasoning improves generalization.
Object-level and editing tasks: ThinkSound substantially outperforms baselines on both object-level generation (FD 43.27 vs. MMAudio 44.46) and audio editing (FD 34.78 vs. AudioLDM-2 61.28).
Inference efficiency: Generation time is only 1.07s, faster than MMAudio (3.01s) and FoleyCrafter (3.84s).

Highlights & Insights¶

Well-motivated three-stage interactive workflow: The design faithfully emulates the professional sound designer's process (global → local → revision), with MLLM CoT reasoning bridging user intent and audio synthesis at each stage.
High value of the AudioCoT dataset: The automated CoT annotation pipeline is scalable to additional data sources, addressing the lack of audio CoT training data.
Dual-path text encoding (MetaCLIP + T5): The complementary combination of scene-level global context and fine-grained CoT reasoning substantially outperforms single-encoder alternatives.

Limitations & Future Work¶

Dependency on additional MLLM inference: Each generation requires a preceding MLLM pass to produce the CoT, adding system complexity (though the paper reports shorter overall generation time, likely attributable to faster audio synthesis).
CoT quality contingent on GPT-4.1-nano: The dataset construction pipeline relies on a closed-source model, and CoT errors propagate downstream.
Unclear scale of human evaluation: The number of evaluators and detailed setup for MOS scores are not sufficiently described in the main text.
No conversational interaction: The three stages constitute a predefined linear pipeline with no support for iterative user-feedback-driven refinement.

vs. MMAudio: MMAudio also employs flow matching with multimodal conditioning but lacks CoT reasoning. ThinkSound decomposes complex scenes into manageable sound components via CoT, improving FD by 20%.
vs. SonicVisionLM: SonicVisionLM follows a two-stage video→text→audio pipeline in which visual detail is lost at the intermediate step; ThinkSound retains the video as a direct conditioning input throughout generation.
vs. DeepSound-V1: DeepSound-V1 also employs CoT but fragments it across three independent models; ThinkSound covers all stages with a single unified foundation model.

Rating¶

Novelty: ⭐⭐⭐⭐ The three-stage interactive framework integrating CoT reasoning into V2A generation is a novel design, and the AudioCoT dataset constitutes an independent contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluation spans multiple benchmarks (VGGSound + MovieGen), three sub-tasks, ablation studies, and both objective and subjective metrics.
Writing Quality: ⭐⭐⭐⭐ The paper is clearly structured with rich figures and detailed method descriptions.
Value: ⭐⭐⭐⭐ Demonstrates the utility of CoT reasoning in generative tasks (beyond pure understanding or reasoning), opening a new application direction.