Learning to Highlight Audio by Watching Movies¶

Conference: CVPR 2025
arXiv: 2505.12154
Code: https://wikichao.github.io/VisAH/ (Project Page)
Area: Audio/Multimodal
Keywords: Visually-guided audio enhancement, acoustic highlighting, multimodal fusion, audio mixing, movie audio

TL;DR¶

A novel task of visually-guided acoustic highlighting is proposed, leveraging well-crafted audiovisual data from movies as free supervision. Through a Transformer-based multimodal framework, VisAH, poorly mixed audio is converted into visually and semantically aligned highlighted audio, significantly outperforming baseline methods across all metrics.

Background & Motivation¶

Background: Video content creation has highly matured in visual editing (e.g., optimal perspective selection, post-editing), but intelligent processing on the audio side lags behind. Most recording equipment (e.g., microphones on cameras) captures all sounds indiscriminately, leading to a lack of acoustic layering in the audio.
Limitations of Prior Work: Traditional approaches require separating audio into individual sources (speech, music, sound effects) first, followed by manual volume adjustments of each source. This not only suffers from limited separation precision but also demands extensive manual effort to ensure temporal alignment with the video. Existing music mixing methods are restricted to the music domain, ignoring the diversity of natural audio.
Key Challenge: Audio needs to be "highlighted" based on video content, but there is a lack of direct paired training data (i.e., "poorly mixed audio" and "well-mixed audio" pairs).
Goal: (a) Define a new task—visually-guided acoustic highlighting; (b) Construct a training dataset; (c) Design an end-to-end multimodal model.
Key Insight: Audio in movies is professionally crafted and naturally contains "good mixing" information, which can serve as a free supervision signal.
Core Idea: Use movie audio as ground truth (GT) and create training pairs through a pseudo-data generation pipeline (separation-adjustment-remixing), guiding the Transformer to perform audio transitions in the latent space using visual information.

Method¶

Overall Architecture¶

The inputs are a "poorly mixed" audio waveform \(\mathbf{a}\) and the corresponding video frames \(\mathbf{v}\), and the output is the highlighted audio \(\mathbf{s}\). The overall model operates in three stages: (1) a dual U-Net audio encoder extracts frequency-domain and time-domain features; (2) a latent highlighting Transformer leverages visual/textual context to guide audio feature transformation; (3) a dual U-Net decoder reconstructs the highlighted audio waveform.

Key Designs¶

Dual U-Net Audio Backbone:
- Function: Simultaneously extract audio representations from both the frequency domain (spectrogram) and the time domain (waveform).
- Mechanism: Based on the HybridDemucs architecture, the spectrogram branch progressively reduces the dimension of the magnitude spectrogram through a 5-layer 2D convolutional encoder; the waveform branch acts as a residual path, capturing fine-grained temporal details with 1D convolutions. The outputs of both branches are element-wise summed to obtain a unified audio embedding \(\mathbf{f_a} \in \mathbb{R}^{C_a \times L}\). Notably, the authors removed the mean normalization from the original HybridDemucs, as it tends to suppress ambient sounds.
- Design Motivation: A single representation (frequency domain or time domain) has its own limitations; the frequency domain is adept at capturing the frequency patterns of different sound sources, while the time domain reconstructs waveforms more precisely. The dual-branch architecture unifies the advantages of both.
Latent Highlighting Transformer:
- Function: Transform latent audio features into "highlighted" representations guided by the visual context.
- Mechanism: First, CLIP ViT-L/14 is used to extract video frame features, and captions generated by InternVL2-8B are encoded with T5-XXL, and temporal context is captured through their respective Transformer encoders. A Transformer decoder then integrates the context information into the audio features via cross-attention. A key design is to use the decoder output as an offset to the original features (residual connection) and employ a zero-initialized convolutional layer \(\mathcal{Z}(\cdot)\) to ensure the model's behavior is close to an identity mapping at the beginning of training: \(\hat{\mathbf{f}}_\mathbf{a} = \mathbf{f_a} + \mathcal{Z}(\mathcal{D}(\mathbf{f_a}, \hat{\mathbf{f}}_i))\).
- Design Motivation: Visual information in video focuses on salient regions, whereas audio captures the overall environment. Thus, the temporal dynamics of video must be leveraged to guide audio highlighting. Text captions, as an extra modality, can convey deeper semantics like emotion and context.
Muddy Mix Pseudo-Data Generation Pipeline:
- Function: Generate "poorly mixed" training inputs from movie audio.
- Mechanism: A three-step pipeline—(a) Separation: decompose movie audio into three sub-streams (speech, music, sound effects) using a three-source separation model, adding a residual to ensure the sum equals the original audio; (b) Adjustment: suppress the loudest source (-6/-9/-12 dB) and boost the other two (+6/+9/+12 dB), categorized across three difficulty levels (High/Medium/Low); (c) Remixing: perform a linear superposition to generate the "poorly mixed" input audio. Finally, 15,078/1,927/1,789 training/validation/test clips were generated from action movies in the CMD dataset.
- Design Motivation: It is virtually impossible to directly acquire paired "good-poor mix" data, but movie audio is naturally professionally-mixed GT. Through pseudo-data generation, massive training pairs can be constructed at zero cost.

Loss & Training¶

Multi-resolution STFT loss (MR-STFT) is used to calculate the L1 distance between the magnitude spectrograms of the predicted audio and GT audio across three different window sizes (2048/1024/512). Training configuration: batch size of 12 per GPU, Adam optimizer, \(lr=0.0001\), trained for 200 epochs. It takes about 18 hours on two RTX 4090 GPUs.

Key Experimental Results¶

Main Results¶

Method	MAG↓	ENV↓	KLD↓	ΔIB↓	W-dis↓
Poorly Mixed Input	22.69	6.30	20.61	1.52	1.94
DnRv3+CDX	26.32	7.62	15.87	1.78	2.84
Learn2Remix	19.07	4.16	61.76	8.27	1.20
LCE-SepReformer	17.18	4.28	30.99	1.88	1.28
VisAH (Ours)	10.08	3.43	11.01	0.80	0.79

VisAH leads significantly on all 5 metrics, with a 56% reduction in MAG and a 59% reduction in W-dis.

Ablation Study¶

Context Type	MAG↓	KLD↓	ΔIB↓
No Context	10.35	11.95	0.99
+ Semantic Visual (Single Frame)	10.35	11.67	0.91
+ Semantic Text (Single Frame Caption)	10.32	11.83	0.84
+ Temporal Visual (Multi-Frame)	10.24	11.18	0.88
+ Temporal Text (Multi-Frame Caption)	10.08	11.01	0.80

Key Findings¶

Contextual Information is Crucial: Adding temporal context (either visual or textual) yields more significant improvements than semantic-level context (single-frame), indicating that audio highlighting requires an understanding of video temporal dynamics.
Text Captions are More Effective than Pure Visuals: Temporal text achieves the best results, as the captions generated by the VLM convey deeper levels of emotional and scene semantics.
The Number of Transformer Encoder Layers Matters: The visual context performs best with 3 layers (6 layers lead to overfitting), whereas text context consistently improves up to 6 layers, as CLIP visual features are already sufficiently compact.
Data Difficulty Ablation: The model shows significant improvements across all three difficulty levels (high/medium/low), verifying the rationality of the data generation strategy and metric design.
Subjective Evaluation: A 77% top-2 ranking rate was achieved. In 34% of the videos, it even outperformed the GT, showing that the model's highlighting is occasionally superior to the original movie soundtrack in certain scenarios.

Highlights & Insights¶

Movies as Free Supervision Signals: Utilizing existing high-quality movie audio as GT to obtain training pairs through pseudo-data generation is an extremely clever data engineering approach that avoids expensive annotations.
Zero-Initialized Residual Design: Utilizing a zero-initialized convolutional layer to add the Transformer output back to the audio feature as a residual ensures model stability in early training. This trick can be transferred to any conditional generation task.
Potential Application Value: The authors demonstrate using VisAH to improve the audio quality of video generation models such as MovieGen, showing that this method can be widely applied as an audio post-processing module.

Limitations & Future Work¶

The data only comes from action movies, limiting scene diversity; expanding to more movie genres may further improve generalization.
Only three-source separation (speech/music/sound effects) is used; finer-grained source separation may bring more delicate highlighting control.
Current evaluations are mainly conducted on 10-second clips; the performance in long-video scenarios remains unverified.
Training relies on specific pre-trained models (CLIP, InternVL2, T5); lightweight alternative solutions can be explored.

vs Learn2Remix: Learn2Remix is an audio-only automatic mixing method without visual guidance. Ours introduces visual context to provide a clear semantic target for mixing.
vs Visually-Guided Audio Source Separation: Audio source separation completely isolates target sources and suppresses others to zero. Ours focuses more on "remixing"—retaining all sources but adjusting their relative prominence.
vs LCE: LCE is a text-guided audio editor but lacks the ability to capture global trends. Our Transformer architecture models temporal dynamics better.

Rating¶

Novelty: ⭐⭐⭐⭐ Proposes a brand-new task definition and data construction method.
Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative, subjective, ablation, and application demonstrations are all comprehensive.
Writing Quality: ⭐⭐⭐⭐ Clear logic, and fully articulated motivations and methods.
Value: ⭐⭐⭐⭐ Opens up a new direction for audio highlighting with practical application prospects.