Skip to content

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

Conference: ICLR 2026 arXiv: 2603.15597 Code: None Area: Audio Generation Keywords: Video-to-Audio, Foley Synthesis, Reference Audio Control, Timbre Transfer, Flow Matching

TL;DR

This paper proposes AC-Foley, a reference-audio-guided video-to-audio synthesis framework that achieves fine-grained timbre control, timbre transfer, and zero-shot sound effect generation via two-stage training (acoustic feature learning + temporal adaptation) and multimodal conditional flow matching, significantly outperforming existing methods in audio quality and acoustic fidelity.

Background & Motivation

Background: Existing V2A methods primarily synthesize audio from text prompts combined with visual information, achieving audio-visual synchronization at the semantic level.

Limitations of Prior Work: (a) Dataset granularity gap — training annotations group acoustically distinct sounds (e.g., barks from different dog breeds) under coarse labels; (b) Limitations of text description — language cannot encode fine-grained acoustic features (e.g., "metal impact" cannot distinguish the time-frequency characteristics of a hammer striking an anvil versus a steel chain falling). These constraints prevent text-based control from achieving fine-grained sound effect synthesis.

Key Challenge: Foley artists need to synthesize multiple acoustic variants for the same visual action (e.g., footsteps on surfaces of different materials), yet text cannot precisely describe timbral differences, and training data lacks such fine-grained annotations.

Goal: To directly control acoustic characteristics using reference audio, bypassing the semantic ambiguity inherent in text.

Key Insight: A VAE encodes the reference audio to preserve its complete acoustic signature (rather than using encoders such as CLAP that extract only semantic information), and two-stage training is employed to learn how to adapt the reference timbre to the temporal structure of the video.

Core Idea: Replace text with raw audio signals as the control condition; leverage a VAE to preserve timbral features and employ two-stage training to achieve adaptive transfer from reference audio to video temporal dynamics.

Method

Overall Architecture

The inputs are a silent video and a reference audio clip (with an optional text prompt). A multimodal Transformer generates audio that is synchronized with the video and preserves the timbral characteristics of the reference audio, operating under a conditional flow matching framework. The three modalities interact through joint training.

Key Designs

  1. Multimodal Conditional Flow Matching:

    • Function: Extends conditional flow matching to three-modality conditional generation over video, audio, and text.
    • Mechanism: The velocity field \(v_\theta(t, \mathcal{C}, x_t)\) is guided by multimodal conditions \(\mathcal{C} = \{V, A, T\}\). The condition vector \(\mathbf{c}\) integrates CLIP visual/text features, Synchformer synchronization features, VAE audio features, and timestep embeddings, modulating Transformer inputs via adaLN.
    • Design Motivation: Flow matching offers faster inference than diffusion models, and joint multimodal training allows different control signals to complement one another.
  2. Audio Control Module:

    • Function: Encodes the reference audio using a VAE to preserve complete acoustic features.
    • Mechanism: A pretrained VAE encoder maps the reference audio to the latent space (instead of CLAP), and average pooling is applied to extract acoustic features. CLAP captures only semantic-level audio information, whereas the VAE retains full spectral and timbral characteristics.
    • Design Motivation: Since text suffers from insufficient semantic granularity, conditioning directly on audio — which preserves waveform-level acoustic information rather than semantic labels — is a principled solution.
  3. Two-Stage Training Strategy:

    • Function: Learns acoustic feature extraction and temporal adaptation in separate stages.
    • Mechanism: Stage 1 (Acoustic Feature Learning) trains on overlapping audio-visual clip pairs to establish the ability to extract acoustic features from the reference audio. Stage 2 (Temporal Adaptation) uses non-overlapping audio segments from different temporal positions within the same video as conditions, exploiting intra-video audio self-similarity (e.g., footsteps in the same scene share acoustic characteristics) to force the model to align reference timbral features with the video's temporal structure.
    • Design Motivation: The non-overlapping conditioning in Stage 2 is the key design choice — it compels the model to learn "timbre transfer" rather than "waveform copying," resolving the temporal misalignment and audio-visual incoherence caused by naive duplication.

Loss & Training

The standard conditional flow matching objective (velocity field regression) is used. Each modality's condition is randomly dropped out during training with a certain probability, enabling flexible condition combinations at inference time.

Key Experimental Results

Main Results

Method FD↓ KL↓ MCD↓ Timbre Fidelity
MMAudio (text only) Baseline Baseline Baseline No control
CondFoley Moderate Moderate Moderate Limited
AC-Foley (audio conditioned) −20% −28% −22% Precise
AC-Foley (no audio condition) Competitive Competitive Competitive

Ablation Study

Configuration Audio Quality Timbre Fidelity Notes
Full model Best Best Two-stage training + VAE encoding
Stage 1 only Moderate Temporal misalignment Lacks temporal adaptation
CLAP instead of VAE Degraded Timbral detail lost CLAP captures only semantics
No audio condition Competitive Degenerates to standard V2A

Key Findings

  • Providing different reference audio clips (e.g., Chihuahua bark vs. large-breed bark) for the same dog video yields acoustically distinct outputs, validating fine-grained control capability.
  • Timbre transfer experiments succeed across categories (e.g., transferring donkey vocalizations to a lion video), demonstrating cross-category acoustic feature transfer.
  • Zero-shot generation: using a suppressed gunshot as reference audio with a shooting video produces a silenced effect that text prompts are entirely incapable of describing.
  • Even without a reference audio input, AC-Foley remains competitive with state-of-the-art V2A methods, indicating that joint multimodal training itself improves baseline generation capability.

Highlights & Insights

  • A principled bypass of text: Rather than improving text descriptions, the method uses audio directly as the control signal — "letting sound describe sound" is fundamentally more effective than "letting words describe sound."
  • Elegant two-stage training design: Exploiting intra-video audio self-similarity to force the model to learn "transfer" rather than "copy" is an ingenious training strategy.
  • VAE vs. CLAP: Prior work defaults to CLAP for audio encoding, but CLAP is designed for semantic retrieval. Preserving timbre requires lower-level waveform features, making VAE the principled choice.

Limitations & Future Work

  • Reference audio must be provided by the creator, increasing the barrier to use.
  • Generation quality may degrade when the reference audio is semantically inconsistent with the video content.
  • Two-stage training adds training complexity.
  • Flexibility with respect to reference audio length may be constrained by the VAE's processing capacity.
  • vs. MMAudio: MMAudio jointly trains on video and text modalities but does not support audio-conditioned control; AC-Foley extends to three modalities and enables precise timbre control.
  • vs. CondFoley: CondFoley requires equal-length reference audio-video pairs, limiting flexibility; AC-Foley supports variable-length references.
  • vs. MultiFoley: MultiFoley performs audio continuation/extension and is constrained by the diversity of input audio; AC-Foley performs timbre transfer applicable across different semantic categories.

Rating

  • Novelty: ⭐⭐⭐⭐ The idea of replacing text with reference audio control is intuitive yet highly effective; the two-stage training design is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three application scenarios are fully covered: fine-grained control, timbre transfer, and zero-shot generation.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; application examples are vivid and intuitive.
  • Value: ⭐⭐⭐⭐⭐ Provides a much-needed fine-grained control tool for Foley production practice.