Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention¶

Conference: ICLR 2026
arXiv: 2509.23610
Code: Available (https://cslikai.cn/Dolphin)
Area: Audio & Speech
Keywords: audio-visual speech separation, discrete lip semantics, vector quantization, global-local attention, lightweight

TL;DR¶

This paper proposes Dolphin, a model that maps lip movements to discrete semantic tokens via a dual-path lightweight video encoder (DP-LipCoder), and introduces a Global-Local Attention (GLA) separator. Dolphin surpasses state-of-the-art methods on three benchmarks while reducing parameters by 50%+, MACs by 2.4×, and GPU inference latency by 6×.

Background & Motivation¶

Audio-Visual Speech Separation (AVSS) leverages visual cues (lip movements) to extract target speaker speech from noisy mixed audio. Existing methods face two fundamental tensions:

Path dependency in visual encoders: Large-scale pretrained video backbones (e.g., 3D ResNet-18) offer strong semantic alignment but at prohibitive computational cost; direct compression severely degrades semantic representation; lightweight encoders designed from scratch can only extract shallow pixel-level features.

Efficiency–quality trade-off in separators: High-performance methods (e.g., AV-Mossformer2) have enormous parameter counts unsuitable for deployment; lightweight alternatives (RTFSNet, AVLiT) rely on multiple iterations, resulting in still-high inference latency.

Method¶

Overall Architecture¶

Dolphin consists of five core components: - Pretrained video encoder DP-LipCoder: Maps lip video to reconstruction features $\mathbf{V}_r$ and semantic features $\mathbf{V}_s$ - Audio encoder: A 1D convolutional layer encoding mixed audio into $\mathbf{X} \in \mathbb{R}^{N_a \times T_a}$ - Audio-Visual Fusion (AVF) module: Fuses visual and audio features - Separator: An encoder-decoder architecture based on TDANet, with GLA blocks embedded at each layer - Audio decoder: A 1D transposed convolution outputting the time-domain separated signal

Key Designs¶

1. DP-LipCoder: Dual-Path Lightweight Video Encoder¶

A dual-path autoencoder is designed based on the MagVIT video generation architecture:

Reconstruction path: Extracts compressed visual features $\mathbf{V}_r$, preserving auxiliary cues such as speaker identity and facial expressions. The encoder consists of cascaded 3D residual blocks, spatial attention blocks, and alternating spatial downsampling.
Semantic path: An encoder with the same structure but non-shared parameters, augmented at the end with a Vector Quantization (VQ) module. Knowledge distillation from AV-HuBERT maps continuous video to audio-aligned discrete semantic tokens $\mathbf{V}_s$.
Decoder: The outputs of both paths are summed and fused for video reconstruction.

Three training losses are jointly optimized: $$\mathcal{L} = \mathcal{L}_{\text{commit}} + \mathcal{L}_{\text{distill}} + \mathcal{L}_{\text{recon}}$$

Loss	Role
$\mathcal{L}_{\text{recon}}$	Reconstruction loss; drives the reconstruction path to capture speaker visual cues
$\mathcal{L}_{\text{distill}}$	AV-HuBERT teacher distillation; guides the semantic path to extract audio-aligned features
$\mathcal{L}_{\text{commit}}$	VQ commitment loss; constrains consistency between encoder outputs and the codebook

During AVSS inference, only the encoder and VQ module are executed, with no need for the decoder. Compared to 3D ResNet-18: parameters are reduced by 93% (0.78M vs. 11.19M), MACs reduced by 70%, and SI-SNRi drops by only 0.2 dB.

2. GLA Block: Global-Local Attention¶

GA Block (Global Attention): - Coarse-grained self-attention (CSA): downsamples to length $T_a/2^Q$, performs MHSA, then upsamples back to original length - Computational complexity reduced to $1/2^{2Q}$ of the original - Followed by an FFN (with DWConv1D, kernel=3)

LA Block (Local Attention): - Heat Diffusion Attention (HDA): Learnable multi-scale filtering designed with physics-based priors from the heat diffusion equation - Maps to the frequency domain via DCT and applies exponential decay filtering: $$\tilde{\mathbf{A}}(p) = \mathbf{A}(p) \cdot \exp(-\mathbf{k}_c (p\pi/T_a)^2)$$ - $\mathbf{k}_c \in \mathbb{R}^{N_a}$ is a learnable channel-adaptive diffusion coefficient - Maps back to time domain via IDCT, followed by a gating mechanism: $\breve{\mathbf{F}}_0 = \mathcal{P}(\hat{\mathbf{x}} \odot \text{SiLU}(\mathbf{z}))$ - Advantage: Not limited by finite convolutional receptive fields; fewer parameters than large-kernel Conv1D with finer modeling

3. Encoder-Decoder Separator¶

Encoder: $Q=4$ layers, each with 2 GLA blocks + downsampling, extracting multi-scale features
Features from all scales are downsampled to the lowest resolution and summed to obtain a global representation $\mathcal{G}$, enhanced by a top-level GA block
Decoder: $Q=4$ layers, each with a TDA block (upsampling) + 3 GLA blocks
Directly outputs target speaker features without mask multiplication, avoiding distortion introduced by conventional masking

4. Audio-Visual Fusion Module¶

Two fusion mechanisms from RTFSNet are extended to the time domain: video-guided gating fusion $\mathcal{F}_1$ and attention-based cross-feature-space fusion $\mathcal{F}_2$, with visual features upsampled along the temporal dimension only.

Loss & Training¶

Separator optimization objective: SI-SNR
Adam optimizer, lr=1e-3; learning rate halved after 15 epochs of validation loss plateau, early stopping after 30 epochs of stagnation
L2 gradient clipping threshold of 5, batch size=48, 8× RTX 5090 GPUs
DP-LipCoder parameters are frozen; only the separation network is trained

Key Experimental Results¶

Main Results¶

Table 1: Pretrained Video Encoder Comparison (LRS2)

Method	SI-SNRi(dB)↑	SDRi(dB)↑	PESQ↑	Params(MB)↓	MACs(G/s)↓
3D ResNet-18	17.0	17.1	3.30	11.19	7.95
AE	15.2	15.4	3.15	0.05	0.17
LipCoder	16.3	16.4	3.24	0.65	5.33
DP-LipCoder	16.8	16.9	3.29	0.78	2.38

Table 2: AVSS SOTA Comparison (Three Datasets)

Method	LRS2 SI-SNRi	LRS3 SI-SNRi	VoxCeleb2 SI-SNRi
IIANet	16.0	18.3	13.6
AV-Mossformer2	15.1	17.7	14.0
Dolphin	16.8	18.8	14.6

Table 3: Efficiency Comparison (Including Video Encoder)

Method	Params(M)↓	MACs(G)↓	GPU Latency(ms)↓
IIANet	15.01	26.51	142.30
AV-Mossformer2	68.52	124.46	62.30
Dolphin	7.00	10.89	33.24

Ablation Study¶

GLA Component Ablation (LRS2):

GA	LA	SI-SNRi↑	Params(MB)↓
✗	✗	10.4	2.04
✓	✗	15.9	5.23
✗	✓	15.6	3.81
✓	✓	16.8	7.00

HDA layer vs. Conv1D: HDA achieves 16.9 dB SI-SNRi, outperforming Conv1D at 16.5 dB, with fewer parameters (7.00M vs. 7.57M).

Key Findings¶

VQ discrete encoding improves SI-SNRi by at least 1.0 dB over continuous autoencoders; the VQ module alone contributes approximately 0.5 dB.
DP-LipCoder generalizes to other AVSS models: replacing the video encoder reduces parameters by 10M+ with only marginal performance degradation.
A single-iteration architecture with GLA outperforms multi-iteration approaches.
Compared to SOTA IIANet: parameters reduced by 53%, MACs by 59%, and GPU inference is 4.3× faster.

Highlights & Insights¶

Superiority of discrete representations: Mapping the video stream to a "visual vocabulary" yields more compact and discriminative representations than continuous alternatives — offering broad inspiration for multimodal system design.
Heat diffusion physics prior: Incorporating the heat equation into local attention enables fine-grained local feature modeling by learning only scaling and gating parameters, reducing overfitting risk.
Dual-path complementarity philosophy: The reconstruction path retains auxiliary identity and expression information, while the semantic path extracts audio-aligned information.

Limitations & Future Work¶

The method relies on clean, synchronized lip video and lacks robustness to large head pose variations, occlusion, and extreme lighting conditions.
Deployment on extremely resource-constrained devices remains challenging; quantization and pruning are worth exploring.
Discrete tokens may lose fine-grained articulatory cues; hierarchical codebooks or hybrid discrete-continuous representations could be explored.
Validation is conducted only on English datasets; cross-lingual generalization remains to be investigated.

TDANet provides the foundational separation architecture; this work adds GLA blocks and removes iterative processing.
AV-HuBERT serves as the teacher model for semantic distillation.
MagVIT's video generation architecture is creatively repurposed as a video encoder.
Insight: Physics-based priors (heat diffusion) can be injected into attention mechanisms as inductive biases.

Rating¶

Novelty: ⭐⭐⭐⭐ — Dual-path discrete encoding + heat diffusion local attention
Technical Depth: ⭐⭐⭐⭐ — Multi-module design with comprehensive ablation studies
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three datasets, multi-dimensional efficiency comparisons, and ablations
Value: ⭐⭐⭐⭐⭐ — Significant efficiency gains with clear deployment scenarios

Loss	Role
\(\mathcal{L}_{\text{recon}}\)	Reconstruction loss; drives the reconstruction path to capture speaker visual cues
\(\mathcal{L}_{\text{distill}}\)	AV-HuBERT teacher distillation; guides the semantic path to extract audio-aligned features
\(\mathcal{L}_{\text{commit}}\)	VQ commitment loss; constrains consistency between encoder outputs and the codebook