EVLF: Early Vision-Language Fusion for Generative Dataset Distillation¶

Conference: CVPR 2026 arXiv: 2603.07476 Code: GitHub Area: Image Restoration Keywords: Dataset Distillation, Diffusion Models, Vision-Language Fusion, Early Fusion, Plug-and-Play

TL;DR¶

This paper proposes EVLF, a plug-and-play early vision-language fusion method operating at the encoder-backbone interface, addressing the problem of text dominance and degraded visual fidelity caused by late-stage semantic injection in diffusion-based dataset distillation.

Background & Motivation¶

Dataset Distillation (DD) aims to synthesize compact training sets that enable models to achieve high accuracy with few samples. Diffusion-based DD methods (e.g., D4M, MGD3) have become mainstream, yet share a core structural problem:

Semantic dominance from late fusion: In standard diffusion pipelines, textual semantics are injected via cross-attention during the denoising stage (late fusion), causing text signals to excessively dominate the generation trajectory.

Degraded visual fidelity: Since encoder-derived visual latents carry only visual information, late-injected semantics operate in a "corrective" rather than "co-evolutionary" manner, producing samples that match class labels but suffer from visual distortion.

Manifestations include: unnatural shapes, text-like textures, and overly simplified contours in generated samples.

Core insight: Moving semantic fusion from the denoising stage to the encoder output stage (encoder-backbone interface) allows visual and semantic signals to co-evolve from the very beginning of the diffusion process.

Method¶

Overall Architecture¶

The EVLF pipeline: 1. The VAE encoder produces visual latents $z_{\text{img}} = \mathcal{E}(x)$ 2. The text encoder produces class embeddings $e_{\text{text}} = \mathcal{T}(y)$ 3. A lightweight cross-attention module fuses both at the encoder output: $z_{\text{fused}} = \text{CA}(z_{\text{img}}, e_{\text{text}})$ 4. The fused $z_{\text{fused}}$ serves as the initial condition for subsequent diffusion generation 5. Optionally, the denoiser is fine-tuned to adapt to the fused latent distribution

Key Designs¶

Early-fusion cross-attention module: Image tokens serve as Query and text tokens as Key/Value, ensuring semantics are injected with vision as the anchor rather than allowing text to dominate: $$Q = \tilde{z}W_Q, \quad K = \tilde{e}W_K, \quad V = \tilde{e}W_V$$ $$z_{\text{fused}} = \psi(\text{LN}(\tilde{z} + \text{softmax}(\frac{QK^\top}{\sqrt{d}})V))$$ Design Motivation: Using image features as Query ensures semantics "guide without overwriting" visual structure.
Dual-loss training objective:
- Visual preservation loss $\mathcal{L}_{\text{MSE}} = \|z_{\text{fused}} - z_{\text{img}}\|_2^2$: ensures the fused latent does not deviate from the original visual structure.
- Semantic alignment loss $\mathcal{L}_{\text{InfoNCE}}$: maps $z_{\text{fused}}$ to the text embedding space via a learnable projector and performs in-batch contrastive learning to align same-class samples.
- Total loss: $\mathcal{L}_{\text{CA}} = \lambda_1 \mathcal{L}_{\text{InfoNCE}} + \lambda_2 \mathcal{L}_{\text{MSE}}$
Plug-and-play design: EVLF inserts only at the encoder-backbone interface, independent of any specific training schedule or loss function, and can be seamlessly integrated into arbitrary encoder-based diffusion DD pipelines such as D4M and MGD3.

Loss & Training¶

Cross-attention module trained for 4 epochs, batch size 16, AdamW optimizer
$\lambda_1 = 0.1$ (fixed); $\lambda_2$ linearly increases from 0.05 to 1.0 over the first 2 epochs
Optional denoiser fine-tuning using standard diffusion loss on $z_{\text{fused}}$
Denoiser fine-tuning required when integrating with D4M; frozen when integrating with MGD3
Training feasible on a single NVIDIA A5000 GPU

Key Experimental Results¶

Main Results¶

Dataset	IPC	Metric	D4M	D4M+EVLF	MGD3	MGD3+EVLF
ImageWoof	10	ResNetAP-10	33.2	37.3	36.6	39.3
ImageWoof	50	ResNetAP-10	51.7	55.8	55.6	59.0
ImageNette	20	ResNetAP-10	66.3	71.7	69.2	72.5
CIFAR-10	10	Accuracy	37.6	45.7	-	-
Tiny-ImageNet	10	Accuracy	42.5	49.2	-	-
ImageNet-1K	50	Accuracy	60.1	60.6	60.3	61.9

Ablation Study¶

Configuration	IPC=10	IPC=20	IPC=50	Notes
D4M baseline	47.7	56.3	67.8	ResNetAP-10 on ImageIDC
+Denoiser fine-tuning	54.1	61.1	70.3	Fine-tuning is effective
+Cross-attention	51.1	57.5	69.1	Cross-attention is effective
+Both combined	57.3	62.0	72.1	Complementary effect is optimal

Key Findings¶

EVLF consistently improves performance across all IPC settings and datasets, with larger gains at low IPC (8.1% on CIFAR-10 at IPC=10)
t-SNE visualizations show that EVLF-generated samples exhibit broader distribution coverage and improved diversity
Enabling EVLF ($\lambda_1 > 0$) yields significant improvements in both accuracy and coverage, with results being insensitive to the specific value of $\lambda_1$
Transfer learning experiments indicate that datasets distilled with EVLF exhibit superior feature transferability

Highlights & Insights¶

Precise diagnosis: The paper accurately identifies late-stage semantic injection as the root cause of text over-correction in diffusion-based DD, and illustrates this with intuitive visualizations (Fig. 1).
Elegant design: A single lightweight cross-attention module achieves plug-and-play improvement without modifying any other part of the pipeline.
Comprehensive experiments: Evaluations span CIFAR-10/100, Tiny-ImageNet, ImageNet-1K and its subsets, across multiple IPC settings and architectures.
Conceptual inspiration: The comparison between early and late fusion reveals the importance of semantic injection timing in conditional generation.

Limitations & Future Work¶

Currently supports only class-level conditioning; instance-level or multi-label scenarios are not addressed.
Gains on the large-scale ImageNet-1K setting are relatively modest (~0.5–1.6%).
More sophisticated fusion mechanisms (e.g., multi-layer fusion, adaptive fusion weights) remain unexplored.
Future directions include instance-aware and compositional prompting, as well as extension to finer-grained control.

D4M: A prototype-driven LDM DD method; EVLF serves as a plug-and-play enhancement for it.
MGD3: A multimodal-guided DD method; EVLF integrates seamlessly into it as well.
MinimaxDiffusion: A DiT-based DD method using minimax optimization, focusing on discriminability and representativeness.
Inspiration: The early-fusion concept of EVLF generalizes to optimizing semantic injection timing in other conditional generation tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — The idea of replacing late injection with early fusion is simple yet insightful.
Experimental Thoroughness: ⭐⭐⭐⭐ — 7 datasets, multiple IPC settings, multiple architectures, ablation/visualization/transfer experiments.
Writing Quality: ⭐⭐⭐⭐ — Clear problem diagnosis, excellent framework diagram, and coherent logical flow.
Value: ⭐⭐⭐⭐ — A general plug-and-play method with practical value for the DD community.