Language-Free Generative Editing from One Visual Example¶

Conference: CVPR 2026 arXiv: 2603.25441 Code: Project Page Area: Image Generation Keywords: Image Editing, Diffusion Models, Visual Conditioning, Language-Free, Training-Free

TL;DR¶

This paper reveals a critical text-visual alignment failure in text-guided diffusion models on simple visual transformations such as rain, haze, and blur, and proposes the VDC framework — which learns a purely visual conditioning signal from a single visual example pair (before and after transformation) to guide diffusion-based editing, requiring neither text nor training. VDC surpasses text-based and fine-tuning-based methods on tasks including deraining, dehazing, and denoising.

Background & Motivation¶

Text-guided diffusion models have achieved remarkable progress in image editing; however, surprisingly, SOTA methods fail severely on simple, everyday visual transformations such as rain, blur, and haze.

Key Challenge: Diffusion models are trained on image-caption pairs and only learn concepts explicitly described in captions. Visual phenomena that are rarely or ambiguously described (e.g., raindrops, haze) are poorly aligned in text-visual space — attention maps under the text condition "rain" remain object-centric and unrelated to rain characteristics.

Existing solutions: - Fine-tuning: computationally expensive and data-demanding - Stronger text conditioning: still cannot overcome the fundamental limit of text-visual alignment

Core Idea: The editing capability of diffusion models is not lost — it is merely hidden by text. Diffusion models already encode rich visual representations; accessing these representations through vision rather than language is feasible.

Method¶

Overall Architecture¶

VDC (Visual Diffusion Conditioning): 1. Takes as input a visual example pair (before \(R_B\) and after \(R_A\) transformation) 2. Optimizes a lightweight MLP to generate a visual conditioning embedding \(C^s\) 3. Performs DDIM inversion on the real input image, then applies conditioning guidance for editing 4. An inversion correction step preserves detail fidelity

Key Designs¶

Condition Steering:
- Function: Guides unconditional diffusion sampling using posterior score steering
- Mechanism: Derived from the posterior score function, the noise prediction is rewritten as a weighted combination of conditional and unconditional predictions: \(\epsilon_\theta(z_t, -C^s) = (1-w) \cdot \epsilon_\theta(z_t, C^s) + w \cdot \epsilon_\theta(z_t, \phi)\). For removal-type tasks (deraining/dehazing), negative conditioning direction is used to steer away from high-density regions of the target feature.
- Design Motivation: Editing the inverted image (\(out = Z(\phi) + Z(C_\theta)\)) rather than generating from scratch (\(out = Z(C_\theta)\)) resembles a global residual connection in image-to-image networks, avoiding generation artifacts.
Condition Generator:
- Function: Generates editing condition embeddings from a single visual example pair
- Mechanism: Inspired by implicit neural representations (INR), a lightweight 3-layer MLP maps token indices to conditioning embeddings, with Fourier feature transformation applied to inputs for enhanced expressiveness. An independent \(\text{MLP}_t\) is trained per diffusion timestep \(t\), with the optimization objective: \(\mathcal{L} = \|Z_0^B - Z^A\|_2^2 + \|D(Z_0^B) - R_A\|_2^2\)
- Design Motivation: Fully decoupled from text space. The MLP+INR formulation is more stable than directly optimizing text embeddings and enables optimization over all 77 tokens (text embedding optimization typically handles only a small number of tokens); the continuous function ensures natural inter-token communication.
DDIM Inversion Correction:
- Function: Reduces accumulated errors from DDIM inversion
- Mechanism: Iteratively optimizes the inversion starting point: \(Z_p \leftarrow Z_p - \text{AdamGrad}(\|\hat{Z}_0 - Z_0\|_2^2)\), i.e., repeatedly executing forward-backward cycles to adjust the noisy latent for more accurate reconstruction.
- Design Motivation: Accumulated errors in DDIM inversion lead to distortion in edited images; the correction step preserves perceptual quality without added complexity.

Loss & Training¶

Condition generator optimization: \(\mathcal{L} = \|Z_0^B - Z^A\|_2^2 + \|D(Z_0^B) - R_A\|_2^2\) (joint loss in latent and pixel space), optimized with Adam, with an independent MLP per diffusion timestep.

Key Experimental Results¶

Main Results¶

Method	SR FID↓	DeBlur FID↓	DeNoise FID↓	DeRain FID↓	DeHaze FID↓	Colorization FID↓
P2P	126.47	45.62	142.95	139.19	44.09	121.87
Null-Opt	73.48	51.89	160.88	167.61	—	—
VDC (Ours)	Best	Best	Best	Best	Best	Best

VDC establishes a new state of the art across all 6 benchmarks, outperforming both training-free and fully fine-tuned text-based editing methods.

Ablation Study¶

Configuration	Effect
Without inversion correction	LPIPS degrades significantly
Without Fourier features	Unstable condition generation
Global unified condition (non-per-step)	Reduced accuracy
Text embedding optimization instead of MLP	Unstable; only a small number of tokens can be optimized

Key Findings¶

Diffusion model attention maps under text conditions such as "rain" are completely misaligned — remaining object-centric rather than degradation-centric
Attention maps recovered by VDC correctly focus on rain streaks and haze regions
A single visual example pair is sufficient to learn generalizable editing conditions
VDC even surpasses methods specifically trained for image restoration tasks

Highlights & Insights¶

Reveals a fundamental failure of text-visual alignment on appearance-level transformations — a finding of independent scientific value
The insight that "diffusion editing capability is hidden rather than lost" is central: visual conditioning can unlock capabilities inaccessible to text
The INR-inspired condition generator design is elegant: continuous functions combined with Fourier features enable stable full-token optimization
A truly training-free method with substantially lower computational cost than fine-tuning approaches

Limitations & Future Work¶

Requires a visual example pair (before and after transformation); obtaining such pairs may not always be straightforward
Each new editing task requires re-optimizing the condition generator (~100 iterations)
Validated only on image restoration/degradation tasks; applicability to semantic-level editing (e.g., object replacement) remains unexplored
Relies on the quality of DDIM inversion; highly complex images may yield poor inversion results

vs. Prompt-to-Prompt / Null-Opt: These methods modify text prompts or attention maps but still depend on text-visual alignment; VDC bypasses text entirely.
vs. InstructPix2Pix and variants: Require large-scale instruction fine-tuning datasets and training; VDC requires zero training.
vs. Diffusion-based inverse problem methods: Assume a known degradation operator (e.g., blur kernel) and cannot handle complex spatially-varying degradations; VDC learns from visual examples.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ A paradigm shift from text to purely visual conditioning; revealing the text-visual alignment failure carries significant scientific value
Experimental Thoroughness: ⭐⭐⭐⭐ Six editing benchmarks with multiple baselines, though semantic-level editing experiments are absent
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is exceptionally clear; the attention map visualizations in Figure 2 are highly convincing
Value: ⭐⭐⭐⭐⭐ Offers important insights for diffusion-based editing and image restoration; the framework is concise and practical