Skip to content

From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Conference: CVPR 2026
Paper: CVF Open Access
Code: To be confirmed
Area: Diffusion Models / Image Generation
Keywords: Image Editing, Test-Time Scaling, Image-CoT, Adaptive Sampling, Early-Stopping Pruning

TL;DR

Addressing the issue that T2I-oriented Image-CoT wastes computational power when directly applied to image editing, this paper proposes ADE-CoT. It dynamically allocates sampling budgets based on editing difficulty, replaces generic MLLM scoring with specialized "edit region + instruction consistency" verifiers for early pruning, and employs a depth-first "stop when sufficient" mechanism to eliminate redundant sampling. ADE-CoT achieves better image quality while accelerating inference by over 2× compared to Best-of-N across three SOTA editing models.

Background & Motivation

Background: Image Chain-of-Thought (Image-CoT) is a training-free, plug-and-play "test-time scaling" strategy. It improves generation quality by sampling multiple candidates during inference and selecting the best one via a verifier. Originally designed for Text-to-Image (T2I), the standard approach involves perturbing initial noise to sample \(N\) candidates followed by Best-of-N (BoN) selection. Advanced methods use MLLMs as verifiers to score intermediate states during denoising, pruning low-potential trajectories early to save computation.

Limitations of Prior Work: Fundamental differences exist between image editing and T2I. T2I is an open-ended task where large-scale sampling continuously yields diverse reasonable results. In contrast, editing is goal-oriented, where the solution space is strictly constrained by the source image and instructions; only a few correct answers exist regardless of noise perturbations or prompt rewriting. Applying T2I's Image-CoT directly to editing reveals three issues: (1) Inefficient resource allocation: All edits use a fixed budget (e.g., 32 samples), but simple edits (with high initial scores) gain almost nothing from Image-CoT, wasting power on easy samples. (2) Unreliable early verification: Edits often involve local or subtle changes difficult to distinguish in early denoising stages; generic MLLM scores misjudge these—40% of early low-scoring samples eventually achieve high scores but are erroneously pruned. (3) Result redundancy: Large-scale sampling produces multiple correct results with identical scores (edits with best scores in \([7,9)\) often have 15+ candidates sharing the top score), but editing only requires one result that aligns with the intent.

Key Challenge: The entire mechanism of existing Image-CoT (fixed budgets, general scoring, breadth-first parallel sampling of all candidates) is designed for "open-ended, the more the better" T2I, which is misaligned with the "goal-oriented, one is enough" nature of editing.

Goal / Core Idea: Shift the focus from "scale" to "speed." The paper proposes the on-demand ADE-CoT with three targeted strategies: dynamic budget allocation by difficulty, accurate early pruning using edit-specific metrics, and opportunistic depth-first stopping to cut redundancy—significantly improving efficiency while maintaining editing accuracy.

Method

Overall Architecture

ADE-CoT takes a source image \(I_{src}\) and an editing instruction \(c\) to produce an edited image \(I\) semantically aligned with \(c\). The pipeline transforms traditional BoN's "fixed budget + breadth-first sampling + general scoring" into a three-stage adaptive process.

First, the editing difficulty is estimated using the initial score of a single candidate to dynamically determine the total budget (difficulty-aware budget). Subsequently, a breadth-first search is conducted in the early stage of denoising, but with an edit-specific verifier (region localization + instruction-caption consistency) instead of general scores for pruning. Visually similar redundant candidates are discarded, and surviving candidates are ranked. Finally, the process switches to depth-first search in the late stage, generating candidates one by one based on early scores. Each is verified by an instance-specific verifier; the process stops immediately once \(N_{high}\) intent-aligned results are collected. These stages correspond to "saving on simple samples," "pruning more accurately," and "cutting redundancy."

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Source Image + Instruction"] --> B["Difficulty-Aware Budget Allocation<br/>Difficulty via Initial Single-Sample Score<br/>Determine Dynamic Budget Na"]
    B --> C["Specialized Early Verification & Pruning<br/>One-step Preview + Region/Caption Scoring<br/>Filter Similarity + Rank by Score"]
    C --> D["Depth-First Opportunistic Stopping<br/>Sequential Late-Stage Generation<br/>Verification via Instance-Specific Verifier"]
    D -->|"Collect Nhigh Aligned Results"| E["Output Optimal Edited Image"]

Key Designs

1. Difficulty-Aware Resource Allocation: Allocating less for simple edits and more for difficult ones

This addresses the "fixed budget wastes compute" pain point. A single candidate is initially generated and scored by a verifier \(\text{Vrf}\) to yield \(S\), acting as a proxy for difficulty. A high score suggests a simple edit where further sampling offers diminishing returns, whereas a low score indicates difficulty justifying a larger search. The adaptive budget \(N_a\) is allocated as:

\[N_a = N_{\min} + \lceil (N - N_{\min}) \times (1 - S/S_{\max})^{\gamma} \rceil\]

where \(N_{\min}\) and \(N\) are the minimum and original budgets, \(S_{\max}\) is the maximum score, and \(\gamma\) controls sensitivity. As \(S \to S_{\max}\) (easy), \(N_a\) converges to \(N_{\min}\); as \(S \to 0\) (difficult), \(N_a\) approaches \(N\). This directs computation precisely toward difficult edits. In experiments, \(\gamma\) was set to 0.15, as NFE decreased steadily while quality remained stable until that point.

2. Edit-Specific Verification + Similarity Filtering: Replacing generic scoring with "correctness of modification logic"

Targeting the "40% misjudgment of high-potential samples by generic MLLMs" issue. This stage includes three components. First is one-step preview: scoring the noisy latent \(x_{t_e}\) at early time \(t_e\) directly is difficult. Since modern editing models often use flow matching, the authors use one-step extrapolation to approximate the clean latent \(x_{0|t_e} = x_{t_e} - \sigma_{t_e}\epsilon_\theta(x_{t_e}, T_{t_e})\), decoded into a preview image without extra denoising steps.

Second, two edit-specific verifiers complement the general score \(S_{gen}\). Edit region correctness: MLLMs are prompted with \(P_{reg}\) to identify objects to be modified/retained, which are fed into Grounded SAM2 to generate a binary mask \(M\) of the expected edit region. The mean change map \(\Delta = \frac{1}{C}\sum_{c=1}^{C}|I_{(c)} - I_{src}^{(c)}|\) is calculated, and a pixel-level softmax weight is applied to \(\Delta\) to aggregate the score within the mask: \(S_{reg} = \sum_{H,W} M \odot \text{softmax}_{H,W}(\Delta)\). Higher \(S_{reg}\) indicates changes concentrated in the intended area. Instruction-caption consistency: Since ground-truth captions are unavailable at test time, MLLMs generate a target caption \(c_{cap}\) based on the source image and instruction via prompt \(P_{cap}\), followed by calculating \(S_{cap} = \text{CLIPScore}(I, c_{cap})\). The unified score \(S = S_{gen} + \lambda_{reg}S_{reg} + \lambda_{cap}S_{cap}\) is used to prune candidates below a rejection threshold \(S_{rj}\). Crucially, \(S_{reg}\) and \(S_{cap}\) require only one MLLM query per edit. This reduced misjudgments in the high-score range \([6,9)\) by 63% (235 to 86) while maintaining low-score pruning accuracy.

Third, visual similarity filtering: DINOv2 is used to extract visual embeddings of preview images; if the similarity between two candidates exceeds threshold \(\tau_{sim}\), the lower-scoring one is discarded to eliminate redundancy. Remaining candidates are ranked by \(S\) for the next stage.

3. Depth-First Opportunistic Stopping: Stopping when sufficient rather than sampling all

Targeting the redundancy of obtaining multiple identical correct results. Unlike the breadth-first approach of BoN/PRM/PARM, a depth-first approach is adopted: candidates are generated sequentially based on early scores. It consists of two parts. Late-stage retention: At a later time \(t_l\) (\(t_e < t_l < T\)), a second preview and unified score are generated, using an adaptive threshold to prune sub-optimal samples. Instance-specific verifier: Generic scores \(S_{gen}\) often fail to distinguish between multiple high-scoring candidates or miss subtle errors. The authors found that "two-stage QA" guides MLLMs to notice key details—first generating a set of yes/no questions (covering instruction following, aesthetics, etc.) via prompt \(P_q\), then answering them via \(P_a\). The instance-specific score \(S_{spec}\) is the count of "yes" answers. This score punishes incorrect candidates. Generation stops once \(N_{high}\) (default 4) intent-aligned results are collected, from which the highest scorer is output.

Loss & Training

ADE-CoT is a training-free, plug-and-play test-time method. Key hyperparameters for three SOTA models: total steps \(T=28/28/50\) (Kontext/BAGEL/Step1X-Edit), \(t_e=8/8/16\), \(t_l=16/16/36\). Qwen-VL-MAX is used for MLLM queries, VIE-Score for general scoring, and 5 yes/no questions per edit.

Key Experimental Results

Main Results

Evaluated on GEdit-Bench (real user edits), AnyEdit-Test (local/global/implicit edits), and Reason-Edit (complex reasoning) using FLUX.1 Kontext, BAGEL, and Step1X-Edit. Efficiency is measured by NFE (Total Denoising Steps), alongside Inference Efficiency \(\eta = \frac{1}{M}\sum_i \sigma_i \cdot \frac{S(i)}{S_{\max}} \cdot \frac{NT}{NFE(i)}\) (where \(\sigma_i=1\) if result is not inferior to BoN) and Result Efficiency \(\xi = \frac{1}{M}\sum_i \sigma_i \frac{NFE(i)}{NFE^{min}(i)}\) (measuring redundancy). Main results on GEdit-Bench with fixed budget \(N=32\):

Model Method G_O ↑ η ↑ ξ ↑
FLUX.1 Kontext BoN 6.641 0.66 0.12
FLUX.1 Kontext TTS-EF 6.376 0.98 0.57
FLUX.1 Kontext ADE-CoT 6.695 1.47 0.66
BAGEL BoN 6.908 0.69 0.14
BAGEL ADE-CoT 6.972 1.27 0.62
Step1X-Edit BoN 7.157 0.72 0.13
Step1X-Edit ADE-CoT 7.196 1.45 0.62

ADE-CoT improves inference efficiency \(\eta\) by over 2× compared to BoN. Result efficiency \(\xi\) improves by 4.9×/2.7×/2.9× on average across benchmarks. Baselines fail because PRM/PARM misjudge early previews, and TTS-EF is unreliable when sampling scales up.

Ablation Study (Incremental addition, GEdit-Bench, G_O / NFE)

Configuration Kontext BAGEL Step1X-Edit
Baseline (BoN) 6.641 / 896 6.908 / 1600 7.157 / 896
+ Difficulty-Aware Budget 6.641 / 797 6.909 / 1391 7.157 / 778
+ Early Pruning (\(S_{gen}\)) 6.642 / 719 6.912 / 1351 7.157 / 719
+ Early Pruning (Unified \(S\)) 6.647 / 673 6.916 / 1290 7.161 / 638
+ Similarity Filtering 6.651 / 508 6.915 / 1087 7.162 / 522
+ Late Retention 6.652 / 464 6.935 / 972 7.163 / 462
+ Instance-Specific Verifier 6.702 / 464 6.984 / 972 7.206 / 462
+ Opportunistic Stop (Full) 6.695 / 418 6.972 / 882 7.196 / 434

Key Findings

  • NFE reduction primarily comes from "similarity filtering + opportunistic stopping": NFE dropped from 896 to 418 (≈2.1× speedup) on Kontext, with similarity filtering and stopping contributing most.
  • Instance-specific verifier is the main driver for quality improvement: It captures detailed errors (e.g., "head tilted rather than forward") that generic scores miss, significantly boosting \(G_O\).
  • Unified score \(S\) is more accurate and efficient than \(S_{gen}\): Using \(S\) allows for higher rejection thresholds, further reducing NFE without losing quality. One-step preview via flow-matching extrapolation outperformed adding extra denoising steps.
  • \(N_{high}=4, \gamma=0.15\) are optimal balance points: Performance saturates after \(N_{high} \ge 4\) while NFE continues to rise; quality declines only after \(\gamma\) exceeds 0.15.

Highlights & Insights

  • The insight "task nature determines scaling strategy" is compelling: T2I is open-ended (more is better), while editing is goal-oriented (one is enough). This dichotomy makes the shift from "scale" to "speed" logical rather than just a collection of tricks.
  • One-step preview with flow-matching is clever: providing an early glimpse of the clean latent without extra cost serves as a cheap foundation for all subsequent verification.
  • Two-stage yes/no QA transforms generic scoring into a "task-specific checklist," effectively adding targeted attention to the verifier—a concept transferable to any scenario where generic scores fail to distinguish candidates.
  • \(S_{reg}\) using Grounded SAM2 quantifies "modifying the right area" without requiring ground truth, a rarity in specialized region verification for editing tasks.

Limitations & Future Work

  • The pipeline heavily relies on external models (Qwen-VL, SAM2, CLIP, etc.); the reliability of \(S_{reg}/S_{cap}\) is capped by these components. MLLM errors propagate to pruning and stopping decisions.
  • Difficulty proxy = single candidate score is a coarse estimation. Randomness in a single sample might lead to budget misallocation, a point requiring further variance analysis.
  • Hyperparameter sensitivity: Several thresholds (\(\gamma, \tau_{sim}, \lambda, S_{rj}, t_e, t_l\)) require tuning across models. While default values worked for three models, the generalization across broader datasets and models remains to be fully explored.
  • The framework is bound by the base model's editing capability: It optimizes selection within a model's output distribution rather than fundamentally improving the model's intrinsic editing ability.
  • vs Best-of-N (BoN): BoN uses a fixed budget and breadth-first sampling. ADE-CoT uses dynamic budgets and opportunistic stopping, achieving 2× acceleration with equal or superior quality.
  • vs PRM / PARM: These also prune during denoising but rely on generic MLLM scores that misjudge subtle editing details; ADE-CoT's specialized verifiers reduce high-score region misjudgments by 63%.
  • vs TTS-EF (ICEdit): TTS-EF introduced Image-CoT to editing via extra denoising steps for early previews but only selects a single best candidate (unreliable for large scales). ADE-CoT uses "one-step previews," specialized scores, and depth-first stopping for a win-win in efficiency and quality.

Rating

  • Novelty: ⭐⭐⭐⭐ The shift from open-ended to goal-oriented scaling is a strong design principle. While individual components (pruning/QA/budget) are known, their synthesis is highly effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive benchmarks, three SOTA models, clear ablation of every strategy, and well-designed efficiency metrics (\(\eta, \xi\)).
  • Writing Quality: ⭐⭐⭐⭐ Clear problem analysis and motivation; mathematical formulations are straightforward.
  • Value: ⭐⭐⭐⭐ Training-free and plug-and-play; 2× speedup is highly practical for deploying Image-CoT in real-world editing applications.