From Scale to Speed: Adaptive Test-Time Scaling for Image Editing¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: To be confirmed
Area: Diffusion Models / Image Generation
Keywords: Image Editing, Test-Time Scaling, Image-CoT, Adaptive Sampling, Early-Stopping Pruning

TL;DR¶

Addressing the issue that T2I-oriented Image-CoT wastes computational power when directly applied to image editing, this paper proposes ADE-CoT. It dynamically allocates sampling budgets based on editing difficulty, replaces generic MLLM scoring with specialized "edit region + instruction consistency" verifiers for early pruning, and employs a depth-first "stop when sufficient" mechanism to eliminate redundant sampling. ADE-CoT achieves better image quality while accelerating inference by over 2× compared to Best-of-N across three SOTA editing models.

Background & Motivation¶

Background: Image Chain-of-Thought (Image-CoT) is a training-free, plug-and-play "test-time scaling" strategy. It improves generation quality by sampling multiple candidates during inference and selecting the best one via a verifier. Originally designed for Text-to-Image (T2I), the standard approach involves perturbing initial noise to sample \(N\) candidates followed by Best-of-N (BoN) selection. Advanced methods use MLLMs as verifiers to score intermediate states during denoising, pruning low-potential trajectories early to save computation.

Limitations of Prior Work: Fundamental differences exist between image editing and T2I. T2I is an open-ended task where large-scale sampling continuously yields diverse reasonable results. In contrast, editing is goal-oriented, where the solution space is strictly constrained by the source image and instructions; only a few correct answers exist regardless of noise perturbations or prompt rewriting. Applying T2I's Image-CoT directly to editing reveals three issues: (1) Inefficient resource allocation: All edits use a fixed budget (e.g., 32 samples), but simple edits (with high initial scores) gain almost nothing from Image-CoT, wasting power on easy samples. (2) Unreliable early verification: Edits often involve local or subtle changes difficult to distinguish in early denoising stages; generic MLLM scores misjudge these—40% of early low-scoring samples eventually achieve high scores but are erroneously pruned. (3) Result redundancy: Large-scale sampling produces multiple correct results with identical scores (edits with best scores in \([7,9)\) often have 15+ candidates sharing the top score), but editing only requires one result that aligns with the intent.

Key Challenge: The entire mechanism of existing Image-CoT (fixed budgets, general scoring, breadth-first parallel sampling of all candidates) is designed for "open-ended, the more the better" T2I, which is misaligned with the "goal-oriented, one is enough" nature of editing.

Goal / Core Idea: Shift the focus from "scale" to "speed." The paper proposes the on-demand ADE-CoT with three targeted strategies: dynamic budget allocation by difficulty, accurate early pruning using edit-specific metrics, and opportunistic depth-first stopping to cut redundancy—significantly improving efficiency while maintaining editing accuracy.

Method¶

Overall Architecture¶

ADE-CoT takes a source image \(I_{src}\) and an editing instruction \(c\) to produce an edited image \(I\) semantically aligned with \(c\). The pipeline transforms traditional BoN's "fixed budget + breadth-first sampling + general scoring" into a three-stage adaptive process.

First, the editing difficulty is estimated using the initial score of a single candidate to dynamically determine the total budget (difficulty-aware budget). Subsequently, a breadth-first search is conducted in the early stage of denoising, but with an edit-specific verifier (region localization + instruction-caption consistency) instead of general scores for pruning. Visually similar redundant candidates are discarded, and surviving candidates are ranked. Finally, the process switches to depth-first search in the late stage, generating candidates one by one based on early scores. Each is verified by an instance-specific verifier; the process stops immediately once \(N_{high}\) intent-aligned results are collected. These stages correspond to "saving on simple samples," "pruning more accurately," and "cutting redundancy."

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Source Image + Instruction"] --> B["Difficulty-Aware Budget Allocation<br/>Difficulty via Initial Single-Sample Score<br/>Determine Dynamic Budget Na"]
    B --> C["Specialized Early Verification & Pruning<br/>One-step Preview + Region/Caption Scoring<br/>Filter Similarity + Rank by Score"]
    C --> D["Depth-First Opportunistic Stopping<br/>Sequential Late-Stage Generation<br/>Verification via Instance-Specific Verifier"]
    D -->|"Collect Nhigh Aligned Results"| E["Output Optimal Edited Image"]

Key Designs¶

1. Difficulty-Aware Resource Allocation: Allocating less for simple edits and more for difficult ones

This addresses the "fixed budget wastes compute" pain point. A single candidate is initially generated and scored by a verifier \(\text{Vrf}\) to yield \(S\), acting as a proxy for difficulty. A high score suggests a simple edit where further sampling offers diminishing returns, whereas a low score indicates difficulty justifying a larger search. The adaptive budget \(N_a\) is allocated as:

\[N_a = N_{\min} + \lceil (N - N_{\min}) \times (1 - S/S_{\max})^{\gamma} \rceil\]

where \(N_{\min}\) and \(N\) are the minimum and original budgets, \(S_{\max}\) is the maximum score, and \(\gamma\) controls sensitivity. As \(S \to S_{\max}\) (easy), \(N_a\) converges to \(N_{\min}\); as \(S \to 0\) (difficult), \(N_a\) approaches \(N\). This directs computation precisely toward difficult edits. In experiments, \(\gamma\) was set to 0.15, as NFE decreased steadily while quality remained stable until that point.

2. Edit-Specific Verification + Similarity Filtering: Replacing generic scoring with "correctness of modification logic"

Targeting the "40% misjudgment of high-potential samples by generic MLLMs" issue. This stage includes three components. First is one-step preview: scoring the noisy latent \(x_{t_e}\) at early time \(t_e\) directly is difficult. Since modern editing models often use flow matching, the authors use one-step extrapolation to approximate the clean latent \(x_{0|t_e} = x_{t_e} - \sigma_{t_e}\epsilon_\theta(x_{t_e}, T_{t_e})\), decoded into a preview image without extra denoising steps.

Second, two edit-specific verifiers complement the general score \(S_{gen}\). Edit region correctness: MLLMs are prompted with \(P_{reg}\) to identify objects to be modified/retained, which are fed into Grounded SAM2 to generate a binary mask \(M\) of the expected edit region. The mean change map \(\Delta = \frac{1}{C}\sum_{c=1}^{C}|I_{(c)} - I_{src}^{(c)}|\) is calculated, and a pixel-level softmax weight is applied to \(\Delta\) to aggregate the score within the mask: \(S_{reg} = \sum_{H,W} M \odot \text{softmax}_{H,W}(\Delta)\). Higher \(S_{reg}\) indicates changes concentrated in the intended area. Instruction-caption consistency: Since ground-truth captions are unavailable at test time, MLLMs generate a target caption \(c_{cap}\) based on the source image and instruction via prompt \(P_{cap}\), followed by calculating \(S_{cap} = \text{CLIPScore}(I, c_{cap})\). The unified score \(S = S_{gen} + \lambda_{reg}S_{reg} + \lambda_{cap}S_{cap}\) is used to prune candidates below a rejection threshold \(S_{rj}\). Crucially, \(S_{reg}\) and \(S_{cap}\) require only one MLLM query per edit. This reduced misjudgments in the high-score range \([6,9)\) by 63% (235 to 86) while maintaining low-score pruning accuracy.

Third, visual similarity filtering: DINOv2 is used to extract visual embeddings of preview images; if the similarity between two candidates exceeds threshold \(\tau_{sim}\), the lower-scoring one is discarded to eliminate redundancy. Remaining candidates are ranked by \(S\) for the next stage.

3. Depth-First Opportunistic Stopping: Stopping when sufficient rather than sampling all

Targeting the redundancy of obtaining multiple identical correct results. Unlike the breadth-first approach of BoN/PRM/PARM, a depth-first approach is adopted: candidates are generated sequentially based on early scores. It consists of two parts. Late-stage retention: At a later time \(t_l\) (\(t_e < t_l < T\)), a second preview and unified score are generated, using an adaptive threshold to prune sub-optimal samples. Instance-specific verifier: Generic scores \(S_{gen}\) often fail to distinguish between multiple high-scoring candidates or miss subtle errors. The authors found that "two-stage QA" guides MLLMs to notice key details—first generating a set of yes/no questions (covering instruction following, aesthetics, etc.) via prompt \(P_q\), then answering them via \(P_a\). The instance-specific score \(S_{spec}\) is the count of "yes" answers. This score punishes incorrect candidates. Generation stops once \(N_{high}\) (default 4) intent-aligned results are collected, from which the highest scorer is output.

Loss & Training¶

ADE-CoT is a training-free, plug-and-play test-time method. Key hyperparameters for three SOTA models: total steps \(T=28/28/50\) (Kontext/BAGEL/Step1X-Edit), \(t_e=8/8/16\), \(t_l=16/16/36\). Qwen-VL-MAX is used for MLLM queries, VIE-Score for general scoring, and 5 yes/no questions per edit.

Key Experimental Results¶

Main Results¶

Evaluated on GEdit-Bench (real user edits), AnyEdit-Test (local/global/implicit edits), and Reason-Edit (complex reasoning) using FLUX.1 Kontext, BAGEL, and Step1X-Edit. Efficiency is measured by NFE (Total Denoising Steps), alongside Inference Efficiency \(\eta = \frac{1}{M}\sum_i \sigma_i \cdot \frac{S(i)}{S_{\max}} \cdot \frac{NT}{NFE(i)}\) (where \(\sigma_i=1\) if result is not inferior to BoN) and Result Efficiency \(\xi = \frac{1}{M}\sum_i \sigma_i \frac{NFE(i)}{NFE^{min}(i)}\) (measuring redundancy). Main results on GEdit-Bench with fixed budget \(N=32\):

Model	Method	G_O ↑	η ↑	ξ ↑
FLUX.1 Kontext	BoN	6.641	0.66	0.12
FLUX.1 Kontext	TTS-EF	6.376	0.98	0.57
FLUX.1 Kontext	ADE-CoT	6.695	1.47	0.66
BAGEL	BoN	6.908	0.69	0.14
BAGEL	ADE-CoT	6.972	1.27	0.62
Step1X-Edit	BoN	7.157	0.72	0.13
Step1X-Edit	ADE-CoT	7.196	1.45	0.62

ADE-CoT improves inference efficiency \(\eta\) by over 2× compared to BoN. Result efficiency \(\xi\) improves by 4.9×/2.7×/2.9× on average across benchmarks. Baselines fail because PRM/PARM misjudge early previews, and TTS-EF is unreliable when sampling scales up.

Ablation Study (Incremental addition, GEdit-Bench, G_O / NFE)¶

Configuration	Kontext	BAGEL	Step1X-Edit
Baseline (BoN)	6.641 / 896	6.908 / 1600	7.157 / 896
+ Difficulty-Aware Budget	6.641 / 797	6.909 / 1391	7.157 / 778
+ Early Pruning (\(S_{gen}\))	6.642 / 719	6.912 / 1351	7.157 / 719
+ Early Pruning (Unified \(S\))	6.647 / 673	6.916 / 1290	7.161 / 638
+ Similarity Filtering	6.651 / 508	6.915 / 1087	7.162 / 522
+ Late Retention	6.652 / 464	6.935 / 972	7.163 / 462
+ Instance-Specific Verifier	6.702 / 464	6.984 / 972	7.206 / 462
+ Opportunistic Stop (Full)	6.695 / 418	6.972 / 882	7.196 / 434

Key Findings¶

NFE reduction primarily comes from "similarity filtering + opportunistic stopping": NFE dropped from 896 to 418 (≈2.1× speedup) on Kontext, with similarity filtering and stopping contributing most.
Instance-specific verifier is the main driver for quality improvement: It captures detailed errors (e.g., "head tilted rather than forward") that generic scores miss, significantly boosting \(G_O\).
Unified score \(S\) is more accurate and efficient than \(S_{gen}\): Using \(S\) allows for higher rejection thresholds, further reducing NFE without losing quality. One-step preview via flow-matching extrapolation outperformed adding extra denoising steps.
\(N_{high}=4, \gamma=0.15\) are optimal balance points: Performance saturates after \(N_{high} \ge 4\) while NFE continues to rise; quality declines only after \(\gamma\) exceeds 0.15.

Highlights & Insights¶

The insight "task nature determines scaling strategy" is compelling: T2I is open-ended (more is better), while editing is goal-oriented (one is enough). This dichotomy makes the shift from "scale" to "speed" logical rather than just a collection of tricks.
One-step preview with flow-matching is clever: providing an early glimpse of the clean latent without extra cost serves as a cheap foundation for all subsequent verification.
Two-stage yes/no QA transforms generic scoring into a "task-specific checklist," effectively adding targeted attention to the verifier—a concept transferable to any scenario where generic scores fail to distinguish candidates.
\(S_{reg}\) using Grounded SAM2 quantifies "modifying the right area" without requiring ground truth, a rarity in specialized region verification for editing tasks.

Limitations & Future Work¶

The pipeline heavily relies on external models (Qwen-VL, SAM2, CLIP, etc.); the reliability of \(S_{reg}/S_{cap}\) is capped by these components. MLLM errors propagate to pruning and stopping decisions.
Difficulty proxy = single candidate score is a coarse estimation. Randomness in a single sample might lead to budget misallocation, a point requiring further variance analysis.
Hyperparameter sensitivity: Several thresholds (\(\gamma, \tau_{sim}, \lambda, S_{rj}, t_e, t_l\)) require tuning across models. While default values worked for three models, the generalization across broader datasets and models remains to be fully explored.
The framework is bound by the base model's editing capability: It optimizes selection within a model's output distribution rather than fundamentally improving the model's intrinsic editing ability.

vs Best-of-N (BoN): BoN uses a fixed budget and breadth-first sampling. ADE-CoT uses dynamic budgets and opportunistic stopping, achieving 2× acceleration with equal or superior quality.
vs PRM / PARM: These also prune during denoising but rely on generic MLLM scores that misjudge subtle editing details; ADE-CoT's specialized verifiers reduce high-score region misjudgments by 63%.
vs TTS-EF (ICEdit): TTS-EF introduced Image-CoT to editing via extra denoising steps for early previews but only selects a single best candidate (unreliable for large scales). ADE-CoT uses "one-step previews," specialized scores, and depth-first stopping for a win-win in efficiency and quality.

Rating¶

Novelty: ⭐⭐⭐⭐ The shift from open-ended to goal-oriented scaling is a strong design principle. While individual components (pruning/QA/budget) are known, their synthesis is highly effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive benchmarks, three SOTA models, clear ablation of every strategy, and well-designed efficiency metrics (\(\eta, \xi\)).
Writing Quality: ⭐⭐⭐⭐ Clear problem analysis and motivation; mathematical formulations are straightforward.
Value: ⭐⭐⭐⭐ Training-free and plug-and-play; 2× speedup is highly practical for deploying Image-CoT in real-world editing applications.