Low-Resolution Editing is All You Need for High-Resolution Editing¶

Conference: CVPR 2026 arXiv: 2511.19945 Code: None Area: Diffusion Models / Image Editing Keywords: High-resolution image editing, test-time optimization, detail transfer function, patch synchronization, diffusion models

TL;DR¶

ScaleEdit is the first work to formally define the high-resolution image editing task. It learns a 1×1 convolutional transfer function in the intermediate feature space of a pretrained generative model to inject fine-grained textural details from the source image, and employs a Blended-Tweedie-based patch synchronization strategy to ensure global consistency. Operating entirely via test-time optimization, the method achieves high-quality editing at resolutions up to 2K and even 8K.

Background & Motivation¶

Background: Text-driven image editing methods (e.g., Step1X-Edit, ICEdit, KV-Edit, Nano Banana) have achieved impressive results at low resolutions (≤1K), but are constrained by the input resolution of pretrained models and cannot directly handle larger images.
Limitations of Prior Work: A naive solution is to perform low-resolution editing followed by super-resolution; however, super-resolution methods cannot recover the micro-level textural details present in the source image, since the editing process is never conditioned on the high-resolution source—detail information is lost during downsampling and cannot be reconstructed from the low-resolution edited result.
Key Challenge: High-resolution editing requires simultaneously preserving semantic correctness and fine-grained texture fidelity, yet pretrained generative models operate at a fixed resolution (typically \(512^2\)), making direct high-resolution inference infeasible.
Goal: How can the strong priors of low-resolution editing methods be leveraged while faithfully preserving the fine-grained details present in the high-resolution source image?
Key Insight: The key observation is that the low-resolution and high-resolution diffusion trajectories share a learnable mapping in the intermediate feature space of the diffusion process. A lightweight feature transfer function can learn this mapping and inject high-resolution details into the low-resolution editing result.
Core Idea: A learnable 1×1 convolution is used as the feature transfer function to inject fine-grained details from the high-resolution source image into the generation trajectory of the low-resolution editing result during the reverse diffusion process. Non-overlapping patch synchronization is applied to eliminate boundary artifacts.

Method¶

Overall Architecture¶

Given a high-resolution source image \(I_{\text{src}}^{\text{high}}\), its downsampled counterpart \(I_{\text{src}}^{\text{low}}\), and a low-resolution edited reference \(I_{\text{ref}}^{\text{low}}\) (produced by a standard editing method such as Nano Banana), the goal is to generate a high-resolution edited result \(I_{\text{ref}}^{\text{high}}\). The method proceeds in three steps: (1) all images are divided into \(N \times M\) non-overlapping patches matching the model's native resolution, and diffusion trajectories for each patch are extracted via the DDIM forward process; (2) a feature transfer function is learned for each patch to inject high-resolution source details into the edited result; (3) adjacent patches are synchronized via Blended-Tweedie combined with a resampling strategy to eliminate boundary artifacts.

Key Designs¶

Detail Enhancement Module:
- Function: Transfers fine-grained textural details from the high-resolution source image into the edited target image.
- Mechanism: A timestep-conditioned transfer function \(\Delta\mathbf{h}_t[i] = \phi_\theta(\mathbf{h}_t[i], t)\) is defined in the intermediate feature space of the pretrained generative model, implemented as a 1×1 convolution. The optimization objective guides the low-resolution source generation trajectory toward the high-resolution source trajectory: \(\mathcal{L} = \|\mathbf{x}_{t-1}^{high}[i] - f^{rev}(\tilde{\mathbf{x}}_t[i], t; \Delta\mathbf{h}_t[i])\|_2^2\). The optimized transfer function is then applied during the reverse process of the reference image to inject details. A control parameter \(\tau\) restricts the transfer function to the first \(\tau\) timesteps, balancing detail transfer against content preservation.
- Design Motivation: Using a constant feature offset cannot handle edits with large semantic changes (e.g., cat→dog), as different image regions require different degrees of adjustment. A 1×1 convolution enables channel-wise adaptive blending, achieving fine-grained detail transfer while preserving spatial layout.
Blended-Tweedie Synchronization:
- Function: Ensures visual consistency at the boundaries between adjacent patches.
- Mechanism: An auxiliary latent \(\tilde{\mathbf{A}}_t[i,i+1]\) is constructed by concatenating the lower half of the current patch and the upper half of the adjacent patch. The Tweedie estimate \(\hat{\mathbf{x}}_{t\to 0}^{aux}\) of this auxiliary latent is computed and linearly interpolated with the Tweedie estimates of the original patches. The blending weight \(\mathbf{M}(v,t) = \frac{2v}{H_p} \cdot (1 - t/\tau)\) increases linearly from the boundary toward the patch center and grows stronger as the timestep advances.
- Design Motivation: Independently denoised patches produce discontinuities at boundaries. The auxiliary latent spans the boundary region, and its Tweedie estimate naturally captures a smoother transition; blending it with the original patches achieves inter-patch consistency.
Resampling Strategy:
- Function: Addresses the absence of a corresponding transfer function \(\Delta\mathbf{h}_t\) for the auxiliary latent.
- Mechanism: A single forward step (without the transfer function) is applied to the detail-injected latent \(\tilde{\mathbf{y}}_{t-1}[i]\): \(\tilde{\mathbf{y}}_t^{rsp}[i] = f^{fwd}(\tilde{\mathbf{y}}_{t-1}[i], t-1)\), producing a resampled latent that retains injected details without depending on \(\Delta\mathbf{h}_t\). This resampled latent is then used to construct the auxiliary latent for synchronization, and the actual reverse step is completed using the blended Tweedie estimate combined with the noise prediction from the resampled latent.
- Design Motivation: Optimizing a separate transfer function for the auxiliary latent would be computationally prohibitive. Resampling decouples synchronization from detail injection, requiring only one additional forward and reverse operation.

Loss & Training¶

ScaleEdit operates entirely via test-time optimization with no training required. The transfer function is optimized independently for each patch at each timestep. Stable Diffusion v2.1-base or FLUX.1-dev is used as the backbone, with total timesteps \(T=50\), \(\tau=15\), and an empty prompt. Null-text inversion is employed for accurate reconstruction. Low-resolution editing is performed using the Nano Banana method.

Key Experimental Results¶

Main Results¶

Method	HaarPSI↑	M-MSE↓	M-SSIM↑	M-PSNR↑	LPIPS↓
1K-editing
DiT-SR	0.335	0.058	0.695	21.53	0.477
PiSA-SR	0.328	0.058	0.668	21.27	0.465
ScaleEdit (Ours)	0.342	0.054	0.739	22.13	0.460
2K-editing
DiT-SR	0.316	0.057	0.754	21.38	0.507
PiSA-SR	0.312	0.056	0.755	21.32	0.472
ScaleEdit (Ours)	0.331	0.053	0.806	21.96	0.496

Ablation Study¶

Configuration	Key Metric	Note
Without synchronization	Visible boundary artifacts	Independent patch denoising produces noticeable seams
With synchronization	Smooth, natural boundaries	Blended-Tweedie + resampling eliminates artifacts
Constant vector vs. 1×1 convolution	Latter significantly better	Spatially adaptive transfer function is more robust

Key Findings¶

ScaleEdit consistently outperforms super-resolution baselines on all metrics, validating the argument that an "edit-then-upscale" pipeline cannot recover source image details.
The advantage is especially pronounced on masked metrics (M-MSE, M-SSIM, M-PSNR), indicating that the method better preserves regions of the source image that should remain unchanged.
The method generalizes to Transformer-based architectures such as FLUX and is not restricted to U-Net backbones.
The approach scales to 8K resolution editing without any additional training.

Highlights & Insights¶

ScaleEdit is the first work to formally define the high-resolution image editing task and distinguish it from naive "edit + super-resolution" pipelines.
The transfer function design is elegant—a 1×1 convolution performs channel-wise adaptive blending in feature space, achieving both efficiency and effectiveness.
The non-overlapping synchronization strategy substantially reduces computational overhead compared to traditional overlapping inference approaches, whose cost scales with the overlap ratio.
The test-time optimization framework requires no training data and is compatible with arbitrary editing methods and generative models.

Limitations & Future Work¶

Test-time optimization must be performed independently for each image, resulting in slow inference due to repeated forward, reverse, and iterative optimization steps.
The hyperparameter \(\tau\) requires manual tuning to balance detail transfer against content preservation.
The method depends on the quality of the low-resolution edited result—if the low-resolution editing fails, the high-resolution output cannot be recovered.
Patch size is fixed to the model's native resolution, offering no flexibility in adjustment.
Results are demonstrated only for Stable Diffusion and FLUX; generalization to other architectures remains unverified.

The design of the transfer function draws inspiration from Null-text inversion, which aligns forward and reverse trajectories by optimizing learnable parameters.
The challenge of patch synchronization also arises in video diffusion and panoramic image generation; the proposed Blended-Tweedie strategy may have broader applicability in those settings.
High-resolution content creation is an industry-critical need that has received insufficient attention in the academic community.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First formal definition of the high-resolution editing task; novel transfer function design and non-overlapping synchronization strategy.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers quantitative, qualitative, ablation, and 8K demonstrations, though the dataset scale is limited (100 source images).
Writing Quality: ⭐⭐⭐⭐⭐ Problem definition is clear, methodological derivation is rigorous, and figures are well-crafted.
Value: ⭐⭐⭐⭐ High-resolution editing addresses a practical need; the general-purpose framework can serve as a plug-and-play solution.