PixPerfect: Seamless Latent Diffusion Local Editing with Discriminative Pixel-Space Refinement¶

Conference: NeurIPS 2025 arXiv: 2512.03247 Code: None Area: Image Editing / Diffusion Models Keywords: Image inpainting, pixel-level refinement, discriminative pixel space, artifact simulation, local editing

TL;DR¶

This paper proposes PixPerfect, a general-purpose pixel-level refinement framework that eliminates color discrepancies, texture mismatches, and visible seams in local editing with latent diffusion models (LDMs) through a discriminative pixel-space loss and a comprehensive artifact simulation pipeline, achieving substantial improvements in visual fidelity across inpainting, object removal, and insertion tasks.

Background & Motivation¶

LDMs have achieved remarkable progress in image inpainting and local editing. However, due to encoding and decoding operations in a low-dimensional latent space, pixel-level inconsistencies at editing boundaries are inevitably introduced—including color shifts, texture mismatches, and visible seams. These artifacts are further exacerbated in more expressive latent space representations such as FLUX.

Existing solutions fall into two categories: (1) latent-space modifications (e.g., Asymmetric VQGAN injecting background information into the decoder, ASUKA introducing color enhancement), which are architecture-specific and generalize poorly; and (2) post-processing pixel-level harmonization (e.g., Poisson blending, DiffHarmony++), which cannot fully eliminate subtle artifacts. The root cause is that conventional pixel-space objective functions are insufficiently sensitive to subtle color and texture deviations.

PixPerfect's starting point is to design a discriminative pixel space that amplifies perceptual differences, paired with a comprehensive artifact simulation pipeline and a direct pixel-level refinement scheme, enabling universal artifact elimination across architectures and tasks.

Method¶

Overall Architecture¶

Given an LDM-partially-synthesized image \(x_\text{gen}\) and mask \(m\), PixPerfect employs a GAN-based refinement network \(G\) to produce \(x_\text{pred} = G(x_\text{gen}, m)\), aligning it with the pixel-consistent ground truth image \(x_\text{gt}\) at and around the edited region. The network is built on the CMGAN architecture with 41M parameters and a fully convolutional design.

Key Designs¶

Discriminative Pixel Space:
- Core Problem: Standard \(\ell_1\) + perceptual + adversarial losses are insufficiently sensitive to subtle hue/texture shifts.
- A differentiable tone-mapping function \(f_\theta: \mathbb{R}^3 \to \mathbb{R}^3\) is defined to transform the RGB color space into a discriminative color space, amplifying color and texture differences between the synthesized region and the background.
- Parameterized via polynomial regression (maximum degree \(D=5\)); regression inputs are predicted image pixel values, and regression targets are the amplified-discrepancy image \(y_\text{amp} = x_\text{gt} + \beta(x_\text{pred} - x_\text{gt})\), \(\beta \in [20, 40]\).
- Regression coefficients are computed via the Moore–Penrose pseudoinverse, adaptively per sample.
- Losses of the same structure as the pixel-space losses (L1 + perceptual + adversarial) are applied in the discriminative space; total loss \(= \mathcal{L}_\text{pixel-space} + \mathcal{L}_\text{disc-space}\).
Artifact Simulation Pipeline:
- Addresses the problem of inconsistent artifact distributions in real diffusion outputs and the unavailability of ground truth.
- Non-uniform color shifts: Uniform color jitter followed by alpha blending with a random gradient alpha map to simulate spatially varying hue/brightness shifts.
- Texture pattern mismatches: Random VAE reconstruction + Gaussian smoothing applied within the masked region; JPEG compression artifacts applied to the background; separate random noise added to each.
- Content discontinuities: An existing inpainting method reconstructs a narrow band at the mask boundary, after which original background pixels are pasted back to produce boundary discontinuities.
- Mixed soft/hard boundaries: Random morphological dilation/erosion and Gaussian blurring applied to the composite mask.
- Artifact types are combined at varying probabilities (content discontinuity 0.5, color enhancement 0.8, texture 0.5, boundary 1.0, etc.).
Inference-Time Pooling:
- \(N\) random color jitter variants are applied to the masked region of the input image.
- The refinement network is run on each variant, and the output with the smallest input–output discrepancy is selected as the final result.
- This constitutes a simple yet effective test-time scaling strategy.

Loss & Training¶

Total loss \(= \mathcal{L}_\text{pixel-space} + \mathcal{L}_\text{disc-space}\), with weights \(w_1=64\), \(w_2=5\), \(w_3=1\) (emphasizing color consistency).
Perceptual loss uses LPIPS; adversarial loss uses a mask-conditioned discriminator.
Moderate Gaussian noise augmentation is applied during training to stabilize GAN training.
The discriminative-space loss includes a warm-up period (disabled in the initial phase).
Adam optimizer, lr \(= 5\times10^{-4}\), batch size \(= 32\); approximately 300 million training images; trained on 32 A100 GPUs for approximately one week.

Key Experimental Results¶

Main Results (Inpainting Task)¶

Method	Dataset	FID↓	LPIPS↓	L1↓	PSNR↑
FLUX-Fill	MISATO	14.66	0.195	0.062	20.90
FLUX-Fill + AsyVQ	MISATO	15.99	0.202	0.057	20.91
FLUX-Fill + DH++	MISATO	14.02	0.190	0.056	20.89
FLUX-Fill + PixPerfect	MISATO	10.87	0.141	0.036	22.18
FLUX-Fill	Places2	19.05	0.240	0.074	19.33
FLUX-Fill + PixPerfect	Places2	15.61	0.194	0.052	20.04

Ablation Study (MISATO Dataset, Based on FLUX-Fill)¶

Configuration	FID↓	LPIPS↓	L1↓
FLUX-Fill baseline	14.66	0.195	0.062
+ paste-back	14.40	0.170	0.040
+ refiner	13.99	0.170	0.040
+ enhance loss (\(d=6\), default)	10.90	0.143	0.037
+ Haar reweighted loss	11.38	0.143	0.038
+ VGG high-dim discriminative space	11.05	0.142	0.036
+ inference-time pooling (PixPerfect)	10.87	0.141	0.036

Key Findings¶

As a plug-and-play module, PixPerfect consistently improves all metrics across multiple diffusion models including SDv1.5, SDv2, and FLUX-Fill.
On the object removal task, PixPerfect reduces OmniPaint's FID from 23.05 to 18.87 and improves PSNR from 24.67 to 27.96.
Inference overhead is approximately 2.7 seconds for 512×512 images, accounting for only 21.8% of FLUX-Fill sampling time.
The discriminative pixel space is the key contribution—degree \(d=6\) polynomial performs best; \(d=2\) is too shallow, and \(d=10\) overfits.
The latent space exhibits spatial entanglement: simply replacing the masked region's latent representation causes global background shifts upon decoding.

Highlights & Insights¶

The discriminative pixel space design is particularly elegant: adaptive polynomial regression amplifies subtle color discrepancies to a magnitude learnable by the network, striking a well-balanced trade-off between computational efficiency and expressive power.
The artifact simulation pipeline is comprehensive: it covers diverse real-world artifact patterns including color shifts, texture mismatches, content discontinuities, and mixed soft/hard boundaries, avoiding the difficulty of relying on real diffusion outputs.
Refinement in pixel space rather than latent space is the correct strategy: the paper demonstrates that the latent space suffers from spatial entanglement, whereas pixel-space refinement enjoys natural spatial locality.
Inference-time pooling is a clever test-time scaling approach.

Limitations & Future Work¶

The method cannot correct major semantic errors from the underlying generative model; it only addresses low-level artifacts.
Performance depends on the quality of the initial prediction and the predefined editing region.
The method inherits biases from upstream diffusion models and training data.
Training requires 300 million images and substantial GPU resources.

Compared to Asymmetric VQGAN (latent-space decoder modification) and DiffHarmony++ (learned harmonization), PixPerfect achieves substantial improvements across all metrics.
Poisson blending, while classical, requires a ground-truth gradient field and is not deployable in practice.
Implication for image editing pipelines: any LDM-based local editing method should incorporate a PixPerfect-style refinement module at its back end.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐