RepLDM: Reprogramming Pretrained Latent Diffusion Models for High-Quality, High-Efficiency, High-Resolution Image Generation¶

Conference: NeurIPS 2025 (Spotlight)

Code: GitHub

Area: Image Generation

Keywords: Latent diffusion models, high-resolution generation, model reprogramming, attention guidance, progressive upsampling

TL;DR¶

This paper proposes RepLDM, a reprogramming framework that enables pretrained latent diffusion models to generate high-quality, high-resolution images without retraining, via two stages: an attention guidance stage and a progressive upsampling stage, while substantially improving efficiency.

Background & Motivation¶

Latent diffusion models (e.g., Stable Diffusion) are designed for high-resolution image generation, yet frequently exhibit severe structural distortions when generating images beyond their training resolution.

Limitations of existing approaches:

High retraining cost: Training models from scratch at high resolutions requires substantial computational resources.

Long inference time: Existing reprogramming methods save on training but require a large number of denoising steps at inference.

Poor quality: Direct upsampling in latent space leads to severe artifacts.

Structural inconsistency: Self-attention behavior at high resolutions deviates from that at the training resolution.

Method¶

Overall Architecture¶

RepLDM consists of two stages: (1) an attention guidance stage that generates a high-quality initialization at the training resolution, and (2) a progressive upsampling stage that incrementally increases resolution in pixel space.

Key Designs¶

1. Attention Guidance Stage

Denoising is performed at the training resolution (e.g., 512×512).
A training-free self-attention guidance mechanism is introduced to enhance structural consistency.
Core idea: the Key/Value matrices of self-attention are modified to guide the generation of more structurally coherent latent representations.
The resulting latent representations are better suited for subsequent upsampling compared to standard methods.

2. Progressive Upsampling Stage

Key insight: upsampling is performed in pixel space rather than latent space.
Upsampling in latent space causes encoder–decoder mismatch, producing severe artifacts.
Pixel-space upsampling preserves VAE decode–encode consistency.
Progressive schedule: 512 → 768 → 1024 → ..., with only a small number of denoising steps required at each stage.

3. Efficient Denoising

The high-quality initialization provided by Stage 1 allows Stage 2 to require very few denoising steps (e.g., 5–10 steps).
Total inference time is substantially lower than existing methods.

Loss & Training¶

Training-free: RepLDM involves no parameter training or fine-tuning.
Only the attention computation and upsampling strategy within the inference pipeline are modified.
The standard denoising objective of the pretrained model is used.

Key Experimental Results¶

Main Results¶

Image generation quality at 1024×1024 resolution (based on SD1.5, 512→1024):

Method	FID ↓	CLIP Score ↑	Inference Time	Structural Consistency
Direct Generation (SD)	85.2	0.265	8.5s	Poor
MultiDiffusion	42.3	0.285	45.2s	Moderate
ScaleCrafter	38.5	0.292	52.8s	Moderate
DemoFusion	32.1	0.301	68.5s	Good
RepLDM (Ours)	25.8	0.315	18.2s	Excellent

At 2048×2048 resolution:

Method	FID ↓	CLIP Score ↑	Inference Time
ScaleCrafter	52.3	0.275	185s
DemoFusion	45.8	0.288	235s
RepLDM (Ours)	35.2	0.302	52s

Ablation Study¶

Contribution of each component (512→1024):

Configuration	FID	CLIP Score	Inference Time
Standard upsampling (latent space)	65.3	0.272	22s
+ Attention guidance	42.5	0.295	25s
+ Pixel-space upsampling	35.2	0.305	20s
+ Progressive schedule (RepLDM)	25.8	0.315	18.2s

Key Findings¶

RepLDM simultaneously outperforms the state of the art in both quality (FID 25.8 vs. 32.1) and speed (18.2s vs. 68.5s).
Pixel-space upsampling is the critical factor for quality improvement, eliminating latent-space artifacts.
Attention guidance provides a better initialization for upsampling, reducing the number of subsequent denoising steps required.
The speed advantage becomes more pronounced at 4× upsampling (512→2048): 52s vs. 235s.

Highlights & Insights¶

Spotlight paper: Recognized as a Spotlight at NeurIPS 2025, attesting to its impact and quality.
Triple advantage: High quality, high efficiency, and high resolution are achieved simultaneously rather than traded off against one another.
Key insight: The choice to perform upsampling in pixel space rather than latent space circumvents a fundamental source of degradation.

Limitations & Future Work¶

The method currently targets UNet-based architectures (SD1.5/SDXL); adaptation to DiT-based architectures remains to be validated.
Text–image alignment for long prompts in text-to-image generation may degrade at high resolutions.
The selection of intermediate resolutions in the progressive schedule lacks an adaptive mechanism.
Evaluation relies primarily on FID and CLIP Score; human perceptual evaluation is absent.

DemoFusion (Du et al.): A denoising fusion method for progressive upsampling.
ScaleCrafter (He et al.): Adapts convolutional operations to accommodate high resolutions.
MultiDiffusion: Patch-based diffusion for panoramic image generation.

Rating¶

⭐ Novelty: 8/10 — The pixel-space upsampling insight is simple yet critical.
⭐ Value: 9/10 — Fast, high-quality, and openly available.
⭐ Writing Quality: 8/10 — Overall quality consistent with Spotlight recognition.