Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal¶

Conference: ICCV 2025 arXiv: 2502.09873 Code: github.com/jp-guo/CODiff Area: Image Generation / Image Restoration Keywords: JPEG artifact removal, one-step diffusion model, compression prior, dual learning, image restoration

TL;DR¶

This paper proposes CODiff, a compression-aware one-step diffusion model for JPEG artifact removal. The core contribution is a Compression-aware Visual Embedder (CaVE) that extracts JPEG compression priors via an explicit–implicit dual learning strategy, guiding the diffusion model toward high-quality restoration. CODiff comprehensively outperforms existing methods on LIVE-1, Urban100, and DIV2K-Val while achieving extremely high inference efficiency.

Background & Motivation¶

Problem Definition¶

JPEG artifact removal aims to eliminate compression distortions such as blocking and banding from compressed images, recovering lost visual information. At high compression rates (e.g., QF=5), information loss is severe, and conventional CNN/Transformer-based methods struggle to cope.

Limitations of Prior Work¶

CNN/Transformer methods (FBCNN, PromptCIR, etc.): Limited effectiveness at high compression rates, as lost information cannot be fully recovered from the remaining signal.

Multi-step diffusion models (DiffBIR, SUPIR): While possessing strong generative priors, their multi-step denoising incurs enormous computational overhead (188T MACs, ~50 seconds for 50 steps).

Existing one-step diffusion models (OSEDiff): Inference-efficient but agnostic to JPEG compression priors, unable to distinguish compression artifacts from natural image features.

Insufficient use of compression priors: Prior quality factor (QF) learning methods treat QF (a single integer) as the sole learning target, capturing very limited information; quantization table methods provide only static numerical values.

Root Cause¶

How can JPEG compression priors be effectively extracted and leveraged to guide a diffusion model while preserving one-step inference efficiency?

Core Idea: Design a Compression-aware Visual Embedder (CaVE) that comprehensively captures JPEG compression characteristics via a dual strategy—explicit learning (QF prediction) and implicit learning (high-quality image reconstruction)—and inject the extracted priors into a one-step diffusion model.

Method¶

Overall Architecture¶

CODiff adopts a two-stage training pipeline: - Stage 1: Train CaVE to extract JPEG compression prior embeddings. - Stage 2: Inject the priors extracted by CaVE into a pretrained Stable Diffusion model (fine-tuned via LoRA), training the generator with perceptual loss and GAN loss.

Key Designs¶

1. Compression-aware Visual Embedder (CaVE)¶

Function: Encodes the low-quality image $\mathbf{I}_L$ into a set of feature vectors $\mathbf{c}_L = \{\mathbf{c}_{L_k} \in \mathbb{R}^d\}_{k=1}^K$ serving as JPEG compression priors.
Core Architecture: UNet encoder + lightweight QF predictor + UNet decoder.
Design Motivation: Exploits the multi-scale feature extraction capacity of UNet to capture compression-related information across multiple resolutions.

2. Dual Learning Strategy¶

Explicit learning: Trains CaVE to predict QF from the embedding using an $\ell_1$ loss: $$\mathcal{L}_{QF} = \frac{1}{B}\sum_{i=1}^{B}\|QF_{pred}^i - QF_{gt}^i\|_1$$
Motivation: Encourages the embedding to explicitly distinguish different compression levels.
Limitation: QF prediction alone fails to generalize to unseen compression levels (confirmed by t-SNE visualization).
Implicit learning: Trains CaVE to reconstruct the high-quality image from the embedding using an $\ell_1$ loss: $$\mathcal{L}_{rec} = \frac{1}{B}\sum_{i=1}^{B}\|\hat{\mathbf{I}}_H^i - \mathbf{I}_H^i\|_1$$
Motivation: The reconstruction objective forces the embedding to capture the complete information of the compression process, rather than a single QF integer.
Joint objective: $\mathcal{L}_{CaVE} = \mathcal{L}_{QF} + \lambda \cdot \mathcal{L}_{rec}$, where $\lambda=1000$.
Key finding: After dual learning, CaVE can effectively distinguish unseen compression levels (QF=1, 5) at test time, whereas purely explicit learning cannot.

3. One-Step Diffusion Generator¶

Function: Takes the latent representation of the low-quality image as input (replacing Gaussian noise) and recovers the high-quality image in a single denoising step.
Core formula: $$\hat{\mathbf{z}}_H = \frac{\mathbf{z}_L - \sqrt{1-\bar{\alpha}_{T_L}} \varepsilon_\theta(\mathbf{z}_L; \mathbf{c}_L, T_L)}{\sqrt{\bar{\alpha}_{T_L}}}$$
Training: The VAE encoder and UNet are fine-tuned via LoRA (rank=16); the VAE decoder is frozen.
Discriminator: Pretrained SD UNet encoder + lightweight MLP.

Loss & Training¶

Stage 2 total loss: $\mathcal{L} = \mathcal{L}_{per} + \lambda_G \mathcal{L}_{\mathcal{G}}$ - Perceptual loss: $\mathcal{L}_{per} = \mathcal{L}_2(\hat{\mathbf{I}}_H, \mathbf{I}_H) + \lambda_D \mathcal{L}_{DISTS}(\hat{\mathbf{I}}_H, \mathbf{I}_H)$ - GAN loss: Standard adversarial loss with $\lambda_G = 5 \times 10^{-3}$ - Distillation is not used (unlike OSEDiff); GAN is instead employed to enhance perceptual realism.

Training details: - Stage 1: 200K iterations, 4× A6000 - Stage 2: 100K iterations, 4× A6000, AdamW, lr=5e-5 - Training QF range: 8–95; patch size: 256×256

Key Experimental Results¶

Main Results¶

LIVE-1 dataset (QF=5, perceptual quality metrics):

Method	Steps	LPIPS↓	DISTS↓	MUSIQ↑	MANIQA↑	CLIPIQA↑
JPEG	—	0.4384	0.3242	40.33	0.2294	0.1716
FBCNN	1	0.3736	0.2353	63.56	0.3425	0.2763
PromptCIR	1	0.3797	0.2334	60.34	0.2790	0.2655
DiffBIR*	50	0.3509	0.2035	58.09	0.2812	0.3776
OSEDiff*	1	0.2675	0.1653	65.51	0.3417	0.5623
CODiff	1	0.2062	0.1121	73.16	0.5321	0.7212

Computational efficiency (1024×1024 input):

Method	Steps	Params (G)	MACs (T)	Time (s)
DiffBIR	50	1.52	188.24	50.81
SUPIR	50	4.49	464.29	24.33
OSEDiff	1	1.40	10.39	0.65
CODiff	1	1.00	9.46	0.57

Ablation Study¶

Comparison of prompt generation strategies (LIVE-1, QF=5):

Method	LPIPS↓	MUSIQ↑	MANIQA↑
Empty string	0.3485	62.56	0.3793
Learnable	0.3471	63.39	0.3900
DAPE	0.3463	62.54	0.3793
CaVE (Ours)	0.3426	67.13	0.4584

The effectiveness of the dual learning strategy is clearly demonstrated through t-SNE visualization: CaVE trained with only explicit learning fails to distinguish unseen QF values (1, 5), whereas dual learning yields well-separated clusters.

Key Findings¶

CODiff comprehensively outperforms existing methods across all three datasets and all QF levels, including 50-step DiffBIR and SUPIR.
The advantage is most pronounced at extreme compression (QF=5): LPIPS drops from 0.2675 to 0.2062, and MANIQA rises from 0.3417 to 0.5321.
Inference is 89× faster than DiffBIR (0.57s vs. 50.81s) with 34% fewer parameters.
CaVE is the core driver of performance gains (MANIQA 18% higher than DAPE).
Dual learning substantially outperforms purely explicit or purely implicit learning.

Highlights & Insights¶

Elegant exploitation of compression priors: Rather than merely predicting QF, the model also employs a reconstruction objective to understand the full compression process—a generalizable idea of using multi-task learning to enrich representations.
Lightweight design: CaVE relies only on a UNet encoder (without a full UNet), making it far more lightweight than auxiliary modules such as ControlNet or DAPE.
No distillation required: Unlike OSEDiff, CODiff replaces distillation with GAN training, removing the performance ceiling imposed by a teacher model.
t-SNE visualization: Intuitively demonstrates how dual learning enables generalization to unseen compression levels.

Limitations & Future Work¶

The training QF range is 8–95; generalization to extremely low QF values (1–7) relies on the implicit extrapolation capacity of dual learning.
The method is designed specifically for JPEG compression and does not explore modern formats such as WebP or HEIF.
GAN training may introduce mode collapse risks.
Evaluation is not conducted on real-world compressed images (as opposed to synthetically degraded ones).
The UNet decoder of CaVE is used only during Stage 1 training and discarded in Stage 2, resulting in some computational overhead.

FBCNN first proposed predicting an adjustable QF as a compression prior; CODiff substantially improves upon this via dual learning.
OSEDiff demonstrated the viability of one-step diffusion for image restoration; CODiff further demonstrates the importance of domain-specific priors.
The design philosophy of CaVE (multi-task representation extraction) is generalizable to other degradation types (blur, noise).

Rating¶

Novelty: ⭐⭐⭐⭐ — The dual learning strategy and injection of compression priors into a diffusion model constitute an effective and novel design.
Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets, multiple QF levels, comprehensive metric suite, and complete ablation study.
Writing Quality: ⭐⭐⭐⭐ — Clear figures; t-SNE visualizations are persuasive.
Value: ⭐⭐⭐⭐⭐ — High practical value with fast inference, strong performance, and open-source code.