PICD: Versatile Perceptual Image Compression with Diffusion Rendering¶

Conference: CVPR 2025
arXiv: 2505.05853
Code: None
Area: Diffusion Models / Image Compression
Keywords: Perceptual Image Compression, Diffusion Model Rendering, Screen Content Compression, Text Accuracy, Image Codec

TL;DR¶

PICD proposes a versatile perceptual image compression framework. By losslessly encoding text information and "rendering" it with the compressed image using a diffusion model, the method improves the conditional diffusion model across three levels (domain level, adapter level, and instance level), simultaneously achieving high visual quality and high text accuracy for both screen content and natural images.

Background & Motivation¶

Background: Perceptual Image Compression utilizes generative models (GANs or diffusion models) as decoders to maintain high visual quality at low bitrates. Representative methods include MS-ILLM, PerCo, CDC, etc. These methods achieve perceptual losslessness by matching the marginal distribution of reconstructed images to the original image distribution.

Limitations of Prior Work: Existing perceptual codecs perform well on natural images but poorly on screen content (e.g., screenshots, web pages). The core issue is that perceptual codecs only guarantee marginal distribution matching, ignoring the accuracy of specific textual content—for instance, reconstructing the letter "a" from the original image as "c" is still considered perceptually lossless, which is unacceptable for screen content. Conversely, existing screen content codecs (such as VTM-SCC) prioritize textual accuracy but suffer from poor perceptual quality, resulting in severe blurriness under low bitrates.

Key Challenge: There is a conflict between text accuracy and perceptual quality—the former requires precise reconstruction of specific pixels, while the latter allows for "reasonable substitutions." Under the rate-distortion-perception three-way trade-off, satisfying both simultaneously is highly challenging.

Goal: To design a versatile perceptual codec effective for both screen content and natural images, achieving both high text accuracy and high visual quality.

Key Insight: It is observed that text information itself has very low entropy (compressible losslessly into just a few KB). Furthermore, according to \(H(Y|Z) + H(Z) = H(Y)\), encoding text first followed by conditional image encoding theoretically does not increase the total bitrate. Therefore, text and images can be encoded separately, and then "rendered" together into a complete reconstructed image using a diffusion model.

Core Idea: OCR is used to extract textual information for lossless encoding. The compressed image and textual information are then fed into the diffusion model as conditions, realizing high-quality "diffusion rendering" via a three-level conditional enhancement.

Method¶

Overall Architecture¶

The encoder of PICD: (1) extracts text content and location information \(Z\) from the screen image using OCR (Tesseract), followed by lossless compression (cmix + Exponential-Golomb coding); (2) performs lossy compression on image \(X\) conditional on \(Z\) using the MLIC codec, securing the bitstream \(Y\) and reconstructed image \(\bar{X}\). The decoder: (1) decodes the text \(Z\) first to generate a text glyph image \(\bar{Z}\); (2) feeds \(\bar{X}\) and \(\bar{Z}\) into a conditional diffusion model for "diffusion rendering", producing the final reconstruction \(\hat{X} \sim p_\theta(X|\bar{X}, Z)\). For natural images, this is simplified by omitting the text condition and using captions generated by BLIP as text inputs.

Key Designs¶

Domain-Level Conditioning:
- Function: Adapts the pre-trained Stable Diffusion to the image domain of screen content.
- Mechanism: Uses the WebUI dataset (400k webpage screenshots) and concatenates OCR-extracted text content as prompts (format: "a screenshot with text: ...") to fine-tune Stable Diffusion with LoRA (rank=256). Before fine-tuning, the model cannot generate screenshot-like images; after fine-tuning, it correctly generates images with typical screen layouts.
- Design Motivation: Stable Diffusion is not exposure-trained on screen content. Without fine-tuning, it cannot comprehend the distribution characteristics of screenshots. This is the most fundamental and critical step of improvement.
Adapter-Level Conditioning:
- Function: Efficiently injects the information of the compressed image \(\bar{X}\) and the text glyph image \(\bar{Z}\) into the diffusion model.
- Mechanism: Proposes a hybrid adapter strategy—for the glyph image \(\bar{Z}\), only the ControlNet feature encoder is used (preventing the SD-VAE encoder from corrupting text information); for the compressed image \(\bar{X}\), both the ControlNet feature encoder and the SD-VAE encoder are employed (providing complementary information), with an additional pixel shuffle transformation incorporated to retain intact pixel attributes. The three-way features are concatenated and then injected into the UNet via SPADE layers.
- Design Motivation: Vanilla ControlNet is ill-suited for low-level vision tasks (lacking control strength in its encoder and residual layers), and StableSR's VAE encoder degrades glyph images. The hybrid scheme harnesses the strengths of both—the VAE encoder offers excellent image representation while the ControlNet encoder safeguards textual details.
Instance-Level Conditioning:
- Function: Further enhances compliance with the conditioning information during the sampling process.
- Mechanism: Adds gradient guidance after each step of DDPM sampling, consisting of two loss terms: (a) OCR loss—ensuring that the OCR output of the intermediate denoised result \(\mathbb{E}[X_0|X_t]\) aligns with \(\bar{Z}\) (utilizing MSE of OCR logits); (b) re-compression loss—ensuring that the intermediate denoised result resembles \(\bar{X}\) after MLIC coding (reducing color shift). The guidance strength is controlled by hyperparameters \(\zeta_1, \zeta_2\).
- Design Motivation: Trained conditional diffusion models may imperfectly adhere to conditions during actual sampling. Utilizing instance-level guidance enforces text accuracy and minimizes color deviations at inference time, echoing the concept of classifier guidance adapted for compression tasks.

Loss & Training¶

Components are trained separately: (1) Domain-level fine-tuning of Stable Diffusion with LoRA; (2) Fine-tuning of the ControlNet branch for the MLIC encoder (freezing the pre-trained MLIC, training the cloned encoder parameters and zero-convolution layers); (3) Joint training of SPADE layers and ControlNet encoders at the adapter level. Instance-level guidance requires no training and represents a pure inference-time optimization.

Key Experimental Results¶

Main Results (Screen Content Images SCI1K)¶

Method	BD-TEXT↑	BD-PSNR↑	BD-FID↓	BD-CLIP↑	BD-DISTS↓
MLIC (Baseline)	0.000	0.00	0.00	0.000	0.000
VTM-SCC	-0.168	-1.99	31.84	-0.062	0.047
PerCo	-0.057	-5.01	-19.90	-0.023	-0.035
MS-ILLM	0.025	-2.59	-2.03	-0.121	-0.034
PICD	0.107	-2.97	-20.68	0.030	-0.050

Ablation Study¶

Setting	Text Acc↑	PSNR↑	FID↓	CLIP↑	LPIPS↓
w/o glyph image (a)	0.3468	19.10	45.83	0.8209	0.1694
+ ControlNet (b)	0.4404	18.84	45.35	0.8617	0.1646
+ Proposed hybrid adapter (d)	0.4081	19.88	37.90	0.8922	0.1376
+ Instance-level guidance (f)	0.4445	23.70	35.54	0.9059	0.1172
+ Domain-level fine-tuning (g, full)	0.4568	23.67	34.77	0.9082	0.1168

Key Findings¶

PICD is the only method that outperforms the baseline in both text accuracy (BD-TEXT) and perceptual quality (BD-FID, BD-DISTS) simultaneously.
Instance-level guidance yields the most significant PSNR improvement (from 19.88 to 23.70), demonstrating that inference-time optimization is highly effective.
Although domain-level fine-tuning exhibits minimal impact on PSNR, it improves both FID and CLIP, indicating that it primarily enhances the distribution quality of generated images.
On natural images (Kodak, CLIC), PICD also achieves the best or second-best results in terms of FID metrics, validating its generalizability.

Highlights & Insights¶

Theoretical guarantees of text-image decoupled coding are exceptionally elegant. It proves via information theory that encoding text first followed by conditional image encoding remains optimal (\(H(Y|Z) + H(Z) = H(Y)\)), meaning it theoretically introduces no extra bitrate penalty while ensuring lossless text recovery.
The design philosophy of three-level conditioning is highly generalizable—adjusting distribution at the domain level, injecting conditions at the adapter level, and fine-tuning output at the instance level. This coarse-to-fine hierarchical conditioning framework is applicable to any scenario requiring precise output control in diffusion models.
Formulating compression as a conditional generation problem represents an intriguing paradigm shift—the decoder no longer "reconstructs" a sign but rather "generates" a signal matching specific constraints.

Limitations & Future Work¶

Instance-level guidance requires forward and backward passes through the OCR model, significantly increasing decoding latency.
Text extraction relies on the accuracy of Tesseract OCR, which might prove unreliable for stylized fonts or handwriting.
Diffusion model decoding is inherently orders of magnitude slower than traditional decoding; practical deployments will require distillation or few-step sampling strategies.
Currently, only English text is processed. Generalizing the approach to other languages (such as Chinese or Arabic) demands further validation.

vs PerCo: Both are perceptual codecs using a diffusion model as the decoder, but PerCo overlooks text accuracy, causing severe text errors on screen content.
vs CDC: A GAN-based perceptual codec that yields good perceptual quality but suffers from poor text accuracy on screen content.
vs VTM-SCC: A traditional screen content codec that offers sharp text but suffers from severe overall blurriness under low bitrates.
vs Text-Sketch: Also focuses on text accuracy but employs a sketch-based approach; PICD's diffusion rendering paradigm achieves significantly better perceptual quality.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to systematically address text accuracy in perceptual compression, featuring an elegantly designed three-level conditioning framework.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on both screen and natural images with comprehensive ablations, though decoding latency comparison is missing.
Writing Quality: ⭐⭐⭐⭐ Clear theoretical analysis, although the methodology section is somewhat complex due to multiple components.
Value: ⭐⭐⭐⭐ Addresses a practical pain point (blurry text in screenshot compression), but the decoding speed of diffusion models restricts real-world application scenarios.