Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior¶

Basic Information¶

arXiv: 2511.22549
Conference: NeurIPS 2025
Authors: Ruoyu Feng, Yunpeng Qi, Jinming Liu, Yixin Gao, Xin Li, Xin Jin, Zhibo Chen
Affiliation: USTC, Eastern Institute of Technology (Ningbo)
Code: https://github.com/RuoyuFeng/Diff-ICMH

TL;DR¶

This paper proposes Diff-ICMH, a diffusion-based generative image compression framework that preserves semantic integrity via a Semantic Consistency (SC) loss and activates generative priors via a Tag Guidance Module (TGM). Using a single encoder-decoder and a single bitstream, the framework simultaneously serves 10+ machine intelligence tasks and human visual perception without any task-specific adaptation.

Background & Motivation¶

Image compression faces a fundamental split between two optimization objectives:

Human-oriented compression: Optimizes pixel fidelity (PSNR) or perceptual quality (LPIPS/FID), but offers poor support for machine vision tasks.
Image Coding for Machines (ICM):
- Traditional codec approaches: Adapted via quantization parameter tuning or bit allocation, constrained by non-differentiable fidelity-driven designs.
- Task-driven end-to-end methods: Strong on specific tasks but poor cross-task generalization.
- Feature compression methods: Directly compress intermediate features, tightly coupled to specific models, and incompatible with human viewing.

Key Insight: Semantic integrity and perceptual realism are jointly required by both machine intelligence and human perception — these two objectives are not inherently opposed.

Core Problem¶

How to design a universal image codec that efficiently serves multiple downstream machine intelligence tasks and human visual perception simultaneously from a single bitstream?

Method¶

1. Design Philosophy¶

Fidelity-driven compression introduces two primary sources of information loss: - Semantic distortion: Loss of core semantic information → directly impairs task analysis. - Perceptual mismatch: Textures and details deviate from the natural distribution → domain shift causes accumulated errors in feature extraction.

Experimental validation (Figure 3): At deep layers of ResNet50 (layer4), fidelity-driven codecs (VTM, ELIC) exhibit far greater feature deviation than generative codecs (MS-ILLM), demonstrating that realistic textures effectively mitigate error accumulation in deep layers.

2. Overall Architecture¶

Encoding side: Input image $\mathbf{x}$ is compressed into latent features $\hat{\mathbf{z}}$, targeting the VAE latent space of Stable Diffusion ($8\times$ spatial downsampling).
Tag extraction: Recognize Anything extracts word-level semantic tags $\mathbf{c}$.
Bitstream: Compressed latents + tag IDs (losslessly encoded, ~100 bits/image).
Decoding side: $\hat{\mathbf{z}}$ is fed as a condition into a ControlNet-style control module, jointly with a frozen Stable Diffusion model for generative reconstruction.

3. Semantic Consistency Loss (SC Loss)¶

The pretrained diffusion model's feature extraction capability is leveraged as a semantic space: $$\mathcal{L}_\text{sem} = -\mathbb{E}_{\mathbf{z}, \hat{\mathbf{z}}} \left[ \frac{1}{N} \sum_{n=1}^N \text{sim}(f(\mathbf{z})_n, f(\hat{\mathbf{z}})_n) \right]$$

where $f(\cdot)$ denotes the forward pass of the frozen diffusion model and $\text{sim}$ is cosine similarity: $$\text{sim}(\mathbf{z}, \hat{\mathbf{z}}) = \frac{\mathbf{z}^T \hat{\mathbf{z}}}{|\mathbf{z}|_2 |\hat{\mathbf{z}}|_2}$$

Key design choices: - Applying the loss at the middle block of the U-Net yields the best performance — deep features better capture abstract semantics. - Using noise-free input ($t=0$) — the diffusion model's semantic feature extraction is optimal on clean signals.

4. Tag Guidance Module (TGM)¶

A pretrained tag extractor $\mathcal{E}_t$ (Recognize Anything) generates image-level tags.
Tags are mapped to numerical indices in a predefined dictionary and losslessly encoded.
At decoding, indices are converted back to text strings and used as conditions for the diffusion model and control module.
Classifier-Free Guidance (CFG scale = 5.0) is applied at inference to enhance semantic clarity.
Overhead is minimal: approximately 100 bits/image.

5. Complete Loss Function¶

\[\mathcal{L}_\text{final} = \lambda_\text{rate} \mathcal{L}_\text{rate} + \lambda_\text{dist} \mathcal{L}_\text{dist} + \lambda_\text{diff} \mathcal{L}_\text{diff} + \lambda_\text{sem} \mathcal{L}_\text{sem}\]

Components: - $\mathcal{L}_\text{rate}$: Rate loss (estimated entropy of quantized latents and hyperprior). - $\mathcal{L}_\text{dist} = \|\mathcal{E}_\text{VAE}(\mathbf{x}) - \mathcal{D}_c(\hat{\mathbf{y}})\|_2^2$: Latent-space reconstruction loss. - $\mathcal{L}_\text{diff} = \mathbb{E}[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, \hat{\mathbf{z}}, \mathbf{c}, t)\|_2^2]$: Diffusion noise prediction loss. - Weights: $\lambda_\text{dist} = \lambda_\text{diff} = 1, \lambda_\text{sem} = 2, \lambda_\text{rate} \in \{2, 4, 8, 16, 32\}$.

6. Loss & Training¶

Dataset: LSDIR, randomly cropped to $512\times512$.
Base model: Stable Diffusion 2.1 (frozen).
Two-stage training: 200K steps at high bitrate → 200K steps of multi-rate fine-tuning.
DDIM sampling with 50 steps at inference.

Key Experimental Results¶

Machine Intelligence Task Performance (10+ Tasks, Single Codec/Bitstream)¶

COCO object detection / instance segmentation / panoptic segmentation: Consistently outperforms VTM, ELIC, and most perception-oriented methods.
Flickr30K cross-modal retrieval: Significant advantage at ultra-low bitrates (0.01–0.05 bpp).
ADE20K open-vocabulary segmentation: Substantially outperforms competing methods at 0.02–0.1 bpp.
All results are achieved without any task-specific fine-tuning.

Perceptual Quality (Kodak / Tecnick / CLIC2020)¶

Metric Type	Diff-ICMH vs. Competitors
PSNR/MS-SSIM (fidelity)	Below VTM/ELIC (inherent to generative methods); on par with DiffEIC
LPIPS ↓	Outperforms all methods
FID ↓	State of the Art
DISTS ↓	State of the Art

Perceptual advantages are most pronounced at ultra-low bitrates.

Ablation Study (COCO Object Detection mAP)¶

SC loss + TGM combination: ~4 mAP improvement over the baseline at ~0.025 bpp.
Optimal SC loss configuration: $\lambda_\text{sem}=2.0$, middle block, noise-free input.

Highlights & Insights¶

One codec, multiple uses: A single codec and bitstream support 10+ downstream tasks and human viewing without any adaptation.
Dual guarantee of semantics and perception: SC loss preserves semantic integrity; the generative framework ensures perceptual realism.
TGM with negligible overhead: Only ~100 bits of tag information suffice to substantially activate the generative prior.
Latent-space compression design: Decoding to the VAE latent space rather than pixel space naturally filters out semantically irrelevant redundancy.
Ultra-low bitrate advantage: Performance gains are most significant under extreme conditions of 0.01–0.05 bpp.

Limitations & Future Work¶

Decoding speed: Diffusion denoising requires 50 iterative forward passes, making decoding substantially slower than traditional codecs.
Fidelity loss: PSNR/MS-SSIM are below fidelity-driven methods, making the approach unsuitable for scenarios requiring pixel-accurate reconstruction.
VAE bottleneck: The $8\times$ downsampled latent space may discard fine spatial details (e.g., pose estimation).
Training cost: Two-stage training requires GPU resources commensurate with Stable Diffusion.

vs. VTM/ELIC (fidelity-driven): Diff-ICMH achieves lower PSNR but substantially outperforms on machine intelligence tasks and perceptual quality.
vs. DiffEIC (diffusion-based compression): Comprehensively surpasses on machine tasks; on par or better on perceptual quality.
vs. TransTIC/Adapter-ICMH (task-adaptive): Diff-ICMH achieves superior performance without any adaptation.
vs. feature compression methods: Diff-ICMH operates in the image domain, simultaneously supporting human viewing.
Semantic preservation is the key to universal compression: The success of SC loss demonstrates that semantic information is the shared foundation of both machine task performance and human understanding.
A new role for generative compression: Beyond pursuing perceptual quality, generative compression represents a pathway toward universal intelligent codecs.
CFG in compression: Classifier-Free Guidance is extended from generation to conditional enhancement in compression.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First work to design a universal codec for both machine and human use from a unified semantic-perceptual perspective.
Technical Depth: ⭐⭐⭐⭐⭐ — SC loss design is well-motivated and rigorously validated through ablation studies.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 10 tasks + 3 perceptual datasets + comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐☆ — Framework is clearly presented, though the extensive reference list slightly reduces readability.