Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior¶
Basic Information¶
- arXiv: 2511.22549
- Conference: NeurIPS 2025
- Authors: Ruoyu Feng, Yunpeng Qi, Jinming Liu, Yixin Gao, Xin Li, Xin Jin, Zhibo Chen
- Affiliation: USTC, Eastern Institute of Technology (Ningbo)
- Code: https://github.com/RuoyuFeng/Diff-ICMH
TL;DR¶
This paper proposes Diff-ICMH, a diffusion-based generative image compression framework that preserves semantic integrity via a Semantic Consistency (SC) loss and activates generative priors via a Tag Guidance Module (TGM). Using a single encoder-decoder and a single bitstream, the framework simultaneously serves 10+ machine intelligence tasks and human visual perception without any task-specific adaptation.
Background & Motivation¶
Image compression faces a fundamental split between two optimization objectives:
- Human-oriented compression: Optimizes pixel fidelity (PSNR) or perceptual quality (LPIPS/FID), but offers poor support for machine vision tasks.
- Image Coding for Machines (ICM):
- Traditional codec approaches: Adapted via quantization parameter tuning or bit allocation, constrained by non-differentiable fidelity-driven designs.
- Task-driven end-to-end methods: Strong on specific tasks but poor cross-task generalization.
- Feature compression methods: Directly compress intermediate features, tightly coupled to specific models, and incompatible with human viewing.
Key Insight: Semantic integrity and perceptual realism are jointly required by both machine intelligence and human perception — these two objectives are not inherently opposed.
Core Problem¶
How to design a universal image codec that efficiently serves multiple downstream machine intelligence tasks and human visual perception simultaneously from a single bitstream?
Method¶
1. Design Philosophy¶
Fidelity-driven compression introduces two primary sources of information loss: - Semantic distortion: Loss of core semantic information → directly impairs task analysis. - Perceptual mismatch: Textures and details deviate from the natural distribution → domain shift causes accumulated errors in feature extraction.
Experimental validation (Figure 3): At deep layers of ResNet50 (layer4), fidelity-driven codecs (VTM, ELIC) exhibit far greater feature deviation than generative codecs (MS-ILLM), demonstrating that realistic textures effectively mitigate error accumulation in deep layers.
2. Overall Architecture¶
- Encoding side: Input image \(\mathbf{x}\) is compressed into latent features \(\hat{\mathbf{z}}\), targeting the VAE latent space of Stable Diffusion (\(8\times\) spatial downsampling).
- Tag extraction: Recognize Anything extracts word-level semantic tags \(\mathbf{c}\).
- Bitstream: Compressed latents + tag IDs (losslessly encoded, ~100 bits/image).
- Decoding side: \(\hat{\mathbf{z}}\) is fed as a condition into a ControlNet-style control module, jointly with a frozen Stable Diffusion model for generative reconstruction.
3. Semantic Consistency Loss (SC Loss)¶
The pretrained diffusion model's feature extraction capability is leveraged as a semantic space: $\(\mathcal{L}_\text{sem} = -\mathbb{E}_{\mathbf{z}, \hat{\mathbf{z}}} \left[ \frac{1}{N} \sum_{n=1}^N \text{sim}(f(\mathbf{z})_n, f(\hat{\mathbf{z}})_n) \right]\)$
where \(f(\cdot)\) denotes the forward pass of the frozen diffusion model and \(\text{sim}\) is cosine similarity: $\(\text{sim}(\mathbf{z}, \hat{\mathbf{z}}) = \frac{\mathbf{z}^T \hat{\mathbf{z}}}{|\mathbf{z}|_2 |\hat{\mathbf{z}}|_2}\)$
Key design choices: - Applying the loss at the middle block of the U-Net yields the best performance — deep features better capture abstract semantics. - Using noise-free input (\(t=0\)) — the diffusion model's semantic feature extraction is optimal on clean signals.
4. Tag Guidance Module (TGM)¶
- A pretrained tag extractor \(\mathcal{E}_t\) (Recognize Anything) generates image-level tags.
- Tags are mapped to numerical indices in a predefined dictionary and losslessly encoded.
- At decoding, indices are converted back to text strings and used as conditions for the diffusion model and control module.
- Classifier-Free Guidance (CFG scale = 5.0) is applied at inference to enhance semantic clarity.
- Overhead is minimal: approximately 100 bits/image.
5. Complete Loss Function¶
Components: - \(\mathcal{L}_\text{rate}\): Rate loss (estimated entropy of quantized latents and hyperprior). - \(\mathcal{L}_\text{dist} = \|\mathcal{E}_\text{VAE}(\mathbf{x}) - \mathcal{D}_c(\hat{\mathbf{y}})\|_2^2\): Latent-space reconstruction loss. - \(\mathcal{L}_\text{diff} = \mathbb{E}[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{z}_t, \hat{\mathbf{z}}, \mathbf{c}, t)\|_2^2]\): Diffusion noise prediction loss. - Weights: \(\lambda_\text{dist} = \lambda_\text{diff} = 1, \lambda_\text{sem} = 2, \lambda_\text{rate} \in \{2, 4, 8, 16, 32\}\).
6. Loss & Training¶
- Dataset: LSDIR, randomly cropped to \(512\times512\).
- Base model: Stable Diffusion 2.1 (frozen).
- Two-stage training: 200K steps at high bitrate → 200K steps of multi-rate fine-tuning.
- DDIM sampling with 50 steps at inference.
Key Experimental Results¶
Machine Intelligence Task Performance (10+ Tasks, Single Codec/Bitstream)¶
- COCO object detection / instance segmentation / panoptic segmentation: Consistently outperforms VTM, ELIC, and most perception-oriented methods.
- Flickr30K cross-modal retrieval: Significant advantage at ultra-low bitrates (0.01–0.05 bpp).
- ADE20K open-vocabulary segmentation: Substantially outperforms competing methods at 0.02–0.1 bpp.
- All results are achieved without any task-specific fine-tuning.
Perceptual Quality (Kodak / Tecnick / CLIC2020)¶
| Metric Type | Diff-ICMH vs. Competitors |
|---|---|
| PSNR/MS-SSIM (fidelity) | Below VTM/ELIC (inherent to generative methods); on par with DiffEIC |
| LPIPS ↓ | Outperforms all methods |
| FID ↓ | State of the Art |
| DISTS ↓ | State of the Art |
- Perceptual advantages are most pronounced at ultra-low bitrates.
Ablation Study (COCO Object Detection mAP)¶
- SC loss + TGM combination: ~4 mAP improvement over the baseline at ~0.025 bpp.
- Optimal SC loss configuration: \(\lambda_\text{sem}=2.0\), middle block, noise-free input.
Highlights & Insights¶
- One codec, multiple uses: A single codec and bitstream support 10+ downstream tasks and human viewing without any adaptation.
- Dual guarantee of semantics and perception: SC loss preserves semantic integrity; the generative framework ensures perceptual realism.
- TGM with negligible overhead: Only ~100 bits of tag information suffice to substantially activate the generative prior.
- Latent-space compression design: Decoding to the VAE latent space rather than pixel space naturally filters out semantically irrelevant redundancy.
- Ultra-low bitrate advantage: Performance gains are most significant under extreme conditions of 0.01–0.05 bpp.
Limitations & Future Work¶
- Decoding speed: Diffusion denoising requires 50 iterative forward passes, making decoding substantially slower than traditional codecs.
- Fidelity loss: PSNR/MS-SSIM are below fidelity-driven methods, making the approach unsuitable for scenarios requiring pixel-accurate reconstruction.
- VAE bottleneck: The \(8\times\) downsampled latent space may discard fine spatial details (e.g., pose estimation).
- Training cost: Two-stage training requires GPU resources commensurate with Stable Diffusion.
Related Work & Insights¶
- vs. VTM/ELIC (fidelity-driven): Diff-ICMH achieves lower PSNR but substantially outperforms on machine intelligence tasks and perceptual quality.
- vs. DiffEIC (diffusion-based compression): Comprehensively surpasses on machine tasks; on par or better on perceptual quality.
- vs. TransTIC/Adapter-ICMH (task-adaptive): Diff-ICMH achieves superior performance without any adaptation.
- vs. feature compression methods: Diff-ICMH operates in the image domain, simultaneously supporting human viewing.
- Semantic preservation is the key to universal compression: The success of SC loss demonstrates that semantic information is the shared foundation of both machine task performance and human understanding.
- A new role for generative compression: Beyond pursuing perceptual quality, generative compression represents a pathway toward universal intelligent codecs.
- CFG in compression: Classifier-Free Guidance is extended from generation to conditional enhancement in compression.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First work to design a universal codec for both machine and human use from a unified semantic-perceptual perspective.
- Technical Depth: ⭐⭐⭐⭐⭐ — SC loss design is well-motivated and rigorously validated through ablation studies.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 10 tasks + 3 perceptual datasets + comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐☆ — Framework is clearly presented, though the extensive reference list slightly reduces readability.