DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression¶
Conference: CVPR 2025
arXiv: 2603.13162
Code: https://njuvision.github.io/DiT-IC/
Area: Image Compression / Diffusion Models
Keywords: Diffusion Transformer, Image Compression, One-step Inference, Variance-guided, Self-distillation
TL;DR¶
DiT-IC adapts a pre-trained T2I diffusion Transformer into a one-step image compression reconstruction model. Operating in a 32x downsampled deep latent space, it achieves SOTA perceptual quality through three alignment mechanisms—variance-guided reconstruction flow, self-distillation alignment, and latent conditional guidance—while decoding 30x faster than existing diffusion codecs.
Background & Motivation¶
Background: Diffusion-based image compression provides outstanding perceptual quality, but the sampling overhead (4-50 steps) and high memory footprint limit its practicality.
Limitations of Prior Work: Existing diffusion codecs are based on U-Net, forcing them to operate in an 8x shallow latent space. Traditional VAE codecs operate at 16x-64x.
Key Challenge: Compression reconstruction starts from structured latent variables (close to the data manifold), making multi-step denoising potentially redundant; however, directly fine-tuning reproductive generative models leads to manifold mismatch.
Goal: To enable one-step diffusion inference in a 32x deep latent space.
Key Insight: Three "alignment" mechanisms adapt the pre-trained T2I DiT (SANA) into a compression reconstruction model.
Core Idea: Threefold alignment from generation to reconstruction: variance-guided denoising, self-distillation from multi-step to one-step, and latent conditioning to replace text.
Method¶
Overall Architecture¶
ELIC encoder + SANA DiT reconstructor. The encoder produces quantized latent variables, and the DiT performs one-step variance-guided flow matching in the 32x space. LoRA is utilized for efficient fine-tuning.
Key Designs¶
-
Variance-Guided Reconstruction Flow
- Function: Folds multi-step diffusion into a one-step adaptive transformation.
- Mechanism: Utilizes encoder variance as a measure of spatial uncertainty, mapping it to pixel-wise pseudo-timesteps. Regions with high variance receive stronger denoising.
- Design Motivation: Compression noise is spatially heterogeneous, making a globally uniform timestep insufficient.
-
Self-Distillation Alignment
- Function: Distills multi-step diffusion behavior into a single step.
- Mechanism: Freezes the encoder, aligning the DiT output with the encoder's latent variables (cosine alignment + margin).
- Design Motivation: There is no off-the-shell multi-step teacher in deep latent spaces.
-
Latent Conditional Guidance
- Function: Replaces text conditioning with compressed representations.
- Mechanism: A lightweight projection maps features to the same embedding space as the pre-trained text encoder; utilizes CLIP-style contrastive alignment.
- Design Motivation: Text lacks sufficient detail for fine-grained spatial information and requires a heavy encoder.
Loss & Training¶
Two-stage IBP: Stage 1 with 100K iterations (256), Stage 2 with 60K iterations (512). LoRA rank is set to VAE=32 and DiT=64.
Key Experimental Results¶
Main Results¶
| Method | Steps | Latency | LPIPS BD-rate | DISTS BD-rate |
|---|---|---|---|---|
| StableCodec | 1 | 0.34s | -79.19% | -83.95% |
| ResULIC | 4 | 0.83s | -62.27% | -65.64% |
| OSCAR | 1 | 0.32s | -19.04% | -58.38% |
| DiT-IC | 1 | 0.15s | -83.65% | -87.88% |
Ablation Study¶
| Configuration | LPIPS | DISTS |
|---|---|---|
| Full | 0.00% | 0.00% |
| Train from scratch | +22.00% | +32.45% |
| Full finetuning | +7.95% | +8.05% |
Key Findings¶
- SOTA perceptual quality, and a decoding time of 0.15s, which is 80x faster than DiffEIC (12.4s).
- Capable of reconstructing 2048x2048 images on a 16GB laptop GPU.
- Pre-trained initialization is highly critical; training from scratch leads to a DISTS degradation of +32.45%.
- LoRA 32/64 outperforms full finetuning.
Highlights & Insights¶
- First to apply DiT to image compression, demonstrating the feasibility of operating in deep latent spaces.
- Three alignment mechanisms build upon one another progressively.
- One-step + deep latent space = extremely low latency and memory footprint.
Limitations & Future Work¶
- Insufficient information in latent variables at extremely low bitrates.
- The training data is limited to only 150K samples.
- Lack of comparison with concurrent works, such as OneDC.
Related Work & Insights¶
- StableCodec pioneered one-step diffusion compression but is limited by the U-Net architecture and cannot operate in deep latent spaces.
- SANA’s linear-attention DiT provides the architectural foundation for efficient diffusion; this work is the first to adapt it for compression tasks.
- ResULIC performs 4-step diffusion in a 32x latent space, whereas this work showcases that a single step can achieve better performance within the same space.
- The self-distillation approach (using the encoder itself as the target) is more elegant than distilling from external teachers and can be generalized to other reconstruction tasks.
- The feature alignment paradigm of VA-VAE inspired the self-distillation design in this work.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to apply DiT to compression; each of the three alignment mechanisms has unique merits.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three test datasets + multiple baselines + detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear architecture diagrams, with step-by-step ablation demonstrations.
- Value: ⭐⭐⭐⭐⭐ 30x speedup + SOTA quality = a critical step toward practical diffusion-based compression.