Learned Image Compression with Hierarchical Progressive Context Modeling¶
Conference: ICCV 2025 arXiv: 2507.19125 Code: github.com/lyq133/LIC-HPCM Area: Image Compression Keywords: Learned image compression, context modeling, entropy coding, hierarchical coding, progressive fusion
TL;DR¶
This paper proposes a Hierarchical Progressive Context Model (HPCM) that partitions the latent representation into multi-scale sub-representations and encodes them sequentially from coarse to fine, combined with a cross-attention-based progressive context fusion mechanism across coding steps, enabling more efficient long-range dependency modeling and more accurate entropy parameter estimation while achieving a better trade-off between compression performance and computational complexity.
Background & Motivation¶
- The core of learned image compression lies in the entropy model: more accurate probability distribution estimation → fewer bits → better compression performance.
- Conditional entropy models jointly use a hyperprior and a context model (encoding latents in groups sequentially) to capture contextual information.
- Two key issues with existing methods:
- Inefficient long-range dependency modeling: Transformer architectures can capture long-range dependencies but introduce high complexity.
- Insufficient utilization of diverse context across coding steps: Each step only uses the hyperprior and already-coded latents, without fully exploiting contextual information accumulated in previous steps.
- Goal: Achieve more efficient context information acquisition through hierarchical coding schedules and progressive context fusion.
Method¶
Overall Architecture¶
The paper follows the standard learned image compression pipeline: analysis transform → quantization → entropy coding → decoding → synthesis transform. The core improvement lies in the context modeling component of the entropy model—HPCM is proposed to replace the conventional single-scale context model.
Key Designs¶
-
Hierarchical Coding Schedule: The quantized latent \(\hat{y}\) is divided via specialized sampling into three sub-representations at different scales: \(\hat{y}^{S1}\) (4× downsampled), \(\hat{y}^{S2}\) (2× downsampled), and \(\hat{y}^{S3}\) (original scale). Encoding proceeds sequentially from the smallest scale, progressively modeling dependencies from long-range to short-range. Key advantage: large receptive fields can be obtained at smaller scales with lightweight operations (ERF visualizations confirm that the effective receptive field of \(g_{ep}^{S1}\) is significantly larger than that of \(g_{ep}^{S3}\)), avoiding the high complexity of global attention at the original resolution. After encoding \(\hat{y}^{S1}\), its values are filled back into the corresponding positions of \(\hat{y}^{S2}\), and so on. Different channel groups use different spatial partitioning strategies to enable spatial-channel interaction. The number of coding steps per scale is (2, 3, 6), totaling 11 steps.
-
Progressive Context Fusion (PCF): The core innovation—integrating contextual information from the previous coding step into the current step. At step \(i\), the context \(C_i^{S2}\) is obtained by fusing the previous step's \(C_{i-1}^{S2}\) with the entropy parameter state \(\psi_{i-1}\). Fusion is implemented via cross-attention: \(Q = \text{Linear}(\psi_i)\), \(K,V = \text{Linear}(C_i^{S2})\), \(C_{i+1}^{S2} = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V + \text{Linear}(\psi_i)\). Cross-scale propagation: the small-scale context \(C_6^{S2}\) is filled into corresponding positions of the large-scale \(C_6^{S3}\), with the remaining positions filled using hyperprior information.
-
Parameter-Efficient Design and Other Improvements: Advanced network architectures (deep residual connections, non-local attention) are employed to construct the transform and context model networks. Network parameters are shared across coding steps, significantly reducing model size. The Python-C data exchange interface is optimized for more efficient arithmetic coding.
Loss & Training¶
- Rate-distortion loss: \(L = \mathcal{R}(\hat{y}) + \mathcal{R}(\hat{z}) + \lambda \cdot \mathcal{D}(x, \hat{x})\)
- Probability distribution model: generalized Gaussian model \(\mathcal{N}_\beta(\mu, \alpha)\) with \(\beta=1.5\)
- MSE optimization: \(\lambda \in \{0.0018, 0.0035, 0.0067, 0.0130, 0.0250, 0.0483\}\)
- MS-SSIM optimization: \(\lambda \in \{2.40, 4.58, 8.73, 16.64, 31.73, 60.50\}\)
- Training data: Flicker2W, 256×256 patches, batch size = 32
- 2 million training steps, Adam optimizer, learning rate decayed from 1e-4 to 1e-6 in stages
Key Experimental Results¶
Main Results¶
BD-Rate Performance (relative to VTM-22.0, PSNR; negative values indicate bit savings):
| Method | Params (M) | kMACs/pixel | Kodak | CLIC Pro | Tecnick |
|---|---|---|---|---|---|
| ELIC (CVPR'22) | 36.93 | 573.88 | -3.22% | -3.89% | -4.57% |
| TCM (CVPR'23) | 76.57 | 1823.58 | -10.70% | -8.32% | -11.84% |
| MLIC++ (ICML'23) | 116.72 | 1282.81 | -15.15% | -14.05% | -17.90% |
| FLIC (ICLR'24) | 70.96 | 1096.04 | -13.20% | -9.88% | -15.27% |
| HPCM-Base | 68.50 | 918.57 | -15.31% | -14.23% | -18.16% |
| HPCM-Large | 89.71 | 1261.29 | -19.19% | -18.37% | -22.20% |
Comparison of Different Entropy Models under the Same Transform:
| Entropy Model | kMACs/pixel | Kodak BD-Rate |
|---|---|---|
| CHARM | 495.75 | +0.86% |
| DCVC-DC intra | 542.14 | -9.18% |
| HPCM-Base | 918.57 | -15.31% |
Ablation Study¶
Ablation on Hierarchical Coding Schedule:
| Configuration | kMACs/pixel | BD-Rate |
|---|---|---|
| HPCM-Base (2,3,6) default | 918.57 | 0.00% |
| Remove hierarchical extraction (model at original scale) | 1107.48 | +1.07% |
| Coding steps (2,3,3) | 663.90 | +2.39% |
| Coding steps (2,3,12) | 1427.91 | -2.55% |
| Coding steps (4,3,6) | 925.59 | +0.35% |
Ablation on Progressive Context Fusion:
| Configuration | kMACs/pixel | BD-Rate |
|---|---|---|
| HPCM-Base (with PCF) | 918.57 | 0.00% |
| Remove progressive fusion | 872.80 | +4.71% |
| Use \(\psi_i\) as progressive context | 872.80 | +1.17% |
Key Findings¶
- Progressive context fusion is the most critical component: removing it leads to a 4.71% performance drop, indicating that cross-step context accumulation is essential for entropy estimation.
- The hierarchical coding schedule effectively enables long-range dependency modeling: larger effective receptive fields are naturally obtained at smaller scales (verified via ERF visualizations).
- Cross-attention fusion outperforms simply using the entropy parameter state (4.71% vs. 1.17%), highlighting the importance of accurate context integration mechanisms.
- HPCM-Large saves approximately 19–22% in bit-rate over VTM-22.0 on Kodak/Tecnick, achieving academic state-of-the-art.
- The number of coding steps has the most significant impact at the largest scale S3 (more latents → greater benefit), but increasing to (2,3,12) yields only an additional 2.55% gain at substantially higher computational cost.
- The attention maps of the PCF module concentrate on high-bit-rate regions (complex textures), indicating that the model learns to allocate more context modeling effort to difficult regions.
Highlights & Insights¶
- Elegant design of the hierarchical schedule: long-range dependencies are modeled without requiring global attention as in Transformers—lightweight operations at smaller scales naturally yield large effective receptive fields.
- The progressive fusion mechanism shares conceptual similarity with hidden state propagation in RNNs, but achieves more flexible information selection via cross-attention.
- The cross-scale context propagation is elegantly designed: accumulated context from smaller scales is directly filled into corresponding positions of larger scales, with hyperprior information supplementing the remaining positions.
- Parameter sharing across coding steps reduces model size while preserving context-aware adaptability.
- Optimizing the Python-C interface for arithmetic coding is also a practical engineering contribution.
Limitations & Future Work¶
- Encoding/decoding time (~80–90 ms) is competitive but still has room for improvement, particularly due to the serial bottleneck of arithmetic coding.
- The (2,3,6) step allocation is manually chosen as a trade-off; automated learning-based allocation could be explored.
- Optimization is currently focused on MSE/MS-SSIM; perceptual quality and subjective evaluation remain largely unexplored.
- Integration with variable-rate techniques to achieve multi-rate compression with a single model is a promising direction.
Related Work & Insights¶
- The checkerboard model (He et al.) and channel-conditional context model (Minnen et al.) form the foundation of context modeling in this field.
- MLIC++'s multi-reference context model inspired the idea of diverse context utilization, but HPCM achieves this more efficiently through hierarchical coding and progressive fusion.
- The hierarchical coding concept shares philosophical roots with multi-resolution approaches in traditional coding (e.g., wavelet decomposition in JPEG2000).
- The cross-attention progressive fusion mechanism is potentially transferable to other tasks requiring multi-step context accumulation.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of hierarchical coding and progressive fusion is novel, and the cross-attention fusion design is insightful.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation on 3 datasets with comparisons against VTM and multiple state-of-the-art methods; well-designed ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Figures are clear (especially the coding schedule diagram in Fig. 2); the method is described in a well-organized manner.
- Value: ⭐⭐⭐⭐⭐ State-of-the-art compression performance with a favorable performance-complexity trade-off; valuable for both academic research and industrial applications.