Learned Image Compression with Hierarchical Progressive Context Modeling¶

Conference: ICCV 2025 arXiv: 2507.19125 Code: github.com/lyq133/LIC-HPCM Area: Image Compression Keywords: Learned image compression, context modeling, entropy coding, hierarchical coding, progressive fusion

TL;DR¶

This paper proposes a Hierarchical Progressive Context Model (HPCM) that partitions the latent representation into multi-scale sub-representations and encodes them sequentially from coarse to fine, combined with a cross-attention-based progressive context fusion mechanism across coding steps, enabling more efficient long-range dependency modeling and more accurate entropy parameter estimation while achieving a better trade-off between compression performance and computational complexity.

Background & Motivation¶

The core of learned image compression lies in the entropy model: more accurate probability distribution estimation → fewer bits → better compression performance.
Conditional entropy models jointly use a hyperprior and a context model (encoding latents in groups sequentially) to capture contextual information.
Two key issues with existing methods:
- Inefficient long-range dependency modeling: Transformer architectures can capture long-range dependencies but introduce high complexity.
- Insufficient utilization of diverse context across coding steps: Each step only uses the hyperprior and already-coded latents, without fully exploiting contextual information accumulated in previous steps.
Goal: Achieve more efficient context information acquisition through hierarchical coding schedules and progressive context fusion.

Method¶

Overall Architecture¶

The paper follows the standard learned image compression pipeline: analysis transform → quantization → entropy coding → decoding → synthesis transform. The core improvement lies in the context modeling component of the entropy model—HPCM is proposed to replace the conventional single-scale context model.

Key Designs¶

Hierarchical Coding Schedule: The quantized latent \(\hat{y}\) is divided via specialized sampling into three sub-representations at different scales: \(\hat{y}^{S1}\) (4× downsampled), \(\hat{y}^{S2}\) (2× downsampled), and \(\hat{y}^{S3}\) (original scale). Encoding proceeds sequentially from the smallest scale, progressively modeling dependencies from long-range to short-range. Key advantage: large receptive fields can be obtained at smaller scales with lightweight operations (ERF visualizations confirm that the effective receptive field of \(g_{ep}^{S1}\) is significantly larger than that of \(g_{ep}^{S3}\)), avoiding the high complexity of global attention at the original resolution. After encoding \(\hat{y}^{S1}\), its values are filled back into the corresponding positions of \(\hat{y}^{S2}\), and so on. Different channel groups use different spatial partitioning strategies to enable spatial-channel interaction. The number of coding steps per scale is (2, 3, 6), totaling 11 steps.
Progressive Context Fusion (PCF): The core innovation—integrating contextual information from the previous coding step into the current step. At step \(i\), the context \(C_i^{S2}\) is obtained by fusing the previous step's \(C_{i-1}^{S2}\) with the entropy parameter state \(\psi_{i-1}\). Fusion is implemented via cross-attention: \(Q = \text{Linear}(\psi_i)\), \(K,V = \text{Linear}(C_i^{S2})\), \(C_{i+1}^{S2} = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V + \text{Linear}(\psi_i)\). Cross-scale propagation: the small-scale context \(C_6^{S2}\) is filled into corresponding positions of the large-scale \(C_6^{S3}\), with the remaining positions filled using hyperprior information.
Parameter-Efficient Design and Other Improvements: Advanced network architectures (deep residual connections, non-local attention) are employed to construct the transform and context model networks. Network parameters are shared across coding steps, significantly reducing model size. The Python-C data exchange interface is optimized for more efficient arithmetic coding.

Loss & Training¶

Rate-distortion loss: \(L = \mathcal{R}(\hat{y}) + \mathcal{R}(\hat{z}) + \lambda \cdot \mathcal{D}(x, \hat{x})\)
Probability distribution model: generalized Gaussian model \(\mathcal{N}_\beta(\mu, \alpha)\) with \(\beta=1.5\)
MSE optimization: \(\lambda \in \{0.0018, 0.0035, 0.0067, 0.0130, 0.0250, 0.0483\}\)
MS-SSIM optimization: \(\lambda \in \{2.40, 4.58, 8.73, 16.64, 31.73, 60.50\}\)
Training data: Flicker2W, 256×256 patches, batch size = 32
2 million training steps, Adam optimizer, learning rate decayed from 1e-4 to 1e-6 in stages

Key Experimental Results¶

Main Results¶

BD-Rate Performance (relative to VTM-22.0, PSNR; negative values indicate bit savings):

Method	Params (M)	kMACs/pixel	Kodak	CLIC Pro	Tecnick
ELIC (CVPR'22)	36.93	573.88	-3.22%	-3.89%	-4.57%
TCM (CVPR'23)	76.57	1823.58	-10.70%	-8.32%	-11.84%
MLIC++ (ICML'23)	116.72	1282.81	-15.15%	-14.05%	-17.90%
FLIC (ICLR'24)	70.96	1096.04	-13.20%	-9.88%	-15.27%
HPCM-Base	68.50	918.57	-15.31%	-14.23%	-18.16%
HPCM-Large	89.71	1261.29	-19.19%	-18.37%	-22.20%

Comparison of Different Entropy Models under the Same Transform:

Entropy Model	kMACs/pixel	Kodak BD-Rate
CHARM	495.75	+0.86%
DCVC-DC intra	542.14	-9.18%
HPCM-Base	918.57	-15.31%

Ablation Study¶

Ablation on Hierarchical Coding Schedule:

Configuration	kMACs/pixel	BD-Rate
HPCM-Base (2,3,6) default	918.57	0.00%
Remove hierarchical extraction (model at original scale)	1107.48	+1.07%
Coding steps (2,3,3)	663.90	+2.39%
Coding steps (2,3,12)	1427.91	-2.55%
Coding steps (4,3,6)	925.59	+0.35%

Ablation on Progressive Context Fusion:

Configuration	kMACs/pixel	BD-Rate
HPCM-Base (with PCF)	918.57	0.00%
Remove progressive fusion	872.80	+4.71%
Use \(\psi_i\) as progressive context	872.80	+1.17%

Key Findings¶

Progressive context fusion is the most critical component: removing it leads to a 4.71% performance drop, indicating that cross-step context accumulation is essential for entropy estimation.
The hierarchical coding schedule effectively enables long-range dependency modeling: larger effective receptive fields are naturally obtained at smaller scales (verified via ERF visualizations).
Cross-attention fusion outperforms simply using the entropy parameter state (4.71% vs. 1.17%), highlighting the importance of accurate context integration mechanisms.
HPCM-Large saves approximately 19–22% in bit-rate over VTM-22.0 on Kodak/Tecnick, achieving academic state-of-the-art.
The number of coding steps has the most significant impact at the largest scale S3 (more latents → greater benefit), but increasing to (2,3,12) yields only an additional 2.55% gain at substantially higher computational cost.
The attention maps of the PCF module concentrate on high-bit-rate regions (complex textures), indicating that the model learns to allocate more context modeling effort to difficult regions.

Highlights & Insights¶

Elegant design of the hierarchical schedule: long-range dependencies are modeled without requiring global attention as in Transformers—lightweight operations at smaller scales naturally yield large effective receptive fields.
The progressive fusion mechanism shares conceptual similarity with hidden state propagation in RNNs, but achieves more flexible information selection via cross-attention.
The cross-scale context propagation is elegantly designed: accumulated context from smaller scales is directly filled into corresponding positions of larger scales, with hyperprior information supplementing the remaining positions.
Parameter sharing across coding steps reduces model size while preserving context-aware adaptability.
Optimizing the Python-C interface for arithmetic coding is also a practical engineering contribution.

Limitations & Future Work¶

Encoding/decoding time (~80–90 ms) is competitive but still has room for improvement, particularly due to the serial bottleneck of arithmetic coding.
The (2,3,6) step allocation is manually chosen as a trade-off; automated learning-based allocation could be explored.
Optimization is currently focused on MSE/MS-SSIM; perceptual quality and subjective evaluation remain largely unexplored.
Integration with variable-rate techniques to achieve multi-rate compression with a single model is a promising direction.

The checkerboard model (He et al.) and channel-conditional context model (Minnen et al.) form the foundation of context modeling in this field.
MLIC++'s multi-reference context model inspired the idea of diverse context utilization, but HPCM achieves this more efficiently through hierarchical coding and progressive fusion.
The hierarchical coding concept shares philosophical roots with multi-resolution approaches in traditional coding (e.g., wavelet decomposition in JPEG2000).
The cross-attention progressive fusion mechanism is potentially transferable to other tasks requiring multi-step context accumulation.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of hierarchical coding and progressive fusion is novel, and the cross-attention fusion design is insightful.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation on 3 datasets with comparisons against VTM and multiple state-of-the-art methods; well-designed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Figures are clear (especially the coding schedule diagram in Fig. 2); the method is described in a well-organized manner.
Value: ⭐⭐⭐⭐⭐ State-of-the-art compression performance with a favorable performance-complexity trade-off; valuable for both academic research and industrial applications.