ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation¶
Conference: CVPR 2026 arXiv: 2603.10188 Code: https://github.com/sof-il/ARCHE Area: Model Compression Keywords: Learned Image Compression, Autoregressive Prior, Hyperprior, Squeeze-and-Excitation, Latent Residual Prediction
TL;DR¶
This paper proposes ARCHE, an end-to-end image compression framework that, within a purely convolutional architecture free of Transformers and recurrent modules, integrates five complementary components — hierarchical hyperprior, Masked PixelCNN spatial autoregressive context, channel conditioning, SE channel recalibration, and latent residual prediction — achieving a 48% BD-Rate reduction over the Ballé baseline and −5.6% over VVC Intra on Kodak, with only 95M parameters and 222ms decoding time.
Background & Motivation¶
Learned image compression has recently surpassed traditional coding standards (JPEG, JPEG 2000) through joint end-to-end optimization of the analysis transform, quantization, and entropy model. Current state-of-the-art methods face a fundamental tension: the trade-off between model representational capacity and computational efficiency.
- Transformer- and attention-based methods offer strong global modeling capability but are large, slow to infer, and difficult to deploy.
- Spatial autoregressive models based on ConvLSTM can precisely capture local dependencies but suffer from severe sequential bottlenecks due to element-wise decoding order.
- Pure channel autoregressive methods (Minnen & Singh) improve parallelism but sacrifice fine-grained spatial dependency modeling.
Key Insight: Rather than pursuing architectural complexity, ARCHE deepens the combined modeling of multiple statistical dependencies within a purely convolutional framework. The Core Idea is that "the synergy of complementary dependency modeling outperforms any single complex architecture."
Method¶
Overall Architecture¶
ARCHE adopts a variational autoencoder (VAE) structure: the analysis transform \(g_a\) encodes image \(x\) into a latent representation \(y\), which is quantized and transmitted via entropy coding; the synthesis transform \(g_s\) reconstructs \(\hat{x}\) from the quantized representation \(\hat{y}\). The entropy model follows a hierarchical design — the hyperprior provides global statistics, channel conditioning and Masked PixelCNN progressively refine probability estimates, and latent residual prediction compensates for quantization noise. The optimization objective is the rate-distortion loss \(L = R + \lambda D\).
Key Designs¶
-
Autoregressive Hyperprior:
- Function: Captures global statistical variation in the latent space.
- Mechanism: The hyper-analysis transform \(h_a(y; \phi_h)\) maps \(y\) to a second-level latent variable \(z\), which is transmitted as side information; the hyper-synthesis transform outputs conditional prior parameters. A spatial autoregressive prior is introduced, modeling dependencies via masked convolutions: \(p(\hat{y}|\hat{z}) = \prod_i p(\hat{y}_i | \hat{y}_{<i}, \hat{z})\)
- Design Motivation: A factorized prior assumes conditional independence among latent elements and fails to capture spatial correlations arising from the limited receptive field of convolutions.
-
Masked PixelCNN Context Model:
- Function: Refines entropy estimation by exploiting the spatial local structure of the latent representation.
- Mechanism: Causal convolutions (Type A/B masks) based on PixelCNN utilize only the upper and left neighborhoods in raster-scan order to predict conditional distribution parameters at each position. Multiple masked convolution layers are stacked to enlarge the receptive field.
- Design Motivation: Compared to ConvLSTM, masked convolutions enable parallel computation in a single forward pass, significantly reducing computational overhead and training instability.
-
Channel Conditioning (CC):
- Function: Models statistical co-occurrence relationships across channels.
- Mechanism: The latent tensor is divided into \(C\) channel slices decoded in causal order; already-decoded channel features are processed through a lightweight convolutional stack to extract cross-channel statistical patterns.
- Design Motivation: Cross-channel dependencies are typically low-frequency and smooth, making them effectively capturable with a lightweight network.
-
Squeeze-and-Excitation Channel Recalibration:
- Function: Adaptively re-weights channel responses within slice transforms.
- Mechanism: Squeeze obtains channel descriptors via global average pooling; Excitation learns channel-wise attention weights through two FC layers: \(w = \sigma(W_2 \cdot \text{ReLU}(W_1 \cdot s))\)
- Design Motivation: SE enables the network to concentrate capacity on more informative channels with negligible parameter overhead.
-
Latent Residual Prediction (LRP):
- Function: Compensates for irreversible noise introduced by quantization.
- Mechanism: A residual correction term \(r_m\) is predicted and applied with bounded correction via softsign activation: \(\hat{y}'_m = \hat{y}_m + \lambda_{LRP} \cdot \text{softsign}(r_m)\)
- Design Motivation: Softsign yields smoother gradients than tanh, leading to more stable training.
Loss & Training¶
- Rate-distortion loss \(L = R + \lambda D\), with MSE as the distortion metric \(D\).
- Eight \(\lambda\) values spanning near-lossless to high-compression operating points.
- Trained on the CLIC dataset with random 256×256 crops, 400 epochs, batch size 8, Adam optimizer with lr=1e-4.
- Latent depth of 320 divided into 10 slices; hyperprior depth of 192; SE reduction ratio of 16.
Key Experimental Results¶
Main Results: Kodak BD-Rate Comparison (PSNR)¶
| Method | BD-Rate vs. Ballé (%) | BD-Rate vs. VVC (%) |
|---|---|---|
| Minnen et al. | −8.00 | +90.61 |
| Minnen & Singh | −16.28 | +63.55 |
| WeConvene | −6.92 | +92.47 |
| Iliopoulou et al. (prior work) | −24.22 | +30.19 |
| ARCHE | −48.01 | −5.61 |
Ablation Study: Contribution of Each Component¶
| Configuration | BD-Rate Change | Notes |
|---|---|---|
| Full ARCHE | Optimal baseline | All 5 components combined |
| w/o AR + MCM | Largest performance drop | Degenerates to pure hyperprior model |
| w/o MCM | Significant drop | Spatial context modeling is critical |
| w/o SE | Moderate drop at low bitrate | Channel recalibration benefits fine-grained structure |
| 10 slices vs. 1 slice | ~11% BD-Rate improvement | More slices help, with diminishing returns |
Key Findings¶
- ARCHE is the first learned codec to surpass VVC Intra within a purely convolutional framework (BD-Rate −5.61%).
- With 95M parameters and 222ms decoding time, it is lighter than Minnen & Singh (121.7M, 249ms) and the prior work (124.3M, 265ms).
- At low bitrates, ARCHE preserves sharper textures and more natural color transitions.
- Replacing ConvLSTM with Masked PixelCNN improves decoding speed and training stability.
Highlights & Insights¶
- Design philosophy of "complementarity over complexity": The five components each address distinct levels of statistical redundancy; their combined effect far exceeds that of any single powerful module.
- Masked PixelCNN vs. ConvLSTM: Maintains causality while enabling parallel training.
- SE: low cost, high return: Plug-and-play and transferable to other compression frameworks.
Limitations & Future Work¶
- Optimization is limited to MSE; perceptual losses are not employed.
- Performance on high-resolution images (4K) has not been evaluated.
- Implementation is restricted to TF 2.11 / TFC library.
- No direct comparison with recent Transformer-hybrid methods is provided.
Related Work & Insights¶
- The Ballé hyperprior model serves as the foundation of learned compression; ARCHE validates a "progressive enhancement" strategy by stacking four complementary components on top of it.
- WeConvene's wavelet-domain autoregression is complementary to ARCHE's spatial-domain autoregression.
Rating¶
- Novelty: ⭐⭐⭐ — Individual components are not entirely novel, but their combination and complementarity analysis offer meaningful contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Dual-dataset evaluation (Kodak + Tecnick) with BD-Rate, visual quality, ablation, and computational cost analyses.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with thorough motivation and derivation for each component.
- Value: ⭐⭐⭐⭐ — Demonstrates that purely convolutional architectures remain competitive with state-of-the-art methods while being more practical.