ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation¶

Conference: CVPR 2026 arXiv: 2603.10188 Code: https://github.com/sof-il/ARCHE Area: Model Compression Keywords: Learned Image Compression, Autoregressive Prior, Hyperprior, Squeeze-and-Excitation, Latent Residual Prediction

TL;DR¶

This paper proposes ARCHE, an end-to-end image compression framework that, within a purely convolutional architecture free of Transformers and recurrent modules, integrates five complementary components — hierarchical hyperprior, Masked PixelCNN spatial autoregressive context, channel conditioning, SE channel recalibration, and latent residual prediction — achieving a 48% BD-Rate reduction over the Ballé baseline and −5.6% over VVC Intra on Kodak, with only 95M parameters and 222ms decoding time.

Background & Motivation¶

Learned image compression has recently surpassed traditional coding standards (JPEG, JPEG 2000) through joint end-to-end optimization of the analysis transform, quantization, and entropy model. Current state-of-the-art methods face a fundamental tension: the trade-off between model representational capacity and computational efficiency.

Transformer- and attention-based methods offer strong global modeling capability but are large, slow to infer, and difficult to deploy.
Spatial autoregressive models based on ConvLSTM can precisely capture local dependencies but suffer from severe sequential bottlenecks due to element-wise decoding order.
Pure channel autoregressive methods (Minnen & Singh) improve parallelism but sacrifice fine-grained spatial dependency modeling.

Key Insight: Rather than pursuing architectural complexity, ARCHE deepens the combined modeling of multiple statistical dependencies within a purely convolutional framework. The Core Idea is that "the synergy of complementary dependency modeling outperforms any single complex architecture."

Method¶

Overall Architecture¶

ARCHE adopts a variational autoencoder (VAE) structure: the analysis transform \(g_a\) encodes image \(x\) into a latent representation \(y\), which is quantized and transmitted via entropy coding; the synthesis transform \(g_s\) reconstructs \(\hat{x}\) from the quantized representation \(\hat{y}\). The entropy model follows a hierarchical design — the hyperprior provides global statistics, channel conditioning and Masked PixelCNN progressively refine probability estimates, and latent residual prediction compensates for quantization noise. The optimization objective is the rate-distortion loss \(L = R + \lambda D\).

Key Designs¶

Autoregressive Hyperprior:
- Function: Captures global statistical variation in the latent space.
- Mechanism: The hyper-analysis transform \(h_a(y; \phi_h)\) maps \(y\) to a second-level latent variable \(z\), which is transmitted as side information; the hyper-synthesis transform outputs conditional prior parameters. A spatial autoregressive prior is introduced, modeling dependencies via masked convolutions: \(p(\hat{y}|\hat{z}) = \prod_i p(\hat{y}_i | \hat{y}_{<i}, \hat{z})\)
- Design Motivation: A factorized prior assumes conditional independence among latent elements and fails to capture spatial correlations arising from the limited receptive field of convolutions.
Masked PixelCNN Context Model:
- Function: Refines entropy estimation by exploiting the spatial local structure of the latent representation.
- Mechanism: Causal convolutions (Type A/B masks) based on PixelCNN utilize only the upper and left neighborhoods in raster-scan order to predict conditional distribution parameters at each position. Multiple masked convolution layers are stacked to enlarge the receptive field.
- Design Motivation: Compared to ConvLSTM, masked convolutions enable parallel computation in a single forward pass, significantly reducing computational overhead and training instability.
Channel Conditioning (CC):
- Function: Models statistical co-occurrence relationships across channels.
- Mechanism: The latent tensor is divided into \(C\) channel slices decoded in causal order; already-decoded channel features are processed through a lightweight convolutional stack to extract cross-channel statistical patterns.
- Design Motivation: Cross-channel dependencies are typically low-frequency and smooth, making them effectively capturable with a lightweight network.
Squeeze-and-Excitation Channel Recalibration:
- Function: Adaptively re-weights channel responses within slice transforms.
- Mechanism: Squeeze obtains channel descriptors via global average pooling; Excitation learns channel-wise attention weights through two FC layers: \(w = \sigma(W_2 \cdot \text{ReLU}(W_1 \cdot s))\)
- Design Motivation: SE enables the network to concentrate capacity on more informative channels with negligible parameter overhead.
Latent Residual Prediction (LRP):
- Function: Compensates for irreversible noise introduced by quantization.
- Mechanism: A residual correction term \(r_m\) is predicted and applied with bounded correction via softsign activation: \(\hat{y}'_m = \hat{y}_m + \lambda_{LRP} \cdot \text{softsign}(r_m)\)
- Design Motivation: Softsign yields smoother gradients than tanh, leading to more stable training.

Loss & Training¶

Rate-distortion loss \(L = R + \lambda D\), with MSE as the distortion metric \(D\).
Eight \(\lambda\) values spanning near-lossless to high-compression operating points.
Trained on the CLIC dataset with random 256×256 crops, 400 epochs, batch size 8, Adam optimizer with lr=1e-4.
Latent depth of 320 divided into 10 slices; hyperprior depth of 192; SE reduction ratio of 16.

Key Experimental Results¶

Main Results: Kodak BD-Rate Comparison (PSNR)¶

Method	BD-Rate vs. Ballé (%)	BD-Rate vs. VVC (%)
Minnen et al.	−8.00	+90.61
Minnen & Singh	−16.28	+63.55
WeConvene	−6.92	+92.47
Iliopoulou et al. (prior work)	−24.22	+30.19
ARCHE	−48.01	−5.61

Ablation Study: Contribution of Each Component¶

Configuration	BD-Rate Change	Notes
Full ARCHE	Optimal baseline	All 5 components combined
w/o AR + MCM	Largest performance drop	Degenerates to pure hyperprior model
w/o MCM	Significant drop	Spatial context modeling is critical
w/o SE	Moderate drop at low bitrate	Channel recalibration benefits fine-grained structure
10 slices vs. 1 slice	~11% BD-Rate improvement	More slices help, with diminishing returns

Key Findings¶

ARCHE is the first learned codec to surpass VVC Intra within a purely convolutional framework (BD-Rate −5.61%).
With 95M parameters and 222ms decoding time, it is lighter than Minnen & Singh (121.7M, 249ms) and the prior work (124.3M, 265ms).
At low bitrates, ARCHE preserves sharper textures and more natural color transitions.
Replacing ConvLSTM with Masked PixelCNN improves decoding speed and training stability.

Highlights & Insights¶

Design philosophy of "complementarity over complexity": The five components each address distinct levels of statistical redundancy; their combined effect far exceeds that of any single powerful module.
Masked PixelCNN vs. ConvLSTM: Maintains causality while enabling parallel training.
SE: low cost, high return: Plug-and-play and transferable to other compression frameworks.

Limitations & Future Work¶

Optimization is limited to MSE; perceptual losses are not employed.
Performance on high-resolution images (4K) has not been evaluated.
Implementation is restricted to TF 2.11 / TFC library.
No direct comparison with recent Transformer-hybrid methods is provided.

The Ballé hyperprior model serves as the foundation of learned compression; ARCHE validates a "progressive enhancement" strategy by stacking four complementary components on top of it.
WeConvene's wavelet-domain autoregression is complementary to ARCHE's spatial-domain autoregression.

Rating¶

Novelty: ⭐⭐⭐ — Individual components are not entirely novel, but their combination and complementarity analysis offer meaningful contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ — Dual-dataset evaluation (Kodak + Tecnick) with BD-Rate, visual quality, ablation, and computational cost analyses.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with thorough motivation and derivation for each component.
Value: ⭐⭐⭐⭐ — Demonstrates that purely convolutional architectures remain competitive with state-of-the-art methods while being more practical.