DLF: Extreme Image Compression with Dual-generative Latent Fusion¶
Conference: ICCV 2025 arXiv: 2503.01428 Code: dlfcodec.github.io Area: Model Compression / Image Compression Keywords: extreme low-bitrate image compression, generative codec, dual-branch encoding, vector quantization, semantic-detail decomposition
TL;DR¶
This paper proposes the Dual-generative Latent Fusion (DLF) framework, which decomposes the image latent space into semantic and detail branches for separate compression, and eliminates inter-branch redundancy via a cross-branch interactive design. At extreme low bitrates (<0.01 bpp), DLF achieves state-of-the-art reconstruction quality with BD-Rate savings of up to 67.82% over MS-ILLM, while decoding significantly faster than diffusion-based approaches.
Background & Motivation¶
State of the Field¶
Extreme low-bitrate image compression (e.g., below 0.01 bpp) is a central challenge in the image compression community. Traditional codecs (VVC) and MSE-optimized neural codecs suffer from severe blurring at low bitrates. Recent methods based on generative tokenizers (e.g., VQGAN) achieve extremely high compression ratios by encoding images into a small number of token indices, yet face a fundamental tension.
Limitations of Prior Work¶
Capacity bottleneck of a single codebook: Generative tokenizers such as VQGAN learn compact codebooks by clustering semantics across an entire dataset, prioritizing dataset-level common content. As codebook size shrinks, instance-specific details (e.g., precise geometric features) are severely distorted.
Limitations of diffusion-based approaches: Methods such as PerCo and DiffEIC improve perceptual realism but fall short on fidelity and suffer from extremely slow decoding (over 4 seconds).
Redundancy in dual-branch designs: The existing dual-branch method HybridFlow uses independent encoders, resulting in substantial information redundancy between the two bitstreams.
Root Cause¶
Can the latent space be flexibly decomposed into a "semantic" component and a "detail" component, each compressed with the most appropriate strategy, while cross-branch interaction eliminates redundancy?
Method¶
Overall Architecture¶
DLF adopts a dual-branch encoding architecture. An input image \(X \in \mathbb{R}^{3 \times H \times W}\) is first projected via patch embedding to \(Emb(X) \in \mathbb{R}^{C \times h \times w}\) (\(h=H/16, w=W/16\)), then fed into a semantic branch and a detail branch respectively. The semantic branch uses a 1-D tokenizer to cluster high-level semantics into compact tokens, while the detail branch uses scalar quantization (SQ) to encode perceptually important details. The two branches interact through multi-layer Interactive Transform (IT) modules. After decoding, semantic and detail features are fused via a latent adaptor and passed to a pretrained VQGAN decoder to generate the reconstructed image.
Key Designs¶
1. Semantic Branch (1-D Tokenizer)¶
- Function: Compresses high-level image semantics into a minimal number of 1-D tokens.
- Mechanism: \(Emb(X)\) is partitioned into \(16 \times 16\) windows. Within each window, 256 2-D image tokens and 32 additional 1-D tokens are jointly fed into a ViT. Through cascaded attention, the 1-D tokens efficiently aggregate key semantic information from the image tokens, yielding \(y_s \in \mathbb{R}^{N \times C \times 32}\), where \(N = \frac{h \times w}{16 \times 16}\).
- Quantization: Vector quantization (VQ) with fixed-length coding for codebook index transmission.
- Design Motivation: Compared to manually predefined mask-based token reduction strategies, learning-based semantic compression is more flexible and adapts better to diverse images.
2. Detail Branch (SQ + Adaptive Bit Allocation)¶
- Function: Captures instance-level detail information that the VQ codebook cannot represent.
- Mechanism: Shifted window attention combined with ConvNeXT extracts local and global details, downsampled to \(y_d \in \mathbb{R}^{C \times h/2 \times w/2}\). SQ with a quadtree-based entropy model is used for arithmetic coding.
- Quantization formula: \(\hat{y}_s = VQ(y_s), \quad \hat{y}_d = Q(y_d)\)
- Design Motivation: SQ provides a far larger quantization space than VQ, enabling learnable, spatially adaptive quantization step sizes—allocating more bits to distinctive object contours and fewer bits to common content well-represented by the semantic branch.
3. Cross-Branch Interaction (Interactive Transform, IT)¶
- Function: Eliminates information redundancy between the semantic and detail bitstreams.
- Mechanism: Detail features \(f_d\) are rearranged according to the same windowing strategy as the semantic branch into \(\tilde{f_d} \in \mathbb{R}^{N \times C \times 256}\), then jointly processed with semantic features through multi-head self-attention. The processed detail features are subsequently restored to their original shape.
- Design Motivation: (1) Self-attention dynamically redistributes semantic and detail information across branches, reducing redundancy while allowing detail information to correct generation errors in the semantic branch. (2) Cross-window perceptual capability is introduced to the semantic branch—detail features contribute global information, expanding the semantic branch's receptive field.
Loss & Training¶
A two-stage progressive training scheme is adopted:
- Stage 1 (Latent Space Alignment): A rate-distortion loss is applied in latent space, supervising the fused feature \(\hat{h}\) with reference features \(\tilde{h}\) generated by a pretrained VQGAN encoder. Training uses 256×256 patches, batch size 8, fixed \(\lambda = 24.0\).
- Stage 2 (End-to-End Fine-tuning): The full model is fine-tuned with generative losses in pixel space. Training uses 512×512 patches, batch size 4, with \(\lambda \in \{5.8, 8.5, 16.0, 28.0\}\) to cover different bitrates.
Key Experimental Results¶
Main Results¶
| Method | Dataset | LPIPS BD-Rate | DISTS BD-Rate | Note |
|---|---|---|---|---|
| DLF | Kodak | -43.05% | -67.82% | Ours |
| GLC | Kodak | -17.24% | -33.41% | Tokenizer-based |
| DiffEIC | Kodak | +66.05% | +14.67% | Diffusion-based |
| PerCo | Kodak | +101.74% | -4.02% | Diffusion-based |
| HybridFlow | Kodak | +65.30% | — | Dual-branch |
| MS-ILLM | Kodak | 0.00% (anchor) | 0.00% (anchor) | Anchor |
| DLF | CLIC2020 | -27.93% | -53.55% | Ours |
Ablation Study¶
| Configuration | Kodak LPIPS | Kodak DISTS | CLIC LPIPS | CLIC DISTS | Note |
|---|---|---|---|---|---|
| w/ SQ detail (DLF) | 0.0% | 0.0% | 0.0% | 0.0% | Full model (anchor) |
| w/o detail | +17.5% | +20.2% | +47.9% | +47.6% | Remove detail branch |
| w/o interactive | +64.1% | +73.6% | +68.8% | +61.8% | Remove IT modules |
| w/ VQ detail | +18.3% | +40.7% | +27.3% | +58.1% | Replace SQ with VQ in detail branch |
Complexity Analysis¶
| Model | Encoding Time | Decoding Time | DISTS BD-Rate |
|---|---|---|---|
| MS-ILLM | 0.064s | 0.070s | 0.00% |
| PerCo | 0.461s | 2.443s | -4.02% |
| DiffEIC | 0.152s | 4.093s | +14.67% |
| DLF | 0.178s | 0.252s | -67.82% |
Key Findings¶
- Removing the cross-branch interaction (IT modules) causes the most severe performance degradation (>60% BD-Rate loss), demonstrating that independent dual branches suffer from substantial redundancy.
- Using SQ in the detail branch significantly outperforms VQ, confirming the importance of a large quantization space for representing diverse details.
- DLF decodes 10–16× faster than diffusion-based methods while achieving significantly better fidelity.
- On CLIC2020 at 768×768, DLF substantially outperforms PerCo in FID, demonstrating its advantage on high-quality datasets.
Highlights & Insights¶
- Semantic-detail decomposition: Decoupling "dataset-level common content" and "instance-level diversity" into two separate bitstreams, each compressed with the most suitable quantization strategy, is an elegant and effective design principle.
- Critical role of cross-branch interaction: Ablation results clearly show that independent dual branches without interaction perform even worse than a single branch; the interactive design is the key to the success of the dual-branch framework.
- Deeper insight into SQ vs. VQ: VQ's limited codebook is naturally suited for encoding clustered common semantics, while SQ's large quantization space accommodates diverse details. A hybrid quantization strategy is therefore optimal.
- Strong competition against diffusion-based methods: DLF achieves comparable perceptual realism while substantially improving fidelity and decoding speed.
Limitations & Future Work¶
- Not yet real-time: Encoding at 0.178s and decoding at 0.252s remain insufficient for real-time applications.
- High training cost: Two-stage training is required, with dependency on a pretrained VQGAN tokenizer.
- Limited bitrate control granularity: Bitrate is controlled via \(\lambda\) adjustment, offering limited flexibility.
- Evaluation limited to perceptual metrics: Downstream task evaluation (e.g., detection, segmentation) is absent.
Related Work & Insights¶
- HybridFlow's independent dual-branch design motivates the necessity of redundancy elimination; DLF achieves 2.6× higher compression ratio through the IT modules.
- The 1-D tokenizer provides a foundation for compact semantic compression, upon which DLF adds detail encoding.
- Comparison with diffusion-based methods demonstrates the advantage of the tokenizer-based approach in terms of speed and fidelity.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The semantic-detail decomposition combined with cross-branch interaction is an elegant design, though the dual-branch concept itself is not entirely new.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset evaluation, comprehensive ablation, and complexity analysis provide a thorough empirical study.
- Writing Quality: ⭐⭐⭐⭐⭐ — Motivation is clearly articulated, deriving the proposed method from the limitations of VQ in a logically coherent manner.
- Value: ⭐⭐⭐⭐ — Represents a significant advance in extreme low-bitrate compression and offers important reference value for the tokenizer-based compression direction.