Efficient Learned Data Compression via Dual-Stream Feature Decoupling¶

Conference: ACL 2026 arXiv: 2604.07239 Code: https://github.com/huidong-ma/FADE Area: Model Compression / Data Compression Keywords: learned data compression, dual-stream feature decoupling, probability modeling, parallel pipeline, lossless compression

TL;DR¶

This paper proposes FADE, a framework that employs a Dual-stream Multi-scale Decoupler to separate micro-syntactic and macro-semantic features into parallel shallow streams (replacing deep serial stacking), combined with a Hierarchical Gated Refiner and a Concurrent Stream Parallel Pipeline, achieving state-of-the-art performance in both compression ratio and throughput simultaneously.

Background & Motivation¶

Background: Learned Data Compression (LDC) leverages deep learning for probabilistic prediction, significantly surpassing traditional methods (Gzip, zstd, etc.) in compression ratio. Mainstream approaches adopt autoregressive frameworks that predict the conditional probability distribution \(P(x_t|x_{<t})\) at each step and compress via entropy coding.

Limitations of Prior Work: Two structural limitations exist: (1) single-stream architectures struggle to capture both micro-syntactic (local N-gram patterns) and macro-semantic (long-range dependencies) features simultaneously, forcing the use of deep MLP stacking to approximate complex distributions and exacerbating autoregressive decoding latency; (2) in heterogeneous systems, the speed mismatch between GPU-based probability generation and CPU-based arithmetic coding causes pipeline stalls, while autoregressive serial decoding is strictly constrained by Amdahl's Law, preventing parallel acceleration.

Key Challenge: Accurate probabilistic modeling (high compression ratio) requires deep networks, yet deep serial execution incurs high latency. Analysis of mutual information decay curves reveals that data sequences exhibit two distinct dependency patterns—"micro-syntactic" (sharp initial decay) and "macro-semantic" (persistent non-zero tails). Single-stream MLPs fit both heterogeneous feature types with shared parameters, leading to diffuse saliency distributions.

Goal: To substantially reduce latency and improve throughput while maintaining or improving compression ratio.

Key Insight: Information-theoretic analysis of the dual dependency patterns in data motivates explicit feature decoupling—replacing deep serial processing with shallow parallel streams—to address bottlenecks at both the model and system levels simultaneously.

Core Idea: A CNN branch captures micro-level local patterns while an MLP branch captures macro-level global dependencies; a content-adaptive router dynamically fuses the two streams, followed by a Hierarchical Gated Refiner for instance-adaptive refinement.

Method¶

Overall Architecture¶

FADE comprises three core innovations: (1) a Dual-stream Multi-scale Decoupler (DMD) that separates features into parallel local CNN and global MLP streams; (2) a Hierarchical Gated Refiner (HGR) that achieves instance-adaptive probability modeling through coarse-to-fine two-level refinement; and (3) a Concurrent Stream Parallel Pipeline (CSPP) that integrates data parallelism and temporal parallelism to enable zero-wait processing.

Key Designs¶

Dual-stream Multi-scale Decoupler (DMD):
- Function: Separates micro-syntactic and macro-semantic features into parallel streams with distinct inductive biases, replacing deep serial stacking.
- Mechanism: The global stream uses a GeGLU-based Rolling Cache to capture long-range dependencies—maintaining a rolling cache \(\bm{M}\) updated at each step as \(\bm{M}_t = \text{Roll}(\bm{M}_{t-1}, \text{GeGLU}(\bm{X}_t))\); the local stream applies 1D convolutions with strong local inductive bias to precisely capture N-gram patterns. A content-adaptive router generates element-wise mixing weights via Sigmoid gating: \(\bm{H}_{\text{mix}} = \bm{\alpha} \odot \bm{H}_{\text{global}} + (1-\bm{\alpha}) \odot \bm{H}_{\text{local}}\).
- Design Motivation: Mutual information decay analysis and feature saliency heatmaps confirm that single-stream MLPs exhibit diffuse saliency distributions and fail to capture sharp micro-syntactic fluctuations. Two parallel shallow streams replace one deep serial stream, simultaneously addressing feature interference and latency issues.
Hierarchical Gated Refiner (HGR):
- Function: Performs coarse-to-fine instance-adaptive refinement on the fused features from DMD to improve probability estimation accuracy.
- Mechanism: A two-level cascade: (a) Coarse-grained channel interaction: batch matrix multiplication (BMM) with persistent memory \(\bm{W}_U \in \mathbb{R}^{B \times d_h \times d_h}\), where each batch index corresponds to a fixed data stream and evolves via backpropagation to capture stream-specific patterns; content-aware self-gating \(\bm{H}_{\text{coarse}} = (\bm{H}_a \odot \sigma(\bm{H}_{\text{mix}} \bm{W}_c)) + \lambda_c \cdot \bm{H}_{\text{mix}}\) suppresses noise. (b) Fine-grained nonlinear refinement: further refinement via GeGLU and projection.
- Design Motivation: The globally shared parameters of DMD cannot adapt to non-stationary feature distribution shifts in online compression. Persistent memory enables each data stream to retain its own patterns, while the gating mechanism selectively amplifies useful features and suppresses noise.
Concurrent Stream Parallel Pipeline (CSPP):
- Function: Overcomes autoregressive serial constraints to achieve fully pipelined parallelism for both compression and decompression.
- Mechanism: Parallelism along two dimensions: (a) Temporal parallelism: asynchronous ping-pong buffering decouples GPU and CPU producer-consumer threads; zero-copy pointer swapping eliminates memory contention. (b) Data parallelism: the input stream is partitioned into \(N\) independent sub-streams, each maintaining internal causality; \(N\) workers execute concurrently via a dual-barrier protocol, reducing complexity from \(O(B)\) to \(O(B/N)\). Both parallelism types are fused during compression; only data parallelism is applied during decompression due to causal constraints.
- Design Motivation: Existing methods can exploit temporal parallelism during compression, but decompression reverts to serial execution due to autoregressive causality. CSPP bypasses global causal dependencies via sub-stream partitioning, enabling decompression speed to match compression speed.

Loss & Training¶

Cross-entropy loss is used to optimize probabilistic prediction accuracy. The persistent memory in HGR adapts to stream-specific patterns via online backpropagation.

Key Experimental Results¶

Main Results¶

Method	Avg. Compression Ratio↑	Throughput	Latency	GPU Memory
Traditional (Gzip/zstd)	Low	High	Low	—
PAC	Medium-high	Medium	Medium	Medium
SEP	High	Medium-high	Medium	Medium-high
EDPC	High	High	Medium-low	Medium-low
FADE	Highest	Highest	Lowest	Lowest

Ablation Study¶

Configuration	Compression Ratio	Throughput	Note
Full FADE	Best	Best	Complete model
w/o Local Stream	Degraded	Slightly higher	Loses micro-syntactic capture
w/o HGR	Degraded	Slightly higher	Loses instance adaptability
w/o CSPP	Unchanged	Significantly lower	Importance of system parallelism

Key Findings¶

FADE achieves state-of-the-art performance in both compression ratio and throughput simultaneously, breaking the previously observed trade-off between the two.
Dual-stream decoupling replaces deep serial processing with shallow parallel streams, significantly reducing latency while improving representational capacity.
Persistent memory enables HGR to achieve stream-specific adaptation during online compression, yielding more accurate modeling than globally shared parameters.
The data parallelism strategy in CSPP brings decompression speed close to compression speed, resolving a longstanding asymmetry.
Superior performance is demonstrated across heterogeneous data types including text, audio, images, video, floating-point, and genomic data.

Highlights & Insights¶

A complete chain from information-theoretic analysis to architecture design: The dual dependency patterns are first verified via mutual information decay and self-similarity matrices, then used directly to motivate the decoupled architecture. This "analysis-driven design" is more principled than intuition-driven approaches.
Shallow parallel streams replacing deep serial stacks: Latency is reduced without sacrificing representational capacity; the core insight is that "separation + specialization" outperforms "unification + stacking."
Innovative use of persistent memory: Each batch index in the BMM corresponds to a learnable weight matrix that continuously evolves via backpropagation during online compression, effectively "memorizing the unique patterns of each data stream."

Limitations & Future Work¶

Data parallelism requires partitioning the input into independent sub-streams, causing cross-stream dependencies to be ignored.
The size of persistent memory scales linearly with batch size, which may incur significant memory overhead in large-scale parallel settings.
Compression ratio still lags behind LLM-based compression methods (e.g., LLMZip), though FADE offers substantial efficiency advantages.
The weight allocation strategy of the adaptive router is relatively simple; more sophisticated MoE-style routing could be explored.

vs PAC/OREO: Lightweight MLP-based methods that accelerate via masking and caching. FADE further improves efficiency and representational power through dual-stream decoupling.
vs SEP: SEP introduces semantic enhancement modules and multi-stream pipelines. FADE's CSPP achieves more complete parallelization.
vs EDPC: EDPC proposes a dual-path framework with a latent transformation engine. FADE's DMD more explicitly aligns decoupling with micro/macro-level patterns.

Rating¶

Novelty: ⭐⭐⭐⭐ The dual-stream decoupling design is grounded in clear theoretical motivation and validated experimentally.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Seven datasets covering text, audio, images, video, floating-point, genomic, and heterogeneous data.
Writing Quality: ⭐⭐⭐⭐ Well-structured, with a coherent progression from analysis to design to system-level implementation.
Value: ⭐⭐⭐⭐ Addresses bottlenecks at both the model and system levels simultaneously, offering high practical engineering utility.