AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees¶

Conference: NeurIPS 2025 arXiv: 2512.04550 Code: None (no link provided in the paper) Area: LLM Efficiency / Model Compression Keywords: Context Compression, Semantic Tree, Gist Token, Hierarchical Compression, Long Context

TL;DR¶

This paper proposes AdmTree — an adaptive hierarchical context compression framework that constructs leaf gist tokens via information-density-driven dynamic segmentation, then aggregates them bottom-up into a binary semantic tree to achieve multi-granularity semantic preservation. It addresses two fundamental challenges: local detail loss in explicit methods and positional bias in implicit methods, outperforming the SOTA baseline Activation Beacon by over 10% on LongBench.

Background & Motivation¶

Background: Long-context processing faces computational bottlenecks due to the quadratic complexity of self-attention. Context compression approaches fall into two categories: explicit methods (directly removing unimportant text) and implicit methods (encoding context into compact vectors/gist tokens).

Limitations of Prior Work: (a) Explicit methods (e.g., LongLLMLingua) preserve global semantics but disrupt local coherence — finer summarization granularity leads to worse performance; (b) Implicit methods (e.g., AutoCompressor, ICAE) suffer from positional bias — middle and early information is forgotten ("lost in the middle"); (c) Recurrent compression (e.g., Activation Beacon) uses fixed segmentation without accounting for information density variation, and unidirectional compression leads to progressive information degradation.

Key Challenge: How to simultaneously preserve global and local semantics uniformly across all positions?

Key Insight: Drawing inspiration from hierarchical information processing in cognitive science — using tree structures to balance breadth (cross-context coverage) and depth (fine granularity).

Core Idea: Information-density-adaptive segmentation + binary semantic tree hierarchical aggregation + bidirectional attention to eliminate positional bias.

Method¶

Overall Architecture¶

Long text \(\mathbf{X}\) → adaptive segmentation by information density → insertion of one gist token (leaf node) after each segment → leaf gist tokens aggregated pairwise bottom-up to construct binary semantic tree \(\mathcal{T}\) → tree-based compression encoding → conditional generation. The LLM backbone is frozen; only gist attention heads, gist embeddings, and the aggregator are trained.

Key Designs¶

Adaptive Leaf Gist Token Construction:
- Function: Dynamically allocates gist token budget based on information density
- Mechanism: The context is first uniformly split into initial segments; an information content score \(\text{Score}(\mathbf{X}_i) = \text{PPL} \cdot \exp(-\lambda \cdot \text{Entropy})\) is computed for each segment; high-scoring segments receive more gist tokens (finer granularity), while low-scoring segments receive fewer
- Allocation rule: Top 25% segments receive \(n/\tau\) gist tokens, middle 25% receive \(n/2\tau\), and bottom 50% receive \(n/4\tau\) (global compression ratio unchanged)
- Design Motivation: Information-dense regions require more "storage capacity," while sparse regions can be compressed more aggressively
Semantic Binary Tree Construction:
- Function: Aggregates leaf gist tokens bottom-up into a hierarchical semantic representation
- Mechanism: Every two leaf nodes generate a parent node via an aggregation function \(h_v = \text{Agg}(\{h_u | u \in C_v\})\); the aggregation function consists of a single-layer self-attention followed by average pooling
- Incremental construction: When processing a new sub-segment, only the existing tree \(\mathcal{T}_{<k}\) is reused and the new leaf node is inserted
- Design Motivation: (a) The tree structure allows information from later nodes to bidirectionally influence earlier nodes, eliminating unidirectional degradation; (b) Multiple levels naturally provide global-to-local semantic abstraction
Tree-Based Compression Encoding:
- Function: Leverages the semantic tree as context when encoding the current sub-segment
- Mechanism: A dual-branch attention scheme is used — text tokens use the original LLM attention heads, while gist tokens use newly trained attention heads (\(W_q^{gt}, W_k^{gt}, W_v^{gt}\)); the outputs are concatenated for joint self-attention
- Tree nodes are flattened into the sequence in left-to-right, bottom-to-top order
- Attention scope: Gist tokens can attend to all preceding tree nodes and text tokens in the current segment

Loss & Training¶

The LLM backbone (e.g., LLaMA-2-7B-Chat) is fully frozen
Only the following are trained: gist attention heads \(\theta_{gt\_attn}\), gist embeddings \(\theta_{gt\_emb}\), and the aggregator \(\theta_{agg}\)
Loss: Standard next-token prediction conditioned on the tree and local context
Compression ratios: 4K–8K ×2, 8K–16K ×4, 16K–32K ×8

Key Experimental Results¶

Main Results (LongBench, LLaMA-2-7B)¶

Method	SingleDoc	MultiDoc	Summ.	FewShot	Code	AVG
Original LLM	24.7	22.4	24.6	63.2	57.7	37.2
AutoCompressor	Poor	Poor	Poor	Poor	Poor	Low
ICAE	Poor	Poor	Poor	Poor	Poor	Low
Activation Beacon	Good	Good	Good	Good	Good	~42
AdmTree	Best	Best	Best	Best	Best	~52+

Key Comparisons¶

Dimension	AdmTree Advantage	Notes
vs Activation Beacon	+10% avg	On LLaMA-2-7B
vs Activation Beacon	+5% avg	On Qwen-2-7B
Max gain on QA tasks	+20 points	On multi-document QA
Latency	Comparable to recurrent methods	Aggregator overhead is negligible

Key Findings¶

Largest gains on QA tasks: QA requires preserving both global positional information and local detail simultaneously; AdmTree's tree structure is well suited to this requirement
Adaptive vs. uniform allocation: Adaptive gist allocation provides richer detail preservation in high-information-density regions
Bidirectional aggregation eliminates positional bias: Self-attention aggregation allows later information to influence earlier nodes, resolving the "lost in the middle" problem
Interpretability: Attention visualization over the tree structure reveals how information flows across different levels of granularity

Highlights & Insights¶

Cognitively inspired hierarchical design: Mimics human hierarchical information processing — retaining details first, then progressively abstracting. The tree structure naturally provides multi-granularity semantics
Information-density-driven dynamic segmentation: \(\text{PPL} \times \exp(-\lambda \cdot \text{Entropy})\) serves as a simple yet effective information content metric; high-information segments automatically receive more gist tokens
Fully frozen LLM backbone: Highly parameter-efficient — only a small number of gist-related parameters are trained
Incremental construction: The semantic tree can be dynamically updated as new context arrives, making it suitable for streaming long-text processing

Limitations & Future Work¶

Limited tree depth: Binary tree depth is \(\log_2 M\); extremely long contexts may require very deep trees
Simple segmentation scoring function: The PPL+Entropy score may not capture all relevant semantic dimensions
Validated on only two LLMs: LLaMA-2-7B and Qwen-2-7B; generalization to larger models and different architectures remains unknown
Future directions: (1) Non-binary tree structures with semantically adaptive branching factors; (2) More sophisticated information density metrics; (3) Integration with KV cache compression methods

vs Activation Beacon: Beacon applies fixed-size recurrent compression (linear); AdmTree employs adaptive tree structures (hierarchical), achieving comprehensively superior semantic preservation over Beacon
vs AutoCompressor/ICAE: These methods use a fixed number of gist tokens and cannot adapt to varying context lengths and complexities; AdmTree's adaptive allocation addresses this limitation
vs SnapKV: SnapKV is a KV cache eviction method orthogonal to context compression; AdmTree performs compression at the input level
vs LongLLMLingua: Explicit methods delete text; AdmTree is an implicit method that compresses context into vectors, achieving more complete information preservation

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The semantic tree compression framework is novel; the combination of information-density adaptation and hierarchical aggregation demonstrates strong originality
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on LongBench across two LLMs and multiple task types
Writing Quality: ⭐⭐⭐⭐ Preliminary experiments (Figure 1) provide well-motivated analysis; method description is systematic and complete
Value: ⭐⭐⭐⭐⭐ Substantially surpasses SOTA (+10%), addresses core challenges in long-context compression, and demonstrates strong practical utility