AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees¶
Conference: NeurIPS 2025 arXiv: 2512.04550 Code: None (no link provided in the paper) Area: LLM Efficiency / Model Compression Keywords: Context Compression, Semantic Tree, Gist Token, Hierarchical Compression, Long Context
TL;DR¶
This paper proposes AdmTree — an adaptive hierarchical context compression framework that constructs leaf gist tokens via information-density-driven dynamic segmentation, then aggregates them bottom-up into a binary semantic tree to achieve multi-granularity semantic preservation. It addresses two fundamental challenges: local detail loss in explicit methods and positional bias in implicit methods, outperforming the SOTA baseline Activation Beacon by over 10% on LongBench.
Background & Motivation¶
Background: Long-context processing faces computational bottlenecks due to the quadratic complexity of self-attention. Context compression approaches fall into two categories: explicit methods (directly removing unimportant text) and implicit methods (encoding context into compact vectors/gist tokens).
Limitations of Prior Work: (a) Explicit methods (e.g., LongLLMLingua) preserve global semantics but disrupt local coherence — finer summarization granularity leads to worse performance; (b) Implicit methods (e.g., AutoCompressor, ICAE) suffer from positional bias — middle and early information is forgotten ("lost in the middle"); (c) Recurrent compression (e.g., Activation Beacon) uses fixed segmentation without accounting for information density variation, and unidirectional compression leads to progressive information degradation.
Key Challenge: How to simultaneously preserve global and local semantics uniformly across all positions?
Key Insight: Drawing inspiration from hierarchical information processing in cognitive science — using tree structures to balance breadth (cross-context coverage) and depth (fine granularity).
Core Idea: Information-density-adaptive segmentation + binary semantic tree hierarchical aggregation + bidirectional attention to eliminate positional bias.
Method¶
Overall Architecture¶
Long text \(\mathbf{X}\) → adaptive segmentation by information density → insertion of one gist token (leaf node) after each segment → leaf gist tokens aggregated pairwise bottom-up to construct binary semantic tree \(\mathcal{T}\) → tree-based compression encoding → conditional generation. The LLM backbone is frozen; only gist attention heads, gist embeddings, and the aggregator are trained.
Key Designs¶
-
Adaptive Leaf Gist Token Construction:
- Function: Dynamically allocates gist token budget based on information density
- Mechanism: The context is first uniformly split into initial segments; an information content score \(\text{Score}(\mathbf{X}_i) = \text{PPL} \cdot \exp(-\lambda \cdot \text{Entropy})\) is computed for each segment; high-scoring segments receive more gist tokens (finer granularity), while low-scoring segments receive fewer
- Allocation rule: Top 25% segments receive \(n/\tau\) gist tokens, middle 25% receive \(n/2\tau\), and bottom 50% receive \(n/4\tau\) (global compression ratio unchanged)
- Design Motivation: Information-dense regions require more "storage capacity," while sparse regions can be compressed more aggressively
-
Semantic Binary Tree Construction:
- Function: Aggregates leaf gist tokens bottom-up into a hierarchical semantic representation
- Mechanism: Every two leaf nodes generate a parent node via an aggregation function \(h_v = \text{Agg}(\{h_u | u \in C_v\})\); the aggregation function consists of a single-layer self-attention followed by average pooling
- Incremental construction: When processing a new sub-segment, only the existing tree \(\mathcal{T}_{<k}\) is reused and the new leaf node is inserted
- Design Motivation: (a) The tree structure allows information from later nodes to bidirectionally influence earlier nodes, eliminating unidirectional degradation; (b) Multiple levels naturally provide global-to-local semantic abstraction
-
Tree-Based Compression Encoding:
- Function: Leverages the semantic tree as context when encoding the current sub-segment
- Mechanism: A dual-branch attention scheme is used — text tokens use the original LLM attention heads, while gist tokens use newly trained attention heads (\(W_q^{gt}, W_k^{gt}, W_v^{gt}\)); the outputs are concatenated for joint self-attention
- Tree nodes are flattened into the sequence in left-to-right, bottom-to-top order
- Attention scope: Gist tokens can attend to all preceding tree nodes and text tokens in the current segment
Loss & Training¶
- The LLM backbone (e.g., LLaMA-2-7B-Chat) is fully frozen
- Only the following are trained: gist attention heads \(\theta_{gt\_attn}\), gist embeddings \(\theta_{gt\_emb}\), and the aggregator \(\theta_{agg}\)
- Loss: Standard next-token prediction conditioned on the tree and local context
- Compression ratios: 4K–8K ×2, 8K–16K ×4, 16K–32K ×8
Key Experimental Results¶
Main Results (LongBench, LLaMA-2-7B)¶
| Method | SingleDoc | MultiDoc | Summ. | FewShot | Code | AVG |
|---|---|---|---|---|---|---|
| Original LLM | 24.7 | 22.4 | 24.6 | 63.2 | 57.7 | 37.2 |
| AutoCompressor | Poor | Poor | Poor | Poor | Poor | Low |
| ICAE | Poor | Poor | Poor | Poor | Poor | Low |
| Activation Beacon | Good | Good | Good | Good | Good | ~42 |
| AdmTree | Best | Best | Best | Best | Best | ~52+ |
Key Comparisons¶
| Dimension | AdmTree Advantage | Notes |
|---|---|---|
| vs Activation Beacon | +10% avg | On LLaMA-2-7B |
| vs Activation Beacon | +5% avg | On Qwen-2-7B |
| Max gain on QA tasks | +20 points | On multi-document QA |
| Latency | Comparable to recurrent methods | Aggregator overhead is negligible |
Key Findings¶
- Largest gains on QA tasks: QA requires preserving both global positional information and local detail simultaneously; AdmTree's tree structure is well suited to this requirement
- Adaptive vs. uniform allocation: Adaptive gist allocation provides richer detail preservation in high-information-density regions
- Bidirectional aggregation eliminates positional bias: Self-attention aggregation allows later information to influence earlier nodes, resolving the "lost in the middle" problem
- Interpretability: Attention visualization over the tree structure reveals how information flows across different levels of granularity
Highlights & Insights¶
- Cognitively inspired hierarchical design: Mimics human hierarchical information processing — retaining details first, then progressively abstracting. The tree structure naturally provides multi-granularity semantics
- Information-density-driven dynamic segmentation: \(\text{PPL} \times \exp(-\lambda \cdot \text{Entropy})\) serves as a simple yet effective information content metric; high-information segments automatically receive more gist tokens
- Fully frozen LLM backbone: Highly parameter-efficient — only a small number of gist-related parameters are trained
- Incremental construction: The semantic tree can be dynamically updated as new context arrives, making it suitable for streaming long-text processing
Limitations & Future Work¶
- Limited tree depth: Binary tree depth is \(\log_2 M\); extremely long contexts may require very deep trees
- Simple segmentation scoring function: The PPL+Entropy score may not capture all relevant semantic dimensions
- Validated on only two LLMs: LLaMA-2-7B and Qwen-2-7B; generalization to larger models and different architectures remains unknown
- Future directions: (1) Non-binary tree structures with semantically adaptive branching factors; (2) More sophisticated information density metrics; (3) Integration with KV cache compression methods
Related Work & Insights¶
- vs Activation Beacon: Beacon applies fixed-size recurrent compression (linear); AdmTree employs adaptive tree structures (hierarchical), achieving comprehensively superior semantic preservation over Beacon
- vs AutoCompressor/ICAE: These methods use a fixed number of gist tokens and cannot adapt to varying context lengths and complexities; AdmTree's adaptive allocation addresses this limitation
- vs SnapKV: SnapKV is a KV cache eviction method orthogonal to context compression; AdmTree performs compression at the input level
- vs LongLLMLingua: Explicit methods delete text; AdmTree is an implicit method that compresses context into vectors, achieving more complete information preservation
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The semantic tree compression framework is novel; the combination of information-density adaptation and hierarchical aggregation demonstrates strong originality
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on LongBench across two LLMs and multiple task types
- Writing Quality: ⭐⭐⭐⭐ Preliminary experiments (Figure 1) provide well-motivated analysis; method description is systematic and complete
- Value: ⭐⭐⭐⭐⭐ Substantially surpasses SOTA (+10%), addresses core challenges in long-context compression, and demonstrates strong practical utility