Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal ICL¶
Conference: ICML 2026
arXiv: 2605.29103
Code: To be confirmed
Area: Multimodal / Vision-Language Models / In-Context Learning
Keywords: Multimodal ICL, Hyperbolic Embedding, Attention Calibration, Cross-modal Hierarchical Structure
TL;DR¶
Hyper-ICL provides a structural prior for multimodal LVLM in-context learning by lifting CLIP embeddings into hyperbolic space to form structured "hyperspherical anchors" combined with hierarchy-aware distillation attention. It consistently outperforms traditional demo selection strategies on tasks such as VQA, Captioning, and Caption Editing.
Background & Motivation¶
Background: Multimodal ICL requires models to learn from few-shot demonstrations and apply this knowledge to new queries. However, LVLMs face two major challenges when selecting and combining multimodal demos: attention mismatch and structural blind spots.
Limitations of Prior Work: (1) Existing methods select demos based on Euclidean similarity, ignoring the hierarchical structure between images, text, and categories; (2) LVLM attention struggles to focus on the most relevant information within demos, especially when modality information is inconsistent; (3) Traditional high-dimensional Euclidean space fails to capture heterogeneous semantic hierarchies efficiently.
Key Challenge: Multimodal semantics inherently possess a hierarchical structure (Image → Local Region → Semantic Concept → Category Label), but Euclidean space suffers from exponential volume explosion, making it difficult to represent such hierarchies efficiently.
Goal: Inject structural priors into multimodal ICL to guide attention toward hierarchically relevant demos.
Key Insight: Hyperbolic geometric space naturally balances hyperbolic embedding radius vs. node depth, where the volume grows exponentially with the radius—making it highly suitable for hierarchical representation.
Core Idea: Map CLIP multimodal embeddings into hyperbolic space to form "hyperspherical anchors," which serve as distillation targets to guide LVLM attention. This retains the strong semantics of pre-trained CLIP while enhancing hierarchical understanding.
Method¶
Overall Architecture¶
The framework consists of two stages—(1) Offline: Constructing a hyperbolic anchor bank (CLIP embedding → Hyperbolic projection → Hierarchical distillation anchors); (2) Online: Computing the hyperbolic embedding of a query, performing a hyperbolic hierarchical attention mechanism for demo selection, and distilling LVLM attention.
Key Designs¶
-
Hyperspherical Anchors + Hyperbolic Projection:
- Function: Map CLIP embeddings into hyperbolic space to create hierarchy-aware anchors.
- Mechanism: Use the exponential map \(\exp_o(x) = \tanh(c\|x\|) \frac{x}{c\|x\|}\) to project the CLIP embedding \(x\) onto the Poincaré ball \(\mathbb{B}^d_c\) (with curvature \(c\)). A hyperspherical manifold of semantically related demos is formed by minimizing a contrastive loss under the hyperbolic metric: \(\mathcal{L}_{\text{anchor}} = \sum_i \log \frac{\exp(-d_{\mathbb{B}}(x_i, x_i^+) / \tau)}{\sum_j \exp(-d_{\mathbb{B}}(x_i, x_j) / \tau)}\), where \(d_{\mathbb{B}}(u, v) = \frac{2}{\sqrt{c}} \text{arctanh}(\sqrt{c} \|-u \oplus v\|)\).
- Design Motivation: CLIP Euclidean embeddings compress heterogeneous semantics into a shared sphere, failing to express hierarchy; hyperbolic space naturally supports hierarchical representation, where root concepts are at the center and leaf nodes are at the boundary.
-
Hierarchy-aware Distillation Attention:
- Function: Use hyperbolic anchors to guide LVLM attention distribution, strengthening the model's focus on hierarchically relevant demos.
- Mechanism: Introduce a calibration term after the LVLM attention calculation: \(\alpha_{i, j}^* = \text{softmax}(QK^T / \sqrt{d} + \lambda \mathcal{H}(x_i, x_j))\), where \(\mathcal{H}(\cdot, \cdot)\) is the inverse function of hyperbolic hierarchical distance. The LVLM learns the hierarchical prior through a distillation loss: \(\mathcal{L}_{\text{distill}} = \text{KL}(\alpha_{\text{teacher}} \| \alpha_{\text{student}})\).
- Design Motivation: Directly modifying LVLM attention weights risks damaging pre-trained knowledge; distillation allows the LVLM to internalize the hierarchical structure naturally.
-
Hyperbolic Demo Selection Algorithm:
- Function: Select the set of demos most hierarchically relevant to the query from a candidate pool.
- Mechanism: Project the query into hyperbolic space to obtain \(\hat{x}_q\). Calculate the hyperbolic distance and hierarchical depth between candidate demos and the query, then rank them according to \(\text{score} = -d_{\mathbb{B}}(x_i, \hat{x}_q) + \mu \cdot \text{depth}_{\mathbb{B}}(x_i)\). Select the top-K as in-context demos.
- Design Motivation: Traditional strategies based on cosine similarity only consider semantic distance; this method considers hierarchical depth, avoiding the selection of demos that are too generic or too specific.
Key Experimental Results¶
Main Results¶
| Task | Model | Random | TopK-CLIP | RICES | Hyper-ICL | Gain over RICES |
|---|---|---|---|---|---|---|
| VQA v2 | IDEFICS-9B | 28.4 | 31.2 | 33.7 | 37.9 | +4.2 |
| OK-VQA | IDEFICS-9B | 19.8 | 22.3 | 24.1 | 28.5 | +4.4 |
| COCO Caption | IDEFICS-9B | 67.5 | 71.8 | 74.2 | 78.6 | +4.4 |
| Caption Editing | IDEFICS-9B | 31.2 | 35.7 | 38.4 | 42.1 | +3.7 |
| Image-Text Match | Otter-9B | 52.3 | 56.8 | 59.4 | 64.7 | +5.3 |
| Visual Reasoning | Otter-9B | 42.8 | 46.3 | 48.9 | 54.2 | +5.3 |
Ablation Study¶
| Configuration | VQA v2 | COCO Caption | Description |
|---|---|---|---|
| Hyperbolic Anchors Only (No Attention Calib.) | 35.1 | 76.2 | Contribution of anchors |
| Attention Calib. Only (Euclidean Anchors) | 33.8 | 75.5 | Contribution of attention mechanism |
| Attention Calib. + Hyperbolic Anchors | 37.9 | 78.6 | Full Hyper-ICL |
| Hyperbolic Metric ↔ Euclidean Metric | 33.7 | 74.2 | Metric degradation comparison |
| Demo Count K=2 → K=4 → K=8 | 35.2 / 37.9 / 38.4 | 76.8 / 78.6 / 79.2 | K=8 is optimal but with diminishing returns |
Curvature Sensitivity¶
| Curvature c | VQA v2 ACC | COCO BLEU-4 |
|---|---|---|
| 0.5 | 35.6 | 76.4 |
| 1.0 | 37.9 | 78.6 |
| 2.0 | 36.4 | 77.2 |
\(c = 1.0\) is optimal; too low a curvature degrades to Euclidean, while too high causes numerical instability.
Key Findings¶
- Hyperbolic anchors and attention calibration produce a synergistic effect—individual use yields +2-3 points, while the combination yields +4-5 points.
- The hierarchical representation advantage of the hyperbolic metric is more significant in hierarchical tasks (reasoning, image-text match).
- More robust to demo count—traditional methods show sharply diminishing returns as K increases from 4 to 8, while Hyper-ICL maintains steady growth.
Highlights & Insights¶
- First Application of Hyperbolic Geometry to Multimodal ICL: Breaks through the representational limits of Euclidean space and introduces new geometric tools.
- Elegant Balance in Distillation Mechanism: Uses soft target distillation rather than hard constraint modification, injecting hierarchical priors while preserving LVLM pre-trained knowledge.
- Complete Closed-loop Design: Forms a unified hyperbolic framework from anchor construction and demo selection to attention calibration, with components reinforcing each other.
Limitations & Future Work¶
- Numerical stability of hyperbolic computation: High curvature or points near the boundary can lead to gradient explosion/vanishing and require careful handling.
- Scale and coverage of the anchor bank: Since it is constructed based on CLIP embeddings, hierarchical mismatch may occur for concepts outside the CLIP training distribution.
- Inference overhead: Additional hyperbolic metric calculations are required for each ICL instance (lightweight but with cumulative overhead).
- Improvements: Explore more stable hyperbolic optimization algorithms; extend to other modalities like video and audio; investigate applicability to larger LVLMs (e.g., LLaVA-1.6, GPT-4V).
Related Work & Insights¶
- vs RICES: RICES selects demos based on CLIP similarity; Hyper-ICL introduces hyperbolic hierarchical structures to improve both selection and attention mechanisms.
- vs Poincaré Embedding: Classic Poincaré embeddings target tree-structured corpora; this work extends them to the dynamic scenario of LVLM in-context learning.
- vs Attention Calibration (in NLP): Prior work focuses only on textual attention; this work expands the concept to multimodal scenarios for the first time.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First attempt to apply hyperbolic geometry to multimodal ICL, showing clear interdisciplinary innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers 6 tasks and 2 LVLMs with detailed ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Mathematical formulas are clear, though some hyperbolic geometry concepts may require reader background.
- Value: ⭐⭐⭐⭐⭐ Provides a new paradigm in the frontier field of multimodal ICL, with consistent multi-task improvements indicating high potential.