Hyperbolic Hierarchical Alignment Reasoning Network for Text-3D Retrieval¶

Conference: AAAI 2026 arXiv: 2511.11045 Code: https://github.com/liwrui/H2ARN Area: Image Generation Keywords: Text-3D Retrieval, Hyperbolic Space, Hierarchical Alignment, Entailment Cones, Contribution-Aware Aggregation

TL;DR¶

This paper proposes H2ARN, which embeds text and 3D point cloud data in the Lorentz hyperbolic space. It addresses hierarchical representation collapse via a hierarchical ordering loss (entailment cones), and mitigates redundancy-induced saliency dilution via contribution-aware hyperbolic aggregation. The method achieves state-of-the-art performance on Text-3D retrieval and introduces the T3DR-HIT v2 dataset, which is 2.6× larger than its predecessor.

Background & Motivation¶

With the explosive growth of 3D data, text-3D retrieval has become increasingly important. Existing methods face two fundamental challenges:

Challenge 1: Hierarchical Representation Collapse (HRC)¶

Both text and 3D data exhibit natural tree-structured hierarchies: - Semantic level: from abstract concepts to fine-grained details (e.g., "ceramic vase" → "trophy-shaped vase with handles" → "surface engraved with patterns") - Geometric level: from global structure to local parts (entire scene → object → handles, textures)

Key Challenge: In such hierarchies, the number of nodes grows exponentially with depth, whereas the volume of Euclidean space and conventional Riemannian spaces grows at most polynomially. Embedding an exponentially growing tree structure into a polynomially growing space inevitably produces a crowding effect, compressing semantically distinct but structurally similar samples into nearby positions.

Challenge 2: Redundancy-Induced Saliency Dilution (RISD)¶

Real-world 3D data contains substantial redundancy (scanning artifacts, decorative textures), and text descriptions include non-discriminative elements (prepositions, function words). Existing methods commonly use mean pooling to aggregate local features, implicitly assuming equal contributions from all parts. As a result, critical geometric and semantic cues are averaged into redundant noise, weakening the ability to distinguish hard negative samples.

Method¶

Overall Architecture¶

H2ARN consists of two main modules: 1. Structural Context Encoder (Euclidean space): extracts and enriches contextual representations of local features 2. Hyperbolic Hierarchical Alignment Module (hyperbolic space): performs contribution-aware aggregation and hierarchical alignment in hyperbolic space

Key Designs¶

1. Lorentz Hyperbolic Space Embedding¶

The Lorentz model is chosen over the Poincaré ball because it provides an isometric embedding in the $(d+1)$-dimensional Minkowski space, preserving distances exactly and supporting stable closed-form geodesic operations.

Hyperboloid definition: $$\mathbb{H}_c^d = \{\mathbf{u} \in \mathbb{R}^{d+1} : \langle \mathbf{u}, \mathbf{u} \rangle_{\mathcal{L}} = -\frac{1}{c}, u_{d+1} > 0\}$$

Lorentz inner product: $\langle \mathbf{u}, \mathbf{v} \rangle_{\mathcal{L}} = \langle \tilde{\mathbf{u}}, \tilde{\mathbf{v}} \rangle_E - u_{d+1}v_{d+1}$

Lorentz distance (geodesic length): $$d_{\mathbb{H}}(\mathbf{u}, \mathbf{v}) = \frac{1}{\sqrt{c}} \text{arccosh}(-c \langle \mathbf{u}, \mathbf{v} \rangle_{\mathcal{L}})$$

Key advantage of hyperbolic space: Volume grows exponentially with radius ($\sim e^{r\sqrt{c}}$), naturally suited for embedding tree-structured hierarchies. Points near the origin represent the most abstract concepts, while points farther from the origin correspond to more concrete instances.

2. Contribution-Aware Hyperbolic Aggregation¶

Core Idea: The Lorentz distance is used to measure each local feature's contribution to the global semantics, rather than applying simple mean pooling.

Procedure: 1. Compute an initial anchor $\bar{z} = \frac{1}{L}\sum_{i=1}^{L} z_i$ (Euclidean mean pooling) 2. Project the anchor and all leaf nodes into hyperbolic space via the exponential map $\exp_\mathbf{o}^c$ 3. Compute the Lorentz distance from each leaf node to the anchor 4. Derive contribution weights $\omega_i$ via softmax over negative distances 5. Perform weighted summation in Euclidean space: $z^\star = \sum_{i=1}^{L} \omega_i z_i$ 6. Map the result back to hyperbolic space to obtain the final global representation $\mathbf{h} = \exp_\mathbf{o}^c(z^\star)$

Key property: Since the norm of $z^\star$ is smaller than that of any individual $z_i$, its hyperbolic image naturally lies closer to the origin, capturing a more abstract and denoised global concept. A learnable modality-specific scaling factor $\alpha$ is introduced to prevent numerical overflow in the exponential map.

3. Dual Geometric Loss¶

Multi-positive contrastive loss $\mathcal{L}_{cont}$: Similarity is defined based on negative Lorentz distance, using a symmetric InfoNCE loss: $$s(i,j) = -d_{\mathbb{H}}(\mathbf{h}_{t,i}, \mathbf{h}_{p,j}) / \tau$$

\[\mathcal{L}_{cont} = \frac{1}{2}(\mathcal{L}_{t \to p} + \mathcal{L}_{p \to t})\]

Hierarchical ordering loss $\mathcal{L}_{ord}$: Entailment cones encode the partial order relation "text entails 3D."

A hyperbolic cone is defined for each text embedding $\mathbf{h}_t$: - Cone axis: $\mathbf{h}_t$ - Half-aperture angle shrinks as the point moves farther from the origin: $\phi(\mathbf{h}_t) = \arcsin\left(\frac{2K}{\sqrt{c}\|\tilde{\mathbf{h}}_t\|_E}\right)$

No penalty is imposed when the paired 3D embedding $\mathbf{h}_p$ lies inside the cone; otherwise, the penalty is proportional to the angular deviation: $$\mathcal{L}_{ord} = \max(0, \theta(\mathbf{h}_t, \mathbf{h}_p) - \phi(\mathbf{h}_t))$$

This asymmetric geometric constraint forces text embeddings to occupy more generic "ancestor" positions.

Total loss: $\mathcal{L}_{total} = \mathcal{L}_{cont} + \lambda \mathcal{L}_{ord}$

Loss & Training¶

Text encoder: CLIP; point cloud encoder: DGCNN
Shared latent dimension $d=512$
Curvature $c$ and scaling factor $\alpha$ are both learnable (positivity enforced via log-parameterization)
AdamW optimizer, learning rate $2 \times 10^{-3}$, $\lambda=0.2$, $\tau=0.07$, $K=0.1$
100 epochs, batch size 256

Key Experimental Results¶

Main Results¶

T3DR-HIT v2 Dataset

Method	Backbone	Text→PC R@1	Text→PC R@5	PC→Text R@1	Rsum
RMARN (CLIP+PointNet)	16 heads, 6 layers	7.6	25.2	6.5	127.3
RMARN (CLIP+DGCNN)	32 heads, 8 layers	13.4	38.3	18.4	220.3
H2ARN (Ours)	64 heads, 6 layers	16.4	44.5	19.6	238.5

T3DR-HIT Original Dataset

Method	Text→PC R@1	Text→PC R@5	Text→PC R@10	Rsum
RMARN (best config)	31	61	69	161
H2ARN (Ours)	32	63	73	168

Ablation Study¶

Configuration	Text→PC R@1	PC→Text R@1	Rsum	Notes
Full H2ARN	16.4	19.6	238.5
w/o $\mathcal{L}_{ord}$	15.3	18.4	229.6	Ordering loss is critical
w/o aggregation	15.2	16.9	233.5	Contribution-aware aggregation is effective
w/o both	14.3	14.5	222.0	Severe performance degradation
Euclidean + mean pooling	10.1	12.5	196.3	Hyperbolic space advantage is clear
Euclidean + contribution aggregation	12.5	14.2	215.1	Aggregation is effective even in Euclidean space
Hyperbolic full (H2ARN)	16.4	19.6	238.5	Optimal

Key Findings¶

Decisive advantage of hyperbolic space: Rsum improves by 42.2 points from Euclidean mean pooling (196) to full H2ARN (238)
Importance of $\mathcal{L}_{ord}$: Removing the hierarchical ordering loss causes an Rsum drop of approximately 9 points
Contribution-aware aggregation is effective even in Euclidean space (196→215), demonstrating its independent value in addressing RISD
64-head 6-layer attention configuration achieves the best Rsum (238.5) across configurations, though different configurations lead on individual metrics
T3DR-HIT v2 scaling is effective: The 2.6× data expansion increases task difficulty and validates the model's scalability

Highlights & Insights¶

The motivation for applying hyperbolic space to cross-modal retrieval is rigorously argued, grounded in the fundamental contradiction between the exponential growth of tree hierarchies and the polynomial growth of Euclidean space
The entailment cone design elegantly encodes the linguistic intuition that "text is more abstract than 3D" as a geometric constraint
Contribution-aware aggregation naturally distinguishes salient from redundant features using hyperbolic distances, without requiring additional supervision
Learnable curvature and scaling parameters enhance model flexibility

Limitations & Future Work¶

Only RMARN is used as a comparison baseline; comparisons with a broader set of methods are lacking
The T3DR-HIT dataset remains relatively small-scale (8,935 pairs) even after expansion
Supplementary text for fine-grained artifact data is generated by LLaVA, which may introduce hallucinations
The current framework supports only text and point cloud modalities; other 3D representations such as meshes and multi-view images are not addressed
Hyperbolic space operations (exponential map, arccosh, etc.) incur higher computational cost than their Euclidean counterparts

Introducing hyperbolic space into cross-modal retrieval is an emerging trend; this work is the first to apply it to the Text-3D setting
RMARN employs Riemannian attention but still operates in Euclidean or low-curvature spaces; H2ARN's constant negative curvature Lorentz model represents a more principled departure
The concept of entailment cones originates from hierarchical relation modeling in natural language processing; its application to cross-modal settings constitutes a key novelty
This work offers insights for fine-grained and mixed-granularity methods in image-text retrieval: hyperbolic space may provide advantages in handling multi-granularity alignment

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of hyperbolic space, entailment cones, and contribution-aware aggregation is novel, though each component has prior precedents
Experimental Thoroughness: ⭐⭐⭐ — Ablation studies are thorough, but the number of comparison baselines is insufficient
Writing Quality: ⭐⭐⭐⭐⭐ — Motivation is rigorously argued, mathematical derivations are complete, and figures are clear
Value: ⭐⭐⭐⭐ — Represents a significant advance in Text-3D retrieval; the hyperbolic geometry methodology is generalizable to other cross-modal tasks