Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models¶
Conference: CVPR2026
arXiv: 2603.22042
Code: github.com/jeeit17/UNCHA
Area: Multimodal VLM
Keywords: Hyperbolic VLM, Uncertainty Modeling, Part-to-Whole Alignment, Compositional Understanding, Entailment Loss
TL;DR¶
This paper proposes UNCHA, a framework that models the semantic representativeness of image parts with respect to the whole scene via hyperbolic uncertainty in hyperbolic VLMs. By incorporating uncertainty-guided contrastive loss and entailment loss, UNCHA enhances compositional scene understanding and outperforms existing hyperbolic VLMs across multiple downstream tasks.
Background & Motivation¶
Limitations of Prior Work¶
Limitations of Prior Work: Background: VLMs such as CLIP struggle to capture hierarchical relationships (e.g., part-whole, parent-child structures) in Euclidean space, and exhibit bias in multi-object compositional scenes.
Hyperbolic VLMs (e.g., MERU, ATMG, HyCoCLIP) better preserve hierarchical structures via the negative curvature and exponential volume growth of hyperbolic space. However, existing methods do not model the varying semantic representativeness of different parts with respect to the whole — a crop containing the core object of a scene is more representative of the whole than a background crop.
When all parts are treated equally, the model cannot distinguish more representative parts from less representative ones.
Method¶
Overall Architecture¶
UNCHA extends HyCoCLIP with uncertainty modeling: (1) define hyperbolic uncertainty to reflect semantic representativeness → (2) incorporate it into the contrastive loss → (3) calibrate uncertainty via an entailment loss.
Key Designs¶
-
Hyperbolic Uncertainty Model:
- \(u(x) = \log(1 + \exp(-\|x\|_2))\)
- Exploits the monotonic relationship between hyperbolic radius (geodesic distance from the origin) and uncertainty
- Near the origin = more abstract = higher uncertainty; far from the origin = more concrete = lower uncertainty
- Parts more representative of the whole → lower uncertainty
-
Uncertainty-Guided Contrastive Loss:
- Adaptive temperature: \(\tau_{un,i}^I = \exp(u(i_i^{part})/2) \cdot \tau_{gl}\)
- High-uncertainty parts → larger temperature → smaller contribution to the contrastive loss
- Incorporates a local contrastive loss (aligning part images with part texts)
-
Uncertainty Calibration via Entailment Loss:
- Piecewise continuous entailment loss: \(L_{ent}^* = \max(0, \phi - \eta\omega) + \alpha\phi\) (Leaky-ReLU-style relaxation)
- Uncertainty calibration: \(L_{ent}^{cal} = \lfloor L_{ent}^* \rfloor e^{-u(p)} + u(p) + \mathcal{H}(\tilde{u}(p))\)
- Encourages higher uncertainty when the entailment relationship is weak; \(u(p)\) prevents excessive uncertainty
- Entropy regularization \(\mathcal{H}\) prevents collapse of uncertainty to a uniform distribution
Loss & Training¶
\(L = \mathcal{L}_{con}^{un} + \lambda_{ent}\mathcal{L}_{ent}^{un}\)
The hyperbolic space is based on the Lorentz model, using exponential and logarithmic maps to transition between the manifold and the tangent space.
Key Experimental Results¶
Main Results¶
| Model | ImageNet | CIFAR-10 | CUB | Cars | Pets | Notes |
|---|---|---|---|---|---|---|
| CLIP (ViT-S/16) | 36.7 | 70.2 | 9.8 | 6.9 | 44.6 | Baseline |
| MERU | 35.4 | 71.2 | 11.3 | 5.2 | 42.7 | Hyperbolic baseline |
| HyCoCLIP | Improved | Improved | Improved | Improved | Improved | + Part alignment |
| UNCHA | Best | Best | Best | Best | Best | + Uncertainty modeling |
Ablation Study¶
| Configuration | Key Metric | Notes |
|---|---|---|
| w/o uncertainty guidance | Performance drop | Treating all parts equally is insufficient |
| w/o entropy regularization | Embedding space collapse | Uncertainty tends toward uniformity |
| Uncertainty vs. similarity | r = −0.739 | Strong negative correlation validates the modeling |
Key Findings¶
- A strong negative correlation (r = −0.739) between uncertainty and part-whole similarity validates the effectiveness of the proposed modeling
- Semantically more representative parts exhibit lower uncertainty, while ambiguous or unrepresentative crops exhibit higher uncertainty
- UNCHA outperforms existing hyperbolic VLMs across multiple downstream tasks including zero-shot classification, retrieval, and multi-label classification
Highlights & Insights¶
- Using the hyperbolic radius as a proxy for uncertainty is a natural and elegant design choice
- The entropy regularization that prevents uncertainty collapse reflects a deep understanding of the properties of hyperbolic space
- The Leaky-ReLU-style relaxation of the entailment loss addresses the vanishing gradient problem that arises after embeddings are pushed into the cone
- Visualization analysis intuitively demonstrates the correspondence between uncertainty and semantic representativeness
Limitations & Future Work¶
- The computational complexity of hyperbolic space limits scalability to larger models
- Part images are generated via random cropping; more intelligent part segmentation strategies remain unexplored
- Validation is limited to ViT-S/16 and ViT-B/16; larger visual encoders have yet to be evaluated
- The setting of the uncertainty threshold \(\tau_A\) is relatively heuristic
Related Work & Insights¶
- MERU first introduced hyperbolic VLMs but modeled only cross-modal entailment
- HyCoCLIP extended entailment to intra-modal settings but did not differentiate part representativeness
- The idea of using hyperbolic radius as an uncertainty proxy is generalizable to any hyperbolic representation learning scenario
Rating¶
- Novelty: ⭐⭐⭐⭐ Novel combination of hyperbolic uncertainty and semantic representativeness modeling
- Experimental Thoroughness: ⭐⭐⭐⭐ Zero-shot classification across 16 datasets with multi-dimensional evaluation
- Writing Quality: ⭐⭐⭐⭐ Detailed mathematical derivations and clear structure
- Value: ⭐⭐⭐⭐ Advances compositional understanding capabilities of hyperbolic VLMs