PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments¶
Conference: CVPR 2026 arXiv: 2603.09760 Code: https://github.com/GL-ZHU925/PanoAffordanceNet Area: Robotics / Affordance Perception Keywords: Panoramic affordance grounding, 360° indoor perception, distortion-aware modulation, omnidirectional densification, one-shot learning
TL;DR¶
PanoAffordanceNet introduces a novel task of holistic affordance grounding in 360° indoor environments. It employs a Distortion-Aware Spectrum Modulator (DASM) to correct ERP geometric distortions, an Omnidirectional Sphere Densification Head (OSDH) to recover continuous affordance regions from sparse activations, and multi-level training objectives. The method achieves substantial gains over existing approaches on 360-AGD, the first panoramic affordance dataset constructed by the authors.
Background & Motivation¶
Background: Visual affordance research aims to localize interactive regions on objects, serving as a bridge between visual perception and physical manipulation. Existing methods have evolved from fully supervised to weakly supervised approaches (LOCATE/WSMA), and further to foundation-model-driven open-vocabulary methods (OOAL/AffordanceLLM). However, nearly all are validated on object-centric paradigms and limited-field-of-view images.
Limitations of Prior Work: (1) Service robots operate in 360° physical spaces, yet existing methods process only perspective images with restricted fields of view (FOV), creating a mismatch with the 360° action space; (2) Directly applying perspective methods to panoramic images causes severe performance degradation — equirectangular projection (ERP) introduces significant geometric distortion (polar stretching), non-uniform sampling results in sparse and scattered affordance region distributions, and precise alignment of abstract affordance semantics with multi-scale regions is extremely difficult.
Key Challenge: Panoramic images are not merely an extension of the field of view — they fundamentally alter the spatial distribution of features. The triple challenges of latitude-dependent ERP distortion, fragmented affordance region distribution, and semantic drift under weak supervision are deeply intertwined and entirely beyond the reach of existing methods.
Goal: (1) How to preserve local interaction details and global affordance structure under ERP distortion; (2) How to recover continuous and complete affordance regions from sparse, fragmented initial activations; (3) How to precisely align semantics and visual regions under extremely sparse (one-shot) annotations.
Key Insight: The problem is decomposed into three independent channels: spectrum-domain processing for distortion (high-frequency and low-frequency correction separately), spherical topology-domain processing for fragmentation (self-similarity propagation), and contrastive learning-domain processing for semantic drift (region–text alignment).
Core Idea: A three-stage design combining spectral distortion correction, spherical densification, and multi-level constraints enables one-shot holistic affordance grounding in 360° indoor environments.
Method¶
Overall Architecture¶
The end-to-end pipeline consists of four modules: (1) dual-encoder feature extraction — DINOv2 visual encoder (with LoRA adaptation) and CLIP text encoder (with CoOp learnable prompts); (2) DASM Distortion-Aware Spectrum Modulator — dual-band decomposition with latitude-adaptive correction; (3) sphere-aware hierarchical decoder — global semantic discovery with OSDH densification; (4) multi-level training objectives — pixel-level, distribution-level, and region–text contrastive losses. Input: 560×1120 panoramic images with one-shot annotations.
Key Designs¶
-
Distortion-Aware Spectrum Modulator (DASM):
- Function: Corrects latitude-dependent geometric distortion and semantic diffusion introduced by ERP projection.
- Mechanism: Cross-modal attention is first applied to inject text guidance into visual features \(\mathbf{F}'_v\), activating semantically relevant regions. Features are then decomposed into two branches: high-frequency (Laplacian operator \(\nabla^2\)) and low-frequency (Gaussian smoothing \(\mathcal{K}_\sigma\)). The High-Frequency Enhancement Module (HFEM) sharpens interaction boundaries in equatorial regions and suppresses polar-magnification artifacts; the Low-Frequency Stabilization Module (LFSM) maintains global structural consistency in polar regions and alleviates semantic fragmentation caused by stretching. The two branches are fused via a hybrid gating mechanism using a language-driven channel gate \(\mathbf{g}_{ch}\) and an adaptive spatial gate \(\mathbf{g}_{sp}\): \(\mathbf{F}_{\text{freq}} = \mathbf{F}'_v + \sum_{k} \lambda_k (\mathbf{g}_{ch} \odot \mathbf{g}_{sp} \odot \mathbf{F}_k)\)
- Design Motivation: ERP preserves sharp edges at the equator but stretches structures at the poles — high-frequency and low-frequency components require correction strategies in opposite directions, motivating independent dual-band processing followed by gated fusion.
-
Omnidirectional Sphere Densification Head (OSDH):
- Function: Recovers topologically continuous and complete affordance regions from sparse, fragmented initial activations.
- Mechanism: Visual self-similarity is leveraged as a structural inductive bias. Visual features are projected onto a unit hypersphere to construct a cosine similarity affinity matrix \(\mathcal{S}_{ij}\). High-confidence seed points are selected via top-k ranking, and a Sigmoid confidence map \(\mathcal{C}\) based on mean and standard deviation is applied to suppress noise on the seeds. Seed activations are then propagated via max-diffusion: \(\mathbf{A}_{\text{refined}} = \mathbf{A}_{\text{init}} + \alpha \cdot \max_{j \in \mathcal{K}}(\mathcal{S}_{ij} \cdot \mathcal{C}_j)\)
- Design Motivation: Affordance regions in panoramic images appear fragmented due to non-uniform sampling, yet visual features within the same affordance region exhibit high self-similarity. Seed propagation exploits this inductive bias to recover dense predictions from sparse activations.
-
Region–Text Contrastive Loss (\(\mathcal{L}_{RTC}\)):
- Function: Establishes precise correspondence between visual regions and affordance semantic concepts, suppressing semantic drift.
- Mechanism: Ground-truth masks are used to pool visual features into region-level representations \(\mathbf{v}_c = \sum_l \hat{M}_{c,l} \mathbf{f}''_{v,l} / \sum_k \hat{M}_{c,k}\), which are then contrastively aligned with corresponding text embeddings via InfoNCE. The loss is jointly optimized with pixel-level BCE and distribution-level KL divergence losses: \(\mathcal{L}_{total} = \lambda_1 \mathcal{L}_{BCE} + \lambda_2 \mathcal{L}_{KL} + \lambda_3 \mathcal{L}_{RTC}\)
- Design Motivation: A single object may support multiple affordances (e.g., "sit" vs. "lean" on a sofa), and pixel-level supervision alone cannot distinguish them. Region–text contrastive learning precisely anchors language supervision to specific visual regions.
Loss & Training¶
AdamW optimizer with cosine annealing, learning rate 1e-5, trained for 20k iterations on 2× A6000 GPUs with batch size 4. DINOv2 is adapted with LoRA (rank=16); the CLIP text encoder is frozen but augmented with CoOp learnable prompts. Panorama-specific data augmentation includes random rotation ±3°, scaling ±5%, and horizontal wraparound shifts.
Key Experimental Results¶
Main Results¶
One-shot affordance grounding on the 360-AGD dataset:
| Method | Easy KLD↓ | Easy SIM↑ | Easy NSS↑ | Hard KLD↓ | Hard SIM↑ | Hard NSS↑ |
|---|---|---|---|---|---|---|
| OOAL | 2.868 | 0.117 | 1.267 | 3.067 | 0.097 | 1.484 |
| OS-AGDO | 2.853 | 0.124 | 1.299 | 2.965 | 0.115 | 1.484 |
| PanoAffordanceNet | 1.270 | 0.506 | 4.490 | 1.306 | 0.474 | 4.398 |
Generalization on the perspective AGD20K dataset:
| Method | Seen KLD↓ | Seen SIM↑ | Unseen KLD↓ | Unseen SIM↑ |
|---|---|---|---|---|
| OOAL | 0.740 | 0.577 | 1.070 | 0.461 |
| Ours | 0.739 | 0.616 | 1.185 | 0.475 |
Ablation Study¶
Component ablation (Hard Split):
| LoRA | DASM | OSDH | KLD↓ | SIM↑ | NSS↑ |
|---|---|---|---|---|---|
| 1.475 | 0.416 | 4.196 | |||
| ✓ | 1.421 | 0.429 | 4.257 | ||
| ✓ | ✓ | 1.380 | 0.450 | 4.317 | |
| ✓ | ✓ | 1.359 | 0.448 | 4.339 | |
| ✓ | ✓ | ✓ | 1.306 | 0.474 | 4.398 |
Loss function ablation:
| \(\mathcal{L}_{KL}\) | \(\mathcal{L}_{RTC}\) | \(\mathcal{L}_{BCE}\) | KLD↓ | SIM↑ | NSS↑ |
|---|---|---|---|---|---|
| ✓ | 1.596 | 0.395 | 3.891 | ||
| ✓ | ✓ | 1.430 | 0.450 | 4.041 | |
| ✓ | ✓ | ✓ | 1.306 | 0.474 | 4.398 |
Key Findings¶
- PanoAffordanceNet reduces KLD by over 55%, improves SIM by more than 4×, and NSS by more than 3× on 360-AGD, demonstrating overwhelming superiority.
- The three modules contribute complementarily: DASM primarily reduces KLD (geometric correction), OSDH primarily improves SIM/NSS (region continuity), and LoRA provides foundational adaptation.
- \(\mathcal{L}_{RTC}\) contributes most to semantically sensitive metrics (SIM/NSS), validating the critical role of region–text alignment in distinguishing multiple affordances.
- KLD varies by only 0.006 across top-k values in the range 5–20, demonstrating that OSDH is highly robust to this hyperparameter.
- LoRA rank=16 is optimal; higher ranks (e.g., 32) cause overfitting and degrade DINOv2's pre-trained semantics.
- Competitive performance on the perspective AGD20K dataset confirms that the method does not rely on panorama-specific assumptions.
Highlights & Insights¶
- Forward-looking new task definition: This work is the first to advance affordance research from an object-centric paradigm to a 360° scene-level setting, directly addressing the practical needs of service robots. The 360-AGD dataset fills a critical gap in panoramic affordance research.
- Elegant symmetric design of the dual-frequency channels: The equatorial region requires high-frequency enhancement (sharpening boundaries), while polar regions require low-frequency stabilization (preventing fragmentation) — the two correction directions are precisely opposite. Independent dual-band processing followed by gated fusion is a natural and principled solution.
- OSDH recovers topological structure via visual self-similarity: No additional geometric information (e.g., depth maps) is required; cosine similarity of features alone drives sparse-to-dense propagation. The design is conceptually clean, hyperparameter-insensitive, and transferable to any setting requiring dense prediction recovery from sparse annotations.
Limitations & Future Work¶
- The 360-AGD dataset is relatively small in scale (total sample count not disclosed), and the complexity gap between Easy and Hard splits warrants further validation.
- Only 19 affordance categories are evaluated; real-world indoor affordances are far more diverse and complex.
- The one-shot setting limits coverage of long-tail affordances; few-shot or zero-shot extensions are a natural next step.
- The method processes static images and does not account for temporal affordance changes in dynamic scenes.
- ERP remains an intermediate representation; operating directly on the sphere (e.g., via spherical convolutions) may offer a more principled approach.
Related Work & Insights¶
- vs. OOAL: OOAL is the current one-shot affordance state of the art but is designed entirely for perspective images; its SIM on panoramic data is only 0.117 versus 0.506 achieved by the proposed method.
- vs. WorldAfford: WorldAfford also targets scene-level affordance but relies on SAM object segmentation and LLM reasoning, making it non-end-to-end; the proposed method is end-to-end and requires no segmentation.
- vs. 3D affordance methods: 3D methods provide precise geometric constraints but require costly annotations and lack mature foundation models; panoramic images serve as a middle ground between 2D and 3D, offering 360° spatial coverage while retaining the generalization capability of 2D foundation models.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — New task definition, new dataset, and purpose-built method design; a pioneering contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Thorough ablations; only two baselines (inherent to the new task), but cross-domain generalization validation is convincing.
- Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clear and method description is detailed, though notation is dense.
- Value: ⭐⭐⭐⭐⭐ — Opens a new research direction in panoramic affordance with direct application value for global perception in service robotics.