Enhancing Rotation-Invariant 3D Learning with Global Pose Awareness and Attention Mechanisms¶

Conference: AAAI 2026 arXiv: 2511.08833 Code: GitHub Area: 3D Vision Keywords: Rotation invariance, point cloud classification, part segmentation, attention mechanism, global pose awareness

TL;DR¶

This paper proposes the Shadow-informed Pose Feature (SiPF) and the RIAttnConv operator. By introducing a global "shadow" reference point generated via Bingham distribution learning, the method enhances the global pose awareness of local rotation-invariant features, resolving the "Wing-tip Feature Collapse" problem where symmetric structures (e.g., left and right wings of an airplane) cannot be distinguished. The approach achieves state-of-the-art performance on ModelNet40 classification and ShapeNetPart segmentation.

Background & Motivation¶

Background: The mainstream paradigm in rotation-invariant (RI) 3D point cloud learning replaces raw coordinates with handcrafted local geometric features (e.g., PPF, RI tensors) to ensure invariance under arbitrary rotations. Representative methods include PaRI-Conv, RISurConv, and PaRot.

Limitations of Prior Work: These methods achieve rotation invariance by discarding absolute coordinate information, but at the cost of losing global pose context. This causes geometrically similar but spatially distinct parts (e.g., the left and right wings of an airplane) to produce identical feature representations.

Key Challenge: The paper formally defines the "Wing-tip Feature Collapse" phenomenon: for symmetric points \(p_{\text{left}}\) and \(p_{\text{right}}\), since their local neighborhoods satisfy \(\Omega(p_{\text{right}}) = \Omega(p_{\text{left}}) R_{\text{sym}}\), any RI function must yield \(f(p_{\text{left}}) = f(p_{\text{right}})\). This is a fundamental limitation of finite receptive fields.

Goal: To inject global pose information while preserving rotation invariance, enabling the model to distinguish geometrically similar but spatially distinct structures.

Key Insight: A "shadow" reference point is introduced for each point — projected to a new location via a learned shared rotation matrix — serving as a globally consistent anchor that encodes relative positional information.

Core Idea: A global rotation learned via Bingham distribution generates "shadow points," which are encoded into local PPF features to form SiPF. Combined with an attention-based convolution operator, this enables rotation-invariant learning with global pose awareness.

Method¶

Overall Architecture¶

The input is a 3D point cloud; the output is either a classification label or per-point part segmentation. The pipeline consists of three core components: 1. Task-adaptive Shadow Locating: Learns a global rotation \(R_g\) to generate shadow points. 2. SiPF Feature Extraction: Constructs an 8D descriptor encoding both local geometry and global pose information. 3. RIAttnConv: An attention-based rotation-invariant convolution operator that uses SiPF to guide feature aggregation.

Key Designs¶

Shadow-informed Pose Feature (SiPF):
- Function: Encodes global pose information into local rotation-invariant features.
- Mechanism: For a reference point \(p_r\), a shadow point \(p_r' = p_r R_g\) is generated using a shared rotation \(R_g\). Building upon the standard PPF (4D: distance + 3 angles), the SiPPF is additionally computed as the normalized difference between the PPFs of the reference point and the neighbor point with respect to the shadow point: \(\text{SiPPF}(p_r, p_r', p_j) = \frac{\text{PPF}(p_r, p_r') - \text{PPF}(p_j, p_r')}{\|\text{PPF}(p_r, p_r') - \text{PPF}(p_j, p_r')\|_2}\)
- The final SiPF is an 8D vector: \(\mathcal{P}_r^j = (\text{PPF}(p_r, p_j), \text{SiPPF}(p_r, p_r', p_j))\)
- Design Motivation: PPF produces identical values for neighbor points symmetrically distributed on the circumference of the LRF principal axis, losing positional information. The shadow point provides a globally consistent reference direction that breaks this symmetry.
Task-adaptive Shadow Locating:
- Function: Adaptively learns the optimal global rotation \(R_g\) for generating shadow points.
- Mechanism: Rotation uncertainty is modeled using the Bingham distribution on the unit quaternion sphere \(S^3\): \(\mathcal{B}(q | \mathbf{V}, \mathbf{\Lambda}) = \frac{1}{F(\mathbf{\Lambda})} \exp(q^\top \mathbf{V} \mathbf{\Lambda} \mathbf{V}^\top q)\)
- The mode vector is extracted from \(\mathbf{V}\) as the optimal rotation candidate for the current epoch.
- Joint loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \delta \cdot \sqrt{(\mathcal{L}_{\text{bingham}} - 0.1 \cdot \mathcal{L}_{\text{task}})^2}\)
- Design Motivation: An arbitrarily chosen \(R_g\) may fail for certain geometric configurations (e.g., degenerating to standard PPF when the shadow point aligns with the LRF principal axis). End-to-end learning with Bingham distribution modeling automatically identifies an optimal rotation that avoids such degeneracy.
RIAttnConv Operator:
- Function: Aggregates neighbor features via an attention mechanism guided by SiPF.
- Mechanism: An MLP maps SiPF \(\mathcal{P}_r^j\) to adaptive kernel weights \(W_j^r\), followed by scaled dot-product attention: \(Q = \mathbf{W}_r, K = \mathbf{X}_r, V = \mathbf{W}_r \cdot \mathbf{X}_r\)
- Combined with Reversed EdgeConv: neighbor features are first aggregated to obtain \(\hat{x}_r\), then fused with the reference point feature \(x_r\): \(x_r' = g((\hat{x}_r - x_r) \oplus x_r)\)
- Design Motivation: In conventional methods, kernel weights depend solely on local relative pose, producing identical weights when local geometry is the same. The global information introduced by SiPF causes kernel weights to differ across spatial locations, enabling discrimination of symmetric structures.

Loss & Training¶

Classification uses cross-entropy loss with Bingham regularization. The optimizer is SGD with an initial learning rate of 0.1, cosine annealing to 0.001, and training for 300 epochs. Batch size is 32 for classification and 16 for segmentation; dropout is set to 0.5.

Key Experimental Results¶

Main Results¶

ModelNet40 Shape Classification (%):

Method	Input	z/z	z/SO(3)	SO(3)/SO(3)
DGCNN	pc	92.2	20.6	81.1
PaRI-Conv	pc+n	-	-	83.3
PaRot	pc	90.9	91.0	90.8
Ours	pc	91.8	91.8	91.8
Ours	pc+n	92.6	92.6	92.6

ShapeNetPart Part Segmentation (z/SO(3)):

Method	C. mIoU	I. mIoU
PaRI-Conv (pc+n)	-	84.6
LocoTrans (pc)	80.1	84.0
Ours (pc)	81.7	84.4
Ours (pc+n)	82.9	85.0

Ablation Study¶

RI Representation	Dim.	C. mIoU	I. mIoU
PPF	4	81.1	84.1
Aug. PPF	8	81.8	84.2
SiPF-w/o Direction	5	82.4	84.5
SiPF	8	82.9	85.0

Key Findings¶

Using coordinates alone (without normals), the method achieves 91.8% classification accuracy, surpassing several methods that require normals.
SiPF improves C. mIoU by 1.8% over standard PPF on the segmentation task.
RIAttnConv achieves superior segmentation performance compared to PaRI-Conv at a comparable parameter count (3.01M) and FLOPs (4795M).
The method maintains state-of-the-art performance on the real-world ScanObjectNN dataset (84.0% z/SO(3)), demonstrating robustness to noise and occlusion.

Highlights & Insights¶

The formal analysis of Wing-tip Feature Collapse is compelling: The Patch-Swapping Transformation rigorously proves the fundamental limitation of finite-receptive-field RI methods, providing a theoretical basis for incorporating global information.
The "shadow point" concept is highly intuitive: Projecting each point to a new position via a learned global rotation as a reference anchor preserves rotation invariance (since the rotation is shared) while injecting global positional information.
Bingham distribution for rotation uncertainty modeling: Rather than fixing a single rotation matrix, a probabilistic distribution is used to adaptively learn the optimal rotation, preventing degenerate cases.

Limitations & Future Work¶

Evaluation is limited to object-level datasets; scene-level point clouds (e.g., S3DIS, ScanNet) represent an important direction for future work.
The learning of the Bingham distribution may be unstable during early training; the paper does not discuss convergence behavior in detail.
The effectiveness of shadow points relies on objects having globally asymmetric geometry and may fail for fully symmetric objects (e.g., spheres).
The large neighborhood size of \(k=40\) used in segmentation may incur high computational cost on large-scale point clouds.

vs PaRI-Conv: PaRI-Conv enhances local descriptors with an 8D Aug. PPF but remains confined to local information; SiPF introduces global information via shadow points and represents a natural extension of the PPF family.
vs VN-DGCNN: VN-DGCNN preserves pose information through equivariant networks but is constrained by linear combination restrictions; the SiPF approach is more flexible.
vs LocoTrans: LocoTrans augments local features with an equivariant backbone at high computational cost (6.72M params, 7998M FLOPs); the proposed method is more efficient.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Formal treatment of wing-tip collapse + shadow points + Bingham distribution — elegant and theoretically grounded.
Experimental Thoroughness: ⭐⭐⭐⭐ Three benchmarks with comprehensive ablations, but lacks scene-level evaluation.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations, clear illustrations, and well-structured exposition.
Value: ⭐⭐⭐⭐ Offers an elegant solution for injecting global information into rotation-invariant point cloud learning.