RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation¶

Conference: ECCV 2024
arXiv: 2408.06110
Code: Yes
Area: 3D Vision
Keywords: Point Cloud Classification, Rotation Invariance, Self-Attention, Surface Attributes, 3D Segmentation

TL;DR¶

RISurConv is proposed to construct local triangular surfaces and extract highly representative Rotation Invariant Surface Properties (RISP). Combined with attention-augmented convolutions, it achieves the first rotation-invariant point cloud analysis network to surpass non-rotation-invariant methods in accuracy.

Background & Motivation¶

Key Challenge¶

Background: Deep learning on 3D point clouds mostly focuses on translation and point permutation invariance, while rotation invariance is less studied. Existing rotation-invariant methods (such as RIConv, RIConv++) ensure rotation invariance through hand-crafted features, but their performance is far below that of non-rotation-invariant methods (such as PointTransformer v2). The primary reason is that global information is lost during the generation of rotation-invariant features, and the LRF/LRA (Local Reference Frame/Axis) is unstable. Goal: The goal of this paper is to narrow or even eliminate the accuracy gap between rotation-invariant and non-rotation-invariant methods.

Method¶

Overall Architecture¶

Construct a \(K\)-nearest neighbor (KNN) local point set for each reference point.
Build two triangular surfaces for each neighbor and extract 14-dimensional Rotation Invariant Surface Properties (RISP).
Embed RISP using an MLP, and then refine features through two self-attention (SA) layers.
Utilize five RISurConv layers + a Transformer Encoder + fully connected layers to output classification/segmentation results.

Key Designs¶

Rotation Invariant Surface Properties (RISP): For each neighboring point \(x_i\), two adjacent neighbors \(x_{i-1}\) and \(x_{i+1}\) are selected to construct two triangular surfaces. The extracted 14-dimensional features include: distance \(L_0\), 5 Euclidean space angles (triangular interior angles and dihedral angles), and 8 tangent space angles (angles between normal vectors and edges). RISP mathematically describes the double triangles and their relations thoroughly, ensuring geometric completeness.

RISurConv Operator: Consists of two self-attention modules—SA1 refines features among \(K\) points in the neighborhood, and SA2 refines global features among \(N\) representative points. The two modules work synergistically to enhance feature representation.

Loss & Training¶

Cross-entropy loss is used for classification, and standard segmentation loss is used for segmentation.

Key Experimental Results¶

Main Results¶

ModelNet40 Classification Accuracy (Overall Accuracy %):

Method	Rotation Invariant	z/z	SO3/SO3	z/SO3	Std.
PointNet++	✗	89.3	85.0	28.6	33.8
Pt Transformer v2	✗	94.2	88.3	51.8	23.0
RIConv++	✓	91.3	91.3	91.3	0.0
RISurConv	✓	96.0	96.0	96.0	0.0

ScanObjectNN Real-World Classification (PB_T50_RS):

Method	z/z	SO3/SO3	z/SO3
RIConv++	80.3	80.3	80.3
RISurConv	93.1	93.1	93.1

ShapeNet Part Segmentation (mIoU %):

Method	SO3/SO3	z/SO3
RIConv++ (xyz+nor)	80.5	80.5
RISurConv (xyz+nor)	81.5	81.5

Ablation Study¶

Ablation Item	Accuracy
Full Model (A)	96.0
Remove \(L_0\) (B)	95.5
Tangent Space Angles Only (C)	90.9
\(L_0 + \phi\) Only (D)	88.2
Remove SA1+SA2+TE (E)	92.8
Remove Transformer Encoder (D)	94.3

Key Findings¶

This represents the first rotation-invariant method to surpass all non-rotation-invariant methods on ModelNet40 (96.0% vs. 94.2% of PT v2).
Outperforms RIConv++ by 12.8 percentage points on ScanObjectNN.
Angle features are more critical than distance features, with tangent space and Euclidean space angles serving as complementary properties.
Self-attention modules facilitate a more uniform feature distribution.

Highlights & Insights¶

Outperforming Non-Rotation-Invariant Methods for the First Time: It is demonstrated that rotation-invariant features do not have to come at the expense of accuracy.
Local Triangular Surface Construction: Captures local geometric structures more effectively compared to point-wise operations.
Completeness of RISP: The 14-dimensional features completely describe the double-triangle structure; adding more features does not further improve performance.
Self-attention enables weight redistribution, enhancing feature efficacy.

Limitations & Future Work¶

Relatively large parameter size (14M vs. 0.4M for RIConv++), leading to a certain decrease in inference speed.
The quality of normal vector estimation affects classification accuracy (w/o normal is 0.4% lower than w/ normal).
There is still room for improvement regarding fine-grained classification.

RIConv only considers local features, leading to diminished accuracy.
GCAConv employs LRF, but LRF is inherently unstable.
Insight: Transitioning from point-wise operations to surface-wise representation is a promising path for enhancing 3D feature representation.

Rating¶

Novelty: ★★★★★ The local triangular surface and RISP feature designs are ingenious, breaking the performance ceiling of rotation invariance for the first time.
Practicality: ★★★★☆ Rotation invariance is highly important for robotics and autonomous driving scenarios.
Experimental Quality: ★★★★★ Comprehensive validation on multiple datasets with detailed ablation studies.