CVPR 2025 3D Vision Weakly-Supervised Semantic Segmentation 3D Reconstruction Cross-Modal Consistency Point Annotation Scribble Annotation Mean Teacher

Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation¶

Conference: CVPR 2025
arXiv: 2603.06374
Code: To be released
Area: 3D Vision / Semantic Segmentation
Keywords: Weakly-Supervised Semantic Segmentation, 3D Reconstruction, Cross-Modal Consistency, Point Annotation, Scribble Annotation, Mean Teacher

TL;DR¶

Rewis3d leverages feed-forward 3D reconstruction (MapAnything) to obtain 3D point clouds from 2D videos as auxiliary supervision signals. Utilizing a dual Student-Teacher architecture and weighted cross-modal consistency (CMC) loss, it improves weakly-supervised 2D semantic segmentation performance by 2-7% mIoU under sparse annotations (points/scribbles/coarse labels), while remaining purely 2D during inference.

Background & Motivation¶

Background: Semantic segmentation heavily relies on dense pixel-level annotations. While sparse annotations can significantly cut down annotation costs, they incur a performance gap. Existing weakly-supervised methods like SASFormer (self-attention propagation) and TEL (minimum spanning tree pseudo-labels) struggle to fully bridge this gap in complex scenes.

Limitations of Prior Work: Pure 2D methods rely solely on single-frame appearance cues to propagate sparse supervision, which are limited by occlusions, scale variations, and long-range dependencies in geometrically complex outdoor scenes.

Key Challenge: 3D geometric structures can provide strong scene constraints to achieve cross-view consistency, but traditional methods require 3D sensors like LiDAR, which limits their applicability.

Goal: How to enhance weakly-supervised semantic segmentation using 3D geometric priors obtained solely from 2D video sequences, while requiring no 3D data during inference?

Key Insight: Leveraging the latest feed-forward 3D reconstruction models (MapAnything) to reconstruct dense 3D point clouds from videos, using 3D as an auxiliary supervision signal during training.

Core Idea: 3D geometric structures provide cross-view consistency constraints; sparse annotations from one viewpoint can be propagated to all visible viewpoints via 3D reconstruction.

Method¶

Overall Architecture¶

Input 2D video \(\rightarrow\) Feed-forward reconstruction with MapAnything to obtain dense 3D point cloud + confidence \(\rightarrow\) Independent training of 2D and 3D Mean Teacher branches (Base Training) \(\rightarrow\) Cross-Modal Consistency (CMC): 2D teacher \(\rightarrow\) 3D student, 3D teacher \(\rightarrow\) 2D student \(\rightarrow\) Output pure 2D segmentation model.

Key Designs¶

3D Scene Reconstruction and View-Aware Sampling:
- Function: Reconstructs dense 3D point clouds from videos using MapAnything, obtaining point-wise reconstruction confidence \(c_i^{\text{rec}}\).
- Generates a view-specific 120K point subsampling for each target image—60% from the viewpoint's own points (ensuring dense 2D-3D correspondence) and 40% from the spatial neighborhood (providing global context).
- Design Motivation: Random sampling over the entire scene (\(60\text{M}+ \rightarrow 120\text{K}\)) results in only about 140 corresponding points per frame, which is insufficient for training the CMC loss.
Dual Student-Teacher + Cross-Modal Consistency (CMC):
- Function: Employs SegFormer-B4 for the 2D branch and Point Transformer V3 for the 3D branch, each equipped with an EMA Teacher.
- Core of CMC: The teacher of one modality supervises the student of the other modality. 3D teacher \(\rightarrow\) 2D student: \(\mathcal{L}_C^{2D} = -\sum_j w_i \cdot \log(S_{2D}^{y_i}(I_j))\)
- Dual Confidence Weighting: \(w_i = \max(\text{softmax}(T_{3D}(p_i))) \cdot c_i^{\text{rec}}\), which is the prediction confidence multiplied by the reconstruction confidence.
- Design Motivation: Dual filtering ensures that only points with reliable predictions and reliable geometry provide supervision.
3D Label Propagation:
- Function: Projects 2D sparse annotations back into 3D points to accumulate multi-view labels.
- Mechanism: Each 3D point originates from a specific 2D pixel, mapping labels 1:1. These are accumulated across all source images to form a unified sparse 3D label map.

Loss & Training¶

\[\mathcal{L}_{\text{Total}} = \sum_{m \in \{2D, 3D\}} (\mathcal{L}_S^m + \mathcal{L}_U^m) + \lambda_{2D} \mathcal{L}_C^{2D} + \lambda_{3D} \mathcal{L}_C^{3D}\]

Base Training runs for 15 epochs \(\rightarrow\) CMC linearly ramps up over 5 epochs to \(\lambda=0.1\).

Key Experimental Results¶

Main Results (Scribble Supervision, mIoU%)¶

Method	Waymo	KITTI-360	NYUv2	SS/FS Ratio
Fully Supervised	59.0	68.4	51.1	—
EMA Baseline	49.4	60.3	42.9	83.7%
SASFormer	37.8	46.4	44.7	64.1%
TEL	42.4	59.2	38.3	71.9%
Rewis3d (Ours)	53.3	63.4	46.1	90.3%

Different Annotation Types (Cityscapes)¶

Annotation Type	EMA	Rewis3d	Gain
Point	50.5	56.5	+6.0
Scribble	61.2	68.1	+6.9
Coarse	66.5	68.6	+2.1

Ablation Study (Waymo)¶

Configuration	mIoU
No Filtering	51.9
+ Prediction Confidence	52.7
+ Dual Confidence	53.3
Random Sampling	51.9
View-Aware	53.3 (+1.4)
Single-Frame Reconstruction	52.1
Multi-View Reconstruction	53.3 (+1.2)

Key Findings¶

Reconstructed 3D outperforms real LiDAR (+1.5 mIoU): Reconstructed point clouds are denser and contain confidence scores for filtering.
Dual confidences are complementary (Prediction +0.8, Reconstruction +0.2, Combined +1.4).
The sparser the annotation, the larger the geometric supervision gain (Point +6.0 > Coarse +2.1).

Highlights & Insights¶

"No 3D at Inference" Design: 3D is only used during training. Inference is fully 2D, which offers high practicality.
Counter-Intuitive Finding (Reconstructed 3D > Real 3D): Dense reconstruction combined with confidence filtering performs better than sparse physical sensor data.
View-Aware Sampling: Solves the practical challenge of managing 2D-3D correspondence density in large-scale point clouds.

Limitations & Future Work¶

Relies on video sequences for 3D reconstruction—limited applicability to single-image datasets.
MapAnything can produce noise when reconstructing dynamic objects.
High 3D preprocessing overhead (200+ images \(\rightarrow\) 60M+ points).

vs SASFormer/TEL: Pure 2D methods are constrained in complex outdoor scenarios, whereas Rewis3d introduces 3D constraints to yield a 7-15% improvement.
vs 2DPASS: 2DPASS requires physical LiDAR, whereas Rewis3d only requires 2D video sequences and performs pure 2D inference.

Rating¶

Novelty: ⭐⭐⭐⭐ Utilizing 3D reconstruction as an auxiliary signal for weak supervision is a novel concept.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 4 datasets, 3 annotation types, with extensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Well-structured and logically clear.
Value: ⭐⭐⭐⭐⭐ Highly practical, providing a significant boost to weakly-supervised segmentation.
vs Pure 2D Weakly-Supervised: Lacks spatial constraints; Rewis3d provides additional geometric priors through 3D reconstruction.

Rating¶

Novelty: ⭐⭐⭐⭐ Novel approach of auxiliary weakly-supervised learning via reconstruction.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets with extensive ablations.
Writing Quality: ⭐⭐⭐⭐ Clear.
Value: ⭐⭐⭐⭐ Introduces a new weakly-supervised paradigm.

Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results (Scribble Supervision, mIoU%)¶

Different Annotation Types (Cityscapes)¶

Ablation Study (Waymo)¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Rating¶

Related Papers¶