Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation¶

Conference: CVPR 2026 arXiv: 2603.06374 Code: To be released Area: Segmentation / Weakly-Supervised Segmentation / 3D Vision Keywords: Weakly-supervised semantic segmentation, 3D reconstruction, cross-modal consistency, dual student-teacher, sparse annotation

TL;DR¶

Rewis3d is the first work to introduce feed-forward 3D scene reconstruction as an auxiliary supervision signal for weakly-supervised semantic segmentation. Through a dual student-teacher architecture, it achieves bidirectional cross-modal consistency (CMC) learning between 2D images and reconstructed 3D point clouds. Combined with dual-confidence filtering and view-aware sampling, the method improves mIoU by 2–7% across multiple datasets under sparse annotations (points, scribbles, coarse labels), while requiring only 2D input at inference time.

Background & Motivation¶

Background: Semantic segmentation relies on large amounts of dense pixel-level annotations. Weakly-supervised methods (points/scribbles/coarse labels) can substantially reduce annotation costs. Existing WSSS methods such as SASFormer and TEL propagate annotation information within the image plane via specialized architectures and loss functions.
Limitations of Prior Work: These methods operate exclusively in the 2D image plane and struggle with occlusion and scale variation in geometrically complex outdoor scenes, resulting in limited annotation propagation.
Key Challenge: Sparse annotations carry insufficient information to cover the entire scene, and 2D methods lack cross-view consistency constraints to compensate for this deficiency.
Goal: How can additional geometric structural information be leveraged to enhance the propagation of sparse annotations?
Key Insight: Recent feed-forward 3D reconstruction methods (e.g., MapAnything) can directly recover high-fidelity 3D point clouds from 2D video sequences. When an object is sparsely annotated in one viewpoint, its 3D structure enables annotation transfer to all other viewpoints where the object appears.
Core Idea: Leverage reconstructed 3D geometry as a cross-view consistency bridge to enable bidirectional 2D–3D knowledge transfer during training, while maintaining a purely 2D inference pipeline.

Method¶

Overall Architecture¶

Rewis3d consists of three core components: (1) a 2D segmentation branch, (2) a 3D segmentation branch, and (3) a cross-modal consistency (CMC) module. The input is a 2D video sequence with sparse annotations. MapAnything is first used to reconstruct the video into a 3D point cloud, and sparse 2D annotations are projected into 3D space. Both branches are trained independently using a Mean Teacher architecture, after which the CMC module enforces bidirectional supervision between the teacher and student of each branch. Only the 2D branch is used at inference time.

Key Designs¶

3D Scene Reconstruction and Preprocessing:
- Function: Generate a 3D point cloud with per-point confidence from 2D video, and create a dedicated point cloud subsample for each target image.
- Mechanism: MapAnything is used for feed-forward multi-view stereo reconstruction, directly outputting point clouds \(P=\{p_i\}\) with per-point reconstruction confidence \(c_i^{\text{rec}}\). A view-aware sampling strategy is proposed: for each target image, a subsample of 120K points is generated, with 60% (72K) drawn from points corresponding to the current viewpoint to ensure dense 2D–3D correspondences, and 40% (48K) drawn from the surrounding scene for contextual coverage. 3D labels are generated by back-projecting sparse 2D annotations onto corresponding 3D points.
- Design Motivation: A complete scene may contain 60M+ points; global random sampling to 120K yields only ~140 correspondences per image, which is far too sparse to train the CMC loss. View-aware sampling guarantees approximately 72K correspondences per image.
Dual Student-Teacher Architecture:
- Function: Establish a student-teacher structure for each branch to provide stable pseudo-labels for unsupervised learning and cross-modal supervision.
- Mechanism: Each branch adopts Mean Teacher, with teacher weights updated via EMA: \(\theta_t^{\text{teacher}} \leftarrow \alpha \theta_{t-1}^{\text{teacher}} + (1-\alpha)\theta_t^{\text{student}}\) (\(\alpha=0.99\)). Students are trained with cross-entropy loss \(\mathcal{L}_S\) on annotated regions and KL-divergence consistency loss \(\mathcal{L}_U\) on unannotated regions to align with teacher pseudo-labels. A confidence weight \(w_t\) (proportion of pixels where the teacher's maximum class probability exceeds threshold \(\tau\)) adaptively scales the consistency loss.
- Design Motivation: Mean Teacher is particularly effective in weakly-supervised settings; the EMA-updated teacher provides stable supervision targets and serves as a reliable pseudo-label source for the cross-modal loss.
Weighted Cross-Modal Consistency (CMC) Loss:
- Function: Enable bidirectional knowledge transfer between the 2D and 3D branches — the 3D teacher supervises the 2D student, and the 2D teacher supervises the 3D student.
- Mechanism: For the 3D teacher → 2D student direction, a weighted cross-entropy loss is applied: \(\mathcal{L}_C^{2D} = -\sum_j w_i \cdot \log(S_{2D}^{y_i}(I_j))\), where \(w_i = \max(\text{softmax}(T_{3D}(p_i))) \cdot c_i^{\text{rec}}\) combines prediction confidence and reconstruction confidence. \(\mathcal{L}_C^{3D}\) is defined symmetrically. Stronger data augmentation is applied to students (2D: RandomCrop/Cutout/AugMix; 3D: RandomRotation/RandomScale/RandomJitter), while teachers receive weaker augmentation.
- Design Motivation: Dual-confidence filtering ensures that supervision primarily originates from regions with reliable predictions and high-quality reconstructed geometry, suppressing the influence of noisy pseudo-labels.

Loss & Training¶

The total loss is: \(\mathcal{L}_{\text{Total}} = \sum_{m \in \{2D, 3D\}} (\mathcal{L}_S^m + \mathcal{L}_U^m) + \lambda_{2D}\mathcal{L}_C^{2D} + \lambda_{3D}\mathcal{L}_C^{3D}\). Training proceeds in two stages: Stage 1 (Base Training, 15 epochs) trains both branches independently; Stage 2 introduces the CMC loss with a linear ramp-up to \(\lambda=0.1\) over 5 epochs. The 2D backbone is SegFormer-B4 and the 3D backbone is Point Transformer V3.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Rewis3d (Recon)	EMA (Baseline)	TEL	SASFormer	Gain (vs EMA)
Waymo	mIoU	53.3%	49.4%	42.4%	37.8%	+3.9%
KITTI-360	mIoU	63.4%	60.3%	59.2%	46.4%	+3.1%
NYUv2	mIoU	46.1%	42.9%	38.3%	44.7%	+3.2%

Cityscapes generalization across annotation types:

Annotation Type	EMA	Rewis3d	Gain
Point	50.5%	56.5%	+6.0%
Scribble	61.2%	68.1%	+6.9%
Coarse	66.5%	68.6%	+2.1%

Ablation Study¶

Configuration	Waymo mIoU	Notes
EMA baseline (2D only)	49.4%	No 3D auxiliary
+ CMC (no filtering)	51.9%	No confidence filtering
+ Prediction confidence	52.7%	+0.8%
+ Reconstruction confidence	52.1%	+0.2%
+ Dual confidence (full)	53.3%	Complementary effect
Random sampling	51.9%	~140 correspondences/image
View-aware sampling	53.3%	+1.4%
Single-frame reconstruction	52.1%	Insufficient geometry
Multi-view reconstruction	53.3%	+1.2%

Key Findings¶

Reconstructed 3D outperforms real 3D: Point clouds reconstructed by MapAnything consistently outperform ground-truth LiDAR/depth data. Two reasons: (1) reconstructed point clouds are denser and more complete than LiDAR; (2) reconstruction provides per-point confidence usable for dual-confidence filtering, a signal unavailable from real 3D sensors.
View-aware sampling contributes most: +1.4 mIoU, ensuring a sufficient number of 2D–3D correspondences for the CMC loss.
Sparser annotations yield larger gains: Performance gaps widen further under extremely sparse scribble annotations, highlighting the value of geometric supervision under label scarcity.
Architecture-agnostic: Consistent improvements are observed when replacing the backbone with EoMT.

Highlights & Insights¶

Counterintuitive finding — reconstructed beats real: Reconstructed 3D outperforms LiDAR due to higher density and availability of confidence scores, challenging the assumption that real sensor data is always superior. This suggests that other cross-modal tasks may also benefit from substituting reconstructed data for sensor data.
Elegant view-aware sampling design: The 60/40 allocation simultaneously ensures dense 2D–3D correspondences for the target image (enabling CMC) and retains global context (supporting 3D segmentation).
Pure 2D inference: The 3D branch is used exclusively during training, introducing zero additional overhead at inference, making the method highly practical. The "multi-modal training, single-modal inference" paradigm is transferable to other tasks that require multi-modal data.

Limitations & Future Work¶

The current 3D reconstruction model does not explicitly handle dynamic objects; moving objects in driving scenes introduce geometric noise.
Video sequences are required as input for 3D reconstruction, limiting applicability to purely single-image datasets (though Cityscapes experiments show +2.7% even with single-frame reconstruction).
The computational cost of MapAnything reconstruction itself is not discussed in detail.
Future work: Integrating reconstruction models that explicitly handle dynamic scenes could further improve performance.

vs. SASFormer/TEL: Pure 2D methods propagate annotations within the image plane, constrained by appearance similarity; this work introduces 3D geometric constraints for cross-view propagation.
vs. 2DPASS/Unal: These methods perform 2D→3D distillation for LiDAR segmentation; this work reverses the direction (3D→2D) and uses reconstructed geometry rather than real sensors.
vs. WSSS with image-level labels: Point/scribble/coarse annotations provide spatial localization, which combines more effectively with 3D geometry.

Rating¶

Novelty: ⭐⭐⭐⭐ First to introduce 3D reconstruction signals into WSSS; the direction is novel, though individual components (Mean Teacher/CMC/confidence filtering) are relatively standard.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 4 datasets, 3 annotation types, comprehensive ablations, real vs. reconstructed 3D comparisons.
Writing Quality: ⭐⭐⭐⭐⭐ Clear logic, rich figures, thorough ablations.
Value: ⭐⭐⭐⭐ Highly practical (zero inference overhead), though reliance on video input limits some application scenarios.