Skip to content

UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning

Conference: AAAI 2026 arXiv: 2512.24763 Code: github.com/val-iisc/UniC-Lift Area: 3D Vision Keywords: 3D instance segmentation, 3D Gaussian splatting, contrastive learning, multi-view consistency, embedding-to-label

TL;DR

This paper proposes UniC-Lift, a unified single-stage 3D instance segmentation framework that learns optimizable vector embeddings on 3DGS primitives via contrastive loss and triplet loss, and directly decodes consistent 3D segmentation labels through a simple Embedding-to-Label procedure — eliminating post-processing clustering steps such as HDBSCAN and reducing training time from 15+ hours to under 40 minutes.

Background & Motivation

Problem Definition

3D scene understanding is a critical task in AR/VR, autonomous driving, and path planning. Existing methods typically achieve 3D segmentation by "lifting" 2D segmentation labels into 3D representations such as NeRF or 3DGS. However, instance labels generated by 2D segmentation models across different viewpoints are inconsistent — the same object may be assigned different instance IDs in different views.

Limitations of Prior Work

Two-stage methods (preprocessing + segmentation): Methods such as Panoptic-Lifting and DM-NeRF rely on the Linear Assignment Problem to match 2D predictions with 3D representations, incurring substantial computational overhead and requiring 20+ hours of training per scene.

Two-stage methods (contrastive learning + clustering): Methods such as Contrastive-Lift optimize feature embeddings via contrastive learning and then apply HDBSCAN clustering as post-processing to assign labels, introducing hyperparameter sensitivity and requiring 15+ hours of training.

Feature distillation methods: Methods such as DFF distill high-dimensional features from CLIP/DINO into 3D representations, but training is extremely slow (approximately 2 days).

Root Cause

Can contrastive learning and label decoding be unified into a single-stage process? The authors observe that when the embedding space is constrained to \([0,1]^d\) via sigmoid, contrastive loss naturally drives embeddings of different instances to converge toward distinct corners of the hypercube, each corresponding to a unique binary code that can be directly decoded into discrete labels. This insight renders post-processing clustering entirely unnecessary.

Method

Overall Architecture

UniC-Lift builds upon 3DGS. The core idea is to attach a \(d\)-dimensional learnable vector embedding \(\boldsymbol{v} \in \mathbb{R}^d\) to each 3D Gaussian primitive, obtain 2D embedding maps via differentiable rendering, apply contrastive and triplet losses for optimization, and directly obtain instance labels through thresholding and binary decoding.

Overall pipeline: - Input: multi-view RGB images + camera poses + 2D segmentation masks (potentially inconsistent) - 3DGS optimization: jointly optimizes color parameters and vector embeddings - Losses: rendering loss + cluster contrastive loss + triplet loss + 3D neighborhood regularization - Inference: rendered embeddings → sigmoid → thresholding → binary decoding → instance labels

Key Designs

1. Rendering of Learnable Vector Embeddings

Each 3D Gaussian primitive is augmented with a \(d\)-dimensional view-independent vector embedding \(\boldsymbol{v}\), rendered to the 2D plane via the same alpha-compositing scheme used for color:

\[\boldsymbol{\mathcal{V}} = \sum_{i \in T} \boldsymbol{v}_i \alpha'_i \prod_{j=1}^{i-1}(1-\alpha'_j)\]

Design Motivation: Leveraging the differentiable rendering framework of 3DGS allows embedding optimization to naturally account for 3D geometric relationships, achieving multi-view consistency.

2. Cluster Contrastive Loss

For the rendered embedding map \(\mathbb{V} \in \mathbb{R}^{H \times W \times d}\) at each viewpoint, pixels are partitioned into \(K\) disjoint sets \(\{\Omega_1, ..., \Omega_K\}\) according to 2D segmentation masks. The centroid \(\boldsymbol{m}_{\Omega_i}\) of each set is computed, and the loss minimizes intra-cluster distances while maximizing inter-cluster centroid distances:

\[\mathcal{L}_{cluster} = \sum_{\Omega_i} \sum_{u \in \Omega_i} \|\mathbb{V}(u) - \boldsymbol{m}_{\Omega_i}\|_2^2 - \sum_{i \neq j} \|\boldsymbol{m}_{\Omega_i} - \boldsymbol{m}_{\Omega_j}\|_2^2\]

Design Motivation: This is a standard contrastive objective — pulling embeddings of the same instance together and pushing embeddings of different instances apart.

3. Triplet Loss with Boundary Hard Mining

Applying the cluster loss directly on raw rendered embeddings does not guarantee consistent penalization. Instead, embeddings are first constrained to \([0,1]\) via sigmoid, then projected through a linear layer \(\mathbb{W}\), and the triplet loss is computed in the projected space:

\[\mathcal{L}_{triplet} = \sum_{(a,p,n) \in \Delta} \max(0, \|a-p\|_2^2 - \|a-n\|_2^2 + \delta)\]

Key Design: Positive and negative samples are mined from segmentation boundaries rather than sampled randomly. Boundary triplets provide non-zero gradients and are more informative than random triplets, accelerating convergence (25k vs. 50k iterations to reach equivalent quality).

Role of the linear layer: Applying hard mining directly on feature embeddings is shown to be unstable. Computing the triplet loss after a linear transformation of the rendered embeddings stabilizes training and significantly improves performance.

4. 3D Neighborhood Regularization

For each Gaussian primitive \(i\), a spatial neighborhood \(\mathcal{N}(i) = \{\|\mu_i - \mu_j\|_2^2 \leq \tau\}\) is defined, and the loss penalizes embedding discrepancies between spatially adjacent Gaussians:

\[\mathcal{L}_{3D} = \sum_{i=1}^{|\mathcal{G}|} \sum_{j \in \mathcal{N}(i)} \|\boldsymbol{v}_i - \boldsymbol{v}_j\|_2^2\]

This loss is activated only after 15,000 iterations (once adaptive density control stabilizes), to avoid interfering with the Gaussian splitting and cloning process.

5. Embedding-to-Label Procedure

This is the central contribution of the paper. The decoding process is remarkably simple:

  1. Apply sigmoid to the rendered embeddings: \(\hat{\boldsymbol{\mathcal{V}}} = \sigma(\boldsymbol{\mathcal{V}})\)
  2. Threshold to binary vectors: \(\tilde{\boldsymbol{\mathcal{V}}} = \mathbf{1}[\hat{\boldsymbol{\mathcal{V}}} > 0.5]\)
  3. Decode binary vectors to labels: \(l = \sum_k \tilde{\boldsymbol{\mathcal{V}}}_k \cdot 2^{k-1}\)

This reduces the time complexity of label prediction at inference to \(O(n)\) (where \(n\) is the number of pixels), compared to \(O(n \log c)\) for Contrastive-Lift (where \(c\) is the number of clusters), significantly improving inference speed.

Loss & Training

Total loss:

\[\mathcal{L}_{total} = \mathcal{L}_{rendering} + \lambda_{cluster} \mathcal{L}_{cluster} + \lambda_{triplet} \mathcal{L}_{triplet} + \lambda_{3D} \mathcal{L}_{3D}\]

Hyperparameters: \(\lambda_{cluster} = \lambda_{triplet} = \lambda_{3D} = 0.1\), triplet margin \(\delta = 1\), neighborhood threshold \(\tau = 0.01\). Embedding dimension \(d = 12\), maximum number of triplets 3000.

Training strategy: - ADAM optimizer, learning rate \(1 \times 10^{-4}\) - 30k iterations total, single RTX A6000 GPU - Gradients from segmentation losses are excluded from 3DGS adaptive density control - Triplet loss and 3D regularization loss are activated only after adaptive density control completes

Key Experimental Results

Main Results

ScanNet and Replica3D (\(\text{PQ}^\text{scene}\) metric):

Dataset DM-NeRF Panoptic-Lifting Contrastive-Lift Gaussian-Grouping UniC-Lift
ScanNet 41.7 58.9 62.3 61.83 63.0
Replica3D 44.1 57.9 59.1 66.52 88.7

On Replica3D, UniC-Lift outperforms Gaussian Grouping by a factor of 1.3× (88.7 vs. 66.52).

Messy-Rooms dataset (\(\text{PQ}^\text{scene}\) metric, varying number of objects):

Method 25obj 50obj 100obj 500obj Mean
Panoptic-Lifting 73.2 69.9 64.3 51.0 63.2
Contrastive-Lift 78.9 75.8 69.1 55.0 69.0
UniC-Lift 86.0 79.1 70.8 57.4 71.5

Achieves best performance on 6 out of 8 scenes.

Ablation Study

Effect of individual loss components (Replica3D):

Configuration \(\text{PQ}^\text{scene}\) mIoU Note
CL + 3D Reg 88.0 94.4 w/o triplet loss
CL only 83.7 91.8 w/o triplet or 3D reg
CL + Triplet (MLP) 89.0 95.2 w/o 3D reg
CL + Triplet (no MLP) + 3D 88.0 94.0 w/o linear projection
All (CL + Triplet (MLP) + 3D) 89.0 95.4 Final configuration

Training time comparison (single Replica scene, NVIDIA A6000):

Method Training Time
Panoptic-Lifting >20 hours
Contrastive-Lift >15 hours
UniC-Lift <40 minutes

Key Findings

  1. The Embedding-to-Label procedure eliminates clustering post-processing: Compared to Contrastive-Lift + 3DGS, UniC-Lift achieves comparable quantitative metrics while completely eliminating the clustering step (42 min vs. 85 min).
  2. Boundary hard mining significantly accelerates convergence: Boundary triplets reach at 25k iterations the quality that random triplets require 50k iterations to achieve.
  3. Low-resolution mask training does not degrade results: Training with 0.5× resolution masks produces visually indistinguishable results from full-resolution training.
  4. Sparse masks suffice for training: Using only 5% of segmentation masks yields results close to training with the full mask set.

Highlights & Insights

  1. Elegance of the core innovation: The observation that contrastive learning in a sigmoid-constrained space naturally drives embeddings toward hypercube corners — enabling label decoding as simple binary encoding — is a particularly elegant and insightful mathematical finding.
  2. From \(O(n \log c)\) to \(O(n)\): The reduction in inference complexity has practical significance, especially for large-scale scenes.
  3. 20–30× training speedup: Reducing training time from 15–20 hours to under 40 minutes makes the method practically deployable.
  4. Downstream applications: High-quality 3D segmentation directly supports object extraction and scene editing, demonstrating the practical utility of the approach.

Limitations & Future Work

  1. Embedding dimensionality limits instance capacity: A \(d=12\) embedding theoretically supports at most \(2^{12} = 4096\) instances, which may be insufficient for very large-scale scenes.
  2. Restricted to static scenes: The method has not been extended to dynamic scenes.
  3. Dependence on 2D segmentation quality: While consistency is not required, reasonable 2D segmentation inputs are still necessary.
  4. Residual artifacts at boundaries: Although the hard mining strategy mitigates boundary issues, the authors acknowledge that this problem is not fully resolved.
  • Contrastive-Lift: The most direct baseline for this work; UniC-Lift merges its two-stage pipeline into a single stage.
  • 3DGS series: UniC-Lift exploits 3DGS's efficient rendering and its support for extensible per-primitive attribute storage.
  • Binary coding perspective: The idea of driving embeddings toward hypercube corners can be generalized to other settings requiring discrete representations.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The Embedding-to-Label idea is highly elegant and novel
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets with thorough ablations; experiments on larger-scale scenes are lacking
  • Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated; toy experiments effectively illustrate the core idea
  • Value: ⭐⭐⭐⭐⭐ — A 20–30× speedup makes real-world deployment genuinely feasible