OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting¶

Conference: CVPR 2026 arXiv: 2603.18510 Institution: State Key Lab of CAD&CG, Zhejiang University; VIVO BlueImage Lab; HKUST Area: 3D Vision Keywords: Panoptic Mapping, Open-Vocabulary, 3D Gaussian Splatting, Online Reconstruction, Instance Segmentation

TL;DR¶

This paper proposes OnlinePG, the first online open-vocabulary panoptic mapping system built upon 3DGS. It adopts a local-to-global paradigm: within a sliding window, a multi-cue clustering graph (geometric overlap + semantic similarity + view consensus) constructs locally consistent 3D instances, which are then incrementally merged into a global map via bidirectional bipartite matching. OnlinePG achieves state-of-the-art semantic and panoptic segmentation among online methods, surpassing OnlineAnySeg by +17.2 mIoU on ScanNet (48.48), while running at 10–18 FPS.

Background & Motivation¶

Background: Open-vocabulary 3D scene understanding is foundational for embodied intelligence. Recent works lift 2D VLM features (CLIP, LSeg, SAM) into 3D space via NeRF/3DGS, achieving strong results in offline settings (LangSplat, OpenGaussian, PanoGS, etc.).

Limitations of Prior Work: - (a) Offline Constraint: Most methods (PanoGS, LangSplat, OpenGaussian) require pre-collected complete data and global optimization, making them unsuitable for real-time robotic tasks. - (b) Lack of Instance-Level Understanding: The online method O2V-Mapping provides only semantic segmentation without distinguishing individual instances of the same class. - (c) 2D Segmentation Noise: 2D segmentations from VLMs are inconsistent across views (over- and under-segmentation), and direct lifting to 3D leads to noise accumulation. - (d) Slow Contrastive Learning Convergence: Offline methods (InstanceGaussian, PanoGS) rely on slowly converging contrastive feature learning for instance clustering, which is unsuitable for online systems.

Key Challenge: 2D VLM segmentation results are inconsistent across views (over/under-segmentation), and direct 3D lifting produces noisy instances. The core challenge is obtaining 3D-consistent panoptic instances and semantics under online streaming input.

Goal: (1) Online panoptic (instance + semantic) mapping; (2) deriving consistent 3D instances from noisy 2D segmentations; (3) open-vocabulary querying.

Key Insight: Resolve 2D inconsistencies locally within a sliding window via multi-cue clustering, then incrementally merge into a global map.

Core Idea: A local-to-global paradigm — multi-cue segment clustering within a sliding window → locally consistent instances → bidirectional bipartite matching → globally consistent panoptic map.

Method¶

Overall Architecture¶

Input: RGB-D stream with poses. Output: 3D Gaussian panoptic map with open-vocabulary query capability.

Pipeline overview:

RGB-D Stream → Keyframe Selection (every 20 frames) → Sliding Window (size 12)
                                                              ↓
                                                   2D VLM Feature Extraction
                                                   (LSeg + EntitySeg)
                                                              ↓
                                              3D Gaussian Segment Initialization
                                              (grouped by mask)
                                                              ↓ every 7 keyframes
                                        Multi-Cue Clustering Graph (Geometric + Semantic + View Consensus)
                                                              ↓
                                           Locally Consistent 3D Instances + Spatial Attribute Grid
                                                              ↓
                                        Bidirectional Bipartite Matching → Global Map Incremental Fusion
                                                              ↓
                                           Global Panoptic Map + Open-Vocabulary Query

Key Design 1: Scene Representation — 3D Gaussians + Voxelized Spatial Attributes¶

Function: Model geometric and semantic information using two complementary representations.

3D Gaussian Primitives: Each Gaussian is defined as \(\mathcal{G}_i := \{\boldsymbol{\mu}_i, \boldsymbol{\Sigma}_i, \sigma_i, \boldsymbol{c}_i\}\), comprising position \(\boldsymbol{\mu}_i \in \mathbb{R}^3\), covariance \(\boldsymbol{\Sigma}_i \in \mathbb{R}^{3 \times 3}\), opacity \(\sigma_i\), and color \(\boldsymbol{c}_i \in \mathbb{R}^3\), used for geometric reconstruction and differentiable rendering.

Voxelized Spatial Attributes: The reconstructed region is voxelized (voxel size 3 cm), and each occupied voxel stores four attributes:

Attribute	Symbol	Dimension	Purpose
Language Feature	\(\mathcal{F}\)	\(\mathbb{R}^{512}\)	Stores VLM language features for open-vocabulary querying
Feature Confidence	\(\mathcal{C}\)	\(\mathbb{R}\)	Weight for multi-view feature fusion
Instance Label	\(\mathcal{T}\)	\(\mathbb{R}\)	Panoptic instance ID
Instance Weight	\(\mathcal{K}\)	\(\mathbb{R}\)	Confidence of instance label for global fusion decisions

Design Motivation: 3D Gaussians are a continuous representation suited for geometric optimization and rendering; discrete voxel grids are suited for storing instance labels and performing incremental updates. The two representations are complementary, each leveraged for its strengths.

Key Design 2: Multi-Cue Segment Clustering — From Noisy 2D to Consistent 3D Instances¶

Function: Merge inconsistent 3D segments from multiple keyframes within the sliding window into consistent 3D instances.

Step 1 — 3D Gaussian Segment Initialization: 3D Gaussian primitives are initialized by projecting depth maps from each keyframe, and grouped into 3D segments \(\mathcal{S}_i := \{\mathcal{G}_j\}_{j=1}\) according to 2D mask IDs. The full set of segments in the window is \(\mathcal{S} := \{\mathcal{S}_1, \cdots, \mathcal{S}_n\}\), where \(n = \sum_{i \in \mathcal{W}} |m_i|\).

Step 2 — Building the Multi-Cue Clustering Graph: Graph vertices are 3D segments \(\mathcal{S}_i\); edges \(\mathcal{E}_{ij}\) are determined jointly by three affinity cues:

(1) Geometric Overlap Cue \(\mathcal{O}\): Segments are voxelized and the bidirectional visible voxel overlap ratio is computed as a symmetric average:

\[\mathcal{O}(\mathcal{S}_i, \mathcal{S}_j) = \frac{1}{2} \cdot \left(\frac{|\mathcal{S}_i \cap \mathcal{S}_j|}{\text{Cont.}(\mathcal{S}_i, \mathcal{S}_j)} + \frac{|\mathcal{S}_i \cap \mathcal{S}_j|}{\text{Cont.}(\mathcal{S}_j, \mathcal{S}_i)}\right)\]

where \(\text{Cont.}(\mathcal{S}_i, \mathcal{S}_j)\) denotes the ratio of visible voxels of \(\mathcal{S}_j\) contained when projected back to the viewpoint of \(\mathcal{S}_i\). The symmetric average mitigates bias between segments of different sizes.

(2) Semantic Similarity Cue \(\mathcal{X}\): Segment-level language features \(z_i = \Phi(\{f(u,v): m(u,v) = i\})\) are obtained by average-pooling LSeg feature maps over 2D masks, and cosine similarity is computed:

\[\mathcal{X}(\mathcal{S}_i, \mathcal{S}_j) = \frac{z_i \cdot z_j}{\|z_i\| \cdot \|z_j\|}\]

(3) View Consensus Cue \(\mathcal{V}\): The proportion of co-visible keyframes in which two segments are labeled as the same instance:

\[\mathcal{V}(\mathcal{S}_i, \mathcal{S}_j) = \frac{N_{\text{supp}}(\mathcal{S}_i, \mathcal{S}_j)}{N_{\text{vis}}(\mathcal{S}_i, \mathcal{S}_j)}\]

Step 3 — Clustering Criterion and Connected Component Merging: A composite criterion determines whether two segments should be merged:

\[\Delta_{ij} = \left((\mathcal{O}_{ij} + \mathcal{X}_{ij}) > \lambda_1\right) \cup \left(\mathcal{V}_{ij} > \lambda_2\right)\]

where \(\lambda_1 = 1.5\) and \(\lambda_2 = 0.8\). Consistent instances are obtained via connected component analysis: \(\mathcal{I} = \text{Cluster}(\{\mathcal{S}_i\}, \{\mathcal{E}_{ij}\})\).

Design Motivation: No single cue suffices — geometric overlap fails for co-located segments with different semantics; semantic similarity fails for same-class different instances; view consensus is unreliable under sparse observations. The three cues are complementary, and the OR logic in the merging criterion enables correct associations across diverse scene configurations.

Key Design 3: Bidirectional Bipartite Matching — Robust Local-to-Global Fusion¶

Function: Incrementally merge locally consistent instances from the sliding window into the global map, ensuring global consistency.

Step 1 — Constructing Bidirectional Matching Matrices:

Forward matrix \(\mathcal{M}_{l \to g} \in \mathbb{R}^{n_l \times n_g}\) (semantic similarity + geometric containment ratio):

\[\mathcal{M}_{l \to g} = \frac{z_l \cdot z_g}{\|z_l\| \cdot \|z_g\|} + \frac{|\mathcal{I}_l \cap \mathcal{I}_g|}{\text{Cont.}(\mathcal{I}_l, \mathcal{I}_g)}\]

Backward matrix \(\mathcal{M}_{g \to l} \in \mathbb{R}^{n_g \times n_l}\): reverses the geometric containment direction.

Step 2 — Hungarian Algorithm + Intersection Confirmation:

\[\mathcal{A} = \text{Hung.}(\mathcal{M}_{l \to g}) \cap \text{Hung.}(\mathcal{M}_{g \to l})^T\]

Step 3 — Global Map Update Rules:

Language feature fusion (weighted average): \(\mathcal{F}_g^t(v) = \frac{\mathcal{C}_l^t \cdot \mathcal{F}_l^t + \mathcal{C}_g^{t-1} \cdot \mathcal{F}_g^{t-1}}{\mathcal{C}_g^t}\)
Matched instances: retain global label, accumulate weight \(\mathcal{K}_g^t = \mathcal{K}_g^{t-1} + \mathcal{K}_l^t\)
Unmatched, local weight ≤ global: retain global label, reduce weight \(\mathcal{K}_g^t = \mathcal{K}_g^{t-1} - \mathcal{K}_l^t\)
Unmatched, local weight > global: replace with local label \(\mathcal{T}_g^t = \mathcal{T}_l^t\), with the weight difference as the new weight

Design Motivation: The local map contains newly explored regions while the global map encodes historical ones, making the geometric containment ratio inherently asymmetric. Bidirectional matching requires agreement in both directions before confirming a correspondence, preventing erroneous fusion. The weight competition mechanism allows high-confidence segmentations to progressively supersede low-confidence ones.

Loss & Training¶

3DGS optimization uses a weighted combination of appearance and geometric L1 losses:

\[\mathcal{L} = \alpha \cdot \mathcal{L}_c + (1-\alpha) \cdot \mathcal{L}_d, \quad \alpha = 0.9\]

Upon each new keyframe insertion, 5 historical frames are randomly selected for 20 optimization iterations. Semantic and instance information do not participate in gradient optimization; they are maintained via the discrete update mechanism of the voxel grid.

Key Experimental Results¶

Main Results (3D Semantic and Panoptic Segmentation)¶

Method	Online	Panoptic	ScanNet mIoU↑	ScanNet mAcc↑	ScanNet PRQ(T)↑	ScanNet PRQ(S)↑	Replica mIoU↑	Replica PRQ(T)↑	Replica PRQ(S)↑
LangSplat*	✗	✗	29.47	45.29	22.57	28.44	4.82	8.29	1.28
OpenGaussian*	✗	✗	24.89	37.35	22.87	19.71	–	–	–
OpenScene*	✗	✗	47.63	69.74	43.53	40.43	49.03	33.04	11.84
InstanceGaussian	✗	✓	34.14	54.95	39.04	27.41	–	–	–
PanoGS	✗	✓	50.72	70.20	33.84	36.22	54.98	43.04	30.60
O2V-Mapping	✓	✗	33.74	55.52	–	–	24.35	–	–
OnlineAnySeg	✓	✓	31.28	52.20	35.98	26.27	37.48	34.19	9.13
OnlinePG (Ours)	✓	✓	48.48	66.01	37.97	41.81	47.93	41.02	12.83

* Methods marked in gray use supervised 3D instance segmentation as auxiliary input for PRQ computation.

Ablation Study 1: Matching Strategy (ScanNet)¶

Matching Strategy	PRQ(T)	PRQ(S)
#1 Nearest Neighbor Matching	24.67	22.98
#2 Forward-Only \(\mathcal{M}_{l \to g}\)	35.83	38.40
#3 Backward-Only \(\mathcal{M}_{g \to l}\)	33.71	42.72
#4 Bidirectional Bipartite Matching (Full)	37.97	41.81

Ablation Study 2: System Components (ScanNet)¶

Configuration	mIoU	PRQ(T)	PRQ(S)
#1 w/o Segment Clustering (direct keyframe segment fusion)	48.48	32.25	30.68
#2 w/o Feature Grid (instance-level coarse features only)	30.40	26.71	24.92
#3 Full System	48.48	37.97	41.81

Key Findings¶

Best among online methods: mIoU exceeds OnlineAnySeg by +17.2 (ScanNet) and +10.5 (Replica).
Large gains in panoptic segmentation: PRQ(S) surpasses OnlineAnySeg by +15.5 (ScanNet), with particularly notable improvements on Stuff categories.
Approaches or exceeds offline SOTA: The mIoU gap with PanoGS is only 2.24; PRQ(S) surpasses PanoGS (41.81 vs. 36.22).
Segment clustering is critical: Removing it causes PRQ(S) to drop by 11.13, especially for large-area Stuff regions.
Feature grid is critical: Removing it causes mIoU to drop by 18.08; instance-level coarse features lead to severe semantic drift.
Bidirectional vs. nearest-neighbor matching: PRQ improves by +13.3/+18.8; backward verification is especially important for Stuff (+4.3).
Multi-cue clustering: Outperforms single-cue baselines by 8–18 PRQ points, with only ~40 ms additional latency.
Real-time performance: 18 FPS for simple scenes, 10 FPS for complex scenes (excluding VLM frontend).
Open-vocabulary advantage: Significantly outperforms OnlineAnySeg on long-tail concepts (e.g., "bag") and multi-instance queries of the same semantic class (e.g., "pillow" ×N).

Highlights & Insights¶

First online 3DGS system with panoptic + open-vocabulary capabilities: Unifies geometric reconstruction, instance segmentation, and open-vocabulary understanding.
Local-to-Global paradigm: 2D inconsistencies are resolved locally within the sliding window (small-scale, tractable), and only lightweight matching is performed for incremental global fusion — decomposing a hard global problem into two manageable steps.
Multi-cue clustering graph: Geometric, semantic, and view-consensus cues are complementary; processes a 12-frame window in only ~350 ms.
Bidirectional bipartite matching: Taking the intersection of forward and backward Hungarian assignments is more robust than unidirectional matching and elegantly handles the asymmetry between local and global maps.
Complementary voxel attributes and Gaussians: Gaussians handle continuous rendering optimization; voxels handle discrete semantic/instance updates — each exploited for its respective strengths.
Substantial lead among online methods: Both mIoU and PRQ significantly exceed OnlineAnySeg and O2V-Mapping, and several metrics surpass the offline method PanoGS.

Limitations & Future Work¶

No support for dynamic objects: The system handles only static scenes; dynamic objects lead to erroneous reconstruction.
Dependency on depth and pose inputs: Requires RGB-D data and known camera poses, limiting applicability to RGB-only settings.
VLM frontend latency excluded: The reported 10–18 FPS does not include inference time for LSeg and EntitySeg; the full system's real-time capability is bottlenecked by the frontend.
Primarily indoor scenes: Evaluated only on ScanNetV2 and Replica indoor datasets; large-scale outdoor scenes are not assessed.
Fixed hyperparameters: \(\lambda_1\), \(\lambda_2\), voxel size, clustering frequency, etc. are manually set; adaptive tuning could improve efficiency.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐
Technical Depth	⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Value	⭐⭐⭐⭐⭐