MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation¶
Conference: CVPR 2026 arXiv: 2511.10376 Code: https://github.com/ylwhxht/MSGNav Area: 3D Vision / Embodied Navigation Keywords: Multi-modal 3D scene graph, zero-shot navigation, open-vocabulary, closed-loop reasoning, last-mile problem
TL;DR¶
This paper proposes a Multi-modal 3D Scene Graph (M3DSG) that replaces conventional text-based relation edges with dynamically assigned image edges, and builds a zero-shot navigation system MSGNav comprising four modules: Key Subgraph Selection, Adaptive Vocabulary Update, Closed-Loop Reasoning, and Visibility Viewpoint Decision. MSGNav achieves 52.0% SR on GOAT-Bench and 74.1% SR on HM3D-ObjNav, both state-of-the-art.
Background & Motivation¶
Background: Embodied navigation requires robots to autonomously explore unknown environments and reach targets specified by object category, textual description, or reference image. Real-world deployment demands open-vocabulary generalization and low training overhead, making zero-shot approaches more attractive than task-specific RL training. Recent zero-shot methods based on explicit 3D scene graphs with LLM reasoning (e.g., SG-Nav, ConceptGraph) have achieved competitive performance on standard benchmarks.
Limitations of Prior Work: Conventional 3D scene graphs over-abstract inter-object relationships into simple text labels (e.g., "top", "beside"), causing three critical issues: (1) Costly construction — frequent MLLM calls to infer textual relations generate substantial token and time overhead; (2) Loss of visual evidence — converting visual observations into pure-text graphs discards visual information, introducing ambiguity and sensitivity to perception errors; (3) Vocabulary constraints — novel categories outside the predefined vocabulary cannot be represented, limiting generalization.
Key Challenge: Scene graphs must encode rich spatial-semantic relationships to support VLM reasoning, yet pure text edges irreversibly compress visual information, while retaining all raw images causes token explosion at inference time. Moreover, existing methods neglect the "last-mile" problem — knowing where a target is does not guarantee finding a suitable navigation viewpoint toward it.
Goal: To construct a 3D scene graph that preserves visual information while remaining efficient and scalable, and to leverage it for robust open-vocabulary embodied navigation in a zero-shot setting.
Key Insight: Replace text relation edges with dynamically assigned images, and employ a greedy subgraph selection algorithm to compress each VLM query to an average of only ~4 images; additionally introduce a visibility scoring mechanism to address last-mile viewpoint selection.
Core Idea: Using images rather than text as relation edges in the scene graph preserves visual evidence while avoiding frequent MLLM calls; combined with subgraph selection, this enables efficient inference.
Method¶
Overall Architecture¶
MSGNav is a zero-shot embodied navigation system. The core pipeline proceeds as follows: (1) incrementally construct the Multi-modal 3D Scene Graph M3DSG — nodes represent detected objects (with attributes including category, 3D coordinates, and room position), while edges store collections of RGB-D images capturing co-occurring object pairs; (2) extract a compact, goal-relevant subgraph from the full scene graph via three-stage Key Subgraph Selection (KSS: Compress–Focus–Prune); (3) dynamically expand the open vocabulary through Adaptive Vocabulary Update (AVU); (4) avoid repeated errors via a decision memory in Closed-Loop Reasoning (CLR); (5) select the optimal final navigation viewpoint through the Visibility Viewpoint Decision (VVD) module. The entire system uses GPT-4o as the VLM backbone and requires no training or fine-tuning.
Key Designs¶
-
Multi-modal 3D Scene Graph M3DSG:
- Function: Constructs a scene representation \(\mathbf{S}=(\mathbf{O}, \mathbf{E})\), where \(\mathbf{O}\) is the set of object nodes (with category, 3D coordinates, point cloud, visual features, and room position) and \(\mathbf{E}\) is the set of image edges — each edge stores a collection of RGB-D images containing the corresponding object pair.
- Mechanism: YOLO-W detects objects, SAM extracts masks, and CLIP computes visual features; cross-frame object matching and merging are performed via spatial and visual similarity. Edge updates simply append the current frame to the image collection of co-occurring object pairs, requiring no MLLM inference for relation prediction, yielding highly efficient construction.
- Design Motivation: The three fundamental flaws of text relation edges (costly, information-lossy, vocabulary-limited) all stem from the need to convert visual observations into text via MLLMs. Directly storing images eliminates this bottleneck at its root.
-
Key Subgraph Selection KSS (Compress–Focus–Prune):
- Function: Extracts a compact, goal-relevant subgraph from the full scene graph (which may contain hundreds of nodes), reducing the token cost of VLM inference by over 95%.
- Mechanism: The Compress stage reduces the scene graph to an adjacency list containing only node IDs and categories; the Focus stage feeds the compressed graph to the VLM to select the top-\(k\) relevant objects \(\mathbf{O}^{rel}\); the Prune stage applies a greedy dynamic allocation algorithm (Algorithm 1) to select the minimum number of images covering the maximum number of edges — averaging approximately 4 images per query.
- Design Motivation: The scene graph grows continuously during exploration, making direct processing of the full graph both inefficient and liable to exceed VLM context limits; a strategy is needed that minimizes input while maximizing retained information.
-
Visibility Viewpoint Decision VVD:
- Function: Addresses the "last-mile" problem — once the agent has correctly localized the target, this module selects a final navigation viewpoint with good visibility rather than simply choosing the nearest reachable point.
- Mechanism: Candidate viewpoints \(\mathbf{V}_c\) are uniformly sampled around the target point cloud \(\mathcal{PC}_{\bar{o}}\); for each candidate, a visibility score is computed as \(S_{\mathbf{v}_i} = \frac{1}{|\mathcal{PC}_{\bar{o}}|}\sum_{\mathbf{p}} \mathbb{1}_{\mathcal{E}(\mathbf{v}_i, \mathbf{p})}\), where \(\mathcal{E}\) checks whether the line of sight is occluded; the viewpoint with the highest score, \(\mathbf{v}_{best}\), is selected.
- Design Motivation: Empirical analysis reveals that existing methods frequently fail not because they cannot approach the target (within 0.25m–1.0m) but because poor viewpoints (too close or occluded) cause task evaluation failures. VVD improves SR at the standard 0.25m threshold from 33.91% to 51.97%.
Loss & Training¶
MSGNav is a zero-shot method requiring no training or fine-tuning. It employs pretrained YOLO-W (detection), SAM (segmentation), CLIP (features), and GPT-4o (reasoning). AVU initializes the vocabulary \(V_0\) from ScanNet-200 and continuously expands it during exploration by having the VLM propose new vocabulary \(\hat{V}_t\) from image edges: \(V_t = V_{t-1} \cup \hat{V}_t\). CLR maintains a decision memory \(\mathbf{M}_t = \mathbf{M}_{t-1} \cup \mathcal{R}_t\), incorporating historical action feedback into current decisions.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | MSGNav | Prev. SOTA | Gain |
|---|---|---|---|---|
| GOAT-Bench | SR | 52.0% | 47.2% (MTU3D) | +4.8% |
| GOAT-Bench | SPL | 29.6% | 27.7% (MTU3D) | +1.9% |
| HM3D-ObjNav | SR | 74.1% | 72.2% (WMNav) | +1.9% |
| HM3D-ObjNav | SPL | 33.4% | 33.3% (WMNav) | +0.1% |
Note: GOAT-Bench involves three multi-modal goal types (category, language, image). MSGNav is a zero-shot method, whereas MTU3D requires training. On GOAT-Bench, SR by goal type is: Category 63.6%, Language 57.2%, Image 59.1%.
Ablation Study¶
| Configuration | SR | SPL | Notes |
|---|---|---|---|
| Baseline (3D-Mem) | 28.8% | 20.2% | No modules |
| +M3DSG | 43.8% | 28.0% | Core contribution of scene graph: +15.0% SR |
| +M3DSG+VVD | 56.3% | 34.7% | Last-mile module: +12.5% SR |
| +M3DSG+VVD+AVU+CLR | 60.0% | 37.0% | Full system |
Scene graph comparison (GOAT-Bench):
| Scene Graph Type | SR | SPL |
|---|---|---|
| Node-only (no edges) | 51.8% | 31.2% |
| Conventional text relation graph | 56.2% | 32.7% |
| M3DSG (image edges) | 60.0% | 37.0% |
Effect of VVD module at different success thresholds:
| Success threshold d(m) | SR w/o VVD | SR w/ VVD |
|---|---|---|
| 0.25 (standard) | 33.91% | 51.97% |
| 0.55 | 57.44% | 63.03% |
| 1.00 | 62.38% | 66.52% |
Key Findings¶
- M3DSG yields the largest gains over conventional scene graphs on Language and Image goals (+4.4% and +6.8% SR, respectively), indicating that image edges most benefit spatial reasoning for language- and vision-based targets.
- AVU and CLR each have limitations in isolation (AVU introduces noisy vocabulary; CLR can be overly conservative), but their combination is strongly complementary.
- A large proportion of failures occur in the 0.25–1.0m range — the agent has reached the vicinity of the target but suffers from poor viewpoints; VVD recovers these cases.
- KSS reduces token cost by over 95%, with an average of approximately 4 images per query.
Highlights & Insights¶
- The idea of replacing text relation edges with images directly addresses the fundamental flaw of conventional scene graphs — irreversible information compression. This seemingly simple design yields substantial gains; M3DSG alone contributes +15% SR.
- The identification and quantitative analysis of the "last-mile" problem are highly insightful: by examining SR across different success thresholds, the authors clearly demonstrate that the true cause of many task failures is viewpoint quality rather than localization accuracy.
- The greedy dynamic allocation algorithm (Algorithm 1) formalizes subgraph selection as a set cover problem, ensuring maximum relational edge coverage with minimum image count.
- The zero-shot method simultaneously outperforming the trained MTU3D highlights the decisive role of scene representation quality in navigation performance.
Limitations & Future Work¶
- Inference latency of VFMs and VLMs is the primary bottleneck limiting real-time deployment; the authors acknowledge that scene-graph-based methods suffer from low inference efficiency.
- VVD mitigates but does not fully resolve the last-mile problem (a performance gap remains even at the relaxed 1.0m threshold); the authors suggest further optimization via RL-based approaches.
- The advantage in SPL is marginal on HM3D-ObjNav (+0.1%), suggesting that VVD may trade shortest-path optimality for better viewpoints.
- Image edge storage in M3DSG grows with exploration; memory management strategies warrant further optimization.
- Validation is currently limited to simulated environments; real-world deployment performance remains unknown.
Related Work & Insights¶
- vs. ConceptGraph: ConceptGraph uses MLLMs to generate pure text relation edges, incurring high construction cost and losing visual information. M3DSG directly stores visual evidence via image edges, achieving higher construction efficiency with no information loss, and surpasses ConceptGraph by 3.8% SR on GOAT-Bench.
- vs. 3D-Mem: 3D-Mem also emphasizes the value of raw images but lacks a structured graph representation. MSGNav adds scene graph structure and multi-module reasoning on top of this foundation, improving SR from 28.8% to 60.0%.
- vs. SG-Nav: SG-Nav uses hierarchical prompting for navigation reasoning but is constrained by the expressiveness of text-based scene graphs. MSGNav's multi-modal scene graph provides richer contextual information.
- vs. VLFM/CompassNav: These zero-shot methods use VLMs for frontier evaluation but lack the relational reasoning capability afforded by graph structure.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Using images in place of text relation edges is a novel and practical design; the identification and quantification of the last-mile problem demonstrate strong insight.
- Experimental Thoroughness: ⭐⭐⭐⭐ — State-of-the-art results on two mainstream benchmarks; multi-dimensional ablations (modules, scene graph types, VVD thresholds); solid analysis.
- Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clearly articulated; the narrative mapping three limitations to three corresponding solutions flows naturally.
- Value: ⭐⭐⭐⭐ — Introduces a new scene representation paradigm for zero-shot embodied navigation; the last-mile problem analysis offers meaningful inspiration to the community.