Skip to content

g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

Conference: CVPR 2025
arXiv: 2411.17030
Code: MrZihan/g3D-LF
Area: Autonomous Driving / Embodied AI
Keywords: 3D Feature Fields, Vision-Language Navigation, Contrastive Learning, BEV Maps, Zero-Shot Object Navigation, Situated Question Answering

TL;DR

This paper proposes g3D-LF, which constructs generalizable 3D-language feature fields for unseen environments by performing multi-level contrastive learning pre-training on approximately 5,000 indoor 3D scenes and nearly 1 million language descriptions. It achieves state-of-the-art (SOTA) or near-SOTA performance across four embodied tasks: VLN (monocular/panoramic), zero-shot object navigation, and situated question answering.

Background & Motivation

Background

Embodied agents need to understand 3D environments and interact with humans to perform tasks such as navigation and question answering. Generalizable 3D feature fields provide an ideal scene representation: they are generalizable to unseen scenes, allow real-time construction and dynamic updating, and support an open-vocabulary semantic space. Existing feature field models (such as HNR) have demonstrated potential in tasks like VLN.

Limitations of Prior Work

  1. Lack of 3D spatial relationship understanding: Supervision for existing feature field models originates from 2D foundation models (e.g., CLIP, DINOv2), which limits their ability to grasp 3D spatial relationships.
  2. Semantic gap with language: These models lack language supervision during training, resulting in a significant semantic gap between the predicted representations and natural language.
  3. Deficiency in long-text comprehension: Large-scale representations (e.g., panoramas, BEV maps) are difficult to align fully with descriptive long texts that convey spatial relations and layouts.

Key Challenge

How to enable 3D feature fields to not only encode visual features but also align with multi-granularity natural language, thereby functioning effectively across various language-guided embodied tasks?

Key Insight

Leverage existing large-scale 3D-language datasets (e.g., SceneVerse) to align representations of different scales (region/view/panorama/BEV) with language of varying granularities (object categories/object descriptions/spatial relations/scene layouts) through a carefully designed multi-level contrastive learning framework.

Core Idea

Construct a multi-level contrastive learning framework with matched language alignment strategies tailored for the different scale outputs of the 3D feature field: object-vocabulary contrast at the region level, CLIP distillation combined with language contrast at the view level, and affinity-matrix-based fine-grained long-text contrast at the panorama/BEV levels.

Method

Overall Architecture

g3D-LF takes posed RGB-D images as input, extracting features using a CLIP image encoder and mapping them to 3D coordinates to construct a feature field. Novel-view feature maps are predicted via volume rendering, and then processed by multi-scale encoders (view encoder, panorama encoder, BEV encoder) to generate representations of different scales. During the pre-training stage, multi-level contrastive learning is used to align these representations with multi-granularity language.

Key Designs

1. Multi-scale 3D-Language Feature Field Encoding

  • Function: Construct generalizable 3D feature fields from RGB-D observations and generate scene representations at different scales.
  • Mechanism:
    • Feature Field Encoding: CLIP extracts patch-level visual features for each frame, which are mapped to 3D world coordinates using depth maps to form a set of feature points \(\mathcal{M}\).
    • Ray-View-Panorama Encoding: An MLP aggregates neighboring feature points within the feature field to predict semantic representations and volume density. Novel-view feature maps \(\mathbf{R} \in \mathbb{R}^{12 \times 12 \times 768}\) are synthesized via volume rendering, which are then processed by a Transformer view encoder and a panorama encoder to obtain multi-view representations.
    • Ray-BEV Encoding: Rendering rays are cast vertically from top to bottom to construct an agent-centric 16.8m×16.8m BEV map \(\hat{\mathbf{R}} \in \mathbb{R}^{168 \times 168 \times 768}\), which is downsampled via convolution and encoded by a Transformer BEV encoder to obtain the BEV representation.
  • Design Motivation: Novel-view representations are suitable for local target recognition and navigation planning; panoramic representations are suited for directional understanding; and BEV maps are ideal for broad spatial layout comprehension. Different embodied tasks require representations at different scales.

2. Multi-Level Contrastive Learning

  • Function: Align 3D feature representations of different scales with language of corresponding granularities.
  • Mechanism:
    • Balanced Object-level Alignment: Use CLIP text embeddings of 1,883 indoor object categories to form a vocabulary \(\mathcal{O}\), providing categorization supervision for volume-rendered ray representations via CrossEntropy. A balanced loss is adopted, which increases the weight coefficient \(\alpha\) for the top 10% highest-loss rays (corresponding to hard-to-recognize small objects like table lamps).
    • View-level CLIP Distillation: Perform contrastive learning on the predicted novel-view/panoramic/BEV representations using CLIP features of ground-truth images to maintain consistency with large-scale vision-language pre-trained models.
    • Fine-grained Long-text Contrast: Compute an affinity matrix \(\mathbf{A}_{(i,l)} = \text{CosSim}(\hat{\mathbf{R}}'_i, \mathbf{W}_l) / \tau\) between BEV map windows (5×5 regions) and long-text descriptions. The mean of the top-\(L\) similarities is taken as the fine-grained matching score, which is optimized using bidirectional CrossEntropy.
  • Design Motivation: The scale of 3D-language data is much smaller than that of image-language data (millions vs. billions), necessitating the joint use of CLIP distillation (to maintain generalizability) and direct language contrast (to enhance language alignment). Fine-grained contrast achieves token-level matching through the affinity matrix, which is more suitable for long texts than global similarity. The balanced loss addresses the long-tail distribution problem dominated by floors/walls in indoor scenes.

3. Embodied Task Integration

  • Function: Seamlessly integrate pre-trained g3D-LF into baselines of various embodied tasks.
  • Mechanism:
    • VLN (Monocular): Replace the feature field in VLN-3DFF and incorporate the BEV map to enhance spatial layout understanding.
    • VLN (Panoramic): Replace the feature field in HNR to provide language-aligned panoramic representations for navigation planning.
    • Zero-Shot Object Navigation: Replace BLIP-2 in VLFM, using the 12 novel-view feature maps and the BEV map predicted by g3D-LF to compute similarity with the target object, building a value map to guide navigation.
    • Situated Question Answering: Train a localization decoder using the BEV map to predict location heatmaps and use panoramic representations to predict orientation, ultimately answering questions jointly.
  • Design Motivation: As a general-purpose 3D-language representation model, g3D-LF can enhance various embodied task baselines in a plug-and-play manner, eliminating the need to redesign representations for each specific task.

Key Experimental Results

Vision-Language Navigation (VLN)

Monocular Setting: R2R-CE Val Unseen:

Method SR↑ SPL↑
NaVid (LLM) 37.4 35.9
VLN-3DFF 44.9 30.4
g3D-LF 47.2 34.6

SR increases by 2.3%, surpassing the LLM-based method NaVid for the first time.

Panoramic Setting: R2R-CE Val/Test Unseen: SPL reaches 52/51, both achieving SOTA.

Zero-Shot Object Navigation

Method HM3D SR/SPL MP3D SR/SPL
VLFM (BLIP-2) 52.5/30.4 36.4/17.5
g3D-LF 55.6/31.8 39.0/18.8

Achieves performance comparable to or exceeding VLM-based methods using only the feature field (without VLM/LLM).

Situated Question Answering (SQA3D)

  • Localization accuracy: [email protected] and Acc@1m significantly outperform baselines.
  • Achieving excellent performance in localization tasks using only image inputs (without point clouds).

Key Findings

  1. g3D-LF is the first method to apply indoor 3D feature fields to zero-shot object navigation.
  2. The advantage in monocular VLN is significantly greater than in panoramic VLN, as the feature field compensates for the field-of-view limitations of monocular cameras.
  3. The balanced loss is crucial for small object recognition (without it, the recognition rate of small objects like table lamps drops significantly).
  4. Fine-grained long-text contrast significantly improves the spatial semantic representation quality of the BEV map.

Highlights & Insights

  1. Exquisite multi-level contrastive learning design: Different scale representations are paired with different granularities of language and distinct contrastive strategies, with each design backed by clear motivation and experimental validation.
  2. Strong versatility: A single pre-trained model can enhance the performance of four different embodied tasks in a plug-and-play manner, validating the potential of 3D-language feature fields as a general-purpose representation.
  3. Highly efficient data utilization: Integrating existing datasets like SceneVerse (around 1M language descriptions) without requiring dedicated annotation, and training completes in only 10 days on two RTX 6000 Ada GPUs.
  4. Surpasses LLM-based methods without using an LLM: In monocular VLN and zero-shot navigation, g3D-LF outperforms LLM-based methods like NaVid and InstructNav without using large language models, offering lower computational costs.

Limitations & Future Work

  1. Pre-training is limited to indoor scenes (ScanNet, HM3D, Structured3D), preventing direct application to outdoor autonomous driving scenarios.
  2. Feature field construction relies heavily on depth map quality; in the real world (non-simulation), noise from depth sensors can degrade representation quality.
  3. Question answering accuracy (EM@1) on SQA3D is significantly lower than that of LLM-based methods, indicating that feature field representations still have limitations in complex reasoning.
  4. The volume rendering process incurs high computational overhead, potentially affecting real-time performance.
  • Relationship with HNR: Building upon the generalizable feature field architecture of HNR, g3D-LF introduces 3D-language pre-training, representing a natural upgrade to HNR.
  • Comparison with CLIP/DINOv2 Distillation: Retains CLIP distillation to guarantee generalization, while adding direct language contrast to compensate for the deficiencies of 2D vision models in 3D spatial understanding.
  • Generalizability of Fine-Grained Contrastive Learning: The affinity-matrix-based TopK matching strategy can be extended to other tasks requiring region-to-long-text alignment.
  • Inspirations for VLA (Vision-Language-Action): g3D-LF demonstrates that 3D-language pre-training can significantly boost the performance of embodied tasks, a direction worthy of exploration with larger-scale data and a wider range of tasks.

Rating

⭐⭐⭐⭐ (4/5)

The study is highly systematic, establishing a complete closed loop from data preparation to model design and multi-task validation. The design of multi-level contrastive learning has depth. However, the core modeling architecture is largely inherited from HNR, with innovation concentrated in the pre-training strategies. Furthermore, the focus on indoor scenes limits its applicability to vertical domains like outdoor autonomous driving.