PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/AaNnWwTt/PV-Ground
Area: 3D Vision
Keywords: 3D Visual Grounding, Voxel Convolution, Keypoint Sampling, Text-guided, Multi-modal Fusion
TL;DR¶
PV-Ground identifies that existing 3D visual grounding (3D VG) methods typically use point cloud backbones that aggressively downsample 50,000 points to 2048, creating a detail bottleneck. It introduces sparse voxel convolution to preserve high-resolution features and distills the voxel feature pyramid into compact keypoints for interaction. A text-guided differentiable soft sampling module is proposed to adaptively concentrate keypoints on task-relevant objects, improving grounding accuracy by approximately 5% on ScanRefer/ReferIt3D.
Background & Motivation¶
Background: Typical models for 3D visual grounding (localizing target objects in a 3D scene given a free-form text query) consist of four parts: "3D Scene Encoder + Text Encoder + Multi-modal Interaction Module + Grounding Head." Most research has focused on backend multi-modal interaction—designing complex attention mechanisms, graph neural networks, or multi-task strategies—while frontend scene feature representation has long been overlooked.
Limitations of Prior Work: Mainstream methods still rely on point cloud backbones like PointNet++ for scene encoding. Subsequent multi-layer attention interactions computationally require aggressive downsampling of the original ~50,000 points to 2048 or fewer keypoints. While enabling attention, this downsampling creates a severe information bottleneck: fine-grained geometric details are lost, which is fatal for localizing small objects, partially occluded objects, or instances distinguishable only by subtle cues.
Key Challenge: Sparse voxel convolution is already mainstream in detection/segmentation, preserving high-resolution spatial details with efficient inference. However, direct application to 3D VG is problematic: the number of voxels in standard upsampling decoders grows exponentially, making dense high-resolution feature maps computationally infeasible for attention-based interaction with text. This creates a dilemma: point backbones "allow interaction but lose detail," while voxel backbones "preserve detail but prevent interaction."
Goal: Design an architecture that enjoys high-fidelity voxel representations while maintaining the computational convenience of keypoint-based multi-modal fusion.
Key Insight: The authors advocate that "voxels handle fidelity, while keypoints handle interaction." Sparse voxel convolution extracts a high-resolution scene feature pyramid, and a small set of compact keypoints serves as aggregation anchors to distill the pyramid. Furthermore, since 3D VG is a text-conditioned task, keypoints need not cover the entire scene uniformly; representation capability should be concentrated on regions relevant to the text description.
Core Idea: Integrate a voxel backbone with keypoint aggregation to bridge "fidelity \(\leftrightarrow\) interaction," and use text-guided differentiable soft sampling to adaptively cluster keypoints around text-relevant objects, filtering out distractors.
Method¶
Overall Architecture¶
PV-Ground combines a voxel backbone with a keypoint aggregation-interaction mechanism. The input is an \(N\times 6\) point cloud (XYZ + RGB), voxelized and fed into multi-layer sparse 3D convolutions. Each block downsamples with stride 2 to form a multi-scale voxel feature pyramid (with 8× downsampled features stacked along the Z-axis as BEV maps). Next, a set of keypoints is sampled via FPS as aggregation anchors to distill the voxel pyramid into compact keypoint features via Voxel Set Abstraction (VSA) in the Point-Voxel Interaction (PVI) module. Then, the Text-Guided Sampling (TGS) module uses RoBERTa text features to soft-sample 1024 uniform keypoints down to 256 target-relevant keypoints. Finally, these keypoints undergo deep multi-modal interaction and are fed into a multi-task head (following MCLN) to output bounding boxes and segmentation masks.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Input: Point Cloud N×6 (XYZ+RGB)"] --> B["Sparse Voxel Convolution<br/>Multi-scale Voxel Pyramid + BEV"]
B --> C["Efficient Point-Voxel Interaction (PVI)<br/>FPS 1024 Keypoints + VSA Pyramid Aggregation"]
C --> D["Text-Guided Keypoint Sampling (TGS)<br/>Cross-Attn + Gumbel-Softmax Soft Sampling to 256"]
T["RoBERTa Text Features"] --> D
D --> E["Multi-modal Interaction + MCLN Multi-task Head"]
E --> F["Output: Bounding Boxes + Segmentation Masks"]
Key Designs¶
1. Efficient Point-Voxel Interaction (PVI): Distilling Voxel Pyramids into Compact Interactable Representations
This is the core mechanism for bridging "fidelity" and "interactivity." The model voxelizes input points \(P\in\mathbb{R}^{N\times 6}\) into a \(L_v\times W_v\times H_v\) grid and extracts a voxel feature pyramid using sparse convolutions. To avoid direct cross-modal attention on dense voxel maps \(F_{vox}\), Voxel-to-Keypoint aggregation is introduced: \(N\) keypoints \(P'\in\mathbb{R}^{N\times 3}\) are sampled via FPS to serve as anchors. For each keypoint \(kp_i\), a spherical query of radius \(r_l\) is defined at each pyramid layer \(l\). Voxel features \(\{v_j^l\}\) within the radius (satisfying \(\lVert v_j^l - kp_i\rVert^2 < r_l\)) are collected into set \(S_i^l\) and processed via a PointNet: \(f_i^l = \text{maxpool}(\text{MLP}(\mathcal{M}(S_i^l)))\), where \(\mathcal{M}(\cdot)\) is random sampling of neighbor voxels. These are fused into a compact keypoint set \(\mathcal{V}=\{f_i^g\}\in\mathbb{R}^{N\times C}\). Unlike "hard pruning" in TSP3D, this soft aggregation preserves information.
2. Text-Guided Keypoint Sampling (TGS): Adaptive Clustering of Keypoints on Relevant Objects
FPS samples keypoints uniformly, which is optimal for detection but sub-optimal for text-conditioned 3D VG. TGS redistributes keypoints toward text-relevant candidates. Initial point features \(V\) and text features \(T\) undergo cross-attention: \(V_t = \text{CrossAtt}(V, \text{SelfAtt}(T))\). A fully connected layer then predicts sampling weights \(SW\in\mathbb{R}^{N\times n}\). Gumbel-Softmax reparameterization is used to convert discrete sampling into differentiable soft assignment: \(SW_{gs}=\text{softmax}((\log(SW)+g)/\tau)\). This allows \(n \ll N\) target-relevant keypoints \(P_k\) and \(V_k\) to be derived through weighted soft-sampling. Unlike Top-K "hard selection" (e.g., EDA/3D-SPS), soft sampling allows the model to utilize all potential features and remain end-to-end trainable.
3. Multi-task Regression Head: Reusing MCLN for Boxes and Masks
The keypoint features \(V_k\) from TGS can be integrated into existing point-based frameworks. This work adopts the decoder and prediction head from MCLN due to its strong performance and multi-task design, outputting both bounding boxes and segmentation masks. The authors also verified that feeding these keypoint features into other decoders (e.g., BUTD-DETR, EDA) yields consistent gains, proving PV-Ground is a plug-and-play frontend improvement.
Loss & Training¶
The model is trained on a single RTX 4090 with a batch size of 10. Input point cloud XYZ is cropped to \([-8,-8,-0.2]\sim[8,8,3.8]\) m with a voxel resolution of 0.02 m. \(N=1024\) initial keypoints are sampled via FPS, and VSA uses radii \(r_l=[0.2,0.4,0.8,1.6]\) m. TGS soft-samples \(n=256\) keypoints. The Gumbel temperature \(\tau\) is set to \(1.0\). Multi-task losses follow MCLN.
Key Experimental Results¶
Main Results¶
Using point-based MCLN as the main baseline (†) on ScanRefer, the gains in the single-stage pipeline are particularly significant:
| Dataset | Config | Method | [email protected] | [email protected] |
|---|---|---|---|---|
| ScanRefer | Single-stage Overall | MCLN† | 54.30 | 42.64 |
| ScanRefer | Single-stage Overall | TSP3D | 56.45 | 46.71 |
| ScanRefer | Single-stage Overall | PV-Ground | 59.31 (+5.0) | 47.77 (+5.1) |
| ScanRefer | Two-stage Overall | MCLN† | 57.17 | 45.53 |
| ScanRefer | Two-stage Overall | PV-Ground | 59.87 (+2.7) | 47.56 (+2.0) |
On ReferIt3D: Single-stage Nr3D 51.3 vs. MCLN 45.7 (+5.6). For referring segmentation (ScanRefer), PV-Ground achieves 62.2/54.8 [email protected]/0.5 and 47.9 mIoU, outperforming MCLN (58.7/50.7, 44.7).
Ablation Study¶
Single-stage ScanRefer, incrementally adding PVI and TGS (Baseline is point-based MCLN):
| ID | PVI | TGS | Grounding 0.25 | Grounding 0.5 | Seg mIoU |
|---|---|---|---|---|---|
| (a) | 54.30 | 42.64 | 43.49 | ||
| (b) | ✓ | 56.33 | 45.19 | 44.75 | |
| (c) | ✓ | 58.15 | 47.23 | 46.72 | |
| (d) | ✓ | ✓ | 59.31 | 47.77 | 47.73 |
Key Findings¶
- Modules are Complementary: Adding only PVI (b) improves [email protected] from 54.30 to 56.33. Adding only TGS (c) reaches 58.15. Combining both (d) yields 59.31.
- TGS Contribution: TGS contributes even more than PVI significantly. Using TGS even with a PointNet++ backbone (c) brings major gains, validating the importance of filtering distractors in text-conditioned tasks.
- Gains in Difficult Scenarios: Improvements are most pronounced in the "multiple" (distractor-heavy) setting where fine-grained details and de-noising are critical.
- Visualization: FPS seeds cover the entire scene uniformly, while TGS keypoints accurately cluster around described objects (sofa, door, sink, toilet).
Highlights & Insights¶
- Revisiting the Frontend: While most research focuses on backend interaction, this work targets "scene feature representation," a long-neglected bottleneck.
- Point-Voxel Synergy: Voxel fidelity + Keypoint interactivity. Distilling exponentially expanding voxels into a fixed number of keypoints bypasses the computation limits of dense voxel-text attention.
- Soft Sampling over Hard Selection: Gumbel-Softmax enables differentiable allocation, allowing "unselected" points to still propagate gradients, mitigating irreversible information loss from Top-K pruning.
- Plug-and-play Frontend: The keypoint features can replace queries in existing decoders like BUTD-DETR or EDA to provide similar gains.
Limitations & Future Work¶
- Dependency on Existing Heads: Decoding and regression reuse MCLN; the contribution is concentrated on representation rather than grounding head innovation.
- Computational Overhead: Voxel convolution + VSA + TGS adds complexity over point-only backbones. Detailed throughput/latency comparisons are limited.
- Hyperparameter Sensitivity: Robustness of parameters like keypoint count (1024 to 256) and temperature \(\tau\) across diverse datasets requires further study.
- Scene Scale: Focused on indoor ScanNet; generalization to large-scale outdoor scenes or open-vocabulary descriptions remains to be seen.
Related Work & Insights¶
- vs. MCLN (Baseline): MCLN uses point backbones limited by aggressive downsampling; PV-Ground replaces the frontend for better fidelity and task-relevance.
- vs. TSP3D: TSP3D uses hard pruning of voxels which risks deleting small objects; PV-Ground uses soft aggregation and differentiable sampling.
- vs. EDA / 3D-SPS: These use Top-K hard selection; PV-Ground adapts differentiable soft assignment, allowing all points to participate in learning.
- vs. PointNet++: The 핵심论点 is that point backbone downsampling is a systemic bottleneck for 3D VG.
Rating¶
- Novelty: ⭐⭐⭐⭐ First point-voxel framework for 3D VG + text-guided soft sampling.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple datasets, tasks, and stages with clear ablations.
- Writing Quality: ⭐⭐⭐⭐ Compelling motivation and visualizations; formulas are complete.
- Value: ⭐⭐⭐⭐ Practical plug-and-play frontend that shifts the focus back to representation.