PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models¶
Conference: CVPR 2026 arXiv: 2603.00412 Code: https://github.com/yharoldsu0627/PointAlign Area: Multimodal VLM Keywords: 3D point cloud understanding, vision-language models, feature alignment, geometric information preservation, regularization
TL;DR¶
PointAlign is proposed to apply feature-level alignment regularization to point cloud tokens at intermediate LLM layers (aligned with Q-Former outputs) in 3D VLMs. By training only a lightweight alignment projector and LoRA adapters, the method effectively prevents geometric information from degrading during language modeling, achieving a 7.50pp improvement on open-vocabulary classification.
Background & Motivation¶
Background: 3D vision-language models (3D VLMs) are critical for applications such as robotics, autonomous driving, and AR, yet remain limited by the scarcity of 3D-text paired data.
Limitations of Prior Work: Existing methods (PointLLM, ShapeLLM, MiniGPT-3D) rely solely on next-token prediction loss for training, with only language tokens providing supervision signals. This leads to: - Low utilization efficiency of limited 3D data - Progressive degradation and loss of valuable geometric information in intermediate representations as they propagate through LLM layers
Key Challenge: The language modeling objective only rewards geometric features that directly contribute to next-token prediction, while structural cues useful for spatial reasoning but irrelevant to the current language task are discarded during training.
Goal: To explicitly supervise point cloud tokens at intermediate LLM layers to preserve fine-grained 3D geometric-semantic information, without introducing any inference overhead.
Key Insight: Q-Former outputs are observed to encode both geometric and semantic information (owing to point cloud-text paired training), making them ideal targets for internal supervision.
Core Idea: A consistency loss is employed to align point cloud tokens at intermediate LLM layers with frozen Q-Former outputs via a lightweight alignment projector, which is discarded at inference time with zero additional cost.
Method¶
Overall Architecture¶
Two-stage training: Stage 1 follows MiniGPT-3D pretraining; Stage 2 freezes the encoder, Q-Former, and projector, training only LoRA and the alignment projector. The alignment projector is used exclusively during training and discarded at inference.
Key Designs¶
-
Alignment Target Selection — Q-Former Output \(\bar{Q}\):
- Why not use point cloud encoder outputs? The encoder captures only geometric features and lacks semantic information.
- Why not use deep LLM representations? Deep representations may have already lost 3D information.
- Q-Former outputs, under direct supervision, retain the richest combination of geometric and semantic information, making them the optimal alignment target.
-
Alignment Projector \(f_\pi\) (3-layer Linear + SiLU):
- Maps point cloud tokens \(T_{pc}^{(\ell)}\) at LLM layer \(\ell\) into the Q-Former feature space.
- Architecture: \(\mathbb{R}^C \to \mathbb{R}^{d_h} \to \mathbb{R}^{d_h} \to \mathbb{R}^{D_1}\), with only 8.39M parameters.
- Completely discarded at inference with zero overhead.
-
Alignment Loss:
- Cosine similarity loss: \(\mathcal{L}_{align} = -\frac{1}{o}\sum_{i=1}^{o} \frac{\tilde{Q}_i^\top \bar{Q}_i}{\|\tilde{Q}_i\|_2 \|\bar{Q}_i\|_2}\)
- Focuses on feature direction rather than magnitude, better suited for cross-space alignment.
- Gradients through Q-Former output \(\bar{Q}\) are detached to prevent backpropagation from affecting frozen modules.
- Total loss: \(\mathcal{L}_{total} = \mathcal{L}_{ntp} + \lambda \mathcal{L}_{align}\)
Loss & Training¶
Stage 2 jointly trains LoRA and the alignment projector using \(\mathcal{L}_{total} = \mathcal{L}_{ntp} + \lambda \mathcal{L}_{align}\), updating only a minimal number of parameters.
Key Experimental Results¶
Main Results (3D Object Classification)¶
| Model | LLM Size | ModelNet40 Avg | Objaverse Avg | Overall Avg |
|---|---|---|---|---|
| PointLLM-7B | 7B | 50.85 | 62.50 | 56.68 |
| PointLLM-13B | 13B | 52.19 | 62.25 | 57.22 |
| MiniGPT-3D (Baseline) | 2.7B | 61.24 | 66.75 | 64.00 |
| PointAlign (Ours) | 2.7B | 61.17 | 71.00 | 66.08 |
Ablation Study¶
| Configuration | Objaverse Avg | Notes |
|---|---|---|
| Baseline (MiniGPT-3D) | 66.75 | No alignment regularization |
| + Align to encoder features | 67.50 | Geometric information only; limited benefit |
| + Align to Q-Former outputs | 71.00 | Geometric + semantic information; best performance |
| + Align to deep LLM features | 68.25 | Partial loss of 3D information |
Key Findings¶
- Achieves 7.50pp improvement on the most challenging open-vocabulary Objaverse classification (66.75→71.00 on 2-prompt avg) and 4.88pp on 3D captioning.
- Performance on ModelNet40 is nearly unchanged (−0.07pp), indicating that alignment regularization primarily benefits difficult or open-ended scenarios.
- A 2.7B-parameter model surpasses the 13B PointLLM, demonstrating the data efficiency of the approach.
- Regarding alignment layer \(\ell\): intermediate layers yield the best results, with performance degrading for both shallower and deeper choices.
Highlights & Insights¶
- Training trick with zero inference overhead: The alignment projector is used only during training and discarded at inference — a design pattern worth adopting in other VLMs, analogous to the use of auxiliary heads in knowledge distillation.
- Extension of 2D VLM findings to 3D: Prior work in 2D VLMs has shown that visual representations degrade in deeper layers without explicit visual supervision; this paper extends that finding to 3D point clouds and provides a concrete solution.
- Data efficiency: Under conditions of extreme 3D data scarcity, internal alignment regularization maximizes the utility of limited data.
Limitations & Future Work¶
- The method is built upon MiniGPT-3D; the relatively small model scale may limit the performance ceiling.
- The alignment layer \(\ell\) requires hyperparameter tuning, and different model architectures may require different settings.
- Validation is limited to single-object understanding; scene-level multi-object 3D understanding is not addressed.
- Future work could explore multi-layer alignment (aligning multiple intermediate layers rather than a single one) or dynamic alignment layer selection.
Related Work & Insights¶
- vs. PointLLM: PointLLM achieves 3D-text alignment through full model fine-tuning, incurring high computational cost (200+ GPU-hours); PointAlign surpasses its performance by training only a small number of parameters.
- vs. representation supervision methods in 2D VLMs: Reconstruction-based methods in 2D (supervising intermediate representations by recovering visual inputs) tend to capture low-level textures; 3D understanding requires capturing structural relationships and geometric configurations, making direct alignment with Q-Former semantic features more appropriate.
Supplementary Analysis¶
- The alignment projector contains only 8.39M parameters, negligible compared to MiniGPT-3D's 2.7B.
- During Stage 2 training, gradients through Q-Former output \(\bar{Q}\) are detached — a critical design choice, as omitting this would allow the alignment loss to modify the Q-Former via backpropagation.
- On open-vocabulary Objaverse classification, instruction-based evaluation shows a larger gain (65→72.5, +7.5pp), suggesting that alignment regularization is especially beneficial for understanding free-form instructions.
- Feature similarity visualizations demonstrate that in the baseline, the cosine similarity between point cloud tokens at intermediate layers and Q-Former outputs decreases with depth, whereas PointAlign maintains stable similarity throughout.
- The method is built on MiniGPT-3D's Q-Former architecture; adaptation to other 3D VLMs using direct projection schemes (e.g., PointLLM) would require adjustments to the alignment target.
Rating¶
- Novelty: ⭐⭐⭐ Clear motivation, moderate technical novelty
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation analysis with convincing feature quality visualizations
- Writing Quality: ⭐⭐⭐⭐ Well-structured with clearly articulated motivation
- Value: ⭐⭐⭐⭐ Practically informative for the 3D VLM community