PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models¶

Conference: CVPR 2026 arXiv: 2603.00412 Code: https://github.com/yharoldsu0627/PointAlign Area: Multimodal VLM Keywords: 3D point cloud understanding, vision-language models, feature alignment, geometric information preservation, regularization

TL;DR¶

PointAlign is proposed to apply feature-level alignment regularization to point cloud tokens at intermediate LLM layers (aligned with Q-Former outputs) in 3D VLMs. By training only a lightweight alignment projector and LoRA adapters, the method effectively prevents geometric information from degrading during language modeling, achieving a 7.50pp improvement on open-vocabulary classification.

Background & Motivation¶

Background: 3D vision-language models (3D VLMs) are critical for applications such as robotics, autonomous driving, and AR, yet remain limited by the scarcity of 3D-text paired data.

Limitations of Prior Work: Existing methods (PointLLM, ShapeLLM, MiniGPT-3D) rely solely on next-token prediction loss for training, with only language tokens providing supervision signals. This leads to: - Low utilization efficiency of limited 3D data - Progressive degradation and loss of valuable geometric information in intermediate representations as they propagate through LLM layers

Key Challenge: The language modeling objective only rewards geometric features that directly contribute to next-token prediction, while structural cues useful for spatial reasoning but irrelevant to the current language task are discarded during training.

Goal: To explicitly supervise point cloud tokens at intermediate LLM layers to preserve fine-grained 3D geometric-semantic information, without introducing any inference overhead.

Key Insight: Q-Former outputs are observed to encode both geometric and semantic information (owing to point cloud-text paired training), making them ideal targets for internal supervision.

Core Idea: A consistency loss is employed to align point cloud tokens at intermediate LLM layers with frozen Q-Former outputs via a lightweight alignment projector, which is discarded at inference time with zero additional cost.

Method¶

Overall Architecture¶

Two-stage training: Stage 1 follows MiniGPT-3D pretraining; Stage 2 freezes the encoder, Q-Former, and projector, training only LoRA and the alignment projector. The alignment projector is used exclusively during training and discarded at inference.

Key Designs¶

Alignment Target Selection — Q-Former Output \(\bar{Q}\):
- Why not use point cloud encoder outputs? The encoder captures only geometric features and lacks semantic information.
- Why not use deep LLM representations? Deep representations may have already lost 3D information.
- Q-Former outputs, under direct supervision, retain the richest combination of geometric and semantic information, making them the optimal alignment target.
Alignment Projector \(f_\pi\) (3-layer Linear + SiLU):
- Maps point cloud tokens \(T_{pc}^{(\ell)}\) at LLM layer \(\ell\) into the Q-Former feature space.
- Architecture: \(\mathbb{R}^C \to \mathbb{R}^{d_h} \to \mathbb{R}^{d_h} \to \mathbb{R}^{D_1}\), with only 8.39M parameters.
- Completely discarded at inference with zero overhead.
Alignment Loss:
- Cosine similarity loss: \(\mathcal{L}_{align} = -\frac{1}{o}\sum_{i=1}^{o} \frac{\tilde{Q}_i^\top \bar{Q}_i}{\|\tilde{Q}_i\|_2 \|\bar{Q}_i\|_2}\)
- Focuses on feature direction rather than magnitude, better suited for cross-space alignment.
- Gradients through Q-Former output \(\bar{Q}\) are detached to prevent backpropagation from affecting frozen modules.
- Total loss: \(\mathcal{L}_{total} = \mathcal{L}_{ntp} + \lambda \mathcal{L}_{align}\)

Loss & Training¶

Stage 2 jointly trains LoRA and the alignment projector using \(\mathcal{L}_{total} = \mathcal{L}_{ntp} + \lambda \mathcal{L}_{align}\), updating only a minimal number of parameters.

Key Experimental Results¶

Main Results (3D Object Classification)¶

Model	LLM Size	ModelNet40 Avg	Objaverse Avg	Overall Avg
PointLLM-7B	7B	50.85	62.50	56.68
PointLLM-13B	13B	52.19	62.25	57.22
MiniGPT-3D (Baseline)	2.7B	61.24	66.75	64.00
PointAlign (Ours)	2.7B	61.17	71.00	66.08

Ablation Study¶

Configuration	Objaverse Avg	Notes
Baseline (MiniGPT-3D)	66.75	No alignment regularization
+ Align to encoder features	67.50	Geometric information only; limited benefit
+ Align to Q-Former outputs	71.00	Geometric + semantic information; best performance
+ Align to deep LLM features	68.25	Partial loss of 3D information

Key Findings¶

Achieves 7.50pp improvement on the most challenging open-vocabulary Objaverse classification (66.75→71.00 on 2-prompt avg) and 4.88pp on 3D captioning.
Performance on ModelNet40 is nearly unchanged (−0.07pp), indicating that alignment regularization primarily benefits difficult or open-ended scenarios.
A 2.7B-parameter model surpasses the 13B PointLLM, demonstrating the data efficiency of the approach.
Regarding alignment layer \(\ell\): intermediate layers yield the best results, with performance degrading for both shallower and deeper choices.

Highlights & Insights¶

Training trick with zero inference overhead: The alignment projector is used only during training and discarded at inference — a design pattern worth adopting in other VLMs, analogous to the use of auxiliary heads in knowledge distillation.
Extension of 2D VLM findings to 3D: Prior work in 2D VLMs has shown that visual representations degrade in deeper layers without explicit visual supervision; this paper extends that finding to 3D point clouds and provides a concrete solution.
Data efficiency: Under conditions of extreme 3D data scarcity, internal alignment regularization maximizes the utility of limited data.

Limitations & Future Work¶

The method is built upon MiniGPT-3D; the relatively small model scale may limit the performance ceiling.
The alignment layer \(\ell\) requires hyperparameter tuning, and different model architectures may require different settings.
Validation is limited to single-object understanding; scene-level multi-object 3D understanding is not addressed.
Future work could explore multi-layer alignment (aligning multiple intermediate layers rather than a single one) or dynamic alignment layer selection.

vs. PointLLM: PointLLM achieves 3D-text alignment through full model fine-tuning, incurring high computational cost (200+ GPU-hours); PointAlign surpasses its performance by training only a small number of parameters.
vs. representation supervision methods in 2D VLMs: Reconstruction-based methods in 2D (supervising intermediate representations by recovering visual inputs) tend to capture low-level textures; 3D understanding requires capturing structural relationships and geometric configurations, making direct alignment with Q-Former semantic features more appropriate.

Supplementary Analysis¶

The alignment projector contains only 8.39M parameters, negligible compared to MiniGPT-3D's 2.7B.
During Stage 2 training, gradients through Q-Former output \(\bar{Q}\) are detached — a critical design choice, as omitting this would allow the alignment loss to modify the Q-Former via backpropagation.
On open-vocabulary Objaverse classification, instruction-based evaluation shows a larger gain (65→72.5, +7.5pp), suggesting that alignment regularization is especially beneficial for understanding free-form instructions.
Feature similarity visualizations demonstrate that in the baseline, the cosine similarity between point cloud tokens at intermediate layers and Q-Former outputs decreases with depth, whereas PointAlign maintains stable similarity throughout.
The method is built on MiniGPT-3D's Q-Former architecture; adaptation to other 3D VLMs using direct projection schemes (e.g., PointLLM) would require adjustments to the alignment target.

Rating¶

Novelty: ⭐⭐⭐ Clear motivation, moderate technical novelty
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation analysis with convincing feature quality visualizations
Writing Quality: ⭐⭐⭐⭐ Well-structured with clearly articulated motivation
Value: ⭐⭐⭐⭐ Practically informative for the 3D VLM community