See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model¶

Conference: NeurIPS 2025 arXiv: 2509.16087 Code: None Area: Multimodal VLM Keywords: Spatial Understanding, Multimodal Large Language Model, Visual Prompting, Visual Odometry, Training-Free

TL;DR¶

This paper proposes See&Trek, a training-free and GPU-free spatial prompting framework that enhances spatial understanding in MLLMs through maximum semantic richness sampling and motion reconstruction, achieving up to 3.5% improvement on VSI-Bench.

Background & Motivation¶

Background: Multimodal large language models (MLLMs) have achieved remarkable progress on image understanding and VQA tasks, yet spatial reasoning remains a critical weakness, particularly in scenarios involving object localization, motion prediction, and physical interaction.

Limitations of Prior Work: Current MLLMs commonly adopt uniform temporal sampling strategies (e.g., extracting 8 or 32 frames) when processing video, which introduces two fundamental problems: - Visual Homogeneity: Uniform sampling tends to select frames with few salient features (e.g., walls, ceilings), reducing the signal-to-noise ratio of input frames. - Unknown Motion: Relying solely on sampled frames without explicit ego-motion information prevents the model from inferring object motion and displacement, forcing it to rely on commonsense priors acquired during pretraining.

Key Challenge: Spatial reasoning requires rich visual semantics and explicit motion information, yet existing pipelines lack both.

Key Insight: Leveraging off-the-shelf perception models (YOLO) and visual odometry (VO) to inject spatial cues into MLLMs without any training.

Core Idea: Visual diversity is increased through maximum semantic richness sampling, while camera trajectory is recovered via motion reconstruction and encoded onto keyframes.

Method¶

Overall Architecture¶

Given a video sequence, See&Trek proceeds in three steps: (1) detecting objects with YOLO and selecting semantically richest keyframes via Maximum Semantic Richness Sampling; (2) estimating the camera motion trajectory using ORB features and the essential matrix (Motion Reconstruction); (3) encoding motion information onto keyframes as spatiotemporal tokens (Spatiotemporal Encoding), which are then combined with text prompts and fed into the MLLM.

Key Designs¶

Maximum Semantic Richness Sampling (Balanced-TopK):
- Function: Select \(K\) keyframes from the video that are both semantically rich and temporally well-distributed.
- Mechanism: YOLO is first applied to detect the object category set \(\mathcal{C}_t\) in each frame; the frame with the most distinct object categories is selected as the anchor frame. The valid interval is then divided into \(K-1\) temporal segments, and within each segment the frame with the fewest category overlaps with already-selected frames and the most objects is chosen.
- Design Motivation: Naive TopK selection is biased toward temporally clustered high-object-count frames. Balanced-TopK addresses temporal locality bias by combining temporal segmentation with a category-deduplication strategy, balancing semantic richness and temporal diversity.
Motion Reconstruction:
- Function: Estimate the camera motion trajectory from monocular video.
- Mechanism: ORB features are extracted and matched between adjacent frames; RANSAC is applied to estimate the essential matrix \(\mathbf{E}\), which is decomposed via SVD to obtain relative rotation \(\mathbf{R}_t\) and translation \(\mathbf{T}_t\). The global trajectory is accumulated recursively as \(\mathbf{T}_t^{world} = \mathbf{R}_{t-1}^{world}\mathbf{T}_t + \mathbf{T}_{t-1}^{world}\).
- Design Motivation: Explicit camera motion information enables MLLMs to reason about spatial relationships based on evidence rather than speculation.
Spatiotemporal Encoding:
- Function: Visually encode trajectory information onto keyframes.
- Mechanism: Each keyframe is assigned a color marker (from a continuous colormap reflecting temporal order) and a frame index, overlaid directly on the upper-right corner of the image. BEV and 3D trajectory visualizations are additionally generated.
- Design Motivation: This addresses the association problem between keyframes and motion trajectories — MLLMs cannot link independent frames to standalone trajectory plots. Spatiotemporal encoding bridges this gap through visual markers.

Loss & Training¶

Completely training-free. Only CPU-level computation is required for ORB matching and YOLO inference, with a single forward pass through the MLLM.

Key Experimental Results¶

Main Results — VSI-Bench¶

Model	Baseline Avg.	+See&Trek	Gain
InternVL3-1B	29.5	32.0	+3.5%
InternVL3-8B	40.2	43.2	+3.0%
InternVL3-14B	44.2	45.6	+1.4%
Qwen2.5-VL-7B	27.3	29.0	+2.6%
LLaVA-OneVision-7B	31.4	33.0	+1.6%
Kimi-VL-A3B	33.4	35.1	+1.7%

Ablation Study — STI-Bench¶

Model	Baseline	+See&Trek	Static Gain	Dynamic Gain
InternVL3-8B	35.2	36.5	+0.5%	+3.4%
Qwen2.5-VL-7B	30.3	33.2	+1.8%	+4.3%

Key Findings¶

See&Trek yields the largest gains on Relative Direction (Rel. Dir.) and Approach Order (Appr. Order), the two subtasks most dependent on motion information.
Minor performance drops are occasionally observed on Object Count (Obj. Count), as semantic sampling may miss frames in which certain object categories appear.
Smaller models (1B/3B) benefit more, indicating that the method effectively compensates for limited spatial reasoning capacity in compact models.
Fully training- and GPU-free; plug-and-play compatible with both open-source and proprietary models.

Highlights & Insights¶

Zero-cost enhancement: Spatial understanding is improved purely through input augmentation without modifying model parameters or requiring additional training — a highly practical engineering approach.
Elegant design of Balanced-TopK: By combining temporal segmentation with category deduplication, the strategy maximizes information content within a limited frame budget, and is transferable to other video understanding scenarios.
Motion reconstruction as "free features": Classical CV visual odometry pipelines provide MLLMs with a structured spatial prior that is more reliable than end-to-end learned representations.

Limitations & Future Work¶

VO relies on feature matching and may fail in texture-poor or fast-motion scenarios.
Balanced-TopK depends on YOLO's detection capabilities and is ineffective for object categories outside YOLO's vocabulary.
The method is applicable only to video inputs and cannot handle single-image spatial reasoning.
Occasional negative effects on Object Count tasks suggest the need for task-adaptive activation of individual components.

vs. SpatialRGPT/LLaVA-3D: These methods require additional modalities such as depth maps or point clouds, whereas See&Trek operates on RGB video alone.
vs. VideoRAG/VideoTree: These approaches require fine-tuning the MLLM or rely on VLMs for retrieval, incurring significant computational overhead; See&Trek is entirely training-free.
Future work could integrate the See&Trek framework with depth estimation models to obtain more precise spatial information.

Rating¶

Novelty: ⭐⭐⭐⭐ First zero-training spatial prompting framework with a concise and effective design.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 10+ models and two benchmarks with comprehensive results.
Writing Quality: ⭐⭐⭐⭐ Clear logical structure with highly informative figures and tables.
Value: ⭐⭐⭐⭐ Strong engineering practicality; plug-and-play deployment.