Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task¶

Conference: NeurIPS 2025 arXiv: 2512.10359 Code: GitHub Area: Video Understanding / LLM Agent Keywords: Video Question Answering, Tool Augmentation, Spatiotemporal Reasoning, MLLM, Progressive Localization

TL;DR¶

This paper proposes the STAR framework, which constructs a video analysis toolbox comprising 22 tools and enables an LLM to alternately invoke temporal and spatial tools to progressively localize a 3D Region of Interest (3D RoI) within videos, achieving improvements of 8.2% on VideoMME and 4.6% on LongVideoBench.

Background & Motivation¶

Background: Two dominant paradigms exist for VideoQA — Video-LLMs suffer from inefficiency due to processing large numbers of frames; tool-augmented LLMs are limited by narrow tool coverage and insufficient scheduling.
Limitations of Prior Work: Existing tool-augmented approaches suffer from: (1) tools covering only a single dimension (temporal or spatial), (2) imbalanced tool quantity and diversity, and (3) lack of effective scheduling strategies leading to disordered tool chains.
Key Challenge: Jointly modeling intra-frame spatial relationships and cross-frame temporal dynamics is necessary, yet beyond the reach of existing methods.
Goal: To construct a comprehensive video toolbox and design an alternating spatiotemporal reasoning framework.
Key Insight: The localization of key video regions is formulated as a progressive narrowing process over a 3D RoI.
Core Idea: A spatiotemporally interleaved tool chain — alternately applying temporal tools to narrow the temporal scope and spatial tools to narrow the spatial scope.

Method¶

Overall Architecture¶

Video Toolbox (22 tools: temporal + spatial + general) → STAR Framework (alternating spatiotemporal tool invocation) → LLM final reasoning and answer generation.

Key Designs¶

Video Toolbox (22 Tools):
- Temporal tools: Adaptive Keyframe Search (AKeyS), temporal grounding, adaptive frame sampling, scene cutting, etc.
- Spatial tools: YOLO-World object detection, Grounding DINO object localization, Patch Zoomer region magnification, image cropping, OCR, etc.
- General tools: VQA, image captioning, video summarization, etc.
- Design principles: spatial-temporal decoupling, natural language interfaces (e.g., converting bounding boxes to textual descriptions), dual granularity at segment and frame levels.
STAR Framework (Spatiotemporal Reasoning):
- Function: The LLM autonomously and alternately invokes spatiotemporal tools to achieve progressive 3D RoI localization.
- Mechanism: Temporal tools first narrow the temporal range → spatial tools then narrow the spatial range → temporal tools narrow further → ... until the key region required to answer the question is identified.
- Properties: autonomy (LLM-driven decision-making), adaptability (adjusts to video length and content), and progressiveness (starts from a small number of frames and expands iteratively).
Tool Chain Strategy Comparison: Spatiotemporal interleaving > spatial-temporal separation > shortcut strategies. The interleaving strategy ensures mutual feedback between temporal and spatial reasoning.

Loss & Training¶

No training required; the framework operates purely at inference time.
GPT-4o serves as the core reasoning engine.
Tools are built upon lightweight models such as YOLO-World and Grounding DINO.
Progressive frame processing reduces computational cost.

Key Experimental Results¶

Benchmark	STAR + GPT-4o	GPT-4o (32 frames)	Qwen-VL-7B	Gain
VideoMME	+8.2%	baseline	reference model	Surpasses 7B VLLM
LongVideoBench	+4.6%	baseline	—	Significant improvement
EgoSchema	Best	—	—	Highest frame efficiency

Key Findings¶

Spatiotemporally interleaved tool chain > separated tool chain > shortcut tool chain.
As the number of frames increases, STAR accuracy improves consistently with the highest efficiency.
Spatial tools (object detection + magnification) contribute most to fine-grained spatial questions.

Toolbox Category Details¶

Tool Type	Count	Representative Tools
Temporal Tools	7	AKeyS, Temporal Grounding, Scene Cutting
Spatial Tools	8	YOLO-World, Grounding DINO, Patch Zoomer
General Tools	7	VQA, Image Captioning, Video Summarization

Tool Chain Strategy Comparison¶

Strategy	VideoMME	LongVideoBench	Frame Efficiency
Spatiotemporal Interleaving	Best	Best	Highest
Spatial-Temporal Separation	−3.1%	−2.8%	Moderate
Shortcut	−5.4%	−4.9%	Lowest

Highlights & Insights¶

The plug-and-play design of 22 tools makes the system highly extensible.
Progressive 3D RoI localization constitutes an intuitively clear paradigm for video understanding.
Lightweight tool augmentation of GPT-4o surpasses dedicated Video-LLMs.
Tool chain visualization demonstrates the interpretability of the reasoning process.

Limitations & Future Work¶

Reliance on GPT-4o as the reasoning engine incurs high costs and limits scalability.
Tool invocations increase total inference time, with each call introducing additional latency.
Errors across tools may propagate in a cascading manner, with early failures potentially invalidating the entire reasoning chain.
No comparison with RL fine-tuning methods (e.g., TempSamp-R1), leaving the relative merit against end-to-end approaches unclear.
Maintaining and updating 22 tools incurs non-trivial overhead, and tool quality varies.
The toolbox lacks audio analysis tools for VideoQA tasks requiring auditory understanding.
The maximum tool chain length is not constrained, potentially leading to excessively long chains that increase cost and error rates.
Progressive 3D RoI localization may over-process very short videos (<5 seconds), introducing unnecessary computational overhead.

vs. DoraemonGPT: DoraemonGPT queries tool outputs via SQL databases and is prone to failure; STAR is more reliable.
vs. VideoAgent: VideoAgent primarily employs temporal tools; STAR covers both temporal and spatial dimensions.
vs. Video-LLM (Qwen-VL): STAR augments GPT-4o with lightweight tools and surpasses dedicated specialized models.

Supplementary Discussion¶

The core innovation of this approach lies in transforming the problem from a single dimension to multiple dimensions, providing a more comprehensive understanding perspective.
The experimental design covers diverse scenarios and baseline comparisons, with statistically significant results.
The modular design facilitates extension to related tasks and new datasets.
Open-sourcing code and data is of significant value for community reproduction and subsequent research.
Compared to concurrent work, this paper demonstrates greater depth in problem formulation and more comprehensive experimental analysis.
The paper's logical structure is clear, forming a complete loop from problem definition to method design to experimental validation.
The computational overhead of the method is reasonable, making it deployable in practical applications.
Future work may explore integration with additional modalities such as audio and 3D point clouds.
Validating the scalability of the method on larger-scale data and models is an important direction for follow-up research.

Rating¶

Novelty: ⭐⭐⭐⭐ The spatiotemporally interleaved tool chain framework is a novel design.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-benchmark evaluation with tool chain strategy comparisons.
Writing Quality: ⭐⭐⭐⭐ Clear illustrations and well-organized tool categorization.
Value: ⭐⭐⭐⭐ Valuable reference for both video analysis assistants and tool learning research.