Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Identification¶

Conference: ACL 2026 arXiv: 2601.06498 Code: Project HomePage Area: LLM Agent Keywords: Tool-augmented agent, interleaved multimodal chain-of-thought, spectral inspection, reinforcement learning, domain VLM

TL;DR¶

This paper proposes Spec-o3, a tool-augmented vision-language agent that simulates the spectral inspection workflow of professional astronomers via Interleaved Multimodal Chain-of-Thought (iMCoT). Through a two-stage training pipeline of cold-start SFT followed by outcome-based RL, Spec-o3 improves macro-F1 from 28.3% to 76.5% on rare celestial object identification, achieving ~50× speedup over manual inspection.

Background & Motivation¶

Background: Modern spectroscopic survey projects (LAMOST, SDSS, DESI) generate massive datasets. Constructing catalogs of rare celestial objects requires a two-stage pipeline — deep learning algorithms for candidate screening, followed by expert visual vetting. The vetting stage remains heavily dependent on human labor.

Limitations of Prior Work: (1) Deep learning classifiers produce opaque probability scores with poor out-of-distribution generalization, making them difficult for experts to trust; (2) post-hoc explanation methods (Grad-CAM, SHAP, etc.) yield coarse feature attributions that cannot be reliably mapped to astrophysical structures; (3) manual vetting does not scale — for example, the LAMOST CV catalog required experts to visually inspect 170,000 candidates, ultimately confirming only 323 targets.

Key Challenge: The volume of candidates from next-generation surveys will continue to grow exponentially, while the throughput of manual inspection cannot scale accordingly, creating a critical bottleneck in astronomy.

Goal: Design a trustworthy, highly generalizable automated vetting agent capable of inspecting spectra in the manner of a trained astronomer.

Key Insight: The expert inspection workflow is inherently a "look-and-think" process — first assessing the global morphology, then iteratively zooming into wavelength regions of interest to examine diagnostic features, and finally reaching a verdict. Combining VLMs with spectral visualization tools allows this iterative process to be faithfully simulated.

Core Idea: Interleaved Multimodal Chain-of-Thought (iMCoT) — alternating between textual reasoning and fine-grained spectral plots rendered by tools, coupled with two-stage post-training to achieve expert-level vetting capability.

Method¶

Overall Architecture¶

Built on Qwen2.5-VL, the agent receives a text prompt \(T_0\) (containing the discrimination query and expert diagnostic guidelines) along with an initial global spectral plot \(I_0\). The agent performs textual reasoning within <think>...</think>, issues tool calls to generate zoomed-in views \(I_{t+1}\) of local wavelength regions, and iterates until issuing a final verdict within <answer>...</answer>. The trajectory is formalized as \(\tau = (T_0, I_0, T_1, I_1, T_2, I_2, \ldots, T_N)\).

Key Designs¶

Interleaved Multimodal Chain-of-Thought (iMCoT): The state is defined as \(s_t = \{I_{\leq t}, T_{\leq t}\}\). At each step, the agent autonomously decides whether to directly output an answer or invoke the visualization tool \(Tool_t\) to acquire finer-grained evidence. The tool accepts a wavelength interval \(\Delta\lambda_t = (\lambda_t^{\min}, \lambda_t^{\max})\) and an optional diagnostic label \(l_t\), and returns a locally re-rendered plot. The design motivation is to faithfully replicate the astronomer's workflow of global assessment → local verification → final decision, rendering the reasoning process auditable and physically consistent.
Cold-Start SFT: Approximately 4k spectra (SNR > 10) are sampled from the official LAMOST catalog, covering five rare object types (CV, CS, SS, MG, WD). GPT-5 first generates initial reasoning trajectories following expert guidelines; astronomers then conduct three rounds of review (initial screening → revision → final vote), yielding ~1k high-quality expert trajectories. Token-level loss masking is applied to tool-returned content during training to prevent the model from memorizing rendered outputs. The design motivation is to inject domain priors and tool-use capabilities from a small number of high-quality expert demonstrations.
Outcome-Based Agentic RL: The GRPO framework is used for outcome-based RL, leveraging only labeled data (without complete trajectories). The reward function is defined based on prediction correctness and format compliance: correct + format valid \(\to r=1\); correct + format violation \(\to r=1-\alpha\); incorrect + format valid \(\to r=0\); incorrect + format violation \(\to r=-\alpha\). The design motivation is that cold-start performance is bounded by scarce expert trajectories, and RL exploits more abundant labeled data to further optimize tool-use strategies.

Loss & Training¶

Cold-start training uses standard SFT loss with token-level loss masking on tool-returned tokens. RL training uses GRPO (8 rollouts/question, up to 8 tool calls/trajectory). The two stages are applied sequentially. Base models are Qwen2.5-VL-3B/7B, trained on 8×H100 GPUs.

Key Experimental Results¶

Main Results (SpecVI-Bench, 5 Rare Object Classes)¶

Model	CV F1	CS F1	SS F1	MG F1	WD F1	Avg. F1
GaiaNet (DL Expert Model)	67.2	87.1	70.3	51.8	48.2	64.9
o3 (OpenAI)	57.1	53.1	53.3	60.0	37.8	52.3
Qwen2.5-VL-7B (base)	25.4	31.5	27.3	29.0	28.1	28.3
S1-VL-32B-SFT	60.7	42.8	43.7	36.3	27.4	42.2
Spec-o3-7B	81.0	80.2	84.5	83.4	53.6	76.5

Ablation Study¶

#	SFT	RL	Tool	3B F1	7B F1
0 (Full)	✓	✓	✓	73.3	76.5
1	✗	✓	✓	35.7 (-37.6)	40.5 (-36.0)
2	✓	✗	✓	33.1 (-40.2)	41.6 (-34.9)
4	✓	✓	✗	43.5 (-29.8)	55.8 (-20.7)

Key Findings¶

Mutual dependence of two-stage training: RL-only or SFT-only achieves only ~35–41% F1, whereas the combined pipeline jumps to 73–76% — cold-start provides domain priors while RL optimizes the tool-use strategy.
Tools are critical: Removing tools drops the 7B model from 76.5% to 55.8% F1, confirming that static global views are insufficient to detect subtle diagnostic features.
Zero-shot cross-survey generalization: Spec-o3 maintains 77–81% F1 on SDSS/DESI, while expert DL models degrade by 14–20%, indicating that Spec-o3 relies on transferable diagnostic evidence rather than survey-specific artifacts.
Zero-shot cross-task generalization: The model achieves 76.4% F1 on unseen O/B/A-type spectra (o3: 60.9%), confirming that a general tool-assisted inspection paradigm has been learned.
Inference efficiency: ~0.2s/sample on 8×H100, approximately 50× faster than expert manual inspection.

Highlights & Insights¶

This is the first work to apply the "think-with-image" paradigm to scientific data analysis, extending it from natural images to astronomical spectra and demonstrating the immense potential of tool-augmented VLMs in vertical domains.
The data construction pipeline is exceptionally rigorous: GPT-5 generation → astronomer screening → revision → dual auditing → final vote, ensuring high cold-start data quality.
Cold-start data efficiency is high: Reducing SFT data from ~1k to ~200 trajectories causes only marginal performance degradation (CV F1: 80.7 → 77.8).
Tool-use behavior is sufficiently reliable after cold-start that no dedicated tool-use reward is required during RL.

Limitations & Future Work¶

Evaluation is limited to a small set of rare object types; broader spectral subclasses remain uncovered.
The expert inspection workflow is abstracted as a "zoom-and-reason" loop, whereas real catalog construction also requires cross-matching with external databases and other observational modalities.
Cold-start still requires expert involvement; the barrier to extending the approach to new tasks or surveys is non-trivial (though the synthetic data pipeline demonstrates the potential to reduce this requirement).
No production-ready risk control mechanisms (e.g., calibration, abstention, triage) have been provided.
WD classification achieves a relatively low F1 of 53.6%, likely because white dwarf spectral features are more subtle and may require more refined diagnostic strategies.

o3's think-with-image: Spec-o3 extends this paradigm from general VQA to scientific data analysis, linking abstract diagnostics to visualized evidence.
GRPO (DeepSeek-R1): Outcome-based RL is extended from mathematical reasoning to scientific tool-use scenarios.
Key insight: In vertical-domain VLMs, the critical factor is not model scale but rather tool-use and reasoning paradigms aligned with domain-specific workflows.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First iMCoT agent for astronomical spectroscopy; the two-stage training strategy is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Cross-survey, cross-task, and edge-case generalization evaluations, combined with human expert assessment and ablation studies — extremely comprehensive.
Writing Quality: ⭐⭐⭐⭐ Domain background is well introduced and the methodology is clearly described, though the barrier for non-astronomy readers is relatively high.
Value: ⭐⭐⭐⭐⭐ Directly addresses a real bottleneck in astronomical observation; the ~50× speedup carries substantial engineering value, and the paradigm is transferable to other scientific domains.