STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs¶

Conference: CVPR 2026
Paper: CVF Open Access
Area: Multimodal VLM
Keywords: Multi-view spatial reasoning, Reinforcement Learning, Cross-view correspondence, GRPO, Process supervision

TL;DR¶

STAR-R1 utilizes a two-stage training approach—"Process-supervised SFT cold start + Reference-aware RL"—on Qwen2.5-VL-7B. This allows the model to mimic human behavior by first anchoring key references and then performing cross-view alignment for scene reconstruction, significantly outperforming open-source and several closed-source models on multi-view spatial understanding benchmarks such as TVR, MMSI-Bench, MindCube-Tiny, and SPAR-Bench.

Background & Motivation¶

Background: Reinforcement Learning (RL) has been proven to significantly enhance the reasoning capabilities of LLMs and MLLMs (evidenced by the surge of multimodal-R1 works following DeepSeek-R1). However, most efforts focus on mathematics, general VQA, or temporal video tasks. Multi-view spatial reasoning—where a model must establish object correspondences across multiple images from different perspectives and then infer consistent scene semantics—remains largely unexplored.

Limitations of Prior Work: The authors diagnosed a representative dual-view task, TVR (Transformation-Driven Visual Reasoning), and found existing approaches lacking. Supervised Fine-Tuning (SFT) tends to memorize transformation patterns in labels but fails at explicit spatial reasoning, leading to errors when perspectives shift (e.g., reporting non-existent changes like "2.color.cyan"). While vanilla RL (GRPO) encourages explicit cross-view correspondence, it frequently misses key objects or provides incorrect mappings during cold starts, and its output format is often inconsistent.

Key Challenge: The authors summarize this as "SFT memorizes, RL generalizes." Table 1 provides quantitative evidence: on ID sets without perspective changes, SFT achieves 84.2 TAcc, far exceeding RL's 76.3. However, on OOD sets with perspective changes, SFT's performance drops to 30.9, while RL achieves 53.9 (a 23% lead). Behavioral analysis reveals that RL models explicitly establish cross-view object correspondences in 81% of OOD samples (compared to 67% in ID scenarios), indicating more thorough cross-view verification under complex conditions—the root of its robustness.

Goal / Key Insight: Since SFT provides structure and RL provides generalization, the goal is to combine their strengths. The Core Idea is to first inject a structured reasoning trajectory ("per-view analysis → cross-view mapping → spatial reasoning") via process-supervised SFT, then use reference-aware RL—providing fine-grained rewards for both reference selection and final answers—to allow the model to explore and solidify cross-view correspondences.

Method¶

Overall Architecture¶

STAR-R1 is a two-stage training framework based on Qwen2.5-VL-7B. It first conducts exploratory experiments on the TVR task to confirm that RL can induce human-like "anchoring → verification" behavior. This logic is then extended to real-world multi-view tasks via a three-step reasoning paradigm and two-stage training.

During inference, the model follows a fixed three-step process: ① Per-view reference analysis: Identifying key references in each image and encoding directional relationships as triplets [object 1, object 2, relation]; ② Cross-view spatial mapping: Comparing visual features and configurations to merge local relationships into a unified scene-level spatial map; ③ Spatial reasoning and answer inference: Reasoning on the reconstructed map to output a standardized <answer>...</answer>.

During training, Stage 1 uses Gemini-2.5-Pro to generate high-quality CoT data following the three-step format, keeping only correct samples for an SFT cold start (4.1k samples). Stage 2 applies RL (19.2k samples) where rewards target both "reference selection" and "answer accuracy." The precision of the reward design (especially dense rewards + dual penalties) is a key contribution derived from the TVR exploration.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Multi-view Images + Question"] --> B["Three-step Structured Reasoning Paradigm<br/>Per-view Analysis → Cross-view Mapping → Reasoning"]
    B --> C["Process-supervised SFT Cold Start<br/>Gemini-2.5-Pro CoT + Result-oriented Filtering"]
    C --> D["Reference-aware RL<br/>R_total = R_ans + R_ref"]
    D --> E["Dense Rewards + Dual Penalty<br/>Fine-grained Scoring / Anti-reward hacking"]
    E --> F["Cross-view Correspondence + Scene Reconstruction → Answer"]

Key Designs¶

1. SFT vs RL Diagnosis and Two-stage Integration: Process-supervised cold start for structure, RL for generalization Pure SFT overfits to annotation patterns and fails under perspective changes, while pure RL misses objects and lacks formatting. The authors chose a hybrid approach. Stage 1 utilizes process-supervised cold start, forcing每条 CoT to follow the three-step trajectory and using result-oriented filtering to exclude incorrect reasoning paths. SFT establishes the "reasoning skeleton" and format, leaving "optimal trajectory exploration" to Stage 2 RL.

2. Fine-grained Dense Accuracy Reward: Converting sparse signals into dense signals TVR requires a triplet (index, attribute, value) to be entirely correct to receive a point. Using binary rewards (1 for all correct, 0 otherwise) limits exploration efficiency. The authors refined the reward to the level of each transformation \(t_i\), applying tiered positive rewards based on the match:

\[R^{\text{pos}}(t_i)=\begin{cases}5.0,& t_i\text{ exact match};\\ 1.5,& (\text{index}_i,\text{attribute}_i)\text{ match};\\ 0.5,& \text{only index}_i\text{ match};\\ 0.0,& \text{otherwise}.\end{cases}\]

The total positive reward is \(R^{\text{pos}}=\sum_{i=1}^{n}R^{\text{pos}}(t_i)\). Formatting rewards ensure reasoning is within <think> and answers within <answer> (\(R^{\text{format}}=1\)). This progressive scoring provides clear climbing signals.

3. Dual Penalty Mechanism: Blocking reward hacking To prevent models from enumerating all possible triplets to farm points, the authors introduced dual penalties: a deduction of \(-1.0\) for each incorrect prediction (\(n_{\text{miss}}\)), and an additional penalty if the number of predicted transformations \(n_{\text{pred}}\) is less than the ground truth \(n_{\text{gt}}\):

\[R^{\text{pun}}=\begin{cases}-n_{\text{miss}}-(n_{gt}-n_{pred}),& n_{pred}<n_{gt};\\ -n_{\text{miss}},& \text{otherwise}.\end{cases}\]

The final accuracy reward is \(R^{\text{acc}}=R^{\text{pos}}+R^{\text{pun}}\).

4. Reference-aware RL Reward: Optimizing "answering" and "anchoring" simultaneously In real-world tasks, binary answer signals are insufficient. The authors introduced complementary rewards: reference reward \(R^{\text{ref}}\) for accurately identifying key references across views, and result reward \(R^{\text{ans}}\) for the final answer:

\[R^{\text{total}}=R^{\text{ans}}+R^{\text{ref}}.\]

This forces the model to solidify cross-view grounding while pursuing the correct answer.

Loss & Training¶

Base model: Qwen2.5-VL-7B, 8×H20 GPUs. TVR exploration used single-stage RL for efficiency. Real-world tasks followed the full two-stage pipeline (4.1k SFT + 19.2k RL samples from MindCube and SPAR-7M). RL was based on GRPO with extended dense rewards. Response length curves show an initial drop followed by a gradual rise and stabilization, as the model learns to systematically compare objects concisely.

Key Experimental Results¶

Main Results¶

Benchmark / Task	Metric	STAR-R1 (7B)	Best Comparison	Gain
TVR	TAcc↑	61.4	o3 36.0 / GPT-4o 23.5	+25.4% / +37.9%
TVR	NDiff↓	0.3	Qwen2.5-VL-7B 1.5	Large Drop
MMSI-Bench	Acc↑	31.4	GPT-4o 30.3	+1.1
MindCube-Tiny Rotation	Acc↑	98.5	Prev. SOTA 53.0	+45.5%
MindCube-Tiny Around	Acc↑	82.8	Prev. SOTA 70.4	+12.4%
SPAR-Bench ObjRel-OC-MV	Acc↑	86.0	SOTA 64.0	+22.0%
SPAR-Bench ObjRel-OO-MV	Acc↑	76.7	SOTA 59.0 (Human 80)	+17.7%

STAR-R1 achieved state-of-the-art results across all TVR metrics. On SPAR-Bench, it outperformed methods trained on 10× more data and approached human levels in the ObjRel-OC-MV task.

Ablation Study (TVR Reward Design)¶

Configuration	TAcc	NDiff↓	Note
STAR-R1 (Full)	61.4	0.31	Complete reward
w/o obj reward	58.0	0.37	Scrapped object reward, lower efficiency
w/o attr reward	56.8	0.40	Attribute reward removal caused largest drop
w/o under-pred penalty	58.2	0.41	Loss of full exploration constraint
w/o incorrect penalty	54.3	0.44	Triggered enumerative reward hacking
w/ naive GRPO	54.5	0.43	Naive GRPO unsuited for TVR structure

Key Findings¶

Incorrect prediction penalty is the "anti-cheat" core: Without it, the model defaults to enumerating all triplets.
Attribute rewards are more critical than object rewards: Removing attribute-level scoring (56.8) caused a larger drop than removing object-level scoring (58.0).
Scale of complexity: Performance decreases as the number of objects increases, indicating that cross-view correspondence difficulty scales sharply with scene complexity.
Rotation task relies heavily on RL: STAR-R1 outperformed STAR-SFT by 44.5% on this task; removing \(R^{\text{ref}}\) led to a 17.5% drop.

Highlights & Insights¶

"SFT Memorizes, RL Generalizes" is a robust insight: Quantitatively verified through ID/OOD behavioral statistics, providing a clear methodology for balancing structure and generalization.
Serious approach to anti-reward hacking: Modeling enumeration as a double penalty is far more reliable than purely positive reinforcement, a tactic transferable to other RLVR tasks.
The structured reasoning paradigm serves as both interface and supervision: It unifies the CoT template with verifiable reward anchors.
Sample efficiency: Achieving SOTA with only 4k SFT and 19k RL samples against baselines using 10× more data validates the effectiveness of structured cold starts.

Limitations & Future Work¶

Task domain constraints: TVR uses synthetic CLEVR-style scenes; generalization to open-world multi-view tasks like long-range navigation requires further verification.
Dependency on closed-source teachers: Stage 1 CoTs are generated by Gemini-2.5-Pro; the supervisor's ceiling and filtering noise may discard valid reasoning paths that happen to lead to incorrect final answers.
Manual reward design: The multi-tiered scoring is currently tailored specifically for TVR structures, increasing the cost of migration to non-structured answer formats.
Ground truth for reference labels: The mechanism for \(R^{\text{ref}}\) supervision in real-world tasks depends on supplementals and may introduce uncertainty during replication.

vs LMM-R1 / Video-R1: While these explore multimodal RL for math or video, they are general-purpose. STAR-R1 is the first to optimize RL specifically for multi-view spatial understanding with reference-aware rewards.
vs Naive GRPO: STAR-R1 extends GRPO with dense rewards and penalties; the performance gap in TVR (61.4 vs 54.5) proves that naive GRPO is insufficient for structured multi-step outputs.
Transferable Insight: Using explicit structured reasoning trajectories as both SFT templates and RL reward anchors can be extended to any multimodal task requiring verifiable intermediate steps (e.g., flowchart or diagram reasoning).

Rating¶

Novelty: ⭐⭐⭐⭐ Systematic application of RL to multi-view spatial reasoning.
Experimental Thoroughness: ⭐⭐⭐⭐ Four benchmarks plus extensive reward ablation and behavioral analysis.
Writing Quality: ⭐⭐⭐⭐ Clear "SFT vs RL" storyline.
Value: ⭐⭐⭐⭐ Demonstrates that an open-source 7B can achieve near-human performance in spatial intelligence using efficient RL strategies.