STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs¶
Conference: CVPR 2026
Paper: CVF Open Access
Area: Multimodal VLM
Keywords: Multi-view spatial reasoning, Reinforcement Learning, Cross-view correspondence, GRPO, Process supervision
TL;DR¶
STAR-R1 utilizes a two-stage training approach—"Process-supervised SFT cold start + Reference-aware RL"—on Qwen2.5-VL-7B. This allows the model to mimic human behavior by first anchoring key references and then performing cross-view alignment for scene reconstruction, significantly outperforming open-source and several closed-source models on multi-view spatial understanding benchmarks such as TVR, MMSI-Bench, MindCube-Tiny, and SPAR-Bench.
Background & Motivation¶
Background: Reinforcement Learning (RL) has been proven to significantly enhance the reasoning capabilities of LLMs and MLLMs (evidenced by the surge of multimodal-R1 works following DeepSeek-R1). However, most efforts focus on mathematics, general VQA, or temporal video tasks. Multi-view spatial reasoning—where a model must establish object correspondences across multiple images from different perspectives and then infer consistent scene semantics—remains largely unexplored.
Limitations of Prior Work: The authors diagnosed a representative dual-view task, TVR (Transformation-Driven Visual Reasoning), and found existing approaches lacking. Supervised Fine-Tuning (SFT) tends to memorize transformation patterns in labels but fails at explicit spatial reasoning, leading to errors when perspectives shift (e.g., reporting non-existent changes like "2.color.cyan"). While vanilla RL (GRPO) encourages explicit cross-view correspondence, it frequently misses key objects or provides incorrect mappings during cold starts, and its output format is often inconsistent.
Key Challenge: The authors summarize this as "SFT memorizes, RL generalizes." Table 1 provides quantitative evidence: on ID sets without perspective changes, SFT achieves 84.2 TAcc, far exceeding RL's 76.3. However, on OOD sets with perspective changes, SFT's performance drops to 30.9, while RL achieves 53.9 (a 23% lead). Behavioral analysis reveals that RL models explicitly establish cross-view object correspondences in 81% of OOD samples (compared to 67% in ID scenarios), indicating more thorough cross-view verification under complex conditions—the root of its robustness.
Goal / Key Insight: Since SFT provides structure and RL provides generalization, the goal is to combine their strengths. The Core Idea is to first inject a structured reasoning trajectory ("per-view analysis → cross-view mapping → spatial reasoning") via process-supervised SFT, then use reference-aware RL—providing fine-grained rewards for both reference selection and final answers—to allow the model to explore and solidify cross-view correspondences.
Method¶
Overall Architecture¶
STAR-R1 is a two-stage training framework based on Qwen2.5-VL-7B. It first conducts exploratory experiments on the TVR task to confirm that RL can induce human-like "anchoring → verification" behavior. This logic is then extended to real-world multi-view tasks via a three-step reasoning paradigm and two-stage training.
During inference, the model follows a fixed three-step process: ① Per-view reference analysis: Identifying key references in each image and encoding directional relationships as triplets [object 1, object 2, relation]; ② Cross-view spatial mapping: Comparing visual features and configurations to merge local relationships into a unified scene-level spatial map; ③ Spatial reasoning and answer inference: Reasoning on the reconstructed map to output a standardized <answer>...</answer>.
During training, Stage 1 uses Gemini-2.5-Pro to generate high-quality CoT data following the three-step format, keeping only correct samples for an SFT cold start (4.1k samples). Stage 2 applies RL (19.2k samples) where rewards target both "reference selection" and "answer accuracy." The precision of the reward design (especially dense rewards + dual penalties) is a key contribution derived from the TVR exploration.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Multi-view Images + Question"] --> B["Three-step Structured Reasoning Paradigm<br/>Per-view Analysis → Cross-view Mapping → Reasoning"]
B --> C["Process-supervised SFT Cold Start<br/>Gemini-2.5-Pro CoT + Result-oriented Filtering"]
C --> D["Reference-aware RL<br/>R_total = R_ans + R_ref"]
D --> E["Dense Rewards + Dual Penalty<br/>Fine-grained Scoring / Anti-reward hacking"]
E --> F["Cross-view Correspondence + Scene Reconstruction → Answer"]
Key Designs¶
1. SFT vs RL Diagnosis and Two-stage Integration: Process-supervised cold start for structure, RL for generalization Pure SFT overfits to annotation patterns and fails under perspective changes, while pure RL misses objects and lacks formatting. The authors chose a hybrid approach. Stage 1 utilizes process-supervised cold start, forcing每条 CoT to follow the three-step trajectory and using result-oriented filtering to exclude incorrect reasoning paths. SFT establishes the "reasoning skeleton" and format, leaving "optimal trajectory exploration" to Stage 2 RL.
2. Fine-grained Dense Accuracy Reward: Converting sparse signals into dense signals
TVR requires a triplet (index, attribute, value) to be entirely correct to receive a point. Using binary rewards (1 for all correct, 0 otherwise) limits exploration efficiency. The authors refined the reward to the level of each transformation \(t_i\), applying tiered positive rewards based on the match:
The total positive reward is \(R^{\text{pos}}=\sum_{i=1}^{n}R^{\text{pos}}(t_i)\). Formatting rewards ensure reasoning is within <think> and answers within <answer> (\(R^{\text{format}}=1\)). This progressive scoring provides clear climbing signals.
3. Dual Penalty Mechanism: Blocking reward hacking To prevent models from enumerating all possible triplets to farm points, the authors introduced dual penalties: a deduction of \(-1.0\) for each incorrect prediction (\(n_{\text{miss}}\)), and an additional penalty if the number of predicted transformations \(n_{\text{pred}}\) is less than the ground truth \(n_{\text{gt}}\):
The final accuracy reward is \(R^{\text{acc}}=R^{\text{pos}}+R^{\text{pun}}\).
4. Reference-aware RL Reward: Optimizing "answering" and "anchoring" simultaneously In real-world tasks, binary answer signals are insufficient. The authors introduced complementary rewards: reference reward \(R^{\text{ref}}\) for accurately identifying key references across views, and result reward \(R^{\text{ans}}\) for the final answer:
This forces the model to solidify cross-view grounding while pursuing the correct answer.
Loss & Training¶
Base model: Qwen2.5-VL-7B, 8×H20 GPUs. TVR exploration used single-stage RL for efficiency. Real-world tasks followed the full two-stage pipeline (4.1k SFT + 19.2k RL samples from MindCube and SPAR-7M). RL was based on GRPO with extended dense rewards. Response length curves show an initial drop followed by a gradual rise and stabilization, as the model learns to systematically compare objects concisely.
Key Experimental Results¶
Main Results¶
| Benchmark / Task | Metric | STAR-R1 (7B) | Best Comparison | Gain |
|---|---|---|---|---|
| TVR | TAcc↑ | 61.4 | o3 36.0 / GPT-4o 23.5 | +25.4% / +37.9% |
| TVR | NDiff↓ | 0.3 | Qwen2.5-VL-7B 1.5 | Large Drop |
| MMSI-Bench | Acc↑ | 31.4 | GPT-4o 30.3 | +1.1 |
| MindCube-Tiny Rotation | Acc↑ | 98.5 | Prev. SOTA 53.0 | +45.5% |
| MindCube-Tiny Around | Acc↑ | 82.8 | Prev. SOTA 70.4 | +12.4% |
| SPAR-Bench ObjRel-OC-MV | Acc↑ | 86.0 | SOTA 64.0 | +22.0% |
| SPAR-Bench ObjRel-OO-MV | Acc↑ | 76.7 | SOTA 59.0 (Human 80) | +17.7% |
STAR-R1 achieved state-of-the-art results across all TVR metrics. On SPAR-Bench, it outperformed methods trained on 10× more data and approached human levels in the ObjRel-OC-MV task.
Ablation Study (TVR Reward Design)¶
| Configuration | TAcc | NDiff↓ | Note |
|---|---|---|---|
| STAR-R1 (Full) | 61.4 | 0.31 | Complete reward |
| w/o obj reward | 58.0 | 0.37 | Scrapped object reward, lower efficiency |
| w/o attr reward | 56.8 | 0.40 | Attribute reward removal caused largest drop |
| w/o under-pred penalty | 58.2 | 0.41 | Loss of full exploration constraint |
| w/o incorrect penalty | 54.3 | 0.44 | Triggered enumerative reward hacking |
| w/ naive GRPO | 54.5 | 0.43 | Naive GRPO unsuited for TVR structure |
Key Findings¶
- Incorrect prediction penalty is the "anti-cheat" core: Without it, the model defaults to enumerating all triplets.
- Attribute rewards are more critical than object rewards: Removing attribute-level scoring (56.8) caused a larger drop than removing object-level scoring (58.0).
- Scale of complexity: Performance decreases as the number of objects increases, indicating that cross-view correspondence difficulty scales sharply with scene complexity.
- Rotation task relies heavily on RL: STAR-R1 outperformed STAR-SFT by 44.5% on this task; removing \(R^{\text{ref}}\) led to a 17.5% drop.
Highlights & Insights¶
- "SFT Memorizes, RL Generalizes" is a robust insight: Quantitatively verified through ID/OOD behavioral statistics, providing a clear methodology for balancing structure and generalization.
- Serious approach to anti-reward hacking: Modeling enumeration as a double penalty is far more reliable than purely positive reinforcement, a tactic transferable to other RLVR tasks.
- The structured reasoning paradigm serves as both interface and supervision: It unifies the CoT template with verifiable reward anchors.
- Sample efficiency: Achieving SOTA with only 4k SFT and 19k RL samples against baselines using 10× more data validates the effectiveness of structured cold starts.
Limitations & Future Work¶
- Task domain constraints: TVR uses synthetic CLEVR-style scenes; generalization to open-world multi-view tasks like long-range navigation requires further verification.
- Dependency on closed-source teachers: Stage 1 CoTs are generated by Gemini-2.5-Pro; the supervisor's ceiling and filtering noise may discard valid reasoning paths that happen to lead to incorrect final answers.
- Manual reward design: The multi-tiered scoring is currently tailored specifically for TVR structures, increasing the cost of migration to non-structured answer formats.
- Ground truth for reference labels: The mechanism for \(R^{\text{ref}}\) supervision in real-world tasks depends on supplementals and may introduce uncertainty during replication.
Related Work & Insights¶
- vs LMM-R1 / Video-R1: While these explore multimodal RL for math or video, they are general-purpose. STAR-R1 is the first to optimize RL specifically for multi-view spatial understanding with reference-aware rewards.
- vs Naive GRPO: STAR-R1 extends GRPO with dense rewards and penalties; the performance gap in TVR (61.4 vs 54.5) proves that naive GRPO is insufficient for structured multi-step outputs.
- Transferable Insight: Using explicit structured reasoning trajectories as both SFT templates and RL reward anchors can be extended to any multimodal task requiring verifiable intermediate steps (e.g., flowchart or diagram reasoning).
Rating¶
- Novelty: ⭐⭐⭐⭐ Systematic application of RL to multi-view spatial reasoning.
- Experimental Thoroughness: ⭐⭐⭐⭐ Four benchmarks plus extensive reward ablation and behavioral analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear "SFT vs RL" storyline.
- Value: ⭐⭐⭐⭐ Demonstrates that an open-source 7B can achieve near-human performance in spatial intelligence using efficient RL strategies.