STELAR-Vision: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision¶
Conference: AAAI 2026 arXiv: 2508.08688 Code: stellar-neuron.github.io/stelar-vision Area: Reinforcement Learning Keywords: Topology-aware reasoning, vision-language models, chain/tree/graph of thought, reinforcement learning, efficient inference
TL;DR¶
This paper proposes STELAR-Vision, a topology-aware training framework for visual language reasoning. Via the TopoAug data generation pipeline, it introduces diverse reasoning topologies—Chain, Tree, and Graph—and combines SFT with RL (SimPO) post-training. The framework achieves +9.7% accuracy on in-distribution data and up to +28.4% on out-of-distribution benchmarks, while reducing output length by 18.1% through Frugal Learning.
Background & Motivation¶
State of the Field¶
Current vision-language models (VLMs) for reasoning rely predominantly on the Chain-of-Thought (CoT) paradigm. The authors identify a critical issue: different problems are better suited to different reasoning topologies. CoT is only one of many possible reasoning structures; Tree- and Graph-based reasoning demonstrably outperforms CoT on certain problem types.
Limitations of Prior Work¶
Through systematic evaluation of Qwen2-VL-7B and GPT-4o-Mini on the MATH-V dataset, the authors find:
- Chain wins 49% of cases, Tree 28%, Graph 23%—Tree and Graph together account for the majority
- Different subjects favor different topologies: Tree/Graph reasoning is notably superior in graph theory, statistics, and related disciplines
- Chain reasoning is the most verbose: its output token length distribution is heavily right-skewed with the highest mean; Tree/Graph distributions are more concentrated and concise
Root Cause¶
- Training data for existing VLMs is almost exclusively CoT-style, causing models to default to chain reasoning even when it is suboptimal
- CoT reasoning is prone to "overthinking," generating unnecessarily lengthy outputs
- Topological diversity correlates with output length—introducing Tree/Graph topologies naturally reduces output redundancy
- Hypothesis 1: Training on topologically diverse data (without increasing data volume) enables models to adaptively select the optimal topology
- Hypothesis 2: Building on this, a mechanism can be designed to encourage concise outputs, substantially improving efficiency with minimal accuracy loss
Method¶
Overall Architecture¶
STELAR-Vision comprises three core components: 1. TopoAug: A synthetic data generation pipeline that produces reasoning responses with multiple topology structures—Chain, Tree, and Graph—for each problem 2. Two-stage post-training: SFT → RL (SimPO) 3. Frugal Learning: A training variant that incentivizes concise outputs
Key Designs¶
1. TopoAug: Topology-Augmented Data Generation Pipeline¶
For each problem, two models (Qwen2-VL-7B-Instruct and GPT-4o-Mini) are used to repeatedly generate reasoning responses across three topologies: Chain, Tree, and Graph. Each topology supports configurable parameters such as maximum depth, number of child nodes, and number of neighbors.
Two types of labels are computed per problem:
-
Topology label \(\mathcal{F}_{q,t}\): A continuous value in \([0,1]\) representing the accuracy of topology \(t\) on problem \(q\): $\(\mathcal{F}_{q,t} = \frac{N_{\text{correct}}(q, t)}{N_{\text{total}}(q, t)}\)$
-
Outcome label \(\mathcal{H}_r\): A binary value \(\{0, 1\}\) indicating whether each individual response is correct
Problems are categorized into three difficulty levels based on the topology label distribution: - Easy: All three topology scores exceed the 85th percentile - Hard: All three topology scores fall below the 15th percentile - Medium: All remaining cases
Using two models of different scales ensures a balanced distribution of positive and negative samples and promotes diversity in reasoning topologies.
2. Two-Stage Post-Training¶
Stage 1: SFT (Supervised Fine-Tuning)
Data preparation follows a three-step filtering process: 1. Balanced sampling from Easy/Medium/Hard problems 2. Retaining only responses with outcome label \(\mathcal{H}_r = 1\) 3. Rejection sampling using a 7B ORM (Outcome Reward Model trained on both topology and outcome labels) to select higher-quality samples
TopoAug data is mixed with three general VQA datasets (OKVQA, A-OKVQA, LLaVA-150k), the latter of which receive no topology augmentation. LoRA fine-tuning is applied with standard next-token prediction loss:
Stage 2: RL (Reinforcement Learning)
Initialized from the SFT checkpoint, preference optimization is performed using SimPO:
Correct responses serve as preferred responses. A key detail: topology prompts are removed during training, compelling the model to autonomously infer the optimal reasoning structure at test time.
3. Frugal Learning: Training Variants for Efficient Inference¶
Two variants are proposed:
- STELAR-Vision-Short†: During SFT, only "short and correct" responses (token length below the 25th percentile) are retained; during RL, "short and correct" responses serve as winners
- STELAR-Vision-Short‡: Extends Short† by additionally treating "correct but verbose" responses as losers, penalizing both incorrect and excessively long outputs
Loss & Training¶
- Base model: Qwen2VL-7B-Instruct
- SFT data volume: approximately 50K–60K samples
- SFT training time: approximately 5–7 hours (8×A100/H100); RL training time: approximately 8–10 hours
- A single set of weights trained across all 5 OOD datasets is used, without any dataset-specific fine-tuning
Key Experimental Results¶
Main Results¶
| Model | VLM_S2H | MATH-V | Overall (ID) | Geometry3K | We-Math | PolyMath | SciBench | LogicVista |
|---|---|---|---|---|---|---|---|---|
| GPT-4o | 32.0 | 28.0 | 30.7 | 57.0 | 66.4 | 25.0 | 31.1 | 34.6 |
| Qwen2VL-7B-Instruct | 21.0 | 13.0 | 18.3 | 35.2 | 46.6 | 16.0 | 10.7 | 17.0 |
| Qwen2VL-72B-Instruct | 21.0 | 20.0 | 20.7 | 50.2 | 60.6 | 13.0 | 25.4 | 28.8 |
| Chain-Only | 25.0 | 21.0 | 23.7 | 31.4 | 42.2 | 17.2 | 10.7 | 25.4 |
| STELAR-Vision | 31.0 | 22.0 | 28.0 | 36.8 | 51.0 | 23.8 | 12.4 | 29.0 |
STELAR-Vision surpasses the base model by +9.7% on in-distribution data and outperforms the 10× larger Qwen2VL-72B-Instruct by +7.3%.
Ablation Study¶
| Model | SFT | RL | VLM_S2H | MATH-V | Overall |
|---|---|---|---|---|---|
| Qwen2VL-7B-Instruct | × | × | 21.0 | 13.0 | 18.3 |
| Chain-Only-SFT | ✓ | × | 18.5 | 19.0 | 18.7 |
| Chain-Only | ✓ | ✓ | 25.0 | 21.0 | 23.7 |
| STELAR-Vision-SFT | ✓ | × | 28.0 | 24.0 | 26.7 |
| STELAR-Vision | ✓ | ✓ | 31.0 | 22.0 | 28.0 |
Overall gain of topology augmentation over Chain-Only: 23.7% → 28.0% (+4.3%).
Frugal Learning efficiency comparison:
| Model | Accuracy (%) | ID Generated Tokens | OOD Generated Tokens |
|---|---|---|---|
| Qwen2VL-7B-Instruct | 26.2 | 613.5 | 543.3 |
| Chain-Only | 28.7 | 878.4 | 742.6 |
| STELAR-Vision | 31.6 | 556.7 | 523.4 |
| STELAR-Vision-Short† | 28.7 | 455.7 | 498.6 |
STELAR-Vision-Short† reduces output length by 18.1% while still outperforming the base model by +2.5%.
Key Findings¶
- Topological diversity expands the exploration space for RL: RL yields consistent gains on TopoAug data, whereas its benefit diminishes with Chain-Only data
- Post-training models autonomously select topologies (without explicit prompting): on Geometry3K, 96.4% of responses use Tree; on LogicVista, 61.7% use Chain—indicating that the model has genuinely learned to select the optimal structure based on problem characteristics
- SFT may overfit, causing SFT-only variants to underperform the full model on certain OOD datasets
- Frugal Learning fails for Chain-Only-Short†: after RL fine-tuning, Chain-Only models tend to generate verbose responses, and Frugal Learning alone cannot effectively constrain output length
Highlights & Insights¶
- Systematic validation of the value of reasoning topology diversity—different problems genuinely require different reasoning structures, offering a compelling rebuttal to the assumption that CoT is universally optimal
- Punching above its weight: the 7B model outperforms the 72B model by 7.3%, demonstrating that training paradigm matters more than model scale
- Frugal Learning is only effective when built on topological diversity—Chain-only training cannot simultaneously achieve conciseness and accuracy
- The topology selection distributions on OOD datasets closely align with problem structure (simple logic → Chain; complex geometry → Tree/Graph), confirming genuine generalization
Limitations & Future Work¶
- The current topology types are predefined as {Chain, Tree, Graph}; more flexible end-to-end topology discovery remains unexplored
- Incompatibility with Qwen2.5-VL (which generates diverse topologies unstably) limits base model selection
- The dynamic relationship between problem structure and optimal topology has not been deeply investigated
- The Short‡ variant of Frugal Learning (simultaneously penalizing verbosity) performs worse, suggesting conflicting optimization signals
- OOD generalization is strong overall, but gains on certain specific datasets (e.g., SciBench) are limited
Related Work & Insights¶
- CoT, ToT, and GoT have previously been employed primarily through sampling or rule-based methods; this paper is the first to incorporate multi-topology training into a VLM post-training framework
- The use of SimPO is a judicious choice: it requires no separate reward model and aligns naturally with decoding behavior
- Complementary to CuRPO: CuRPO identifies CoT as harmful in visual grounding and mitigates this via curriculum learning, while this paper finds CoT insufficient and introduces additional topologies as alternatives
- The Frugal Learning direction is related to L1 (RL-based reasoning length control) and SelfBudgeter, but is simpler in design
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Introducing multi-topology reasoning structures into VLM post-training is a novel and persuasive contribution
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 7 datasets (including 5 OOD), multiple baselines, and complete ablations; however, experiments are limited to the 7B scale
- Writing Quality: ⭐⭐⭐⭐ — Systematic analysis with rich figures and tables, though the extended version is lengthy
- Value: ⭐⭐⭐⭐⭐ — The approach of achieving strong performance with a small model is highly practical, and the TopoAug pipeline can be directly reused