Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play¶
- Conference: ICLR 2026
- arXiv: 2509.25541
- Code: GitHub
- Area: Multimodal VLM
- Keywords: VLM, Self-Play, Reinforcement Learning, Zero-Shot, Gamification, Self-Improvement
TL;DR¶
This paper proposes Vision-Zero, the first annotation-free gamified self-play framework for VLMs. By casting visual reasoning as a "Who is the Spy?"-style game and combining it with the Iterative-SPO training algorithm, Vision-Zero achieves scalable self-improvement and surpasses SOTA methods trained on human-annotated data across reasoning, chart understanding, and vision-centric tasks.
Background & Motivation¶
Current VLM training faces two core bottlenecks:
Data scarcity: Multimodal annotation is prohibitively expensive (COCO Attributes: $60,480 for 200K objects; Ego4D: >250K annotation hours).
Knowledge ceiling: Model capability is bounded by the upper limit of human annotations, preventing the discovery of strategies beyond human experience.
Self-play has demonstrated the ability to break knowledge ceilings in domains such as Go (AlphaGo) and esports (OpenAI Five). However, extending self-play to VLMs is non-trivial: the game environment must simultaneously handle visual and language modalities while satisfying requirements of skill alignment, scalable difficulty, diversity, and low data demand.
Vision-Zero's design philosophy draws inspiration from the social deduction game "Who is the Spy?": civilians observe a real image while the spy receives a blank input, and the two sides engage in interactive strategic gameplay, enabling the model to autonomously generate its own training data.
Method¶
Game Environment¶
Roles: \(n_c\) civilians (observing the real image \(I_c\)) + 1 spy (receiving a blank image \(I_s\)).
Two-stage gameplay:
Clue Stage: - Each player provides a language clue based on their role and observation. - The spy must infer the hidden image content solely from civilian clues and disguise its identity. - Civilians must provide accurate clues while minimizing information leakage to the spy.
Decision Stage: - Civilians analyze all clues along with their own image and vote to identify the spy. - The spy does not participate in voting. - Players may respond with "n/a" to indicate uncertainty.
Annotation-Free, Domain-Agnostic Data Input¶
Training requires only arbitrary images. Three data types are validated experimentally: - CLEVR data: 2,000 automatically rendered images (4–6 random objects each). - Chart data: 1,000 ChartQA images. - Real-world data: 1,000 ImgEdit images.
Iterative Self-Play Policy Optimization (Iterative-SPO)¶
Clue Stage — Self-Play Optimization:
Zero-sum rewards:
The sum of spy and civilian rewards is zero; players receiving more votes obtain lower rewards.
Role Advantage Estimation (RAE): mitigates win-rate imbalance caused by information asymmetry:
Clue stage objective:
Decision Stage — RLVR Optimization:
Discrete rewards: correctly identifying the spy yields +1; answering n/a yields −0.5; incorrect answers yield −1.
Group-normalized GRPO objective:
Alternating training: stages are switched via hysteresis thresholds: - Decision → Clue: when \(\bar{acc}_t \geq \tau_{acc}^\uparrow\) and \(\bar{na}_t \leq \tau_{na}^\downarrow\) (the spy is too easily identified; clue-stage difficulty is increased). - Clue → Decision: when \(1 - \bar{acc}_t \geq \tau_{err}^\uparrow\) or \(\bar{na}_t \geq \tau_{na}^\uparrow\) (the spy is too difficult to identify; decision-stage training is reinforced).
Advantages¶
- Domain-agnostic: gameplay exploits inter-image differences without depending on specific image types.
- Simultaneous multi-capability enhancement: reasoning, spatial understanding, visual understanding, and OCR.
- Extremely low cost: no human annotation required; data can be rapidly generated using ChatGPT/NanoBanana.
Key Experimental Results¶
Reasoning and Mathematics Tasks¶
| Method | MathVista | MathVision | WeMath | MathVerse | LogicVista | Avg |
|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 68.2 | 25.4 | 36.1 | 49.0 | 47.2 | 41.1 |
| MM-Eureka-7B | 73.0 | 26.9 | 36.2 | 50.3 | 42.9 | 42.9 |
| ViGaL-S+R | 71.9 | 27.5 | 36.9 | 52.4 | 46.5 | 43.0 |
| VZ (CLEVR) | 72.2 | 28.4 | 39.2 | 53.2 | 49.8 | 44.3 |
| VZ (Real) | 73.1 | 28.5 | 40.1 | 52.1 | 50.8 | 44.5 |
Using only unannotated data, Vision-Zero surpasses all baselines that rely on human-annotated data.
Chart Understanding and Vision-Centric Tasks¶
Vision-Zero (Chart) achieves substantial gains on chart benchmarks such as ChartXiV and FunctionQA, with additional improvements on vision-centric tasks including MMVP and BLINK.
Training Dynamics¶
- The civilian win rate against the spy increases steadily throughout training.
- Clue length (token count) grows over training, indicating that the model learns to describe and reason more elaborately.
- Iterative-SPO effectively prevents the premature convergence associated with naive self-play.
Ablation Study¶
| Ablation | MathVista | MathVision |
|---|---|---|
| Clue stage only | 70.8 | 27.1 |
| Decision stage only | 71.5 | 27.6 |
| Iterative-SPO | 73.1 | 28.5 |
Alternating training substantially outperforms single-stage training.
Comparison with Gobang¶
On MathVision, Vision-Zero yields a +3% improvement (over 100 rounds) while Gobang shows no gain, demonstrating the superior generalization of visual reasoning games.
Highlights & Insights¶
- Zero human involvement: no human annotation or feedback is required at any stage.
- Domain-agnostic inputs: CLEVR, chart, and natural images are all effective.
- Theoretically elegant Iterative-SPO: alternating self-play and RLVR avoids local equilibria.
- Surpassing annotated baselines: an annotation-free method outperforms SOTA trained on expensive human-labeled data.
- Simultaneous multi-capability gains: improvements span reasoning, chart understanding, and vision-centric tasks.
Limitations & Future Work¶
- The number of roles in each game (\(n_c + 1\)) is fixed; more complex multi-role settings remain unexplored.
- Whether the strategic space of the "Who is the Spy?" game sufficiently covers all visual reasoning capabilities is unclear.
- The spy receives a blank image rather than a visually similar one, deviating from the original game design.
- The hysteresis threshold hyperparameters in Iterative-SPO require manual tuning.
- Gains on certain vision-centric tasks (e.g., RealWorldQA) remain limited.
Related Work & Insights¶
- LLM self-play: SPIRAL (Liu et al., 2025) enhances reasoning via board games; Absolute Zero (Zhao et al., 2025) achieves SOTA on mathematics and coding.
- VLM post-training: R1-OneVision, MM-Eureka, and VLAA-Thinker employ RLVR with human annotations.
- Gamified VLM training: ViGaL (Xie et al., 2025) uses snake and rotation games but requires game data collection.
- Self-play theory: AlphaGo (Silver et al., 2017), TD-Gammon (Tesauro, 1995).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First annotation-free gamified self-play framework for VLMs.
- Practicality: ⭐⭐⭐⭐⭐ — Minimal cost, domain-agnostic, and plug-and-play.
- Clarity: ⭐⭐⭐⭐ — Framework is well-structured, though notation-heavy.
- Significance: ⭐⭐⭐⭐⭐ — Opens a new paradigm for VLM self-evolution.