Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play¶

Conference: ICLR 2026
arXiv: 2509.25541
Code: GitHub
Area: Multimodal VLM
Keywords: VLM, Self-Play, Reinforcement Learning, Zero-Shot, Gamification, Self-Improvement

TL;DR¶

This paper proposes Vision-Zero, the first annotation-free gamified self-play framework for VLMs. By casting visual reasoning as a "Who is the Spy?"-style game and combining it with the Iterative-SPO training algorithm, Vision-Zero achieves scalable self-improvement and surpasses SOTA methods trained on human-annotated data across reasoning, chart understanding, and vision-centric tasks.

Background & Motivation¶

Current VLM training faces two core bottlenecks:

Data scarcity: Multimodal annotation is prohibitively expensive (COCO Attributes: $60,480 for 200K objects; Ego4D: >250K annotation hours).

Knowledge ceiling: Model capability is bounded by the upper limit of human annotations, preventing the discovery of strategies beyond human experience.

Self-play has demonstrated the ability to break knowledge ceilings in domains such as Go (AlphaGo) and esports (OpenAI Five). However, extending self-play to VLMs is non-trivial: the game environment must simultaneously handle visual and language modalities while satisfying requirements of skill alignment, scalable difficulty, diversity, and low data demand.

Vision-Zero's design philosophy draws inspiration from the social deduction game "Who is the Spy?": civilians observe a real image while the spy receives a blank input, and the two sides engage in interactive strategic gameplay, enabling the model to autonomously generate its own training data.

Method¶

Game Environment¶

Roles: $n_c$ civilians (observing the real image $I_c$) + 1 spy (receiving a blank image $I_s$).

Two-stage gameplay:

Clue Stage: - Each player provides a language clue based on their role and observation. - The spy must infer the hidden image content solely from civilian clues and disguise its identity. - Civilians must provide accurate clues while minimizing information leakage to the spy.

Decision Stage: - Civilians analyze all clues along with their own image and vote to identify the spy. - The spy does not participate in voting. - Players may respond with "n/a" to indicate uncertainty.

Annotation-Free, Domain-Agnostic Data Input¶

Training requires only arbitrary images. Three data types are validated experimentally: - CLEVR data: 2,000 automatically rendered images (4–6 random objects each). - Chart data: 1,000 ChartQA images. - Real-world data: 1,000 ImgEdit images.

Iterative Self-Play Policy Optimization (Iterative-SPO)¶

Clue Stage — Self-Play Optimization:

Zero-sum rewards:

\[r_s^{clue} = -\beta(v_s - \bar{v}_c), \quad r_{c_j}^{clue} = \frac{\beta}{n_c}(v_s - \bar{v}_c) - \lambda(v_{c_j} - \bar{v}_c)\]

The sum of spy and civilian rewards is zero; players receiving more votes obtain lower rewards.

Role Advantage Estimation (RAE): mitigates win-rate imbalance caused by information asymmetry:

\[A_k^{clue} = r_k^{clue} - b_k, \quad b_s = \alpha b_s + (1-\alpha) r_s^{clue}\]

Clue stage objective:

\[\mathcal{L}^{clue}(\theta) = -\mathbb{E}\left[\frac{1}{n}\sum_{k \in \mathcal{K}} A_k^{clue} \log \pi_\theta^k(u_k | I_k, h)\right] + \tau_{clue} \cdot D_{KL}(\pi_\theta^k \| \pi_{ref}^k)\]

Decision Stage — RLVR Optimization:

Discrete rewards: correctly identifying the spy yields +1; answering n/a yields −0.5; incorrect answers yield −1.

Group-normalized GRPO objective:

\[\mathcal{L}^{dec}(\theta) = -\mathbb{E}\left[\frac{1}{n_c}\sum_{i=1}^{n_c} A_{c_i}^{dec} \log q_\theta(\hat{s}_{c_i} | H)\right] + \tau_{dec} \cdot D_{KL}(q_\theta \| q_{ref})\]

Alternating training: stages are switched via hysteresis thresholds: - Decision → Clue: when $\bar{acc}_t \geq \tau_{acc}^\uparrow$ and $\bar{na}_t \leq \tau_{na}^\downarrow$ (the spy is too easily identified; clue-stage difficulty is increased). - Clue → Decision: when $1 - \bar{acc}_t \geq \tau_{err}^\uparrow$ or $\bar{na}_t \geq \tau_{na}^\uparrow$ (the spy is too difficult to identify; decision-stage training is reinforced).

Advantages¶

Domain-agnostic: gameplay exploits inter-image differences without depending on specific image types.
Simultaneous multi-capability enhancement: reasoning, spatial understanding, visual understanding, and OCR.
Extremely low cost: no human annotation required; data can be rapidly generated using ChatGPT/NanoBanana.

Key Experimental Results¶

Reasoning and Mathematics Tasks¶

Method	MathVista	MathVision	WeMath	MathVerse	LogicVista	Avg
Qwen2.5-VL-7B	68.2	25.4	36.1	49.0	47.2	41.1
MM-Eureka-7B	73.0	26.9	36.2	50.3	42.9	42.9
ViGaL-S+R	71.9	27.5	36.9	52.4	46.5	43.0
VZ (CLEVR)	72.2	28.4	39.2	53.2	49.8	44.3
VZ (Real)	73.1	28.5	40.1	52.1	50.8	44.5

Using only unannotated data, Vision-Zero surpasses all baselines that rely on human-annotated data.

Chart Understanding and Vision-Centric Tasks¶

Vision-Zero (Chart) achieves substantial gains on chart benchmarks such as ChartXiV and FunctionQA, with additional improvements on vision-centric tasks including MMVP and BLINK.

Training Dynamics¶

The civilian win rate against the spy increases steadily throughout training.
Clue length (token count) grows over training, indicating that the model learns to describe and reason more elaborately.
Iterative-SPO effectively prevents the premature convergence associated with naive self-play.

Ablation Study¶

Ablation	MathVista	MathVision
Clue stage only	70.8	27.1
Decision stage only	71.5	27.6
Iterative-SPO	73.1	28.5

Alternating training substantially outperforms single-stage training.

Comparison with Gobang¶

On MathVision, Vision-Zero yields a +3% improvement (over 100 rounds) while Gobang shows no gain, demonstrating the superior generalization of visual reasoning games.

Highlights & Insights¶

Zero human involvement: no human annotation or feedback is required at any stage.
Domain-agnostic inputs: CLEVR, chart, and natural images are all effective.
Theoretically elegant Iterative-SPO: alternating self-play and RLVR avoids local equilibria.
Surpassing annotated baselines: an annotation-free method outperforms SOTA trained on expensive human-labeled data.
Simultaneous multi-capability gains: improvements span reasoning, chart understanding, and vision-centric tasks.

Limitations & Future Work¶

The number of roles in each game ($n_c + 1$) is fixed; more complex multi-role settings remain unexplored.
Whether the strategic space of the "Who is the Spy?" game sufficiently covers all visual reasoning capabilities is unclear.
The spy receives a blank image rather than a visually similar one, deviating from the original game design.
The hysteresis threshold hyperparameters in Iterative-SPO require manual tuning.
Gains on certain vision-centric tasks (e.g., RealWorldQA) remain limited.

LLM self-play: SPIRAL (Liu et al., 2025) enhances reasoning via board games; Absolute Zero (Zhao et al., 2025) achieves SOTA on mathematics and coding.
VLM post-training: R1-OneVision, MM-Eureka, and VLAA-Thinker employ RLVR with human annotations.
Gamified VLM training: ViGaL (Xie et al., 2025) uses snake and rotation games but requires game data collection.
Self-play theory: AlphaGo (Silver et al., 2017), TD-Gammon (Tesauro, 1995).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First annotation-free gamified self-play framework for VLMs.
Practicality: ⭐⭐⭐⭐⭐ — Minimal cost, domain-agnostic, and plug-and-play.
Clarity: ⭐⭐⭐⭐ — Framework is well-structured, though notation-heavy.
Significance: ⭐⭐⭐⭐⭐ — Opens a new paradigm for VLM self-evolution.