EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards¶
Conference: CVPR2026 arXiv: 2511.16672 Code: mbzuai-oryx/EvoLMM Area: Multimodal VLM Keywords: Self-evolving learning, large multimodal models, continuous rewards, Proposer-Solver, unsupervised reinforcement learning, self-consistency
TL;DR¶
This paper proposes EvoLMM, a fully unsupervised self-evolving framework that instantiates two roles from a single backbone LMM: a Proposer (generating visual questions) and a Solver (producing multiple answers). By replacing discrete majority voting with continuous self-consistency rewards, the model improves multimodal mathematical reasoning using only raw images (ChartQA +2.7%, MathVista +2.1%).
Background & Motivation¶
- Annotation dependency bottleneck: Existing LMM training relies heavily on human-annotated data (QA pairs, metadata), limiting scalability and domain generalization.
- Dependence on external reward models: Many methods require additional reward models or human evaluation to provide learning signals, increasing system complexity.
- Instability of discrete rewards: Prior self-evolving LLM work (e.g., SQLM) uses majority-voting discrete rewards, which frequently produce zero reward in multimodal settings due to early high inconsistency among Solver outputs, leading to unstable optimization.
- Poor transferability of language-domain methods: Existing self-evolving research (e.g., Multi-Agent Evolve) is confined to purely textual domains; multimodal scenarios require visual grounding, making self-evaluation more challenging.
- Residual semi-supervision: Existing multimodal self-improvement methods (Vision-SR1, ViPER, Vision-Zero) still rely on reconstruction objectives, SFT warm-starts, or implicit supervision from external generative models (GPT-4o/Gemini).
- Core problem: Can an LMM autonomously improve its reasoning ability without any human annotations or external rewards?
Method¶
Overall Architecture¶
EvoLMM instantiates two collaborative roles from the same pretrained backbone:
- Proposer \(\pi_\phi(q|x)\): Given a raw image \(x\), generates visually grounded mathematical questions \(q\).
- Solver \(\pi_\theta(y|x,q)\): Samples \(N=5\) independent answers \(y_{1:N}\) for question \(q\).
Both roles form a closed-loop training pipeline via internal consistency signals, without using any ground-truth labels.
Continuous Self-Consistency Solver Reward¶
- \(p(y_i|x,q)\): consistency score of \(y_i\) under the empirical answer distribution.
- \(\gamma=0.7\): reward softening exponent that amplifies differences at intermediate confidence levels.
- Length penalty: encourages concise answers (target length \(\tau=6\) words).
- Key advantage: Unlike discrete majority voting, even partial agreement (e.g., 2 out of 5) yields non-zero gradients.
Entropy-Guided Continuous Proposer Reward¶
- \(H(x,q)\): entropy of the Solver's answer distribution, measuring question difficulty.
- Bandpass filter design: \(\mu_H=0.90\), \(\sigma_H=0.35\).
- When \(H \approx 0\) (trivially easy questions) or \(H\) is excessively large (unsolvable questions), the reward is low.
- Adaptive curriculum: as the Solver improves, the Proposer is incentivized to generate slightly harder but still solvable questions.
Training Optimization¶
- Policy gradient: REINFORCE with EMA baseline \(b_A\) for variance reduction.
- KL regularization: Token-level KL divergence constraint to prevent excessive deviation from the pretrained distribution.
- Dynamic KL controller: \(\beta_A \leftarrow \text{clip}(\beta_A \cdot \exp(\eta \cdot (\overline{KL}_A - \tau_A)/\tau_A), \beta_{\min}, \beta_{\max})\)
- Update frequency: Solver updates every step; Proposer updates every 5 steps.
- Parameter efficiency: Dual LoRA adapters with a frozen backbone.
- Training configuration: 6,000 steps, batch size 1, AdamW, lr=1e-6, 8× AMD MI250X GPUs.
Key Experimental Results¶
Main Results: Continuous vs. Discrete Rewards (Qwen2.5-VL-7B)¶
| Method | ChartQA | MathVista | MathVision | MathVerse | ScienceQA | AI2D | MMMU |
|---|---|---|---|---|---|---|---|
| Baseline | 84.00 | 68.46 | 23.91 | 43.78 | 88.30 | 82.61 | 51.11 |
| + Discrete Reward | 84.62 | 68.88 | 22.52 | 42.10 | 87.98 | 82.18 | 50.84 |
| + Continuous Reward (Ours) | 86.70 | 70.52 | 24.81 | 44.88 | 89.50 | 83.41 | 52.01 |
| Δ | +2.7% | +2.06% | +0.9% | +1.1% | +1.2% | +0.8% | +0.9% |
Discrete rewards even cause performance degradation on MathVision/MathVerse, demonstrating the fragility of majority voting in multimodal reasoning.
Cross-Backbone Generalization (Table 3)¶
| Model | ChartQA Δ | MathVista Δ | ScienceQA Δ |
|---|---|---|---|
| Qwen2.5-VL-7B | +2.70 | +2.06 | +1.20 |
| InternVL3-8B | +2.57 | +2.00 | +0.36 |
| Gemma3-12B-It | +2.97 | +2.00 | +1.08 |
| Llama-3.2-11B | +3.00 | +2.00 | +1.20 |
Consistent gains of +1–3% across all backbones demonstrate architecture-agnostic applicability.
Ablation Study: Fine-tuning Strategies (Table 2)¶
| Strategy | ChartQA | MathVista | MathVision |
|---|---|---|---|
| LoRA (default) | 86.70 | 70.52 | 24.81 |
| QLoRA | 85.32 | 68.92 | 23.97 |
| Full Fine-tune | 84.20 | 68.41 | 23.37 |
- Full fine-tuning leads to degradation, as full-parameter updates are prone to overfitting and conflict with KL regularization in the unsupervised setting.
- QLoRA suffers from quantization noise that reduces Solver consistency.
Model Scale Scaling (Table 4)¶
On Qwen2.5-VL-72B, ChartQA improves from 88.20 to 91.04 (+2.84) and MathVista from 73.93 to 76.44 (+2.51), with larger models yielding greater absolute gains.
Highlights & Insights¶
- Fully unsupervised: Only ~6k raw images (no QA pairs or metadata) are needed to stably improve reasoning ability.
- Elegant continuous reward design: Soft consistency-probability-based rewards on the Solver side combined with entropy-based bandpass rewards on the Proposer side effectively avoid gradient vanishing from discrete rewards.
- Emergent curriculum learning: The Proposer automatically transitions from simple to moderately difficult questions without manual curriculum design.
- Lightweight and reproducible: Dual LoRA with a frozen backbone converges in 6,000 training steps.
- Strong generalization: Effective across 4 different architectures (Qwen, InternVL, Gemma, Llama) and 2 model scales (7B/72B).
Limitations & Future Work¶
- Limited improvement magnitude: Gains are at most ~+3%, with diminishing returns on near-saturated benchmarks (e.g., ScienceQA 88→89%).
- Restricted to mathematical reasoning: The method has not been extended to broader tasks (OCR, VQA, visual dialogue), leaving generality unclear.
- Small training data scale: Only ~6k images are used; it remains unexplored whether larger-scale data would yield greater gains.
- Single-turn question answering: The Proposer generates a single question per image; multi-turn interaction or chained questioning is not considered.
- Fixed Solver sampling count: \(N=5\) is used as a fixed hyperparameter; the impact of different values of \(N\) is not thoroughly explored.
- No direct comparison with RLHF/RLVR: A fair comparison against supervised reinforcement learning methods is absent.
Related Work & Insights¶
| Method | Supervision Type | Reward Form | External Model | Multimodal |
|---|---|---|---|---|
| SQLM [5] | No annotation | Discrete majority voting | None | ❌ Text only |
| Multi-Agent Evolve [6] | No annotation | Judge role | Implicit Judge | ❌ Text only |
| Vision-SR1 [18] | SFT warm-start | Perception+reasoning decomposition | None | ✅ |
| ViPER [47] | Reconstruction objective | Image/instance reconstruction | OmniGen2/Qwen-Image | ✅ |
| Vision-Zero [36] | Synthetic image pairs | Social reasoning game | GPT-4o/Gemini | ✅ |
| EvoLMM (Ours) | Fully unsupervised | Continuous self-consistency | None | ✅ |
EvoLMM is the only multimodal self-evolving approach in this comparison that relies on no form of external supervision or auxiliary models.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Replacing discrete rewards with continuous rewards and the entropy-guided bandpass Proposer reward are cleverly designed)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Cross-backbone, cross-scale, and multiple ablations, though comparisons with supervised RL are missing)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, rich figures and tables, complete mathematical derivations)
- Value: ⭐⭐⭐⭐ (Provides a clean and effective baseline for unsupervised multimodal self-evolution, though absolute gains are modest)