NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation¶
Conference: NeurIPS 2025 arXiv: 2504.13055 Code: GitHub Area: Reinforcement Learning / VLM Reasoning Keywords: Visual Reasoning, Policy Exploration, Data Augmentation, GRPO, Noise Annealing
TL;DR¶
This paper proposes NoisyRollout, a data augmentation method with zero additional training cost. During GRPO-based VLM training, it mixes rollouts from clean and moderately perturbed images to enhance policy exploration diversity. Using only 2.1K samples, it achieves state-of-the-art performance among open-source RL fine-tuned models across five out-of-domain benchmarks.
Background & Motivation¶
- Scaling test-time computation (inference) via reinforcement learning is an important direction for enhancing model intelligence, but VLMs face unique challenges:
- Insufficient policy exploration: Conventional methods such as raising the sampling temperature introduce only superficial diversity and fail to guide policies toward more robust behaviors.
- Visual perception deficiencies: VLMs frequently exhibit perception errors that propagate into subsequent reasoning steps.
- Existing VLM-RL works largely transplant methods from the LLM domain without accounting for the specific challenges of visual perception.
- Core insight: If a model can successfully reason over perturbed images, its reasoning path is more robust; reward discrepancies between clean and perturbed images serve as implicit contrastive signals that improve perception.
Method¶
Overall Architecture¶
For each training sample \((I, \mathbf{q})\), the old policy generates two sets of rollouts: \(n_1\) rollouts from the clean image and \(n_2\) rollouts from the perturbed image \(\tilde{I} = T_{\alpha_t}(I)\). All rollouts are mixed to compute the reward baseline and advantage values. Crucially, policy updates are conditioned solely on the clean image; perturbed images are used only for collecting diverse rollouts. A noise annealing schedule progressively reduces the perturbation intensity throughout training.
Key Designs¶
-
Mixed Rollout Strategy:
- Function: Mixes reasoning trajectories from clean and perturbed images for GRPO optimization.
- Mechanism: \(n_1\) clean rollouts and \(n_2\) noisy rollouts jointly form a single group, sharing a unified reward mean and standard deviation as the normalization baseline.
- Design Motivation:
- Successful trajectories on perturbed images provide alternative, more robust reasoning paths.
- Reward discrepancies between clean and perturbed inputs expose perceptual fragility, functioning as implicit contrastive learning.
-
Noise Annealing Schedule:
- Function: Gradually reduces image perturbation intensity during training.
- Mechanism: Employs a sigmoid-shaped annealing schedule \(\alpha_t = \alpha_0 \cdot \left(1 - \frac{1}{1 + e^{-\lambda(t-\gamma)/t_{max}}}\right)\).
- Design Motivation: High noise in early training encourages exploration; low noise in later stages reduces distributional shift and ensures stable convergence.
-
Policy Updates Conditioned on Clean Inputs Only:
- Function: Although rollouts are collected from perturbed images, the policy gradient is computed using \(\frac{\pi_\theta(\mathbf{o}_i | I, \mathbf{q})}{\pi_{\theta_{old}}(\mathbf{o}_i | I, \mathbf{q})}\).
- Design Motivation: Prevents the policy from learning noise-dependent behaviors, ensuring optimal performance on clean inputs at inference time.
Loss & Training¶
- Rule-based rewards are used (correct = 1, incorrect = 0) with no KL divergence penalty.
- Default configuration: Gaussian noise, \(n_1 = 6\), \(n_2 = 6\) (total rollouts = 12, unchanged).
- The visual encoder is frozen; learning rate is set to 1e-6.
Key Experimental Results¶
Main Results¶
Qwen2.5-VL-7B-Instruct, trained on only 2.1K Geometry3K samples:
| Method | MathVerse | MathVision | MathVista | WeMath | HallusionBench |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B (base) | 46.2 | 25.0 | 67.5 | 63.1 | 64.6 |
| + Vanilla GRPO | 50.8 | 27.3 | 70.5 | 67.4 | 69.8 |
| + NoisyRollout | 53.2 | 28.5 | 72.6 | 69.6 | 72.1 |
Ablation Study¶
- Rollout diversity analysis: NoisyRollout significantly increases rollout cosine distance diversity in early training, with effects comparable to raising the temperature to 1.2.
- Temperature comparison: NoisyRollout (temperature 1.0) consistently outperforms vanilla GRPO at any temperature (0.8–1.4) across all benchmarks, demonstrating more targeted diversity.
- Noise type: Both Gaussian noise and rotation are effective; Gaussian noise performs marginally better.
- Ratio experiment: \(n_1 = 6, n_2 = 6\) (50% noisy rollouts) is the optimal configuration.
- 32B model: NoisyRollout remains effective at larger scale (MathVision 41.6 vs. GRPO 40.0).
Key Findings¶
- Using only 2.1K training samples, NoisyRollout surpasses competitors trained on 15K–260K samples (e.g., OpenVLThinker, R1-VL), demonstrating exceptional data efficiency.
- Improvements on HallusionBench (+2.3%) indicate that NoisyRollout enhances not only reasoning but also visual perception.
- Noise annealing is critical for training stability—fixed perturbation intensity leads to instability in later training stages.
- Consistent gains are observed across different datasets (Geometry3K vs. MMK12) and model scales (7B vs. 32B).
Highlights & Insights¶
- The design is remarkably simple ("free lunch"): no additional training cost, no modification to the RL objective, and no increase in total rollout count.
- Using visual perturbation as a policy exploration tool is a novel idea—it exploits the visual perception characteristics of VLMs to provide meaningful diversity.
- The implicit contrastive learning mechanism is elegant: reward discrepancies between clean and perturbed inputs naturally regularize perceptual behavior.
- Data efficiency is striking: 2.1K samples suffice to achieve state-of-the-art results on five out-of-domain benchmarks.
Limitations & Future Work¶
- The perturbation types (Gaussian noise, rotation) are relatively simple; more complex augmentations (e.g., occlusion, style transfer) remain unexplored.
- The reasons for the failure of certain strategies such as cropping are not thoroughly analyzed.
- Hyperparameter selection for the noise annealing schedule (\(\alpha_0, \lambda, \gamma\)) remains largely manual.
- Applicability to non-visual reasoning tasks (e.g., pure text reasoning) is not discussed.
Related Work & Insights¶
- NoisyRollout is complementary to concurrent works such as DeepVideo-R1: NoisyRollout improves exploration strategy while DeepVideo-R1 refines the optimization objective.
- The mixed rollout concept is generalizable to other RL fine-tuning scenarios (e.g., code generation, mathematical reasoning).
- Noise annealing aligns with the principle of curriculum learning: transitioning from broad exploration to narrow exploitation.
Rating¶
- ⭐⭐⭐⭐⭐ — The method is concise, efficient, and demonstrates strong generalization, representing a practical contribution to the VLM-RL field.