NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation¶

Conference: NeurIPS 2025 arXiv: 2504.13055 Code: GitHub Area: Reinforcement Learning / VLM Reasoning Keywords: Visual Reasoning, Policy Exploration, Data Augmentation, GRPO, Noise Annealing

TL;DR¶

This paper proposes NoisyRollout, a data augmentation method with zero additional training cost. During GRPO-based VLM training, it mixes rollouts from clean and moderately perturbed images to enhance policy exploration diversity. Using only 2.1K samples, it achieves state-of-the-art performance among open-source RL fine-tuned models across five out-of-domain benchmarks.

Background & Motivation¶

Scaling test-time computation (inference) via reinforcement learning is an important direction for enhancing model intelligence, but VLMs face unique challenges:
- Insufficient policy exploration: Conventional methods such as raising the sampling temperature introduce only superficial diversity and fail to guide policies toward more robust behaviors.
- Visual perception deficiencies: VLMs frequently exhibit perception errors that propagate into subsequent reasoning steps.
Existing VLM-RL works largely transplant methods from the LLM domain without accounting for the specific challenges of visual perception.
Core insight: If a model can successfully reason over perturbed images, its reasoning path is more robust; reward discrepancies between clean and perturbed images serve as implicit contrastive signals that improve perception.

Method¶

Overall Architecture¶

For each training sample \((I, \mathbf{q})\), the old policy generates two sets of rollouts: \(n_1\) rollouts from the clean image and \(n_2\) rollouts from the perturbed image \(\tilde{I} = T_{\alpha_t}(I)\). All rollouts are mixed to compute the reward baseline and advantage values. Crucially, policy updates are conditioned solely on the clean image; perturbed images are used only for collecting diverse rollouts. A noise annealing schedule progressively reduces the perturbation intensity throughout training.

Key Designs¶

Mixed Rollout Strategy:
- Function: Mixes reasoning trajectories from clean and perturbed images for GRPO optimization.
- Mechanism: \(n_1\) clean rollouts and \(n_2\) noisy rollouts jointly form a single group, sharing a unified reward mean and standard deviation as the normalization baseline.
- Design Motivation:
  - Successful trajectories on perturbed images provide alternative, more robust reasoning paths.
  - Reward discrepancies between clean and perturbed inputs expose perceptual fragility, functioning as implicit contrastive learning.
Noise Annealing Schedule:
- Function: Gradually reduces image perturbation intensity during training.
- Mechanism: Employs a sigmoid-shaped annealing schedule \(\alpha_t = \alpha_0 \cdot \left(1 - \frac{1}{1 + e^{-\lambda(t-\gamma)/t_{max}}}\right)\).
- Design Motivation: High noise in early training encourages exploration; low noise in later stages reduces distributional shift and ensures stable convergence.
Policy Updates Conditioned on Clean Inputs Only:
- Function: Although rollouts are collected from perturbed images, the policy gradient is computed using \(\frac{\pi_\theta(\mathbf{o}_i | I, \mathbf{q})}{\pi_{\theta_{old}}(\mathbf{o}_i | I, \mathbf{q})}\).
- Design Motivation: Prevents the policy from learning noise-dependent behaviors, ensuring optimal performance on clean inputs at inference time.

Loss & Training¶

\[\mathcal{J}(\theta) = \mathbb{E}\left[\frac{1}{n_1+n_2}\sum_{i=1}^{n_1+n_2} \min\left(\frac{\pi_\theta(\mathbf{o}_i | I, \mathbf{q})}{\pi_{\theta_{old}}(\mathbf{o}_i | I, \mathbf{q})}\hat{A}_i, \text{clip}(\cdot, 1-\epsilon, 1+\epsilon)\hat{A}_i\right)\right]\]

Rule-based rewards are used (correct = 1, incorrect = 0) with no KL divergence penalty.
Default configuration: Gaussian noise, \(n_1 = 6\), \(n_2 = 6\) (total rollouts = 12, unchanged).
The visual encoder is frozen; learning rate is set to 1e-6.

Key Experimental Results¶

Main Results¶

Qwen2.5-VL-7B-Instruct, trained on only 2.1K Geometry3K samples:

Method	MathVerse	MathVision	MathVista	WeMath	HallusionBench
Qwen2.5-VL-7B (base)	46.2	25.0	67.5	63.1	64.6
+ Vanilla GRPO	50.8	27.3	70.5	67.4	69.8
+ NoisyRollout	53.2	28.5	72.6	69.6	72.1

Ablation Study¶

Rollout diversity analysis: NoisyRollout significantly increases rollout cosine distance diversity in early training, with effects comparable to raising the temperature to 1.2.
Temperature comparison: NoisyRollout (temperature 1.0) consistently outperforms vanilla GRPO at any temperature (0.8–1.4) across all benchmarks, demonstrating more targeted diversity.
Noise type: Both Gaussian noise and rotation are effective; Gaussian noise performs marginally better.
Ratio experiment: \(n_1 = 6, n_2 = 6\) (50% noisy rollouts) is the optimal configuration.
32B model: NoisyRollout remains effective at larger scale (MathVision 41.6 vs. GRPO 40.0).

Key Findings¶

Using only 2.1K training samples, NoisyRollout surpasses competitors trained on 15K–260K samples (e.g., OpenVLThinker, R1-VL), demonstrating exceptional data efficiency.
Improvements on HallusionBench (+2.3%) indicate that NoisyRollout enhances not only reasoning but also visual perception.
Noise annealing is critical for training stability—fixed perturbation intensity leads to instability in later training stages.
Consistent gains are observed across different datasets (Geometry3K vs. MMK12) and model scales (7B vs. 32B).

Highlights & Insights¶

The design is remarkably simple ("free lunch"): no additional training cost, no modification to the RL objective, and no increase in total rollout count.
Using visual perturbation as a policy exploration tool is a novel idea—it exploits the visual perception characteristics of VLMs to provide meaningful diversity.
The implicit contrastive learning mechanism is elegant: reward discrepancies between clean and perturbed inputs naturally regularize perceptual behavior.
Data efficiency is striking: 2.1K samples suffice to achieve state-of-the-art results on five out-of-domain benchmarks.

Limitations & Future Work¶

The perturbation types (Gaussian noise, rotation) are relatively simple; more complex augmentations (e.g., occlusion, style transfer) remain unexplored.
The reasons for the failure of certain strategies such as cropping are not thoroughly analyzed.
Hyperparameter selection for the noise annealing schedule (\(\alpha_0, \lambda, \gamma\)) remains largely manual.
Applicability to non-visual reasoning tasks (e.g., pure text reasoning) is not discussed.

NoisyRollout is complementary to concurrent works such as DeepVideo-R1: NoisyRollout improves exploration strategy while DeepVideo-R1 refines the optimization objective.
The mixed rollout concept is generalizable to other RL fine-tuning scenarios (e.g., code generation, mathematical reasoning).
Noise annealing aligns with the principle of curriculum learning: transitioning from broad exploration to narrow exploitation.

Rating¶

⭐⭐⭐⭐⭐ — The method is concise, efficient, and demonstrates strong generalization, representing a practical contribution to the VLM-RL field.