DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO¶

Conference: NeurIPS 2025 arXiv: 2506.07464 Code: GitHub Area: LLM Alignment / Video Large Language Models Keywords: Video reasoning, reinforcement fine-tuning, GRPO, regression objective, difficulty-aware augmentation

TL;DR¶

This paper proposes DeepVideo-R1, which reformulates GRPO as Reg-GRPO that directly regresses advantage values (eliminating clipping/min safeguards), and mitigates the vanishing advantage problem via difficulty-aware data augmentation, achieving up to 10.1 percentage points improvement over standard GRPO on video reasoning tasks.

Background & Motivation¶

State of the Field¶

Background: RL-based post-training (e.g., GRPO) has proven effective for enhancing LLM reasoning, yet its application to Video Large Language Models (VideoLLMs) remains underexplored.

Limitations of Prior Work¶

Limitations of Prior Work: Applying GRPO to VideoLLMs faces two critical issues:

Root Cause¶

Key Challenge: Dependence on safeguard mechanisms: PPO-style clipping and min operations produce zero gradients when the policy deviates excessively, impeding exploration and convergence.

Starting Point¶

Key Insight: Vanishing advantage: When samples are too easy or too hard, all responses within a group receive identical rewards, causing zero advantage values and loss of training signal.

Remarks¶

Remarks: Video reasoning requires complex spatiotemporal semantic understanding, making both issues particularly pronounced in video tasks.

Remarks¶

Remarks: Existing work has primarily focused on reward function design, leaving algorithmic improvements to GRPO itself relatively underexplored.

Method¶

Overall Architecture¶

DeepVideo-R1 comprises two key innovations: (1) Reg-GRPO reformulates the GRPO objective to directly regress group-relative advantage values, eliminating safeguard mechanisms such as clipping and min operations; (2) difficulty-aware data augmentation dynamically adjusts inputs based on sample difficulty to ensure diverse reward signals.

Key Designs¶

Regressive GRPO (Reg-GRPO):
- Function: Transforms the RL objective from PPO-style optimization to direct regression of advantage values.
- Mechanism: Leverages a reparameterization of the closed-form solution to the KL-constrained RL objective, defining the predicted advantage as \(\hat{A}_\theta^{(i)} = \frac{\rho(\mathbf{x}, \mathbf{y}^{(i)}) - \mu_\rho}{\sigma_\rho}\), where \(\rho = \log \frac{\pi_\theta}{\pi_{\theta_{old}}}\), and minimizes MSE against the target advantage.
- Design Motivation: The regression loss is inherently free of clipping truncation, and normalization naturally eliminates the partition function \(Z(\mathbf{x})\).
Difficulty-aware Data Augmentation:
- Function: Dynamically adjusts video-text inputs based on sample difficulty.
- Mechanism: Uses the mean reward from a replay buffer as a reference to compute difficulty \(\Delta_\mathcal{R}(\mathbf{x})\).
- Design Motivation: Samples of moderate difficulty yield the most diverse reward distributions, ensuring effective gradient signals.
Bidirectional Difficulty Adjustment:
- Reducing difficulty (hard samples): Partial reasoning clues extracted from successful reasoning trajectories are injected into the prompt, with intensity adaptively scaled by difficulty.
- Increasing difficulty (easy samples): Gaussian noise or masking is applied to video frames, with intensity proportional to the degree of easiness.

Loss & Training¶

\[\mathcal{L}_{\text{Reg-GRPO}}(\theta) = \mathbb{E}\left[(\hat{A}^{(i)} - \hat{A}_\theta^{(i)})^2 - \beta \mathbb{D}_{KL}[\pi_\theta || \pi_{ref}]\right]\]

The KL divergence constraint prevents excessive deviation from the reference model.
A replay buffer storing the most recent \(W\) steps of data is used for dynamic difficulty baseline computation.

Key Experimental Results¶

Main Results¶

Performance on SEED-Bench-R1 validation set and LongVideoBench:

Method	SEED-Bench-R1 (Acc)	LongVideoBench
Qwen2.5-VL-7B (SFT)	55.4	57.3
+ GRPO	55.8	54.1
+ Reg-GRPO	63.2	59.4
+ DeepVideo-R1	65.9	60.7

DeepVideo-R1 achieves a 10.1-point gain over GRPO on SEED-Bench-R1.

Ablation Study¶

Reg-GRPO vs. GRPO: Reg-GRPO consistently outperforms GRPO across all benchmarks with faster convergence.
Contribution of difficulty augmentation: Provides an additional 2.3-point gain on top of Reg-GRPO.
Difficulty reduction vs. difficulty increase: Each component is individually effective; combining both yields the best results.
Zero-advantage ratio: Difficulty augmentation reduces the zero-advantage ratio from approximately 30% to approximately 10%.

Key Findings¶

The regression objective yields more stable gradients, eliminating zero-gradient regions caused by clipping.
Difficulty-aware augmentation directly addresses the root cause of vanishing advantage — zero reward variance within a group.
Consistent improvements on both in-distribution and out-of-distribution tasks indicate enhanced generalization.

Highlights & Insights¶

The derivation of Reg-GRPO is concise: starting from the closed-form RL solution, the partition function is naturally eliminated through group normalization.
Difficulty-aware augmentation constitutes an RL-native implementation of curriculum learning.
Extracting reasoning clues from successful trajectories as a difficulty-reduction strategy represents an interesting self-guided approach.
The method is not restricted to the video domain and is applicable to any setting employing GRPO.

Limitations & Future Work¶

Validation is limited to 7B-scale models; scaling behavior at larger sizes remains unknown.
Reasoning clue extraction requires additional generation steps, increasing data preparation cost.
Comparisons with other alignment methods such as DPO are absent.
Sensitivity analysis of the replay buffer window size \(W\) is insufficiently thorough.

Reg-GRPO shares conceptual similarities with REBEL (direct reward regression) but is more naturally motivated in the group-relative setting.
The approach is complementary to NoisyRollout: NoisyRollout improves exploration diversity while this work improves the optimization objective.
The difficulty-aware paradigm can be integrated with VideoLLM RL works such as VideoChat-R1 and TimeZero.

Rating¶

⭐⭐⭐⭐ — The RL algorithm improvement is theoretically well-grounded and empirically significant, and the difficulty augmentation strategy is practical; however, validation at larger scales remains limited.

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO¶

TL;DR¶

Background & Motivation¶

State of the Field¶

Limitations of Prior Work¶

Root Cause¶

Starting Point¶

Remarks¶

Remarks¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Related Papers¶