TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs¶
Conference: NeurIPS 2025 arXiv: 2509.18056 Code: GitHub Area: Video Understanding Keywords: Video Temporal Grounding, Reinforcement Learning Fine-Tuning, GRPO, Mixed-Policy Sampling, Chain-of-Thought
TL;DR¶
This paper proposes TempSamp-R1, a mixed-policy reinforcement fine-tuning framework that integrates high-quality off-policy (ground truth) guidance into GRPO's on-policy sampling and introduces nonlinear soft advantage estimation to stabilize training, achieving state-of-the-art performance on video temporal grounding (Charades-STA R1@0.7: 52.9%, ActivityNet R1@0.5: 56.0%).
Background & Motivation¶
-
Background: MLLMs excel at general video understanding but still struggle with temporal grounding tasks, which require precise comprehension of spatiotemporal relationships in long videos.
-
Limitations of Prior Work: SFT-based methods overfit to static timestamp annotations and lack flexible temporal reasoning. GRPO's on-policy sampling is inefficient in large temporal search spaces, making it difficult to find temporally precise solutions.
-
Key Challenge: GRPO uses ground truth only for computing IoU rewards, rather than treating it as a dynamic learning resource; additionally, high rewards from off-policy solutions introduce bias in advantage estimation.
-
Goal: Design a more stable and efficient RL fine-tuning framework that fully leverages annotation information to guide policy learning.
-
Key Insight: Incorporate ground truth directly as an off-policy solution in policy optimization, while addressing the reward distribution bias this introduces.
-
Core Idea: Mixed-policy sampling + nonlinear soft advantage estimation + hybrid CoT training.
Method¶
Overall Architecture¶
Built on Qwen2.5-VL-7B, the framework generates \(G-1\) on-policy samples and 1 off-policy solution (ground truth) per query. A soft advantage estimation module computes advantage values for stable policy optimization.
Key Designs¶
-
Mixed-Policy Sampling:
- Function: Provides temporally precise guidance solutions.
- Mechanism: For each query, \(G-1\) solutions are sampled from the current policy, and 1 ground truth is added as an off-policy solution; advantages are computed via joint normalization.
- Design Motivation: Pure on-policy sampling rarely produces high-IoU solutions in large temporal search spaces; ground truth provides precise temporal anchoring.
-
Nonlinear Soft Advantage Estimation:
- Function: Mitigates advantage bias caused by high rewards from off-policy solutions.
- Mechanism: An asymmetric transformation is applied to rewards: \(\tilde{r}_i = \tau + \alpha_1 \cdot \ln((r_i - \tau) + 1)\) when \(r_i \geq \tau\), with exponential expansion in the low-reward region. This compresses advantage gaps near optimal solutions while amplifying differences among suboptimal ones.
- Design Motivation: Directly using off-policy rewards causes all on-policy solutions to receive negative advantages, suppressing effective exploration.
-
Hybrid CoT Training:
- Function: Enables a single model to support both CoT and non-CoT inference.
- Mechanism: Two-stage training — the initialization stage trains direct answering without reasoning; subsequently, format rewards encourage \<Think> reasoning steps.
- Design Motivation: Charades-STA and ActivityNet benefit from CoT, while QVHighlights benefits from direct prediction; the hybrid mode is complementary.
Loss & Training¶
- IoU Reward: \(R_{IoU}\) based on the intersection-over-union between predicted and ground truth temporal intervals.
- Timestamp Matching Reward: \(R_{ts} = \lambda_{rec} \cdot F2 + \lambda_{score} \cdot \frac{1}{1+WMSE}\)
- Format Reward: Regex-based verification of the \<Think>...\</Think>\<Answer>...\</Answer> structure.
- Training: 4 × A100, batch size = 1 per GPU, \(G=4\) (3 on-policy + 1 off-policy).
- Two-stage training: the initialization stage trains direct answering; the subsequent stage adds format rewards to encourage reasoning steps. Video frame rate is fixed at 2 FPS.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | TempSamp-R1 | TimeZero | VideoChat-R1 | Gain |
|---|---|---|---|---|---|
| Charades-STA | R1@0.7 | 52.9% | 47.9% | 50.2% | +2.7% |
| ActivityNet | R1@0.5 | 56.0% | 47.3% | - | +8.7% |
| QVHighlights | mAP | 30.0% | - | - | +3.0% |
Ablation Study¶
| Configuration | mIoU | Note |
|---|---|---|
| SFT only | 20.6 | Baseline |
| GRPO only | 30.7 | Pure on-policy |
| TempSamp-R1 | 34.7 | Mixed-policy + soft advantage |
Key Findings¶
- Selecting the optimal inference mode (CoT vs. non-CoT) in hybrid CoT training yields an additional 4%+ improvement in mIoU.
- Performance remains robust in low-data regimes.
- Nonlinear reward shaping is particularly effective on datasets with large search spaces, such as ActivityNet.
- Few-shot experiments: with only 50 training videos, TempSamp-R1 achieves mIoU 44.7% (vs. SFT 41.9%); with 500 videos, it reaches R1@0.5 64.0% (vs. SFT 51.4%, GRPO 55.3%), with a training time of 218 minutes (vs. GRPO 338 minutes).
- Advantage shaping ablation: directly injecting GT rewards (Mixed-policy only) drops R1@0.5 to 63.0% (due to distribution shift and reduced diversity); reward downscaling achieves 70.3%; advantage anchoring achieves 70.7%; nonlinear shaping yields the best result at 72.1%.
- Skewness analysis: GRPO exhibits consistently negative skew (dominated by low-reward solutions); Mixed-policy alone shows high positive skew (over-reliance on high-reward solutions); nonlinear shaping maintains near-zero skewness, ensuring stable optimization.
- Cross-domain generalization: training on Charades-STA and testing on ActivityNet, TempSamp-R1 outperforms GRPO by +4.0% mIoU and +4.7% R1@0.5.
Highlights & Insights¶
- Ground truth is cleverly repurposed from an "evaluation tool" to a "learning resource."
- The asymmetric design of nonlinear advantage estimation is applicable to any RL scenario involving high-quality external solutions.
- Hybrid CoT training enables a single model to adapt to queries of varying complexity.
- The method is simple, requiring only minor modifications on top of GRPO.
- Ablation on sample count: with only 2 on-policy samples + 1 off-policy sample, R1@0.7 already improves from GRPO's 34.4% to 44.6% (+10.2%), representing the largest gain. As the sample count increases to 4/6/8, the gap narrows but TempSamp-R1 consistently leads.
Limitations & Future Work¶
- Relies on ground truth as off-policy guidance, making it inapplicable in unannotated settings.
- The threshold \(\tau\) and coefficients in the nonlinear transformation require manual tuning.
- Combination with larger models (e.g., 72B) has not been explored.
- The video frame rate is fixed at 2 FPS; higher frame rates may improve performance.
- Reward distribution analysis: GRPO exhibits low median and high variance on ActivityNet; TempSamp-R1 shows a significantly higher median with reduced variance, indicating that mixed-policy sampling consistently finds higher-quality solutions.
Related Work & Insights¶
- vs. TimeZero: TimeZero relies on pure GRPO on-policy sampling, whereas TempSamp-R1 incorporates off-policy guidance. ActivityNet R1@0.5: TempSamp-R1 56.0% vs. TimeZero 47.3% (+8.7%).
- vs. VideoChat-R1: VideoChat-R1 focuses on reward function design, while TempSamp-R1 focuses on sampling strategy optimization. Charades-STA R1@0.7: TempSamp-R1 52.9% vs. VideoChat-R1 50.2% (+2.7%).
- vs. iMOVE (SFT): SFT methods overfit to timestamps, whereas TempSamp-R1 learns flexible reasoning through RL.
Rating¶
Implementation Details¶
Built on Qwen2.5-VL-7B, trained on 4 × A100, batch size = 1 per GPU. \(G=4\) (3 on-policy + 1 off-policy), video frame rate 2 FPS. - Novelty: ⭐⭐⭐⭐ Mixed-policy sampling combined with soft advantage estimation constitutes an effective improvement to RL fine-tuning. - Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across 3 datasets. - Writing Quality: ⭐⭐⭐⭐ Method motivation and design rationale are clearly articulated. - Value: ⭐⭐⭐⭐ Offers practical reference for both video temporal grounding and RL fine-tuning research.