Scaling RL to Long Videos¶

Conference: NeurIPS 2025 arXiv: 2507.07966 Code: GitHub Area: Video Generation Keywords: Long video reasoning, reinforcement learning, vision-language models, sequence parallelism, chain-of-thought

TL;DR¶

This paper proposes LongVILA-R1, a full-stack framework that extends VLM reasoning to long videos (up to 8192 frames) via a 104K long-video reasoning dataset, a two-stage CoT-SFT + RL training pipeline, and the MR-SP multimodal reinforcement sequence parallelism system, achieving 65.1%/71.1% on VideoMME.

Background & Motivation¶

Background: Long video understanding demands temporal, spatial, goal-oriented, and narrative reasoning capabilities. Closed-source models such as GPT-4o and Gemini-1.5-Pro have demonstrated strong performance, while open-source VLMs have also made progress on short videos.
Limitations of Prior Work: (1) High-quality long-video reasoning datasets are lacking — unlike math or code reasoning with structured annotations, long-video reasoning requires labeling complex temporal dynamics and narrative elements. (2) RL training frameworks for long videos are difficult to construct — processing hundreds to thousands of frames incurs enormous memory overhead and extremely long rollout times.
Key Challenge: Existing RL frameworks (e.g., R1-V, EasyR1) are not designed for long videos; GRPO's group sampling is computationally prohibitive under long contexts, and visual encoding requires redundant computation.
Goal: To holistically address the three core challenges of long-video reasoning: data, training methodology, and training systems.
Key Insight: On the data side, NVILA-8B and DeepSeek-R1-671B are used to automatically generate long-video CoT annotations; on the system side, sequence parallelism and video embedding caching are employed to accelerate RL training.
Core Idea: Caching video embeddings combined with sequence parallelism makes long-video RL training feasible, while high-quality CoT data and difficulty filtering are critical to the emergence of reasoning capabilities.

Method¶

Overall Architecture¶

Three major components: 1. LongVideo-Reason Dataset: 104K QA pairs derived from 18K long videos 2. Two-stage training: Stage-1 CoT-SFT (36K) → Stage-2 GRPO RL (68K + 102K) 3. MR-SP training system: Sequence parallelism + embedding caching

Key Designs¶

1. LongVideo-Reason Data Construction

Function: Provides large-scale, high-quality reasoning annotations for long videos
Mechanism: Videos are segmented into ~10s clips → NVILA-8B generates per-clip descriptions → all clip descriptions are aggregated → DeepSeek-R1-671B generates Question-Reasoning-Answer triplets based on the full-video description. Four reasoning types are covered: temporal, goal/purpose, spatial, and narrative reasoning. Data filtering: each question is rolled out 10 times; samples that are too easy (all correct) or too hard (all wrong) are discarded, retaining only moderate-difficulty instances (GRPO requires diversity across rollouts).
Design Motivation: GRPO is sensitive to batch sampling — if all rollouts are uniformly correct or incorrect, gradients vanish; data of appropriate difficulty is therefore required.

2. MR-SP Multimodal Reinforcement Sequence Parallelism

Function: Makes long-video RL training feasible and efficient
Mechanism:
- Stage 1 — Rollout parallel encoding: Video frames are evenly distributed across multiple GPUs, each encoding independently, with embeddings aggregated via all-gather. Key optimization: video embeddings are cached and reused across 8–16 rollouts to avoid redundant encoding.
- Stage 2 — Prefilling sequence parallelism: The prefilling step for both the policy and reference models is parallelized — global embeddings are padded to a uniform length and sharded across GPUs, with each GPU computing logits for only a subset of tokens.
Design Motivation: Long videos produce massive numbers of visual tokens (\(10^4\)–\(10^5\)), which cannot fit on a single GPU. Embedding caching eliminates \(G\)-fold redundant encoding (where \(G\) is the number of rollouts).

3. Two-Stage Training Strategy

Function: First establishes a reasoning foundation, then scales it via RL
Mechanism: Stage-1 applies SFT on 36K high-quality CoT data (formatted as <think></think><answer></answer>) to equip the model with basic reasoning and instruction-following capabilities. Stage-2 applies GRPO on 68K curated data plus 102K open-source data (rule-based rewards for accuracy and format), further scaling reasoning ability.
Design Motivation: Skipping SFT and directly applying RL degrades performance; SFT provides necessary reasoning warm-up.

Loss & Training¶

Stage-1: Standard SFT cross-entropy loss
Stage-2: GRPO objective \(\mathcal{J}(\theta)\) with clipping and KL regularization. Group size \(G=8\); advantage \(A_i\) is computed via within-group normalization
Reward: Rule-based (format correctness + answer accuracy)

Key Experimental Results¶

Main Results¶

VideoMME Benchmark

Model	w/o sub	Short	Medium	Long	w/ sub
LongVILA-7B	60.1	69.0	58.3	53.0	65.1
LongVILA-R1-7B	65.1	76.8	63.2	55.2	71.1
Video-R1-7B	61.4	-	-	-	-
Gemini-1.5-Pro	75.0	-	-	-	81.3
LLaVA-Video-7B	63.3	-	-	-	69.7

LongVideo-Reason-eval

Model	Temporal	Goal	Plot	Spatial	Overall
Video-R1-7B	61.4	85.0	62.0	58.5	68.1
Gemini-1.5-Pro	65.4	81.9	67.8	53.3	69.3
LongVILA-7B	58.0	80.2	57.1	46.7	62.7
LongVILA-R1-7B	68.1	85.7	70.6	53.3	72.0

Ablation Study¶

Setting	CoT-SFT	RL Data	LongVideo-Reason-eval
Base only	✗	✗	62.7
SFT only	✓ Ours	✗	Significant gain
Direct RL (no SFT)	✗	✓	Performance drop
Full pipeline	✓	✓	72.0

MR-SP Training Efficiency (8×A100, LongVILA-7B)

Frames	w/o MR-SP	MR-SP Stage 1	Full MR-SP
256	Normal	Accelerated	Accelerated
512	Slow	Accelerated but OOM	2.1× speedup
1024	OOM	OOM	Runnable

Key Findings¶

Reasoning gains from RL continue to scale with increasing frame count (LongVILA-R1 improves consistently from 16→512 frames, whereas LongVILA saturates or degrades beyond 256 frames)
MR-SP achieves a 2.1× speedup at 512 frames and is the only viable solution at 1024 frames
CoT-SFT is a necessary prerequisite for RL — skipping it leads to performance degradation
RL training on hour-long videos (3600 frames) is feasible on a single A100 node

Highlights & Insights¶

Full-stack solution: data, training, and system components are all internally consistent
MR-SP transforms long-video RL from infeasible to practically viable; embedding caching to eliminate \(G\)-fold redundant encoding is the key optimization
The data filtering strategy (removing too-easy and too-hard samples) is critical for GRPO — this is a practical answer to the theoretical gradient-vanishing condition
The sustained improvement in reasoning ability with increasing frame count validates the value of long-video RL

Limitations & Future Work¶

Data generation requires approximately 80,000 H100 GPU hours, resulting in substantial cost
The reasoning pipeline relies on segment-level captioning followed by LLM-based reasoning generation, which may introduce caption noise
Experiments are currently limited to the 7B model scale; performance at larger scales remains unexplored
Spatial reasoning scores show limited improvement (53.3%), which is identified as a known weakness

Built upon LongVILA's MM-SP infrastructure; data generation leverages NVILA and DeepSeek-R1
Video-R1 handles only 16-frame short videos; this work extends RL for VLMs to long-video settings
Insight: The essence of RL training lies in aligning data difficulty with model capability and in robust systems engineering

Rating¶

Novelty: ⭐⭐⭐⭐ Full-stack integration rather than a single-point contribution; MR-SP represents a significant systems contribution
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-benchmark comparisons, comprehensive ablations, and quantified training efficiency
Writing Quality: ⭐⭐⭐⭐ Content-dense but well-organized
Value: ⭐⭐⭐⭐⭐ Provides a reproducible, complete solution for long-video VLM reasoning with high open-source utility