Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress¶

Conference: CVPR 2026 arXiv: 2603.17312 Code: HuggingFace Area: Multimodal VLM Keywords: task progress estimation, embodied intelligence, recurrent reasoning, Chain-of-Thought, reinforcement learning

TL;DR¶

This paper proposes R²VLM, a recurrent reasoning framework that processes local video segments sequentially, maintains a dynamically updated CoT record tracking task decomposition and completion status, and leverages a multi-dimensional RL reward scheme to achieve state-of-the-art performance in long-horizon embodied task progress estimation. The framework additionally supports downstream applications including policy learning, reward modeling, and proactive assistance.

Background & Motivation¶

Background: Embodied agents require accurate estimation of execution progress over multi-step, long-horizon tasks to support long-range planning and context-aware decision-making.

Limitations of Prior Work: - Methods such as GVL and ROVER exploit VLMs' video understanding capabilities and large context windows while neglecting their reasoning potential. - Processing long video trajectories (often thousands of frames) incurs prohibitive computational overhead, making real-time deployment impractical. - Tasks involving multiple temporally dependent subtasks demand reasoning capabilities to align visual observations with logical dependencies.

Key Challenge: Full-video processing is computationally prohibitive, whereas local segment processing lacks global context; video understanding alone is insufficient to handle complex temporal-logical dependencies.

Goal: Efficient, accurate, and interpretable estimation of long-horizon embodied task progress.

Key Insight: Emulating human cognition — "observe a segment, reason about it, and retain key information" — by recurrently processing video segments while maintaining structured memory.

Core Idea: Recurrent reasoning combined with a dynamic CoT as a cross-timestep memory carrier, avoiding full-video processing while preserving global context.

Method¶

Overall Architecture¶

The input video is segmented into short clips \(v_t\) (4s/2s). At each reasoning step, the model receives the current clip \(v_t\) and the historical CoT \(c_{t-1}\), producing an updated CoT \(c_t\) and a progress estimate \(p_t\): \(c_t, p_t = f_\theta(\tau, v_t, c_{t-1})\).

Key Designs¶

Recurrent Reasoning Framework:
- Initial iteration: the VLM leverages commonsense knowledge to generate an initial CoT \(c_0\) representing task decomposition.
- Subsequent iterations: task decomposition is dynamically refined (merging/splitting/reordering steps) based on new video segments, with completion status updated accordingly.
- Three core advantages: (1) CoT improves accuracy and interpretability; (2) historical CoT provides global context; (3) inheriting prior-round reasoning ensures logical consistency.
- Design Motivation: eliminates redundant computation from processing full-length videos by propagating global information through the CoT.
CoT Structure:
- Three components: (i) task decomposition (listing subtasks); (ii) key step analysis (completed/pending); (iii) progress estimation based on the proportion of completed steps.
- The decomposition can be dynamically adjusted each round, as the environment is partially observable and the actual execution may not fully align with the initial decomposition.
Multi-Dimensional RL Reward System (PPO):
- Format Reward (\(R_{fmt}\)): verifies output format (think/answer tags); compliant = 1.
- Bin Reward (\(R_{bin}\)): checks whether the predicted progress falls within the correct step interval; correct = 1.0, adjacent = 0.25.
- MAE Reward (\(R_{mae}\)): \(\max(1 - |p_t - p_t^{gt}|/\delta_1, 0)\), providing fine-grained supervision.
- Improvement Reward (\(R_{imp}\)): encourages each round's prediction error to be lower than the previous round, reflecting the self-correction capability of recurrent reasoning.
- Finish Reward (\(R_{fin}\)): rewards correct judgment of task completion.
- Overall reward: \(R_{overall} = R_{fmt} \cdot (R_{bin} \cdot R_{mae} + \alpha R_{imp} + \beta R_{fin})\)
- PPO over GRPO: in the multi-turn setting, \(c_{t-1}\) differs across trajectories, violating GRPO's requirement of identical inputs for generating multiple candidates.

Loss & Training¶

Two-stage training: (1) SFT to learn the reasoning pattern; (2) multi-turn PPO reinforcement learning initialized from the cold-start SFT checkpoint.

Key Experimental Results¶

Main Results¶

Model	Size	ALFRED \(p_{mae}\)↓	ALFRED \(bin\)↑	Ego4D \(p_{mae}\)↓	Ego4D \(bin\)↑
GPT-5	-	18.35	0.505	25.04	0.259
Gemini-2.5-Pro	-	16.27	0.481	28.22	0.217
Qwen2.5-VL-72B	72B	24.88	0.342	26.88	0.254
R²VLM (SFT+RL)	7B	6.34	0.758	11.88	0.526

Ablation Study¶

Configuration	ALFRED \(p_{mae}\)↓	Note
SFT only	7.52	Basic supervised fine-tuning
+ RL (w/o \(R_{imp}\))	6.89	Missing cross-round improvement signal
+ RL (w/o \(R_{bin}\))	7.11	Missing coarse-grained step constraint
Full R²VLM	6.34	All rewards combined, best performance

Key Findings¶

R²VLM at 7B comprehensively outperforms GPT-5 and Gemini-2.5-Pro, reducing MAE by over 65%.
The Improvement Reward contributes significantly to multi-turn reasoning, demonstrating the value of self-correction in recurrent reasoning.
Strong generalization is observed across three downstream tasks: progress-augmented policy learning, reward modeling, and proactive assistance.
Recurrent reasoning avoids full-video processing, yielding substantially faster inference compared to global methods.

Highlights & Insights¶

Recurrent Reasoning + CoT as Memory: the CoT is extended from a one-shot reasoning tool to a structured cross-timestep memory carrier, maintaining global consistency while avoiding long-video computation. This paradigm is transferable to any VLM task requiring long-span temporal reasoning.
Improvement Reward Design: directly measures the model's self-correction capability by rewarding cross-round error reduction — a design unique to multi-turn reasoning settings.
Automated Data Generation Pipeline: expert trajectories from ALFRED/Ego4D are automatically converted into video segment + CoT training pairs, including a strategy for generating distractor task descriptions.
Multi-Downstream Validation: beyond progress estimation, the framework demonstrates value as an RL reward model and a proactive assistance system.

Limitations & Future Work¶

The quality of CoT step decomposition heavily depends on the VLM's commonsense reasoning, which may be inaccurate for complex novel tasks.
ALFRED is a simulation environment; performance on real-world data (Ego4D) still exhibits a notable gap.
Clip length is fixed (4s/2s), with no mechanism for dynamically adjusting segment granularity.
Experiments are limited to Qwen2.5-VL-7B; the effectiveness on larger or stronger base models remains unverified.

vs. GVL / ROVER: these methods rely on VLM in-context learning and large context windows without explicit reasoning, limiting performance on complex long-horizon tasks. R²VLM achieves substantial improvements through recurrent reasoning and RL.
vs. Hierarchical Reward Methods: conventional approaches require manually designed task hierarchy decompositions, whereas R²VLM automatically learns decomposition and reasoning strategies through training.

Supplementary Analysis¶

Progress is defined as the proportion of completed steps rather than elapsed time, better reflecting long-horizon task structure given the large variance in per-step duration.
The distractor task generation strategy is elegant: the first \(n_r\) steps are forced to match the original task while subsequent steps differ, enabling precise control over ground-truth progress.
Human-verified benchmark retention rates of 93% (ALFRED) and 74% (Ego4D) indicate high quality of the automatically generated data.
The technical rationale for choosing PPO over GRPO is that GRPO requires generating multiple candidates from the same input, which is incompatible with the recurrent setting where \(c_{t-1}\) differs across trajectories.
The asymmetric range [-1, 0.8] of the Improvement Reward amplifies the penalty for error increases, encouraging conservative yet stable progress estimation.

Rating¶

Novelty: ⭐⭐⭐⭐ — the recurrent reasoning + CoT memory framework is elegantly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — two datasets, four metrics, and three downstream applications.
Writing Quality: ⭐⭐⭐⭐ — clear structure with detailed method descriptions.
Value: ⭐⭐⭐⭐⭐ — significant implications for progress estimation and reward modeling in embodied AI.