Skip to content

Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

Conference: CVPR 2026 arXiv: 2603.17312 Code: HuggingFace Area: Multimodal VLM Keywords: task progress estimation, embodied intelligence, recurrent reasoning, Chain-of-Thought, reinforcement learning

TL;DR

This paper proposes R²VLM, a recurrent reasoning framework that processes local video segments sequentially, maintains a dynamically updated CoT record tracking task decomposition and completion status, and leverages a multi-dimensional RL reward scheme to achieve state-of-the-art performance in long-horizon embodied task progress estimation. The framework additionally supports downstream applications including policy learning, reward modeling, and proactive assistance.

Background & Motivation

Background: Embodied agents require accurate estimation of execution progress over multi-step, long-horizon tasks to support long-range planning and context-aware decision-making.

Limitations of Prior Work: - Methods such as GVL and ROVER exploit VLMs' video understanding capabilities and large context windows while neglecting their reasoning potential. - Processing long video trajectories (often thousands of frames) incurs prohibitive computational overhead, making real-time deployment impractical. - Tasks involving multiple temporally dependent subtasks demand reasoning capabilities to align visual observations with logical dependencies.

Key Challenge: Full-video processing is computationally prohibitive, whereas local segment processing lacks global context; video understanding alone is insufficient to handle complex temporal-logical dependencies.

Goal: Efficient, accurate, and interpretable estimation of long-horizon embodied task progress.

Key Insight: Emulating human cognition — "observe a segment, reason about it, and retain key information" — by recurrently processing video segments while maintaining structured memory.

Core Idea: Recurrent reasoning combined with a dynamic CoT as a cross-timestep memory carrier, avoiding full-video processing while preserving global context.

Method

Overall Architecture

The input video is segmented into short clips \(v_t\) (4s/2s). At each reasoning step, the model receives the current clip \(v_t\) and the historical CoT \(c_{t-1}\), producing an updated CoT \(c_t\) and a progress estimate \(p_t\): \(c_t, p_t = f_\theta(\tau, v_t, c_{t-1})\).

Key Designs

  1. Recurrent Reasoning Framework:

    • Initial iteration: the VLM leverages commonsense knowledge to generate an initial CoT \(c_0\) representing task decomposition.
    • Subsequent iterations: task decomposition is dynamically refined (merging/splitting/reordering steps) based on new video segments, with completion status updated accordingly.
    • Three core advantages: (1) CoT improves accuracy and interpretability; (2) historical CoT provides global context; (3) inheriting prior-round reasoning ensures logical consistency.
    • Design Motivation: eliminates redundant computation from processing full-length videos by propagating global information through the CoT.
  2. CoT Structure:

    • Three components: (i) task decomposition (listing subtasks); (ii) key step analysis (completed/pending); (iii) progress estimation based on the proportion of completed steps.
    • The decomposition can be dynamically adjusted each round, as the environment is partially observable and the actual execution may not fully align with the initial decomposition.
  3. Multi-Dimensional RL Reward System (PPO):

    • Format Reward (\(R_{fmt}\)): verifies output format (think/answer tags); compliant = 1.
    • Bin Reward (\(R_{bin}\)): checks whether the predicted progress falls within the correct step interval; correct = 1.0, adjacent = 0.25.
    • MAE Reward (\(R_{mae}\)): \(\max(1 - |p_t - p_t^{gt}|/\delta_1, 0)\), providing fine-grained supervision.
    • Improvement Reward (\(R_{imp}\)): encourages each round's prediction error to be lower than the previous round, reflecting the self-correction capability of recurrent reasoning.
    • Finish Reward (\(R_{fin}\)): rewards correct judgment of task completion.
    • Overall reward: \(R_{overall} = R_{fmt} \cdot (R_{bin} \cdot R_{mae} + \alpha R_{imp} + \beta R_{fin})\)
    • PPO over GRPO: in the multi-turn setting, \(c_{t-1}\) differs across trajectories, violating GRPO's requirement of identical inputs for generating multiple candidates.

Loss & Training

Two-stage training: (1) SFT to learn the reasoning pattern; (2) multi-turn PPO reinforcement learning initialized from the cold-start SFT checkpoint.

Key Experimental Results

Main Results

Model Size ALFRED \(p_{mae}\) ALFRED \(bin\) Ego4D \(p_{mae}\) Ego4D \(bin\)
GPT-5 - 18.35 0.505 25.04 0.259
Gemini-2.5-Pro - 16.27 0.481 28.22 0.217
Qwen2.5-VL-72B 72B 24.88 0.342 26.88 0.254
R²VLM (SFT+RL) 7B 6.34 0.758 11.88 0.526

Ablation Study

Configuration ALFRED \(p_{mae}\) Note
SFT only 7.52 Basic supervised fine-tuning
+ RL (w/o \(R_{imp}\)) 6.89 Missing cross-round improvement signal
+ RL (w/o \(R_{bin}\)) 7.11 Missing coarse-grained step constraint
Full R²VLM 6.34 All rewards combined, best performance

Key Findings

  • R²VLM at 7B comprehensively outperforms GPT-5 and Gemini-2.5-Pro, reducing MAE by over 65%.
  • The Improvement Reward contributes significantly to multi-turn reasoning, demonstrating the value of self-correction in recurrent reasoning.
  • Strong generalization is observed across three downstream tasks: progress-augmented policy learning, reward modeling, and proactive assistance.
  • Recurrent reasoning avoids full-video processing, yielding substantially faster inference compared to global methods.

Highlights & Insights

  • Recurrent Reasoning + CoT as Memory: the CoT is extended from a one-shot reasoning tool to a structured cross-timestep memory carrier, maintaining global consistency while avoiding long-video computation. This paradigm is transferable to any VLM task requiring long-span temporal reasoning.
  • Improvement Reward Design: directly measures the model's self-correction capability by rewarding cross-round error reduction — a design unique to multi-turn reasoning settings.
  • Automated Data Generation Pipeline: expert trajectories from ALFRED/Ego4D are automatically converted into video segment + CoT training pairs, including a strategy for generating distractor task descriptions.
  • Multi-Downstream Validation: beyond progress estimation, the framework demonstrates value as an RL reward model and a proactive assistance system.

Limitations & Future Work

  • The quality of CoT step decomposition heavily depends on the VLM's commonsense reasoning, which may be inaccurate for complex novel tasks.
  • ALFRED is a simulation environment; performance on real-world data (Ego4D) still exhibits a notable gap.
  • Clip length is fixed (4s/2s), with no mechanism for dynamically adjusting segment granularity.
  • Experiments are limited to Qwen2.5-VL-7B; the effectiveness on larger or stronger base models remains unverified.
  • vs. GVL / ROVER: these methods rely on VLM in-context learning and large context windows without explicit reasoning, limiting performance on complex long-horizon tasks. R²VLM achieves substantial improvements through recurrent reasoning and RL.
  • vs. Hierarchical Reward Methods: conventional approaches require manually designed task hierarchy decompositions, whereas R²VLM automatically learns decomposition and reasoning strategies through training.

Supplementary Analysis

  • Progress is defined as the proportion of completed steps rather than elapsed time, better reflecting long-horizon task structure given the large variance in per-step duration.
  • The distractor task generation strategy is elegant: the first \(n_r\) steps are forced to match the original task while subsequent steps differ, enabling precise control over ground-truth progress.
  • Human-verified benchmark retention rates of 93% (ALFRED) and 74% (Ego4D) indicate high quality of the automatically generated data.
  • The technical rationale for choosing PPO over GRPO is that GRPO requires generating multiple candidates from the same input, which is incompatible with the recurrent setting where \(c_{t-1}\) differs across trajectories.
  • The asymmetric range [-1, 0.8] of the Improvement Reward amplifies the penalty for error increases, encouraging conservative yet stable progress estimation.

Rating

  • Novelty: ⭐⭐⭐⭐ — the recurrent reasoning + CoT memory framework is elegantly designed.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — two datasets, four metrics, and three downstream applications.
  • Writing Quality: ⭐⭐⭐⭐ — clear structure with detailed method descriptions.
  • Value: ⭐⭐⭐⭐⭐ — significant implications for progress estimation and reward modeling in embodied AI.