Skip to content

Self-Improving Loops for Visual Robotic Planning

Conference: ICLR 2026 arXiv: 2506.06658 Code: https://diffusion-supervision.github.io/silvr/ Area: Image Generation Keywords: Visual Planning, Self-Improvement, Video Generation Models, Inverse Dynamics Model, Online Experience

TL;DR

This paper proposes SILVR, a framework that achieves continual self-improvement on unseen tasks by iteratively fine-tuning an in-domain video generation model on self-collected online trajectories. SILVR achieves up to 285% performance improvement on MetaWorld and real-robot benchmarks.

Background & Motivation

Limitations of Prior Work

Background: Video generation models have demonstrated strong capabilities as text-conditioned visual planners for robotic task planning. However, generalization to unseen tasks remains a significant challenge. Existing methods rely primarily on offline data (pre-collected demonstrations or internet videos) and lack the ability to continuously improve from online self-collected experience.

Core Problem: Can one design a visual planning agent that self-improves online?

Method

Overall Architecture

The core loop of SILVR: 1. Video model generates visual plans → 2. Inverse dynamics model converts plans to actions → 3. Environment interaction collects trajectories → 4. Successful trajectories are filtered → 5. Video model is fine-tuned → Repeat

Key Design 1: Video Model as Visual Planner

Based on the UniPi framework: - A text-to-video model predicts future frame sequences as task plans - An inverse dynamics model (IDM) converts consecutive frame pairs into executable actions - Supports AVDC-based in-domain video models with text cross-attention

Key Design 2: Inverse Probability Adaptation (IPA)

Integrates the in-domain video model with an internet-pretrained video prior (AnimateDiff, ~2B parameters) via score composition:

\[\tilde{\epsilon}_{\text{inv}} = \epsilon_{\text{general}}(\tau_t, t) + \alpha(\epsilon_{\text{general}}(\tau_t, t|\text{text}) + \gamma\epsilon_\theta(\tau_t, t|\text{text}) - \epsilon_{\text{general}}(\tau_t, t))\]
  • \(\gamma\): prior strength
  • \(\alpha\): text guidance scale
  • The in-domain model supplies environment-specific visual knowledge
  • The pretrained model provides text generalization and motion priors

Key Design 3: Self-Improving Loop

Each iteration (Algorithm 1): 1. Optionally adapt with internet video prior 2. Execute \(N\) visual planning interactions 3. Filter trajectories using filter function \(f_r\) (success signal) 4. Fine-tune the in-domain video model on accumulated data 5. Optionally distill into a lightweight policy

Experiments

MetaWorld Results (12 Unseen Tasks, Average Success Rate)

Method Iter 0 Iter 1 Iter 2 Iter 3 Iter 4
DSRL (GT filter) 9.4 8.3 7.4 7.5 7.7
BCIL (GT filter) 5.6 12.3 20.9 23.3 23.2
SILVR (GT filter) 14.7 27.7 33.5 43.5 44.2
SILVR (VLM filter) 17.0 24.4 28.7 34.4 38.4

After iteration 4, SILVR achieves a success rate of 44.2%, substantially outperforming BCIL (23.2%) and DSRL (7.7%).

Real-Robot Experiments

  • Cup-pushing task: Continual improvement on unseen object colors
  • Drawer-opening task: SILVR with internet video prior successfully guides self-improvement
  • Without the internet video prior, improvement stagnates or degrades in real-world experiments

Distillation

Distilling the final SILVR video planner into a Diffusion Policy yields a marginal further gain (44.2% → 49.2%).

Ablation Study

Setting Finding
No data filtering Slow improvement on MetaWorld; improvement still observed in real world
VLM instead of GT filter Gemini-2.5-Pro performs best; self-improvement remains achievable
Suboptimal initial data SILVR still achieves continual improvement
10 iterations Performance saturates after iteration 5

Key Findings

  • The decoupled design of visual planning (dynamics modeling vs. action prediction) facilitates generalization
  • Internet video priors are critical for real-world experiments
  • Without filtering, suboptimal experience can still convey useful information through score composition
  • SILVR is more sample-efficient than RL fine-tuning baselines

Highlights & Insights

  • First systematic self-improving framework for visual robotic planning
  • Organically integrates offline data with online experience
  • Introduction of internet video priors elegantly addresses real-world generalization
  • Robust to the choice of filtering signal (GT / VLM / no filter all remain functional)
  • Distillation balances planning quality and inference speed

Limitations & Future Work

  • Assumes the initial model has a reasonable baseline success rate (cold-start problem)
  • The choice of large-scale pretrained video models introduces efficiency/quality trade-offs
  • Performance saturates after 10 iterations, potentially converging to a local policy optimum
  • Video generation inference speed remains a deployment bottleneck
  • Exploration mechanisms for escaping unimodal behavior are not investigated
  • Video planning: UniPi, AVDC, and related video models for decision-making
  • Self-improving models: Self-improvement in LLMs, VideoAgent, etc.
  • RL fine-tuning of BC policies: DPPO, DSRL, ResIP, etc.

Rating

  • Novelty: ⭐⭐⭐⭐ — The combination of visual planning and self-improving loops is novel
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across simulation and real robot, with multiple ablations and baselines
  • Value: ⭐⭐⭐⭐ — Real-robot validation and distillation scheme address practical deployment
  • Writing Quality: ⭐⭐⭐⭐ — Ablation studies on various design decisions are thorough