EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Equivariant Flow Matching¶

Conference: ICCV 2025 arXiv: 2507.06224 Code: ec-flow1.github.io Area: Image Generation Keywords: robotic manipulation, action-unlabeled video learning, optical flow prediction, URDF kinematics, diffusion models

TL;DR¶

EC-Flow introduces an "embodiment-centric flow" paradigm that predicts pixel-level motion trajectories of the robot body from action-unlabeled RGB videos, and converts visual predictions into executable actions via URDF kinematic constraints. It substantially outperforms object-centric methods in scenarios involving deformable objects, occlusion, and non-displacement manipulation.

Background & Motivation¶

Current language-guided robotic manipulation systems primarily rely on imitation learning (VLA models), which require large-scale datasets with low-level action annotations. Although vast quantities of action-unlabeled manipulation videos exist on the internet, learning manipulation policies from such data remains an open problem.

Existing video-based manipulation learning methods fall into two categories:

Video prediction + imitation learning: Future frames are predicted first, and actions are inferred via an inverse dynamics model. However, this still requires action-annotated data to train the inverse dynamics model.

Object-centric optical flow methods (e.g., AVDC, Track2Act): Object flow trajectories are predicted and actions are inferred from object transformations. While no action annotations are required, three fundamental limitations remain: - Rigid body assumption: All parts of the object are assumed to undergo uniform rigid transformation, making it inapplicable to deformable objects (e.g., folding clothes). - Occlusion fragility: Actions are inferred entirely from object state changes and fail when the object is occluded. - Failure on non-displacement tasks: Rotational motions (e.g., turning a switch) or subtle movements (e.g., pressing a mouse button) cannot be captured.

Core insight: Shifting the prediction target from object flow to embodiment flow fundamentally avoids all three limitations above. Robot body motion: (a) does not depend on object properties → applicable to any object; (b) remains visible in most scenarios → robust to occlusion; (c) directly reflects the executed action → handles non-displacement tasks.

Method¶

Overall Architecture¶

EC-Flow consists of two core modules: 1. Embodiment-centric flow prediction (Sec. 3.2): Given an initial frame and a language instruction, the model predicts 2D trajectories over \(T\) future steps for randomly sampled points on the robot body. 2. Kinematics-aware action computation (Sec. 3.3): Using kinematic constraints from the robot's URDF file, 2D optical flow is converted into 6-DoF end-effector pose transformations.

Key Designs¶

Flow Prediction with Diffusion Models:
- Data construction: Grounded SAM is used to segment the robot body mask → \(N_p = 400\) points are randomly sampled within the mask → CoTracker tracks their 2D trajectories throughout the video.
- A conditional diffusion model is adopted for flow prediction, with conditioning signal \(\mathbf{c} = [\tilde{\mathbf{v}}, \tilde{\mathbf{l}}, \tilde{\mathbf{s}}]\) comprising: visual context (initial frame encoded by ResNet-50), language guidance (CLIP text encoder), and initial state (starting point coordinates).
- Training objective: \(\mathcal{L}_{\text{flow}} = \mathbb{E}_{t, \mathbf{z}_0, \epsilon}[\|\epsilon - \epsilon_\theta(\mathbf{z}_t, t, \mathbf{c})\|_2^2]\)
- At inference, DDIM sampling generates complete \(T=8\) step trajectories.
Target Image Prediction — Aligning Object Interaction with Language Instructions:
- Challenge: Pure embodiment flow prediction may generate actions that do not effectively interact with the target object.
- Solution: An auxiliary target image prediction branch is introduced, sharing the diffusion timestep with the flow prediction branch.
- The target image generator receives augmented conditioning \(\mathbf{c_t}^{\text{img}} = [\tilde{\mathbf{v}}, \tilde{\mathbf{l}}, \tilde{\mathbf{s}}, f_t^{\text{flow}}]\), where \(f_t^{\text{flow}}\) denotes the flow prediction output.
- Total loss: \(L = L_{\text{flow}} + \lambda L_{\text{image}}\) (\(\lambda = 0.4\))
- Design motivation: Joint optimization establishes an implicit constraint — generated actions must lead to physically plausible object state changes aligned with the language instruction.
URDF-Aware Action Computation:
- Step 1 (point assignment): The geometric properties of the URDF file are used to compute 2D bounding boxes for each joint; sampled points are assigned to their corresponding joints (retaining only points uniquely belonging to a single joint).
- Step 2 (action optimization): End-effector pose is optimized by minimizing reprojection error:
\(\mathbf{T}_{ee}^* = \arg\min_{\mathbf{T}_{ee}} \sum_{j=1}^M \sum_{i=1}^{N_j} \|\pi(\mathbf{T}_{ee} \cdot {}_j^{ee}\mathbf{T} \cdot \mathbf{P}_{ji}^{3D}) - \mathbf{P}_{ji}^{(t+1)_{2D}}\|_2\)
- Design motivation: Different joints are subject to different kinematic constraints and cannot be handled by a single unified transformation; the URDF is a standard configuration file for robotic systems, requiring no additional annotation.

Loss & Training¶

Flow prediction: diffusion noise prediction MSE loss.
Target image: diffusion noise prediction MSE loss (weight 0.4).
Training configuration: 8 × RTX 4090 GPUs, batch size 56, flow lr = 5e-5, image lr = 1e-4, DDIM 250-step sampling.

Key Experimental Results¶

Main Results — Meta-World Benchmark¶

Method	Data Requirement	Avg. Success Rate ↑
BC-Scratch	Action annotations	0.204
BC-R3M	Action annotations	0.360
Diffusion Policy	Action annotations	0.298
UniPi	Video + action annotations	0.093
AVDC (object-centric)	Video only	0.489
Track2Act (object-centric)	Video only	0.556
EC-Flow (embodiment-centric)	Video only	0.720

EC-Flow surpasses the strongest baseline by 16.4%. Particularly notable improvements are observed on occlusion/non-displacement tasks such as btn-top-press (1.00 vs. 0.40) and hammer-strike (0.88 vs. 0.24).

Ablation Study — Contribution of Each Component¶

#	Flow Prediction	Target Image	Point Filtering	EE Only	Avg. Success Rate
1	End-to-end EC-Flow	✓	✓	✗	0.720
2	Video + GT flow	✗	✓	✗	0.636
3	End-to-end EC-Flow	✗	✗	✗	0.582
4	End-to-end EC-Flow	✓	✓	✓	0.604
5	Video + GT flow	✓	✓	✗	0.667

Target image prediction (#3 vs. #1: +13.8%) and full-body point modeling (#4 vs. #1: +11.6%) contribute the most.

Key Findings¶

Occlusion robustness: Object-centric methods fail completely when the object is occluded, whereas EC-Flow infers actions from the motion of other visible robot joints.
Hallucination in video prediction: Video prediction models may generate hallucinated multiple robot arms, causing optical flow tracking errors; end-to-end flow prediction avoids this issue.
Real-world validation (7 tasks): EC-Flow achieves a 45% improvement on deformable object manipulation and an 80% improvement on non-displacement tasks relative to Track2Act.
Cross-embodiment data availability: Preliminary experiments show that incorporating 50 human videos improves success rate from 46% to 70% in 2-demo scenarios.

Highlights & Insights¶

The paradigm shift from "object-centric" to "embodiment-centric" is the paper's most fundamental contribution — remarkably concise yet highly effective.
Using target image prediction as an auxiliary task to align language, objects, and actions is an elegant design choice that avoids the difficulty of directly predicting object optical flow.
The entire system requires only RGB video and a URDF file, with no action annotations, 3D point clouds, or specialized sensors, resulting in extremely low deployment cost.

Limitations & Future Work¶

The current system requires manual specification of the manipulation starting pose; automatic grasp pose generation has not been integrated.
Gripper state (open/close) is borrowed from object-centric methods and still depends on object segmentation.
DDIM 250-step sampling leads to slow inference (4.37s for flow prediction); flow matching could be used to accelerate this.
Depth perception relies on a D435i camera, limiting accuracy on tasks involving complex deformations (e.g., folding clothes).

Distinction from General Flow: General Flow requires RGBD input to predict 3D flow fields, whereas EC-Flow uses RGB only.
Distinction from MT-π: MT-π relies on 5 predefined gripper keypoints that must always be visible; EC-Flow is robust to partial occlusion.
Insight: The embodiment-centric approach may generalize to more complex embodied scenarios such as dual-arm collaboration and dexterous hand manipulation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (The paradigm shift to embodiment-centric optical flow is highly elegant)
Experimental Thoroughness: ⭐⭐⭐⭐ (Simulation + real-world, thorough ablations, though real-world experiments are limited in scale)
Writing Quality: ⭐⭐⭐⭐⭐ (Clear motivation, detailed method description)
Value: ⭐⭐⭐⭐⭐ (Significantly lowers the barrier for learning manipulation from video)