OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning¶

Conference: ICLR 2026 arXiv: 2509.09332 Code: Project Page Area: Embodied Intelligence / 3D Reasoning Keywords: MLLM, Task-Adaptive 3D Grounding, Gated Routing, Embodiment-aware Reasoning, GRPO

TL;DR¶

This paper proposes OmniEVA, which addresses two critical gaps in spatial MLLMs — poor geometric adaptability (2D-only or hard-coded 3D injection) and the absence of embodiment constraints (plans that are theoretically feasible but physically unexecutable) — via a task-adaptive gated router that dynamically injects 3D positional encodings only when geometric reasoning is required, and an embodiment-aware reasoning framework that integrates physical constraints into the planning loop. OmniEVA achieves state-of-the-art performance on 7 out of 8 benchmarks.

Background & Motivation¶

State of the Field¶

MLLMs are increasingly applied to embodied intelligence, requiring spatial understanding, reasoning, and action. Two dominant paradigms exist: (1) direct 2D RGB input, which lacks 3D geometric information; and (2) 3D-LLMs with hard-coded 3D injection, which lack flexibility.

Limitations of Prior Work¶

(1) Geometric Adaptability Gap: 2D-only models fail on 3D reasoning tasks (e.g., stacking, occlusion handling, navigation); hard-coded 3D injection in 3D-LLMs introduces noise when 3D input is irrelevant or noisy.

Root Cause¶

(2) Embodiment Constraint Gap: Models trained on web images and videos ignore robot physical constraints, producing plans that are theoretically valid but physically unexecutable (e.g., infeasible grasp poses, workspace violations, kinematic infeasibility).

Key Insight: (1) A gated router dynamically determines whether 3D information is needed, enabling on-demand injection; (2) TE-GRPO training teaches the model to respect physical constraints.

Method¶

Task-Adaptive Gated Router (TAGR)¶

3D Positional Encoding: Depth maps are unprojected to world coordinates, patch-level averaged, and encoded via sinusoidal functions to obtain \(V^p \in \mathbb{R}^{N \times H_p \times W_p \times d_v}\).
Gating Decision:
Task condition: a sentence Transformer encodes the instruction to produce \(V^T\)
Scene condition: mean-pooled visual encoder output \(V_{avg}^I\)
Concatenated features are passed through an MLP to produce 2D gate logits, followed by Gumbel-Softmax for a binary decision
Dynamic Injection:
Gate = 1: \(V^{final} = V^I + V^p\) (3D positional encoding added)
Gate = 0: \(V^{final} = V^I\) (2D only)
The routing is automatically determined per task and scene, avoiding noise from unnecessary 3D injection

Embodiment-aware Reasoning¶

Primitive Skill Decomposition:
Where2Go: navigation target selection
Where2Grasp: grasp pose estimation
Where2Approach: approach pose determination
Where2Fit: placement feasibility assessment
TE-GRPO (Task- and Embodiment-aware GRPO):
Applied as a post-training stage using Group Relative Policy Optimization (GRPO)
Reward function accounts for: task objectives, object affordances, workspace boundaries, and kinematic feasibility
Ensures that generated plans are physically executable

Two-Stage Training¶

Stage 1: Supervised Fine-Tuning (SFT) on 2D + 3D VQA and embodied reasoning data
Stage 2: TE-GRPO post-training via reinforcement learning to optimize executability

Key Experimental Results¶

8 Benchmarks (2D + 3D + Video)¶

Main Results¶

Benchmark Type	Model	Performance
2D Spatial Reasoning	OmniEVA	SOTA
3D Spatial Reasoning	OmniEVA	SOTA (7/8)
Object Navigation (HM3D)	OmniEVA	Leaderboard #1
Object Navigation (MP3D)	OmniEVA	Leaderboard #1

4 Primitive Skill Benchmarks¶

Ablation Study¶

Skill	OmniEVA vs. SOTA	Description
Where2Go	+5%	Navigation target selection
Where2Grasp	+8%	Grasp pose estimation
Where2Approach	+6%	Approach strategy
Where2Fit	+7%	Placement adaptation

Key Findings¶

The gated router deactivates 3D injection for ~40% of tasks, confirming that these tasks genuinely do not require geometric reasoning — validating the adaptive strategy.
Hard-coded 3D injection in baseline models degrades performance on 2D tasks that do not require 3D input, demonstrating the value of TAGR.
TE-GRPO post-training raises the proportion of physically executable plans from ~65% to ~90% compared to SFT alone.

Highlights & Insights¶

"3D on Demand" Design Philosophy: Rather than applying 3D to all tasks, the model learns when 3D reasoning is necessary — a more flexible and accurate approach than hand-crafted rules.
Contribution of Primitive Skill Benchmarks: The four new benchmarks (Where2Go / Where2Grasp / Where2Approach / Where2Fit) provide the first systematic evaluation of embodied plan executability.
TE-GRPO Bridges LLM Training and Robotics: Combining GRPO — a mainstream LLM post-training method — with physics-constraint rewards represents a natural and effective integration of LLMs into robotics.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Dual innovation in task-adaptive 3D grounding and embodiment-aware reasoning
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 8 + 4 benchmarks, ablations, and leaderboard results
Writing Quality: ⭐⭐⭐⭐ Architecture descriptions are clear and well-organized
Value: ⭐⭐⭐⭐⭐ Significant contribution to embodied MLLMs