MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents¶
Conference: CVPR 2026
arXiv: 2511.23055
Code: zhangdaxia22.github.io/MindPower/ (Benchmark)
Area: Multimodal VLM
Keywords: Theory of Mind, BDI Reasoning, Embodied Agent, Mind-Reward, GRPO
TL;DR¶
MindPower proposes a robot-centric Theory-of-Mind (ToM) reasoning framework that organizes perception → belief → desire → intention → decision → action into a six-layer reasoning hierarchy, and optimizes reasoning consistency via Mind-Reward (based on GRPO), surpassing GPT-4o by 12.77% on decision-making and 12.49% on action generation.
Background & Motivation¶
Background: The embodied agent field is advancing rapidly — PaLM-E, RoboBench, and Smart-Help have realized task decomposition and execution. VLMs (GPT-4o, Gemini, Qwen-VL) perform well at the perception level but remain weak at inferring human intent and providing proactive assistance. Existing ToM benchmarks (MuMA-ToM, MMToM-QA) only evaluate inference of mental states of characters observed in videos.
Limitations of Prior Work: (1) Existing VLM-based agents can only execute explicit instructions and lack the ability to infer human beliefs, desires, and intentions; (2) existing ToM benchmarks adopt a "role-centric" perspective — inferring the mental states of video characters without involving the agent's own viewpoint or requiring the generation of decisions and actions; (3) VLMs at the perception level are susceptible to scene bias (e.g., predicting "cleaning" upon observing a kitchen rather than reasoning about actual intent).
Key Challenge: An agent must understand "what others are thinking" in order to offer proactive assistance, yet must simultaneously reason from its own perspective — e.g., "I know the apple is actually in the refrigerator, even though Alice believes it is on the table." Neither existing benchmarks nor methods have established this dual-perspective reasoning loop.
Goal: Enable embodied agents to infer human mental states (beliefs, desires, intentions) from their own perspective and thereby produce proactive decisions and actions.
Key Insight: Systematically introduce the cognitive-science BDI (Belief-Desire-Intention) framework into embodied agents, constructing a three-level six-layer continuous reasoning hierarchy, and optimize reasoning consistency through a structured reward function (Mind-Reward) via reinforcement learning.
Core Idea: Connect perception to action via a robot-centric BDI reasoning hierarchy of three levels and six layers, and optimize the consistency of the reasoning chain through GRPO using an atomic-action-matching Mind-Reward.
Method¶
Overall Architecture¶
MindPower consists of three components: (1) MindPower Benchmark — 590 household scenarios (VirtualHome + ThreeDWorld) with two tasks (false-belief correction and implicit goal inference); (2) MindPower Reasoning Hierarchy — a three-level six-layer reasoning structure; (3) Mind-Reward + GRPO — two-stage training (SFT cold-start + GRPO reinforcement). The base model is Qwen2.5-VL-7B.
Key Designs¶
-
MindPower Reasoning Hierarchy (Three-Level Six-Layer Structure):
- Function: Formalizes the embodied decision-making process as a continuous reasoning chain from perception to action.
- Mechanism:
- Level-1 Perception
<Perception>: Observes the environment and human behavior to answer "what is happening now." - Level-2 Mind Reasoning:
<Belief>(infers both the agent's own and the human's beliefs, including second-order beliefs — "I think Alice believes the apple is on the table") →<Desire>(identifies the assistance goal — "what help does Alice need") →<Intention>(forms a specific action intention). - Level-3 Decision and Action:
<Decision>(selects a plan) →<Action>(outputs an atomic operation sequence such aswalk(fridge), open(fridge), pick(apple)).
- Level-1 Perception
- Design Motivation: Existing VLMs rely on one-step decision-making without intermediate reasoning. The BDI hierarchy ensures that every decision is grounded in traceable belief-desire-intention support, improving interpretability and consistency.
-
Robot-Centric Perspective (vs. Role-Centric):
- Function: Requires the agent to simultaneously infer its own beliefs and the human's beliefs, forming a closed dual-perspective reasoning loop.
- Mechanism: In the false-belief correction task — the agent observes that an object has been moved (Stage 2); when the human returns to search for it (Stage 3), the agent must reason that "Alice believes the apple is on the table (false belief)" + "I know the apple is actually in the refrigerator (agent's own belief)" → "I should retrieve the apple from the refrigerator for Alice."
- Design Motivation: Existing benchmarks such as MuMA-ToM and MMToM-QA are limited to multiple-choice questions about character mental states and do not involve the agent's own perspective. Genuine collaboration requires the agent to simultaneously maintain mental models of itself and others.
-
Mind-Reward (Atomic Action Matching Reward):
- Function: Designs a structured reward function to drive GRPO optimization and ensure consistency across the reasoning chain from perception to action.
- Mechanism: Atomic action sequences are extracted from each reasoning layer's output via an LLM (Qwen3-Max), and three alignment metrics are computed: atomic accuracy (ROUGE-1), local consistency (ROUGE-2), and global consistency (ROUGE-L). \(R_{Mind} = \alpha_1 R_{atomic} + \alpha_2 R_{local} + \alpha_3 R_{global}\), supplemented by a Format-Reward to ensure structural completeness of the hierarchy.
- Design Motivation: The reasoning layers are sequential — temporal and logical dependencies exist from perception to action. Process-level rewards better ensure the quality of intermediate reasoning steps than evaluating only the final output.
Loss & Training¶
- Two-stage training: (1) SFT cold-start (5 epochs) to establish basic reasoning capability; (2) GRPO reinforcement (400 iterations, 8 generated samples per step) using Mind-Reward + Format-Reward.
- GRPO updates the policy via intra-group relative advantage: \(A_i = (R_i - \text{mean}(\{R_j\})) / \text{std}(\{R_j\})\).
- Training is conducted on a single H800 GPU with Qwen2.5-VL-7B as the base model.
Key Experimental Results¶
Main Results¶
| Method | Decision (S) | Action SR | Action AC | BPC |
|---|---|---|---|---|
| GPT-4o (image) | 34.35 | 1.82 | 2.91 | 8.05 |
| Gemini-2.5 Pro | 33.87 | 2.08 | 2.54 | 8.56 |
| Video-R1 (best open-source) | 30.33 | 1.43 | 1.72 | 6.45 |
| Qwen2.5-VL-7B (base) | 26.56 | 0.29 | 0.22 | 6.07 |
| Ours (SFT+Mind-Reward) | 47.12 | 11.75 | 15.40 | 8.87 |
| Human Baseline | 56.66 | 19.37 | 26.26 | 8.19 |
Ablation Study¶
| Training Configuration | Action AC | Decision (S) | BPC |
|---|---|---|---|
| Qwen2.5-VL-7B (no training) | 0.22 | 26.56 | 6.07 |
| Mind-Reward only (no SFT) | 0.40 | - | - |
| SFT only (no RL) | 10.48 | 42.35 | 8.32 |
| SFT + Mind-Reward | 15.40 | 47.12 | 8.87 |
| Reasoning Strategy (GPT-4o) | Decision | Action AC |
|---|---|---|
| Direct output (no reasoning) | 33.11 | 0.82 |
Standard CoT (<think>) |
29.46 | 0.90 |
| MindPower Hierarchy | 34.35 | 2.91 |
Key Findings¶
- SFT alone yields substantial improvements (Action AC: 0.22→10.48), demonstrating the intrinsic effectiveness of the BDI reasoning hierarchy structure.
- RL further improves performance by approximately 5 points over SFT (10.48→15.40), but RL without SFT is nearly ineffective (0.40).
- The MindPower Hierarchy significantly outperforms standard CoT (Decision +4.89%) — structured BDI reasoning is more effective than generic "thinking."
- Open-source VLMs exhibit severe deficiency in robot-centric perspective and are easily misled by scene bias (e.g., kitchen → cleaning, bedroom → tidying).
- A significant gap remains relative to the human baseline (Decision: 47.12 vs. 56.66; Action: 15.40 vs. 26.26).
Highlights & Insights¶
- The systematic introduction of the cognitive-science BDI framework into embodied agents produces an interpretable reasoning chain in which every decision is grounded in traceable belief support.
- The robot-centric perspective is the core innovation — the agent not only infers others' mental states but also explicitly models its own beliefs, enabling second-order reasoning.
- Mind-Reward decomposes reasoning quality into atomic-, local-, and global-level consistency evaluations, offering more controllability than black-box LLM scoring.
- The two task designs are insightful: false-belief correction (the agent detects that an object has been moved) and implicit goal inference (inferring needs from search behavior).
Limitations & Future Work¶
- The dataset contains only 590 scenarios, all sourced from simulators (VirtualHome + ThreeDWorld), limiting scene diversity.
- The action space is coarse (high-level atomic operations such as
walk(fridge)), with no coverage of low-level motion control. - Mind-Reward depends on Qwen3-Max for atomic action extraction, introducing an additional LLM dependency.
- Whether automated open-ended evaluation metrics (BERTScore, ROUGE) truly reflect reasoning quality remains questionable.
- Evaluation is limited to the 7B model; performance at larger scales has not been verified.
Related Work & Insights¶
- MuMA-ToM / MMToM-QA: Limited to multiple-choice inference of character mental states; MindPower requires complete BDI reasoning from the agent's own perspective together with action generation.
- Smart-Help / AToM-Bot: Address human-robot collaborative assistance but lack explicit mental reasoning; MindPower explicitly models the detection and correction of belief inconsistencies.
- Video-R1 / VideoChat-R1: Apply RL training for video understanding but do not involve ToM reasoning or embodied decision-making.
- Insights: The BDI reasoning hierarchy can be generalized as a "structured CoT" for other tasks requiring inference of others' intentions; the process decomposition and atomic matching approach of Mind-Reward provides a valuable reference for designing other process-level rewards.
Rating¶
- ⭐⭐⭐⭐⭐ Novelty: Robot-centric ToM combined with the BDI reasoning hierarchy represents an entirely new perspective and a cross-disciplinary innovation at the intersection of cognitive science and AI.
- ⭐⭐⭐⭐ Experimental Thoroughness: Comparisons against multiple closed-source and open-source VLMs, a human baseline, and detailed ablations are provided, though the dataset scale is limited.
- ⭐⭐⭐⭐ Writing Quality: Concepts are clearly articulated with well-organized structure; the three-level six-layer formalization is easy to follow.
- ⭐⭐⭐⭐ Value: Endowing embodied agents with ToM capability is an important research direction; practical deployment remains distant but the direction is well-defined.