TP-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making¶
Conference: NeurIPS 2025 arXiv: 2511.17225 Code: Unavailable Area: 3D Vision Keywords: Embodied Navigation, Long-Horizon Task Planning, Multi-Demand-Driven, Semantic Map, Large Language Models
TL;DR¶
This paper proposes the Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN) benchmark and the AWMSystem autonomous decision-making framework, which achieves long-horizon multi-subtask navigation through three LLM modules—instruction decomposition, dynamic goal selection, and task status monitoring—coupled with a multi-dimensional cumulative semantic map.
Background & Motivation¶
In daily life, people frequently need to complete multiple sequential demands (e.g., cleaning → resting → eating), each accompanied by personal preferences. Conventional Demand-Driven Navigation (DDN) handles only a single demand at a time and fails to capture the complexity of real-world multi-demand scenarios.
Limitations of prior work:
Single-demand constraint: Methods such as DDN and MO-DDN process one demand instruction at a time and cannot efficiently manage the sequential execution and state tracking of multiple subtasks.
Absence of task preferences: Existing DDN methods do not explicitly model user preferences. For instance, "organizing the living space" may imply cleaning tools, decorations, or storage boxes, requiring preference constraints for precise execution.
Long-horizon navigation challenges: Frequent invocation of large language models incurs high inference costs, and errors such as collisions and boundary violations during navigation lack real-time correction mechanisms.
Insufficient environmental memory: Existing methods encode limited environmental memory, making it difficult to maintain spatial-semantic consistency across long-horizon tasks.
The core idea of this paper is to extend single-instance DDN into a long-horizon navigation benchmark that encompasses multiple sub-demands and explicit task preferences, and to design a modular autonomous decision-making system to address this setting effectively.
Method¶
Overall Architecture¶
The system comprises four core components: (1) the AWMSystem autonomous decision-making framework (containing three foundation model modules: BreakLLM, LocateLLM, and StatusMLLM); (2) the MASMap multi-dimensional accumulative semantic map; (3) a dual-rhythm action generation framework; and (4) an adaptive error corrector.
Key Designs¶
-
MASMap Multi-Dimensional Accumulative Semantic Map: Integrates 3D point cloud accumulation with a 2D semantic map to balance accuracy and efficiency without additional training.
- Raw data processing: RAM-Grounded-SAM is applied to RGB images for object detection and segmentation; 3D point clouds are extracted from depth maps.
- Real-time accumulation: An overlap metric determines whether a currently detected object corresponds to a previously recorded one. If \(os^* > 0.8\) and \(ros^* > 0.8\), the point clouds are merged and labels updated; if \(\max os^* < 0.25\), the object is treated as a new entry.
- Global semantic map fusion: Object point cloud centroids are recorded and 3D data are subsequently cleared to reduce memory usage; historical object matching employs 2D IoU and the Hungarian algorithm.
- Memory is divided into long-term memory (global accumulative map + historical planning targets) and short-term memory (local 3D point clouds + current subtask status).
-
AWMSystem Three-Module Decision Chain:
- BreakLLM: Automatically decomposes long-horizon instructions into a subtask list \(d_{sub}\) and a status list \(Sub_{Status}\).
- LocateLLM: Determines the next target by integrating object memory, subtask status, and execution feedback. An auxiliary feedback mechanism is introduced: when the consecutive failure count for the same target reaches \(n_{CFE} \geq n_{tolerance}\), a prompt instructing the model to avoid reselecting that target is generated to prevent execution loops.
- StatusMLLM: Triggered when the policy network outputs a Done action; a multimodal LLM evaluates whether the current subtask has been completed and outputs the reasoning result along with the updated status.
-
Dual-Rhythm Action Generator: Decouples planning into slow-rhythm and fast-rhythm phases to balance reasoning depth and efficiency.
- Slow-rhythm phase: Extracts the current target point cloud → LocateLLM decision → computes a feasibility value map → A* path planning → decomposes into waypoint sequences → execution.
- Fast-rhythm phase: A pretrained policy network directly outputs low-level actions (MoveAhead/RotateRight, etc.); a Done output from the policy network triggers StatusMLLM evaluation.
- The feasibility value map combines an obstacle avoidance value \(a_{obs}(n_i)\) and a semantic target proximity value \(a_{tgt}(n_i)\).
-
Adaptive Error Corrector: When a potential MoveAhead collision is detected, a new path is planned from the current position. The trajectory is divided into an initial segment (fine sampling interval \(n_{block}\) to support detailed reasoning near obstacles) and a subsequent segment (standard sampling frequency \(n_{waypoint}\)); the feasibility value map is recomputed to generate a corrected trajectory.
Loss & Training¶
The system is designed for zero-shot deployment and requires no end-to-end retraining. The policy network reuses the pretrained DDN model, and Qwen2.5-VL-72B serves as the large language model for inference and decision-making. Evaluation metrics include Success Rate (SR), Independent Success weighted by Path Length (ISPL), Success Trajectory Length (STL), and Independent Success Rate (ISR).
Key Experimental Results¶
Main Results¶
Comparison with SOTA methods on the TP-MDDN benchmark:
| Method | Zero-Shot | LLM Reasoning | Explicit History | STL↑ | ISR↑ | SR↑ | ISPL↑ |
|---|---|---|---|---|---|---|---|
| DDN | ✗ | ✗ | ✗ | 15.50 | 44.67 | 16.00 | 40.66 |
| MO-DDN | ✗ | ✓ | ✓ | 12.11 | 39.78 | 13.33 | 36.25 |
| InstructNav | ✓ | ✓ | ✓ | 9.50 | 42.44 | 16.00 | 39.41 |
| AWM-Nav | ✗ | ✓ | ✓ | 20.11 | 62.89 | 32.00 | 44.19 |
Ablation Study¶
Ablation of individual components (selected results):
| Configuration | ISR↑ | SR↑ | ISPL↑ | Notes |
|---|---|---|---|---|
| GLEE segmentor | 51.11 | 21.33 | 41.05 | Insufficient precision |
| YOLO segmentor | 58.00 | 29.33 | 43.69 | Suboptimal |
| RAM-Grounded-SAM | 62.89 | 32.00 | 44.19 | Best |
| Qwen2.5-VL-7B inference | 47.78 | 19.33 | 36.45 | Insufficient parameters |
| GPT-4o inference | 56.44 | 28.67 | 39.95 | Weaker context understanding than 72B |
| Qwen2.5-VL-72B | 62.89 | 32.00 | 44.19 | Best |
| w/o adaptive error correction | 60.67 | 27.33 | 42.46 | Reduced path planning robustness |
| w/o StatusMLLM | 60.44 | 28.00 | 42.20 | Subtask status assessment disabled |
Key Findings¶
- AWM-Nav achieves a success rate 16 percentage points higher than both DDN and InstructNav, while also being substantially more efficient than InstructNav (6.82 min vs. 88.90 min).
- In the dual-rhythm strategy, slow-rhythm action execution takes approximately 22 times longer than fast-rhythm execution per step, yet the overall inference efficiency far exceeds InstructNav's per-step LLM invocation approach.
- Model scale significantly impacts intelligent planning capability in long-horizon navigation (the gap between 7B and 72B is substantial).
- Even with strong reasoning capability, adaptive error correction remains critical, as collisions and boundary violations are unavoidable physical constraints.
Highlights & Insights¶
- Comprehensive benchmark design: TP-MDDN explicitly defines a long-horizon navigation task format with multiple sub-demands and task preferences, filling a gap in the field.
- Modular system design: The three LLM modules serve distinct roles (decomposition, localization, and monitoring), cleanly decoupling the core challenges of long-horizon navigation.
- Balance between efficiency and performance: The dual-rhythm design avoids the substantial overhead of per-step LLM invocation while preserving strong reasoning capability.
- Lightweight MASMap design: Retaining only 2D centroid coordinates after 3D point cloud detection substantially reduces memory overhead.
Limitations & Future Work¶
- The dual-rhythm action generation framework exhibits involuntary mode-switching issues; transitions between slow-rhythm and fast-rhythm phases are not always smooth.
- Excessive reliance on pretrained large language models may lead to instruction misinterpretation, adversely affecting navigation decisions.
- Future work may employ reinforcement learning to optimize mode-switching strategies and train domain-specific language models to reduce dependence on general-purpose large models.
- The benchmark is currently evaluated only in the ProcTHOR simulation environment; transferability to real-world settings remains to be verified.
Related Work & Insights¶
The system design draws on the world model concepts of WMNav and the autonomous evolution mechanism of Voyager. Compared with InstructNav, this paper introduces explicit task decomposition and state tracking, rendering long-horizon navigation more controllable. The IoU-based fusion strategy and memory cleanup mechanism of MASMap provide an effective spatial memory management solution for other long-horizon embodied tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ The system design is novel and the TP-MDDN benchmark fills a gap in the field, though the innovation of individual modules is limited.
- Experimental Thoroughness: ⭐⭐⭐⭐ Ablation studies cover all components, but evaluation is limited to simulated environments.
- Writing Quality: ⭐⭐⭐⭐ The structure is clear and the system description is thorough, though the extensive use of formulas makes the presentation somewhat verbose.
- Value: ⭐⭐⭐⭐ The paper provides a complete solution for multi-demand long-horizon navigation and advances the development of embodied AI.