Why LLM Web Agents Fail: A Hierarchical Planning Perspective¶
Conference: ACL 2026
arXiv: 2603.14248
Code: https://github.com/Ziyu-Yao-NLP-Lab/llm-hierarchical-web-agents
Area: LLM Agent / Web Navigation
Keywords: Web agent failure analysis, hierarchical planning, natural language vs PDDL, execution bottlenecks
TL;DR¶
This paper systematically analyzes the failure causes of LLM web agents through a hierarchical planning framework (high-level planning, low-level execution, and replanning). It finds that PDDL representation outperforms natural language planning, but low-level execution and perceptual grounding are the primary bottlenecks.
Background & Motivation¶
Background: The performance of LLM web agents on long-horizon tasks remains significantly lower than human levels. However, existing evaluations focus primarily on end-to-end success rates, offering limited understanding of specific failure sources.
Limitations of Prior Work: End-to-end evaluation metrics (e.g., task success rate) mask the real issues—they cannot distinguish whether a failure stems from high-level planning errors, insufficient low-level execution, or the failure of replanning mechanisms.
Key Challenge: Different components have distinct bottlenecks, but existing methods optimize overall performance indiscriminately, leading to obscure improvement directions.
Goal: To establish a systematic hierarchical evaluation framework that decomposes web agent capabilities into three independent dimensions for diagnosis.
Key Insight: Inspired by automated planning (e.g., HTN planning), humans solve complex tasks via a three-layer process: "abstract strategy \(\rightarrow\) concrete execution \(\rightarrow\) dynamic replanning." LLM agents should be decomposable using the same logic.
Core Idea: Utilize a hierarchical planning framework instead of black-box end-to-end evaluation to precisely locate the failure causes of LLM agents.
Method¶
Overall Architecture¶
The proposed hierarchical planning evaluation framework consists of a four-stage process:
- High-level Planning: The LLM decomposes a natural language instruction into a sequence of high-level subgoals \(P = [g_1, g_2, \ldots, g_n]\), where each \(g_i\) represents an abstract, meaningful step.
- Low-level Execution: For each subgoal \(g_i\), the agent generates a series of executable low-level actions \(a_t \in \mathcal{A}\), producing an execution trajectory \(\tau_i = (o_t, a_t, o_{t+1}, \ldots, o_{t+k})\).
- Post-condition Check: An LLM is used as a judge to verify if the execution result satisfies the intended effect of the subgoal, formulated as \(\Phi(g_i, s') = 1\).
- Replanning: If a subgoal fails or hits a dead end, the agent decides between local adjustment (continuing from the last successful subgoal) or global replanning (generating a new plan from scratch).
Key Designs¶
-
PDDL vs Natural Language Representation:
- Function: Compares the impact of two high-level planning representations on plan quality.
- Mechanism: Natural language (NL) is flexible but prone to over-specification (implicit requirements) or over-decomposition. PDDL enforces clear plan semantics via formal structures (preconditions, effects).
- Design Motivation: NL plans often incorporate low-level details. This design tests whether PDDL can generate more abstract and executable plans through symbolic constraints.
-
Multi-dimensional Failure Mode Analysis:
- Function: Measures performance across three dimensions: planning alignment, execution reliability, and replanning effectiveness.
- Mechanism: Defines six alignment metrics (Perfect Match / Partial / Missing / Decomposed / Unmatched / Matched Rate) to quantify the deviation of high-level plans from human reference plans. It also defines low-level execution metrics (subgoal completion rate / plan completion rate / task success rate / action efficiency) and evaluates performance changes before and after replanning.
- Design Motivation: Isolating capabilities allows for clearer improvement directions—for example, if low-level execution is poor despite high-quality planning, the focus should be on perception grounding rather than planning capacity.
-
LLM as a Judge for Verification:
- Function: Automatically judges subgoal completion and overall task success.
- Mechanism: Uses gpt-5-nano as a judge to determine goal achievement based on execution trajectories and final webpage states. Human verification on 50 samples showed an accuracy of 82%-86%.
- Design Motivation: In real web environments, rule-based success judgment is fragile; LLMs can understand semantics rather than relying on mechanical matching.
Key Experimental Results¶
High-level Planning: Natural Language vs PDDL¶
| Metric | NL (Pre-replan) | PDDL (Pre-replan) | NL (Post-replan) | PDDL (Post-replan) |
|---|---|---|---|---|
| Perfect Match | 60.6% | 67.7% | 56.1% | 59.0% |
| Partial | 5.7% | 7.4% | 6.1% | 6.9% |
| Missing | 4.2% | 2.2% | 4.0% | 14.5% |
| Decomposed | 29.5% | 22.7% | 33.8% | 19.6% |
| Unmatched | 29.4% | 15.4% | 35.0% | 15.4% |
| Matched (Effective steps) | 70.6% | 84.6% | 65.0% | 84.6% |
Key Findings: PDDL plans achieve a higher Perfect Match rate (67.7% vs 60.6%) and fewer Unmatched steps (15.4% vs 29.4%). NL plans tend toward over-decomposition (29.5%), leading to redundant steps.
Low-level Execution: Identifying the True Bottleneck¶
| Dataset Metric | gpt-5-nano (human plan) | gpt-5-nano (NL plan) | gpt-5-nano (PDDL plan) |
|---|---|---|---|
| Subgoal Completion Rate | 38.5% | 26.8% | 32.1% |
| Plan Completion Rate | 38.5% | - | - |
| Final Task Success Rate | 36.4% | 18.5% | 24.7% |
Low-level Execution Failure Modes:
| Failure Mode | Incidence | Root Cause |
|---|---|---|
| Hallucinated links (goto action) | 32.0% | LLM invents non-existent URLs |
| Redundant actions | 34.2% | Insufficient environment understanding; invalid operations |
| Out-of-domain links | 16.7% | Navigating away from target site (e.g., to Wikipedia via search) |
| Repetitive execution | 10.4% | Failure to learn from feedback; getting stuck in loops |
Key Findings: Even when provided with perfect human-annotated high-level plans, the LLM executor success rate is only 36.4%, indicating that low-level execution and perceptual grounding are the true bottlenecks.
Effect of Replanning¶
| Configuration | Subgoal Completion | Task Success | Gain |
|---|---|---|---|
| Pre-replan (NL) | 26.8% | 18.5% | Base |
| Post-replan (NL) | 31.2% | 22.3% | +4.4pp |
| Pre-replan (PDDL) | 32.1% | 24.7% | Base |
| Post-replan (PDDL) | 35.5% | 28.9% | +3.4pp |
Single-turn replanning improves success rates by 4-5 percentage points, suggesting that while the replanning mechanism is effective, its impact is limited.
Comparison Across LLMs¶
- gpt-5-nano: Strongest performance, 36.4% task success rate (human plan).
- claude-haiku-4.5: 29.2% success rate; highest rate of repetitive failures (16.7%), indicating weak feedback utilization.
- gemini-flash-2.5: 17.3% success rate; worst low-level execution with the highest redundant action rate (41.2%), despite compact planning.
Highlights & Insights¶
- Innovation in Hierarchical Diagnostic Framework: Rather than aiming for incremental end-to-end gains, the paper provides a systematic method to isolate and evaluate three layers of capability. This approach, borrowed from automated planning, is applied to LLM web agent failure analysis for the first time.
- Quantitative Advantage of PDDL: The paper quantitatively proves that formal representations outperform natural language. Although PDDL has a higher learning curve, it yields more precise plans with less redundancy and higher executability.
- Low-level Execution as the Core Issue: The study challenges the common assumption that "improving LLM reasoning will automatically improve web agents." Evidence shows that even with perfect high-level planning, execution fails at a rate of 63.6%, shifting the focus toward perceptual grounding.
- Fine-grained Classification of Failure Modes: Categorizing execution failures into hallucinated links, redundant actions, out-of-domain jumps, and repetitive loops (each accounting for 16-34%) provides a concrete roadmap for optimization.
- Limited Utility of Simple Replanning: Gains from single-turn replanning are modest (+4-5pp), indicating a need for more sophisticated adaptation mechanisms beyond simple retries.
Limitations & Future Work¶
Author-acknowledged Limitations: * Experiments focused on a limited set of high-level representations (NL and PDDL), action spaces (3 types), and agent configurations. * Does not account for multimodal settings (visual information). * High-level plan evaluation requires human-annotated reference plans, which limits flexibility.
Self-identified Limitations: * Evaluation is restricted to 104 tasks from Mind2Web-Live, which is a relatively small sample size. * While LLM-as-a-Judge has 82%-86% accuracy, it may still misjudge edge cases in highly complex web pages. * Replanning was explored for only one round; multi-round iterative convergence remains unstudied. * The potential of hybrid options (e.g., PDDL planning combined with neural network low-level executors) was not explored.
Specific Improvement Directions: 1. Perceptual Grounding: Incorporate visual features or structured page representations (e.g., symbolic DOM trees) to mitigate link hallucinations. 2. Action Space Design: Allow agents to explicitly express "uncertainty" or "need for clarification" rather than forced guessing. 3. Distributed Execution: Decouple the planning module (using PDDL) from the execution module (using specialized tools or neural nets) to optimize each independently. 4. Multi-round Replanning Strategies: Design adaptive feedback mechanisms that teach the agent when to backtrack versus when to proceed.
Related Work & Insights¶
- vs. WebArena/Mind2Web (End-to-End Evaluation): These benchmarks only measure success rates without diagnosing the source of failure. This work redefines evaluation through a diagnostic lens rather than pure performance testing.
- vs. Prior PDDL Planning Work (Silver et al.): Previous research applied PDDL+LLM to classical planning; this paper introduces it to web agents with quantitative comparisons, finding formal representations maintain advantages in open-world web environments.
- vs. Low-level Execution Improvements (e.g., WALT tool calling): While those methods improve execution using APIs, they do not diagnose why execution fails. This framework is orthogonally compatible with those directions.
- vs. Adaptive Agents (e.g., Reflexion): While these methods use feedback for improvement, this paper quantifies the specific gain of single-round replanning (+4-5pp) and highlights the need for multi-layered feedback.
Rating¶
- Novelty: ⭐⭐⭐⭐ Applying hierarchical planning to web agent diagnostics is a fresh perspective. The quantitative PDDL vs. NL comparison is a significant contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Rigorous design covering multiple LLMs and multi-dimensional metrics, though the task count (104) is somewhat low. Excellent failure mode analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Extremely logical flow from motivation through framework to experiments and recommendations. High readability.
- Value: ⭐⭐⭐⭐ Highly practical, providing clear guidance to the community by identifying low-level execution as the primary bottleneck over planning.