Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning in Web Agents¶
Conference: ICLR2026 arXiv: 2508.01858 Code: github.com/Gnonymous/Web-CogReasoner Area: LLM Agent Keywords: Web Agent, Cognitive Reasoning, Bloom's Taxonomy, Knowledge-Driven CoT, Curriculum Learning
TL;DR¶
Web-CogReasoner draws on Bloom's Taxonomy in educational psychology to decompose Web Agent capabilities into a three-tier hierarchy of factual, conceptual, and procedural knowledge, constructing a structured knowledge-driven Chain-of-Thought reasoning framework that substantially outperforms existing methods on web navigation tasks.
Background & Motivation¶
- Multimodal large language models have advanced Web Agent development, yet general-domain pretraining knowledge remains a performance bottleneck on specialized tasks.
- Existing knowledge augmentation approaches lack systematicity or theoretical grounding.
- This paper takes Bloom's Taxonomy—a foundational educational theory—as its starting point, decomposing learning into two stages:
- Phase 1 (Knowledge Content Learning): accumulating factual and conceptual knowledge → addressing "what to learn"
- Phase 2 (Cognitive Process): developing procedural knowledge → addressing "how to act"
- This mirrors the human learning trajectory: first acquiring knowledge through education, then learning to apply, analyze, and create based on that knowledge and experience.
- Core problem: how to systematically inject hierarchical knowledge into Web Agents to endow them with cognitive reasoning capabilities?
Method¶
1. Web-CogKnowledge Framework¶
A three-tier knowledge hierarchy grounded in Bloom's Taxonomy:
Factual Knowledge: - Concrete information extracted from webpage content - e.g., identifying attributes of web elements, predicting the direct outcome of a single interaction
Conceptual Knowledge: - Semantic relationships and abstract patterns within webpage content and structure - e.g., inferring the function of interface components, understanding the overall purpose and structure of a webpage
Procedural Knowledge: - Actionable knowledge for completing specific tasks through interaction - e.g., executing goal-directed action sequences, inferring user intent, handling unexpected interruptions
2. Web-CogDataset¶
- Multimodal metadata crawled from 14 mainstream real-world websites
- 12 fine-grained, progressively challenging task types:
- Factual tier: element attribute recognition, next-page prediction, source element prediction
- Conceptual tier: element understanding, webpage understanding
- Procedural tier: user intent prediction, popup dismissal, single-step exploration, etc.
- Data progressively deepens from perception → understanding → action
3. Web-CogBench¶
- 876 test instances spanning 8 task types
- Evaluation across three cognitive dimensions:
- Memorizing: assesses the ability to recall specific information (corresponding to factual knowledge)
- Understanding: assesses semantic interpretation ability (corresponding to conceptual knowledge)
- Exploring: assesses the ability to plan and execute goal-directed actions (corresponding to procedural knowledge)
- Multiple evaluation metrics including ROUGE-L, Accuracy, and LVM Judge
4. Web-CogReasoner Reasoning Framework¶
Modeled as a POMDP: \(P = (S, A, O, K, T, R)\)
Knowledge-driven CoT reasoning process: $\(\textbf{Task Prompt} \rightarrow \textbf{Knowledge-driven CoT} \rightarrow \textbf{Plan} \rightarrow \textbf{Action}\)$
Reasoning proceeds through three tiers: 1. Factual Reasoning: "What is on the page?" → identifying page elements and states 2. Conceptual Reasoning: "What does this mean?" → inferring roles and interaction relationships 3. Procedural Reasoning: "How to complete the task?" → planning goal-directed steps
The backbone model is Qwen2.5-VL-7B, fine-tuned via supervised training on Web-CogDataset.
Key Experimental Results¶
Web-CogBench Comprehensive Evaluation¶
| Model | Elem Attr | Next Page | Source Elem | Elem Und | WebPage Und | User Intent | Popup | Step Exp | Overall |
|---|---|---|---|---|---|---|---|---|---|
| Claude Sonnet 4 | 79.7 | 93.5 | 62.5 | 62.8 | 54.3 | 64.7 | 100 | 96.8 | 76.8 |
| Gemini 2.5 Pro | 79.8 | 94.6 | 84.4 | 62.6 | 73.5 | 51.9 | 96.6 | 98.4 | 80.4 |
| UI-TARs-7B-SFT | 63.5 | 88.0 | 31.3 | 48.0 | 48.0 | 32.4 | 25.9 | 33.9 | 46.4 |
| Web-CogReasoner | 91.4 | 93.5 | 87.5 | 69.2 | 79.0 | 61.4 | 98.3 | 95.2 | 84.4 |
VisualWebBench¶
| Model | Perception Avg | Reasoning Avg | Overall Avg |
|---|---|---|---|
| Claude Sonnet 4 | 80.7 | 91.2 | 85.9 |
| UI-TARs-7B-SFT | 82.4 | 89.7 | 86.0 |
| Web-CogReasoner | 79.0 | 93.6 | 86.3 |
Ablation Study on Curriculum Learning¶
| Configuration | Memorizing | Understanding | Exploring | Overall |
|---|---|---|---|---|
| Qwen2.5-VL-7B (Base) | 67.6 | 61.0 | 77.9 | 69.8 |
| + Factual (S1) | 85.5 (+17.9) | 64.2 | 60.1 | 72.1 |
| + Conceptual (S2) | 88.1 | 75.5 (+11.3) | 65.8 | 78.3 |
| + Procedural (S3) | 90.8 | 74.1 | 85.0 (+19.2) | 84.4 |
Online Web Tasks — WebVoyager Success Rate¶
| Agent | Overall |
|---|---|
| Claude Sonnet 4 | 47.7% |
| Gemini 2.5 Pro | 54.9% |
| OpenWebVoyager_Max | 26.2% |
| Web-CogReasoner | 30.2% |
Best among open-source models, with the fewest average steps (4.73 steps/successful task).
Highlights & Insights¶
- Solid theoretical grounding: Bloom's Taxonomy provides a principled basis for decomposing agent capabilities into trainable knowledge tiers.
- Structured knowledge-driven CoT: Rather than simple prompt engineering, the knowledge hierarchy is embedded directly into the reasoning process.
- Pronounced curriculum learning effect: Ablation experiments clearly quantify each knowledge tier's contribution, yielding an overall gain of 14.6%.
- Comprehensive evaluation framework: Web-CogBench systematically assesses agents along three cognitive dimensions—memorizing, understanding, and exploring.
- Surpasses closed-source models: The 7B model outperforms Claude Sonnet 4 and Gemini 2.5 Pro on Web-CogBench.
- Efficient execution: The fewest average steps in online tasks indicate that knowledge-driven reasoning reduces redundant exploration.
Limitations & Future Work¶
- Performance on online tasks (Mind2Web cross-web) lags behind Claude/Gemini; generalizing to entirely unseen websites remains challenging.
- The 14 training websites may offer insufficient coverage to encompass all webpage types.
- The POMDP formulation lacks an explicit memory mechanism, which may limit long-horizon decision-making.
- Integration with reinforcement learning training has not been explored; the current approach relies solely on SFT.
- The partitioning of knowledge tiers may be overly rigid, as the three knowledge types are often intertwined in real-world tasks.
Related Work & Insights¶
- Compared to UI-TARs (strong visual perception but weak cognitive reasoning): Web-CogReasoner leads substantially on cognitive tasks (84.4% vs. 46.4%), demonstrating that visual perception alone is insufficient.
- Compared to OpenWebVoyager (requires retraining on target websites): Web-CogReasoner generalizes through structured knowledge alone.
- Compared to general LVMs (Claude/Gemini): the open-source 7B model is competitive on knowledge-intensive tasks.
- Compared to earlier curriculum learning approaches such as AutoWebGLM: Web-CogReasoner offers a more systematic knowledge hierarchy.
Broader Implications¶
- Bloom's Taxonomy could be extended to other agent domains (e.g., GUI agents, robotic manipulation).
- The knowledge-driven CoT paradigm is applicable to other tasks requiring structured reasoning.
- The progressive factual → conceptual → procedural training strategy offers a valuable reference for multi-stage training pipelines.
- The cognitive-dimension evaluation methodology of Web-CogBench can inspire better agent benchmark design.
Rating¶
- Novelty: ⭐⭐⭐⭐ (The integration of Bloom's Taxonomy with Web Agents is novel, though knowledge injection itself is not an entirely new idea)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (4 benchmarks, detailed ablations, component analysis, and online task evaluation)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, though the paper is lengthy and some sections could be condensed)
- Value: ⭐⭐⭐⭐ (Provides a systematic methodology for knowledge-based Web Agent training; Web-CogBench has practical utility)