Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents¶

Conference: ICLR 2026 arXiv: 2508.01858 Code: https://github.com/Gnonymous/Web-CogReasoner Area: LLM Agent Keywords: Web Agent, Cognitive Reasoning, Bloom's Taxonomy, Chain-of-Thought, Knowledge-Driven

TL;DR¶

Inspired by Bloom's educational taxonomy, this paper proposes the Web-CogKnowledge Framework, which decomposes Web Agent capabilities into a progressive three-tier knowledge hierarchy—Factual→Conceptual→Procedural—and trains Web-CogReasoner using a Knowledge-driven CoT (KCoT) reasoning framework. The resulting model achieves 84.4% on Web-CogBench, surpassing Claude Sonnet 4 (76.8%) and Gemini 2.5 Pro (80.4%).

Background & Motivation¶

Web Agents have evolved from early rule-based systems to LLM/LVM-powered intelligent systems. The central challenge is that general-purpose pretraining knowledge creates a performance bottleneck on specialized tasks.

Specifically:

Text-only Agents: process only HTML/Accessibility Trees, missing visual cues.

Vision-only Agents: reason directly from screenshots but lack structured data.

Hybrid Agents: integrate both modalities but still lack a systematic knowledge foundation.

Prior knowledge-augmentation methods tend to lack theoretical grounding or systematic organization. The paper draws a key insight from educational psychology: human learning first involves knowledge acquisition (Stage 1), followed by application, synthesis, and creation based on that foundation (Stage 2). Translated to Web Agents:

Stage 1 (Knowledge Content Learning): Establishes multi-tier foundations—factual knowledge (basic concepts) and conceptual knowledge (relational understanding), corresponding to what to know.
Stage 2 (Cognitive Process): Develops procedural knowledge—logical reasoning frameworks, corresponding to how to act.

Method¶

Overall Architecture¶

The Web-CogKnowledge Framework consists of three components: 1. Web-CogKnowledge: A hierarchical knowledge taxonomy. 2. Web-CogDataset: Structured training data constructed from 14 real-world websites. 3. Web-CogBench: A benchmark evaluating agents' knowledge mastery and cognitive capabilities.

Agent interaction is modeled as a POMDP: \(P = (S, A, O, K, T, R)\), where at each step the agent receives a screenshot and an Accessibility Tree, generates a reasoning trace \(h_t\), and selects an action \(a_t\).

Key Designs¶

Three-Tier Knowledge Taxonomy:

Knowledge Tier	Definition	Cognitive Capability	Example Tasks
Factual Knowledge	Concrete information about web elements	Memorizing	Identifying element attributes, predicting single-step interaction outcomes
Conceptual Knowledge	Semantic relationships and abstract patterns	Understanding	Inferring UI component functions, understanding page structure
Procedural Knowledge	Operational methods for completing tasks	Exploring	Executing goal-directed sequences, handling interruptions

Web-CogDataset (12 fine-grained task types): - Metadata selectively crawled from 14 representative websites. - Progressive training tasks designed according to the three knowledge tiers. - Covers the full pipeline from perception → understanding → execution.

Knowledge-driven CoT Reasoning—the core inference chain:

\[\text{Task Prompt} \rightarrow \text{Knowledge-driven CoT} \rightarrow \text{Plan} \rightarrow \text{Action}\]

Each reasoning step is decomposed into three layers: 1. Factual layer: "What is on the page?" — identifying elements and states. 2. Conceptual layer: "What does this mean?" — inferring roles and interaction relationships. 3. Procedural layer: "How to complete the task?" — planning goal-directed steps.

Curriculum Learning Strategy—progressive training by knowledge tier: - S1: Factual knowledge training. - S2: Conceptual knowledge training. - S3: Procedural knowledge training.

Loss & Training¶

Supervised fine-tuning (SFT) is applied on Qwen2.5-VL-7B, with training data organized into three stages: - Stage S1: Factual knowledge samples—element attribute recognition, single-step interaction prediction. - Stage S2: Conceptual knowledge samples—element function understanding, page structure comprehension. - Stage S3: Procedural knowledge samples—multi-step task execution, intent reasoning, interruption handling.

Training follows imitation learning (IL) using knowledge-guided reasoning templates. Activation of KCoT is critical: removing KCoT causes the online task success rate to drop sharply from 42.9% to 25.35%.

Key Experimental Results¶

Main Results (Web-CogBench)¶

Aggregate performance across 8 tasks:

Model	Memorizing	Understanding	Exploring	Overall
Claude Sonnet 4	–	–	–	76.8
Gemini 2.5 Pro	–	–	–	80.4
Qwen2.5-VL-7B	53.2	60.0	–	69.8
UI-TARs-7B-SFT	63.5	48.0	–	46.4
Web-CogReasoner	91.4	69.2	–	84.4

WebVoyager online task success rate (average across 15 websites):

Agent	Overall
Claude Sonnet 4	47.7%
Gemini 2.5 Pro	54.9%
OpenWebVoyager-Max	26.2%
Web-CogReasoner	30.2%

VisualWebBench composite scores:

Model	Perception Avg.	Reasoning Avg.	Overall
Claude Sonnet 4	80.7	91.2	85.9
Gemini 2.5 Pro	80.3	93.0	86.6
UI-TARs-7B-SFT	82.4	89.7	86.0
Web-CogReasoner	79.0	93.6	86.3

Ablation Study (Curriculum Learning Progressive Gains)¶

Effect of progressive knowledge training on Web-CogBench:

Configuration	Memorizing	Understanding	Exploring	Overall
Base model	67.6	61.0	77.9	69.8
+S1 (Factual)	85.5 (+17.9)	64.2	60.1	72.1
+S1+S2 (Conceptual)	88.1	75.5 (+11.3)	65.8	78.3
+S1+S2+S3 (Procedural)	90.8	74.1	85.0 (+19.2)	84.4

Tier dependency verification (WebVoyager):

Configuration	Overall
S3 only	13.14%
S1+S3	23.47%
S1+S2+S3 (w/o KCoT)	25.35%
S1+S2+S3 (w/ KCoT)	42.9%

Key Findings¶

Lower-tier knowledge is a prerequisite for higher-tier knowledge: S1+S3 achieves nearly twice the success rate of S3 only (23.47% vs. 13.14%), demonstrating that procedural exploration depends on factual grounding.
KCoT functions as a knowledge activator: The full knowledge configuration (S1+S2+S3) without KCoT yields only 25.35%, while adding KCoT brings this to 42.9%—knowledge and the reasoning framework are mutually indispensable.
Lesson from UI-TARs: UI-TARs performs well on VisualWebBench (86.0%) but only 46.4% on Web-CogBench, demonstrating that strong perceptual ability does not equate to cognitive reasoning capability.
Execution efficiency: Web-CogReasoner achieves the lowest average number of steps per task (4.73) among all compared methods, indicating that structured knowledge leads to more efficient decision-making.

Highlights & Insights¶

Educational theory guiding AI training: This work is the first to systematically apply Bloom's Taxonomy to Web Agent training, designing the training pipeline along both "what to teach" and "how to teach" dimensions.
Rigorous validation of knowledge tier dependencies: Ablation experiments clearly demonstrate the Factual→Conceptual→Procedural dependency chain—skipping lower tiers and training higher-tier capabilities directly leads to failure.
KCoT as a "knowledge activator": Unlike simply increasing training data volume, KCoT provides an explicit knowledge organization mechanism that enables the model to invoke acquired knowledge in a tier-wise manner during decision-making.
Surpassing open-source baselines: The 7B open-source model outperforms or approaches closed-source commercial models such as Claude and Gemini across multiple evaluation dimensions.

Limitations & Future Work¶

Reliance on imitation learning: Current training is entirely based on SFT/IL and lacks the exploratory capacity of reinforcement learning, potentially limiting the agent's ability to discover novel strategies.
Significant gap in online tasks relative to commercial models: On WebVoyager (30.2% vs. Gemini 54.9%), real-world web interaction capability still has considerable room for improvement.
Weak cross-website generalization: Cross-site generalization on Online Mind2Web (10.1%) is substantially below Claude (21.7%), and unseen websites remain a challenge.
Limited website coverage in training data: The training set is constructed from only 14 websites; generalization to a broader range of domains and sites remains to be validated.
Future directions: Integrating reinforcement learning to enhance exploration, generalization, and autonomous discovery of procedural knowledge.

Complementarity with UI-TARs: UI-TARs excels at visual perception but is weak in cognitive reasoning; Web-CogReasoner bridges the gap from perception to reasoning through its knowledge framework.
Relationship with AutoWebGLM: AutoWebGLM also employs curriculum learning but focuses solely on the skill dimension of structure recognition → component understanding → task execution, without establishing a knowledge-tier theory.
Relationship with CoT reasoning: KCoT is not a general-purpose chain-of-thought but a structured reasoning process organized by knowledge tier—Factual→Conceptual→Procedural—where each layer has explicit resource dependencies.

Rating¶

Novelty: ⭐⭐⭐⭐ (Applying Bloom's Taxonomy to Web Agent training is an original contribution; the KCoT framework is theoretically grounded.)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (4 benchmarks, detailed ablations, tier dependency verification, efficiency analysis, and cross-domain generalization tests.)
Writing Quality: ⭐⭐⭐⭐ (Theoretical framework is clearly articulated; educational motivation is self-consistent; experimental organization is sound.)
Value: ⭐⭐⭐⭐ (Provides a systematic training methodology for Web Agents; the dataset and benchmark contribute community value.)