Skip to content

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

Conference: ACL 2026
arXiv: 2510.26615
Code: SlideAgent
Area: Information Retrieval / Document Understanding
Keywords: Multi-page document understanding, hierarchical agent, visual document QA, slide understanding, element-level reasoning

TL;DR

SlideAgent is proposed as a hierarchical agentic framework that constructs structured knowledge representations through three specialized levels—global, page, and element. It significantly enhances fine-grained understanding of multi-page visual documents, particularly slides.

Background & Motivation

Background: Multi-page visual documents (e.g., financial reports, academic presentations, technical manuals) are prevalent in high-stakes fields such as finance, science, and education. These documents rely not only on text but also on layout, icons, color coding, and cross-page references to convey information.

Limitations of Prior Work: Current Multi-modal Large Language Models (MLLMs) face three challenges in multi-page visual document understanding: (1) Insufficient fine-grained reasoning — MLLMs tend to process pages holistically, ignoring element-level details (e.g., specific data segments in charts); (2) Lack of domain-specific visual semantics — pre-training based on natural images results in poor understanding of professional charts, icon meanings, and spatial layouts; (3) Metadata dependency — many systems rely on clean document metadata (e.g., chart bounding boxes, hierarchical labels), which is often missing or corrupted in real-world scenarios.

Key Challenge: MLLMs may fail when reasoning about a full page (e.g., miscounting categories in a chart) but can correctly identify information once relevant charts are cropped. This indicates that models possess reasoning capabilities but lack effective fine-grained information extraction mechanisms.

Goal: To build a universal agent framework capable of processing multi-page multi-modal documents without relying on metadata, achieving precise document understanding through hierarchical knowledge construction and selective agent activation.

Core Idea: Drawing from human information processing, document understanding is decomposed into three levels: global (overall theme), page (single-page features and cross-page relationships), and element (fine-grained parsing of charts, text blocks, and icons). Specialized agents at each level collaborate during both knowledge construction and reasoning phases.

Method

Overall Architecture

SlideAgent operates in two phases: (1) Knowledge Construction — a hierarchical, query-agnostic knowledge base \(\mathcal{K}=\{\mathcal{K}_g, \mathcal{K}_p, \mathcal{K}_e\}\) is built top-down; (2) Reasoning — specialized agents are selectively activated based on user query classification for multi-level retrieval and answer synthesis. The framework is model-agnostic and can be paired with various backbones such as GPT-4o or InternVL3-8B.

Key Designs

  1. Hierarchical Knowledge Construction:

    • Function: Constructs a structured, query-agnostic knowledge representation for the document.
    • Mechanism: The global agent \(\mathcal{M}_g\) generates document-level summaries and themes by sampling the first few pages; the page agent \(\mathcal{M}_p\) processes pages sequentially, generating page-level knowledge \(\mathcal{K}_p^i = \mathcal{M}_p(v_i, \mathcal{K}_g^{(0)}, \mathcal{K}_p^{i-1})\) conditioned on global and previous page context; the element agent \(\mathcal{M}_e\) decomposes each page into elements (text blocks, charts, icons) via layout parsing and generates semantic roles and functional descriptions for each.
    • Design Motivation: Global knowledge provides the thematic framework, page knowledge provides sequential context and cross-page associations, and element knowledge provides fine-grained spatial and content information—the three are complementary and essential.
  2. Query Classification & Selective Activation:

    • Function: Precisely activates the required subset of agents based on the query type to avoid unnecessary computation and noise.
    • Mechanism: Queries are classified into four categories: global understanding (global agent only), factual queries (page + element agents), multi-hop reasoning (all agents), and layout/visual relations (element agent). All agents are activated by default for unclassifiable queries.
    • Design Motivation: Different queries require different levels of granularity. Excessive agent activation increases computational overhead and may introduce noise; selective activation balances efficiency and accuracy.
  3. Subquery Generation & Multi-level Retrieval:

    • Function: Expands the original query into multiple subqueries for precise retrieval at the page and element levels.
    • Mechanism: Key entities are extracted from the query to generate subqueries. The original and subqueries are concatenated for joint retrieval of top-k pages and their elements. Sparse (BM25), dense (SFR), and multi-modal (COLPALI) retrievers are supported.
    • Design Motivation: Original queries are often short, leading to noisy retrieval; subquery generation enhances semantic coverage, particularly benefiting multi-hop reasoning scenarios.

Loss & Training

Ours utilizes a training-free approach—all agents are implemented via prompt engineering on existing MLLMs without additional training or fine-tuning. During knowledge construction, carefully designed prompt templates guide the agents to generate structured knowledge. Global knowledge incorporates a refine step (a single full-field rewrite) to synthesize global information from all pages and reduce bias toward initial pages.

Key Experimental Results

Main Results

Dataset Metric SlideAgent (GPT-4o) GPT-4o Gain
SlideVQA Overall 84.9 77.0 +7.9%
TechSlides Overall 70.9 63.4 +7.5%
FinSlides Overall 85.5 80.0 +5.5%
InfoVQA Overall 79.6 69.0 +10.6%
SlideVQA (InternVL3) Overall 72.7 63.0 +9.8%

Ablation Study

Configuration Key Metric (Overall) Description
w/o Page Agent -6.3 (GPT-4o) Largest drop; page-level reasoning is critical for cross-page coherence.
w/o Element Agent -4.6 (GPT-4o) Fine-grained reasoning is vital for numerical questions.
w/o Global Agent -2.8 (GPT-4o) Smallest drop; low-level agents partially embed global context.
w/o Subquery -5.0 (GPT-4o) Impact is particularly significant in retrieval scenarios.

Key Findings

  • Hierarchical knowledge construction improves not only QA performance but also page-level retrieval (text retriever SFR achieved a +6.4 MRR gain).
  • Multi-hop reasoning queries saw the largest improvement (+9.8%), demonstrating the value of structured knowledge for complex reasoning.
  • In the oracle setting (providing ground-truth pages), a +7.7% gain was still achieved, indicating the independent value of element-level retrieval.
  • Only 12.5% of errors were attributed to OCR/parsing failures, with most errors originating from question ambiguity and answer labeling issues.

Highlights & Insights

  • Hierarchical Divide-and-Conquer: Adopting a "global-page-element" processing model inspired by human cognition is both systematic and intuitive, facilitating modular extension.
  • Training-free Plug-and-Play: The framework is entirely based on prompt engineering and existing MLLMs, allowing direct application to any backbone model.
  • Necessity of Element-level Reasoning: The visual case in Figure 1 demonstrates MLLM failure in whole-page reasoning vs. success after element-level cropping, providing compelling evidence.
  • Knowledge Construction Gains for Retrieval: Generated structured knowledge (page descriptions and subqueries) serves both QA and as enhancement signals for retrieval.
  • Model Agnosticism: Consistent and significant improvements across diverse backbones like GPT-4o and InternVL3-8B.

Limitations & Future Work

  • Element boundaries depend on OCR and layout parsing tools; quality may vary based on the choice of tools.
  • Initial global knowledge sampling (first three pages) may lack representation for long documents; content-based page selection could be explored.
  • Primarily utilizes text retrievers (SFR); the potential of multi-modal retrievers requires further investigation.
  • Multi-turn dialogue scenarios remain unaddressed; extension to interactive document QA is a key direction.
  • Knowledge construction incurs high computational overhead, requiring individual MLLM calls per page.
  • vs ViDoRAG: While ViDoRAG also uses multi-agent architectures, SlideAgent's hierarchical design and element-level parsing are more detailed, achieving superior performance across all datasets.
  • vs VDocRAG: VDocRAG combines retrieval and reasoning but lacks element-level decomposition; SlideAgent shows a distinct advantage in numerical reasoning (Num).
  • vs COLPALI: As a pure multi-modal retrieval method, COLPALI is compared to SlideAgent, which demonstrates that combining text retrieval with structured knowledge can match or exceed multi-modal retrieval performance.

Rating

  • Novelty: ⭐⭐⭐⭐ Combination of hierarchical agents and element-level reasoning is novel in the document understanding field.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 4 datasets, 15+ baselines, and detailed ablation and error analysis.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure, intuitive case studies, and rigorous methodology descriptions.
  • Value: ⭐⭐⭐⭐ High generalizability with direct application potential for enterprise-level document understanding.