Towards Scalable Web Accessibility Audit with MLLMs as Copilots¶
Conference: AAAI2026 arXiv: 2511.03471 Code: eaglelab-zju/AAA Area: Multimodal VLM Keywords: web accessibility, WCAG-EM, multimodal LLM, graph neural network, page sampling
TL;DR¶
This paper proposes the AAA framework, which operationalizes the WCAG-EM standard through two key innovations—GRASP (Graph-based multimodal page sampling) and MaC (MLLM as Copilot)—enabling scalable end-to-end web accessibility auditing.
Background & Motivation¶
Web accessibility is fundamental to digital inclusion, yet recent surveys reveal that 94.8% of top-one-million website homepages contain WCAG violations. The root cause lies not in a lack of education or tooling, but in the resource bottleneck of the audit process itself:
- Limitations of existing tools: Tools such as WAVE and Axe only perform hard-coded rule checks (e.g., missing alt text, insufficient contrast) and cannot cover semantic or cognitive accessibility issues.
- Difficulty executing WCAG-EM: Although W3C's five-step audit methodology is well-standardized, no technical framework exists to support its execution at scale.
- Inadequate page sampling: Existing clustering methods (SDC) rely solely on shallow textual statistics, ignoring multimodal semantics such as visual layout and hyperlink structure.
- Human evaluation bottleneck: Manually identifying accessibility-critical components (structured pages, complete processes) demands substantial expert effort.
Method¶
Overall Architecture of the AAA Framework¶
The framework aligns with the five-step WCAG-EM process: website crawling → automated checking → page sampling → manual inspection → reporting/remediation. The core innovations lie in the page sampling and manual inspection stages.
GRASP: Graph-Based Multimodal Page Sampling¶
Page representativeness is defined along three dimensions: 1. Textual semantic representativeness: BERT is used to extract contextualized semantic representations from DOM text. 2. Visual layout representativeness: ViT is used to learn layout-level visual representations from page screenshots. 3. Link-structural representativeness: A GNN is applied over the hyperlink graph to learn structural representations.
The fusion pipeline is \(\mathbf{X} = \mathbf{H}_t || \mathbf{H}_v\). After GNN message passing, k-means clustering is applied, and the node closest to each cluster centroid is selected as a sampled page. A representativeness-enhanced graph learning module is additionally introduced, which uses clustering results to prune noisy edges and recover semantically similar but non-adjacent edges.
MaC: MLLM as a Multi-Role Copilot¶
- Assistant: Automatically identifies WCAG-EM-defined structured pages (common/relevant/essential/technology-dependent) to assist individual-feature-based page sampling; pre-extracts accessibility-critical elements (search bars, forms, CAPTCHAs, etc.).
- Auditor: Evaluates cognitive accessibility issues overlooked by conventional tools (WCAG 2.2 SC 3.3.8/3.3.9), such as the cognitive load imposed by CAPTCHAs.
- Consultant: Provides remediation suggestions (identified as a future direction).
Four New Datasets¶
- TPS: 97,246 pages from 495 websites, including DOM, screenshots, Axe checks, and adjacency matrices.
- APR: 968 pages across 5 website categories, annotated for 4 types of WCAG-EM structured pages.
- CCT: 1,985 CAPTCHA images across 17 authentication task types, for evaluating cognitive accessibility.
- CPE: 1,199 pages annotated for 5 component categories: search, filter, form, CAPTCHA, and contact information.
Key Experimental Results¶
GRASP Page Sampling (Average over 495 Websites)¶
| Method | Layout \(S_{sampled}\)↓ | Layout \(D_{intra-inter}\)↑ | Text \(S_{sampled}\)↓ | Text \(D_{intra-inter}\)↑ |
|---|---|---|---|---|
| SDC_content | 56.66 | 9.96 | 89.29 | 2.73 |
| SDC_tags | 54.18 | 10.76 | 88.76 | 2.12 |
| GRASP_GCN | 51.54 | 13.05 | 86.99 | 1.59 |
| GRASP_IGNN | 44.31 | 14.94 | 80.45 | 7.40 |
GRASP_IGNN substantially outperforms baselines in both representation spaces, demonstrating that heterogeneous graph modeling is better suited to website hyperlink structures.
MaC F1 on APR/CPE¶
- GPT-4o achieves F1 = 98.01% on search bar identification and F1 = 95.33% on CAPTCHA detection.
- The smaller model Qwen2.5-VL-72B achieves F1 = 80.21% on relevant page identification, surpassing GPT-4o (35.44%).
- For cognitive CAPTCHA classification, fine-tuned InternVL2-8B achieves macro-F1 = 45.58%, outperforming GPT-4o (29.16%).
Highlights & Insights¶
- First end-to-end WAA framework: Aligns with the full five-step WCAG-EM process, covering the entire audit lifecycle.
- Multimodal page sampling: The first approach to jointly integrate textual, visual, and link-structural representativeness; GRASP_IGNN significantly outperforms text-only methods.
- Multi-role MLLM positioning: Goes beyond the narrow scope of evaluation and remediation to explore MLLM applications in sampling, pre-audit localization, and cognitive accessibility assessment.
- Potential of small models: Experiments demonstrate that fine-tuned 8B models can serve as domain specialists with high cost-effectiveness.
Limitations & Future Work¶
- GRASP relies on the quality of BERT/ViT pretraining; its effectiveness on non-English websites has not been validated.
- MLLMs still have substantial room for improvement on tasks such as relevant page identification (GPT-4o F1 only 35%).
- The highest macro-F1 for cognitive CAPTCHA classification is 45.58%, which remains far from practical requirements.
- Dataset scale is limited (APR covers only 968 pages from 5 websites), and generalizability requires further validation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First work to systematically integrate MLLMs and GNNs into the full WCAG-EM audit pipeline.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Sampling experiments on 495 websites, comparisons with 5 MLLMs, and 4 datasets provide broad coverage.
- Writing Quality: ⭐⭐⭐⭐ — Framework is clearly presented and well-aligned with the standard, though the paper is detail-heavy.
- Value: ⭐⭐⭐⭐ — Directly applicable to large-scale web accessibility auditing.