Skip to content

Towards Scalable Web Accessibility Audit with MLLMs as Copilots

Conference: AAAI2026 arXiv: 2511.03471 Code: eaglelab-zju/AAA Area: Multimodal VLM Keywords: web accessibility, WCAG-EM, multimodal LLM, graph neural network, page sampling

TL;DR

This paper proposes the AAA framework, which operationalizes the WCAG-EM standard through two key innovations—GRASP (Graph-based multimodal page sampling) and MaC (MLLM as Copilot)—enabling scalable end-to-end web accessibility auditing.

Background & Motivation

Web accessibility is fundamental to digital inclusion, yet recent surveys reveal that 94.8% of top-one-million website homepages contain WCAG violations. The root cause lies not in a lack of education or tooling, but in the resource bottleneck of the audit process itself:

  • Limitations of existing tools: Tools such as WAVE and Axe only perform hard-coded rule checks (e.g., missing alt text, insufficient contrast) and cannot cover semantic or cognitive accessibility issues.
  • Difficulty executing WCAG-EM: Although W3C's five-step audit methodology is well-standardized, no technical framework exists to support its execution at scale.
  • Inadequate page sampling: Existing clustering methods (SDC) rely solely on shallow textual statistics, ignoring multimodal semantics such as visual layout and hyperlink structure.
  • Human evaluation bottleneck: Manually identifying accessibility-critical components (structured pages, complete processes) demands substantial expert effort.

Method

Overall Architecture of the AAA Framework

The framework aligns with the five-step WCAG-EM process: website crawling → automated checking → page sampling → manual inspection → reporting/remediation. The core innovations lie in the page sampling and manual inspection stages.

GRASP: Graph-Based Multimodal Page Sampling

Page representativeness is defined along three dimensions: 1. Textual semantic representativeness: BERT is used to extract contextualized semantic representations from DOM text. 2. Visual layout representativeness: ViT is used to learn layout-level visual representations from page screenshots. 3. Link-structural representativeness: A GNN is applied over the hyperlink graph to learn structural representations.

The fusion pipeline is \(\mathbf{X} = \mathbf{H}_t || \mathbf{H}_v\). After GNN message passing, k-means clustering is applied, and the node closest to each cluster centroid is selected as a sampled page. A representativeness-enhanced graph learning module is additionally introduced, which uses clustering results to prune noisy edges and recover semantically similar but non-adjacent edges.

MaC: MLLM as a Multi-Role Copilot

  • Assistant: Automatically identifies WCAG-EM-defined structured pages (common/relevant/essential/technology-dependent) to assist individual-feature-based page sampling; pre-extracts accessibility-critical elements (search bars, forms, CAPTCHAs, etc.).
  • Auditor: Evaluates cognitive accessibility issues overlooked by conventional tools (WCAG 2.2 SC 3.3.8/3.3.9), such as the cognitive load imposed by CAPTCHAs.
  • Consultant: Provides remediation suggestions (identified as a future direction).

Four New Datasets

  • TPS: 97,246 pages from 495 websites, including DOM, screenshots, Axe checks, and adjacency matrices.
  • APR: 968 pages across 5 website categories, annotated for 4 types of WCAG-EM structured pages.
  • CCT: 1,985 CAPTCHA images across 17 authentication task types, for evaluating cognitive accessibility.
  • CPE: 1,199 pages annotated for 5 component categories: search, filter, form, CAPTCHA, and contact information.

Key Experimental Results

GRASP Page Sampling (Average over 495 Websites)

Method Layout \(S_{sampled}\) Layout \(D_{intra-inter}\) Text \(S_{sampled}\) Text \(D_{intra-inter}\)
SDC_content 56.66 9.96 89.29 2.73
SDC_tags 54.18 10.76 88.76 2.12
GRASP_GCN 51.54 13.05 86.99 1.59
GRASP_IGNN 44.31 14.94 80.45 7.40

GRASP_IGNN substantially outperforms baselines in both representation spaces, demonstrating that heterogeneous graph modeling is better suited to website hyperlink structures.

MaC F1 on APR/CPE

  • GPT-4o achieves F1 = 98.01% on search bar identification and F1 = 95.33% on CAPTCHA detection.
  • The smaller model Qwen2.5-VL-72B achieves F1 = 80.21% on relevant page identification, surpassing GPT-4o (35.44%).
  • For cognitive CAPTCHA classification, fine-tuned InternVL2-8B achieves macro-F1 = 45.58%, outperforming GPT-4o (29.16%).

Highlights & Insights

  • First end-to-end WAA framework: Aligns with the full five-step WCAG-EM process, covering the entire audit lifecycle.
  • Multimodal page sampling: The first approach to jointly integrate textual, visual, and link-structural representativeness; GRASP_IGNN significantly outperforms text-only methods.
  • Multi-role MLLM positioning: Goes beyond the narrow scope of evaluation and remediation to explore MLLM applications in sampling, pre-audit localization, and cognitive accessibility assessment.
  • Potential of small models: Experiments demonstrate that fine-tuned 8B models can serve as domain specialists with high cost-effectiveness.

Limitations & Future Work

  • GRASP relies on the quality of BERT/ViT pretraining; its effectiveness on non-English websites has not been validated.
  • MLLMs still have substantial room for improvement on tasks such as relevant page identification (GPT-4o F1 only 35%).
  • The highest macro-F1 for cognitive CAPTCHA classification is 45.58%, which remains far from practical requirements.
  • Dataset scale is limited (APR covers only 968 pages from 5 websites), and generalizability requires further validation.

Rating

  • Novelty: ⭐⭐⭐⭐ — First work to systematically integrate MLLMs and GNNs into the full WCAG-EM audit pipeline.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Sampling experiments on 495 websites, comparisons with 5 MLLMs, and 4 datasets provide broad coverage.
  • Writing Quality: ⭐⭐⭐⭐ — Framework is clearly presented and well-aligned with the standard, though the paper is detail-heavy.
  • Value: ⭐⭐⭐⭐ — Directly applicable to large-scale web accessibility auditing.