Skip to content

Position: Assistive Agents Need Accessibility Alignment

Conference: ICML 2026
arXiv: 2605.13579
Code: None
Area: Agent / Accessible AI / Human-centric Alignment
Keywords: Blind assistance, accessibility alignment, agentic AI, risk calibration, lifecycle design

TL;DR

This is a position paper. Through a systematic review of 778 blind assistance task instances from 417 papers, the authors argue that "accessibility alignment" should be considered a primary alignment objective for agents, on par with helpful/harmless/honest, and propose a design pipeline covering four dimensions: goal, interaction, risk, and lifecycle.

Background & Motivation

Background: Current agentic AI is rapidly advancing in multi-step reasoning, tool use, and autonomous decision-making. Researchers are applying these agents to accessibility scenarios such as blind navigation, street scene understanding, and UI operation, aiming to replace traditional white canes/screen readers with general-purpose agents.

Limitations of Prior Work: The authors present substantial evidence that even SOTA agents like GPT-4o, ChatGPT real-time video chat, and StreetReaderAI still produce "confident but incorrect" instructions in scenarios such as dynamic street scenes, blurry medication labels, and crossing streets. Since blind users cannot independently verify visual outputs, errors often go undetected and can directly cause physical harm.

Key Challenge: The root cause is that current agent design, training, and evaluation all implicitly assume: users can quickly verify outputs visually; errors are low-cost and easily iterated upon; and users and agents share the same visual context. All three assumptions fail for BVI (Blind and Visually Impaired) users, leading to four systemic failures: silent failure, overconfident hallucination, miscalibrated autonomy, and cognitive overload.

Goal: (1) Use large-scale instance data to characterize the real distribution of blind assistance tasks; (2) Argue that accessibility is not a UI patch but an agent alignment issue; (3) Provide a practical design pipeline.

Key Insight: Positioning as a "position paper," the authors first use 778 real tasks for statistical profiling, then diagnose via a "stressor → failure mode → violated design assumption" causal chain, and finally propose an alignment framework—an approach that infers theoretical frameworks from empirical data.

Core Idea: Elevate accessibility from the HCI interface layer to the agent core, making it a third alignment objective alongside helpfulness and harmlessness, and operationalize it through a four-dimensional framework: goal, interaction, risk, and lifecycle.

Method

This paper does not present a traditional algorithm but instead provides a complete chain of reasoning from task classification, failure diagnosis, and alignment definition to an engineering pipeline. The following organizes it as a "method."

Overall Architecture

The argument pipeline consists of four steps: (1) Build a task-centered taxonomy from 778 task instances; (2) Extract four types of stressors and their induced failure modes in BVI scenarios; (3) Attribute failure modes to three implicit agent assumptions plus capability-need mismatch; (4) Propose remedies using a four-dimensional alignment framework and a three-stage lifecycle pipeline.

Key Designs

  1. Four-category Taxonomy of 778 Tasks (Empirical Foundation):

    • Function: Blind assistance tasks are categorized into Mobility & Safety (34%), Reading & Text Access (35%), Object Recognition & Daily Operations (12%), and VQA Goal-directed Query (18%), with instance counts for each subcategory (e.g., hazard perception 108, path planning 116, interactive digital reading 100).
    • Mechanism: Task descriptions were extracted from 417 papers (2012-2025) across CV/GenAI/Robotics/HCI, followed by qualitative coding to obtain fine-grained tasks and frequency distributions. This empirical anchor grounds all subsequent arguments, avoiding pure speculation.
    • Design Motivation: To refute the notion that "accessibility is a marginal issue," the authors first use data to show these tasks are numerous, broadly distributed, and heavily concentrated in high-risk categories like mobility and reading.
  2. 4 Stressors × 4 Failure Modes Diagnostic Matrix:

    • Function: From four environmental characteristics of BVI scenarios—limited verifiability (cannot independently verify), high-cost errors (irreversible and physically harmful), cognitive burden (narrow audio/haptic bandwidth), privacy exposure (highly sensitive home/medical contexts)—the authors derive four failure modes: silent failure, overconfident hallucination, miscalibrated autonomy, and interaction-induced cognitive overload, specifying which stressor combinations drive each failure.
    • Mechanism: Each failure mode is anchored to a specific stressor combination (e.g., silent failure is driven by limited verifiability + asymmetric cost), forming a causal chain: "environmental constraint → failure phenomenon → design responsibility."
    • Design Motivation: Transform "accessibility failures" from anecdotal complaints into reverse-engineerable engineering problems—if agent design can close these four stressors, the corresponding failure modes can be eliminated.
  3. Four-dimensional Accessibility Alignment Framework + Lifecycle Pipeline:

    • Function: Alignment is decomposed into Goal (accessibility-defined success, including safety margin/critical-field reliability/recovery procedure), Interaction (chunked/landmark-based low-bandwidth non-visual protocols), Risk (uncertainty-triggered conservative actions + privacy by default), and Lifecycle (logging/feedback/safety updates). Each dimension corresponds to a concrete design artifact, such as Task Card, Accessibility Success Specification, Interaction Contract, Risk and Uncertainty Policy, Privacy Manifest, and Autonomy Calibration Specification.
    • Mechanism: The process is structured into Design, Deployment, and Post-deployment stages. The Design stage produces six artifacts; the Deployment stage translates artifacts into runtime guardrails (risk-triggered autonomy downgrade, safe pause, escalation); the Post-deployment stage involves near-miss logging, incident triage mapped to alignment dimensions, and safety updates with regression testing. The authors instantiate all red-line failure, uncertainty trigger, evaluation shift, and runtime implication concepts using navigation and medication label reading cases.
    • Design Motivation: Position papers are often criticized for "raising flags without solutions," so the authors deliberately bind alignment dimensions to concrete artifacts and runtime behaviors, making the framework auditable and open to challenge.

Loss & Training

As a position paper, there is no training objective. The authors suggest that future evaluation metrics should shift from task-completion indicators like SPL/path length/OCR accuracy to safety-aware metrics such as unsafe instruction rate, risk-trigger compliance, abstention precision/recall, critical-field accuracy, and critical hallucination rate.

Key Experimental Results

This paper does not present quantitative experiments; the "experiments" are statistical descriptions of 778 task instances and qualitative demonstrations via two case studies.

Main Results

Distribution table of 778 task instances:

Category Instances Proportion Representative Subtasks (Instances)
Reading & Text Access ~293 35% General Document Reading (95) / Interactive Digital Reading (100) / Non-linear Visual Doc (98)
Mobility & Safety ~253 34% Hazard Perception (108) / Path Planning & Navigation (116) / Localization & Relocation (29)
VQA Goal-directed Query ~141 18% Situational Understanding (96) / Goal-directed Object Queries (45)
Object Recognition & Daily Operations ~91 12% Object Understanding (56) / Object-Centered Interaction (35)

Ablation Study

Two cases compare the differences between non-aligned baselines and accessibility-aligned designs across four operational dimensions:

Case Red-line failure Uncertainty trigger Evaluation Metric Shift Runtime Behavior
Navigation Assistance Issues decisive crossing instructions even with localization drift/unreliable intersection geometry Localization drift/occlusion/ambiguous map evidence/dynamic obstacles SPL, path length → unsafe instruction rate, risk-trigger compliance, recovery success rate, confidence calibration Conservative path selection, autonomy downgrade, landmark instructions, safe pause, human escalation
Medication Label Reading Confidently reports dosage/contraindications/interactions from blurry/partial evidence Blur/occlusion/bent packaging/OCR-VLM candidate conflict/low-confidence numeric fields OCR accuracy, CER/WER, answer accuracy → critical-field accuracy, critical hallucination rate, abstention precision/recall, recapture success Field-level confidence, ambiguity detection, structured output, recapture policy, key field verification, abstention, escalation to pharmacist

Key Findings

  • Reading and Mobility together account for 69%, both being high-risk tasks where errors can have serious consequences. This indicates that accessibility agent development must focus on safety guarantees in the absence of verification, rather than general multimodal capabilities.
  • The same set of stressors can trigger different runtime behaviors in different cases, but the principle that "uncertainty must be expressed at decision points, not upstream" is universal—this is the core engineering lesson distilled from the two cases.
  • Silent failure and hallucination increase cognitive burden (users are forced to mentally verify each output), while miscalibrated autonomy blocks verification in high-risk situations and wastes bandwidth in low-risk ones. The four failure modes reinforce each other, necessitating a unified framework for joint handling rather than isolated patches.

Highlights & Insights

  • The framing of "accessibility as alignment" is firmly established: By emphasizing that BVI users cannot verify and errors are irreversible, the authors directly link accessibility to the mainstream helpful/harmless/honest triad in the RLHF era, making it easier for the academic community to accept as a primary objective rather than a UI engineering issue.
  • Anchoring with 778 instances is crucial. The greatest pitfall for position papers is "stance based on intuition," but this approach—large-scale literature coding followed by framework development—is much harder to refute and can be transferred to any survey-style paper seeking to establish a framing.
  • The three-layer causal decomposition of stressor → failure mode → assumption is a reusable trick: first explain "observable system bugs" via "objective environmental constraints," then attribute bugs to violated design assumptions. This approach is both explanatory and directly leads to remedies.
  • The insistence in the lifecycle pipeline that "conservative by default + escalation pathway must be enforced runtime properties, not nice-to-haves" is applicable beyond LLM safety, including medical and financial agents.

Limitations & Future Work

  • The taxonomy is derived from papers rather than real deployments, possibly underestimating the distribution of needs in non-academic settings. For example, real BVI users may prioritize social and employment scenarios rarely addressed in the literature.
  • The framework remains at the design specification level and does not provide directly deployable architectures or systems. Future work should include longitudinal deployments and quantitative trust/uncertainty calibration metrics for validation.
  • The coupling between the four dimensions (Goal/Interaction/Risk/Lifecycle) is still somewhat intuitive, lacking formal compatibility proofs. For example, how to formally verify that an agent meets a given Accessibility Success Specification.
  • The focus is mainly on the BVI population; whether the same framework can be directly applied to other disabilities such as hearing, motor, or cognitive impairments is not discussed.
  • vs HCI accessibility approaches (Lazar et al.): HCI treats accessibility as a UI/screen reader issue; this paper argues that agents are autonomous decision-makers, and the root cause of errors lies in policy/goal, not interface, requiring alignment at the architectural level.
  • vs General Agent Scaling approaches (Ferrag, Acharya et al.): The scaling approach assumes accessibility will be solved as capabilities improve; this paper uses empirical evidence from ChatGPT-4o and StreetReaderAI to show that scaling does not eliminate silent failure and that increased confidence can amplify harm.
  • vs RLHF triad alignment: The HHH framework assumes users are sighted and can correct errors; this paper adds verifiability, risk asymmetry, and interaction bandwidth as new dimensions to the alignment paradigm, representing a natural extension of alignment research to underrepresented users.

Rating

  • Novelty: ⭐⭐⭐⭐ Reframing accessibility as an alignment issue is an early systematic call in the agent literature, though the framing itself has roots within HCI.
  • Experimental Thoroughness: ⭐⭐⭐ The statistical profiling of 778 instances is solid, but lacks real-world deployment or quantitative user studies.
  • Writing Quality: ⭐⭐⭐⭐⭐ The structure is clear, with a strong causal chain from stressor → failure → assumption → framework → pipeline, and the case studies ground abstract principles in concrete design.
  • Value: ⭐⭐⭐⭐ Provides direct design guidance for researchers working on agent safety, healthcare, and assistive scenarios; the framework is reusable.