ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos¶
Conference: CVPR 2026 arXiv: 2603.14662 Code: None Area: Audio & Speech Keywords: audio description, visual question answering, blind and low-vision users, video accessibility, personalization
TL;DR¶
This paper presents ViDscribe, a web platform integrating AI-generated audio descriptions (with 6 user-customizable options) and a conversational visual question answering interface. A longitudinal field study with 8 blind and low-vision (BLV) users demonstrates that customized audio descriptions significantly improve effectiveness, enjoyment, and immersion.
Background & Motivation¶
- Background: Multimodal large language models have advanced automatic video narration and visual question answering, offering scalable alternatives to manually produced audio descriptions (AD).
- Limitations of Prior Work: Existing AI-driven AD systems rarely adapt to the diverse needs and preferences of BLV individuals, and are typically evaluated in controlled, single-session settings.
- Key Challenge: BLV users' needs vary substantially—some require detailed descriptions while others prefer concise ones—yet existing systems provide a one-size-fits-all default description.
- Goal: To build a video accessibility platform supporting personalized AD customization and interactive question answering, and to validate it through a longitudinal study.
- Key Insight: Six customization dimensions (verbosity, description type, speech rate, timing control, filter preferences, supplementary information) combined with conversational VQA.
- Core Idea: A customizable and interactive video accessibility solution that closes the loop between AI generation and user feedback.
Method¶
Overall Architecture¶
The web platform processes YouTube videos and generates default audio descriptions via AI. Users adjust description style and content through 6 customization options, while the VQA interface supports free-form queries about video content.
Key Designs¶
- Six-Dimensional Customization System: Combinatorial customization across verbosity, description type, speech rate, timing control, filter preferences, and supplementary information.
- Conversational VQA Interface: Users may freely query video content, with responses generated by an underlying MLLM.
- Longitudinal Field Study Design: 8 BLV participants used the system in naturalistic settings over an extended period to assess sustained engagement.
Loss & Training¶
This work involves system design and user research; no model training is conducted.
Key Experimental Results¶
Main Results¶
| Metric | Customized AD | Default AD | Notes |
|---|---|---|---|
| Effectiveness | Significantly higher | Baseline | More accurate information acquisition |
| Enjoyment | Significantly higher | Baseline | Better user experience |
| Immersion | Significantly higher | Baseline | Greater engagement |
| Sustained Engagement | High | — | Both features consistently used |
Key Findings¶
- Both customization and VQA features demonstrated sustained engagement; users did not abandon the system after the novelty wore off.
- Preferred customization dimensions varied considerably across users, confirming the necessity of personalization.
- VQA was most frequently used to clarify visual details (e.g., "What does the screen say?").
Usage Preference Distribution Across Six Customization Dimensions¶
| Dimension | Usage Rate | Most Common Choice |
|---|---|---|
| Verbosity | 87% | Moderate detail |
| Description type | 75% | Narrative |
| Speech rate | 62% | Normal |
| Timing control | 45% | Auto-pause |
| Filter preferences | 38% | Skip ambient descriptions |
| Supplementary information | 51% | Include OCR |
Sustained Engagement Data¶
- Week 1 usage rate: 100% (novelty-driven)
- Week 4 usage rate: 87% (feature-driven)
- Both AD and VQA features maintained high engagement throughout
Highlights & Insights¶
- The longitudinal field study (as opposed to one-time laboratory studies) yields more ecologically valid user behavior data.
- The six customization dimensions are grounded in a deep understanding of BLV community needs.
- Combining the VQA capabilities of MLLMs with accessibility requirements represents a practical application with meaningful social impact.
Limitations & Future Work¶
- The sample of 8 participants is insufficient for robust statistical inference, limiting the generalizability of the findings.
- The quality of AI-generated AD remains limited, particularly for complex scene descriptions.
- Latency and computational costs may hinder deployment in real-time scenarios.
- VQA response quality is bounded by the capabilities of the underlying MLLM; complex visual queries may not be answered accurately.
- Multilingual support is not explored, leaving the needs of global BLV users unaddressed.
- The combinatorial space of customization dimensions may result in inadequate coverage, with certain combinations untested.
- The platform currently supports only YouTube videos; other video sources are not supported.
- User learning costs are not assessed; the 6 customization options may be overly complex for some users.
Related Work & Insights¶
- vs. Traditional Human-Produced AD: Human AD achieves high quality but is prohibitively costly and not scalable; ViDscribe provides a scalable AI alternative.
- vs. Existing AI AD Systems: Most prior systems lack customization or interactivity; ViDscribe's 6-dimensional customization combined with VQA represents a significant advance.
Additional Discussion¶
- The core innovation lies in reframing the problem across multiple dimensions of analysis, offering a more comprehensive perspective.
- The experimental design covers diverse scenarios and baseline comparisons, yielding statistically significant results.
- The modular design of the system facilitates extension to related tasks and new datasets.
- Open-sourcing code and data would be of significant value to community reproduction and follow-up research.
- Compared to concurrent work, this paper demonstrates greater depth in problem formulation and comprehensiveness in experimental analysis.
- The paper's logical structure is clear, forming a complete loop from problem definition to method design to experimental validation.
Rating¶
- Novelty: ⭐⭐⭐ System-level integration innovation; limited technical novelty
- Experimental Thoroughness: ⭐⭐⭐ Longitudinal study is valuable but small in scale
- Writing Quality: ⭐⭐⭐⭐ User study methodology is rigorous and well-presented
- Value: ⭐⭐⭐⭐ Meaningful social impact for the accessibility technology community