Skip to content

ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

Conference: CVPR 2026 arXiv: 2603.14662 Code: https://vidscribe.org/ Area: Audio & Speech Keywords: Audio Description, Video Accessibility, Multimodal Large Language Models, User Customization, Visual Question Answering

TL;DR

ViDscribe is a web-based platform that leverages a multimodal large language model (Gemini 3 Pro) to provide customizable AI-generated audio descriptions (AD) and interactive visual question answering (VQA) for blind and low-vision (BLV) users. Supporting arbitrary YouTube videos, the system is validated through a one-week longitudinal user study, which demonstrates that customized AD outperforms default AD in terms of effectiveness, enjoyment, and immersion.

Background & Motivation

  1. Background: Audio description is a critical assistive technology that helps BLV users understand the visual content of videos. Traditional human-authored AD is expensive, time-consuming, and requires specialized expertise, leaving the vast majority of online videos undescribed. Recent advances in MLLMs have made automatic AD generation increasingly feasible.
  2. Limitations of Prior Work: Existing AI-AD systems adopt a one-size-fits-all strategy that fails to accommodate the diverse needs and preferences of BLV users. Evaluations are typically conducted in controlled, short-term laboratory settings, lacking longitudinal usage data.
  3. Key Challenge: BLV users' needs vary according to degree of visual impairment, viewing context, and content type, yet existing systems cannot dynamically adapt their description strategies.
  4. Goal: To build an AI-AD platform supporting user customization and interactive VQA, and to validate its value through a longitudinal study.
  5. Key Insight: The system offers six customization options (frequency, length, focus, subjectivity, color, and free text) together with real-time VQA functionality.
  6. Core Idea: Translate MLLM capabilities into controllable parameters, enabling BLV users to adjust AD generation strategies according to personal preference.

Method

Overall Architecture

ViDscribe employs a React frontend and an AWS Lambda backend, with Gemini 3 Pro as the core inference engine. Users paste a YouTube URL and select customization settings, after which the system automatically generates synchronized AD. Insertion timing is determined via audio analysis, while description content is generated by the MLLM conditioned on the customization parameters. The interface is fully compatible with screen readers and keyboard navigation.

Key Designs

  1. Six-Dimensional Customization Control:

    • Function: Accommodate the diverse needs of BLV users.
    • Mechanism: (A) Frequency — insert a description every 8/15/30 seconds; (B) Length — slider controlling 15–100 words per description; (C) Focus — general / character-oriented / environment-oriented / instructional content; (D) Subjectivity — objective factual description vs. subjective interpretation; (E) Color — whether to describe color attributes; (F) Free text — user-defined instructions. All settings are converted into prompt parameters to condition AD generation.
    • Design Motivation: Prior research and BLV community feedback indicate that different users require different types of descriptions in different contexts.
  2. Adaptive AD Generation:

    • Function: Generate descriptions aligned with user preferences at appropriate temporal positions.
    • Mechanism: The pipeline operates in two stages. (a) AD timing module: audio is extracted and analyzed for three signals — silence, speech-free segments, and scene changes — with natural pause points identified where signals overlap; excessively long intervals are recursively subdivided. (b) Description generation module: Gemini 3 Pro receives the video, timestamps, user customization settings, and 42 AD guidelines to produce personalized descriptions.
    • Design Motivation: Effective AD must not only be content-accurate but must also appear at appropriate moments without interrupting dialogue.
  3. Interactive VQA:

    • Function: Allow users to ask questions at any point during playback to obtain additional visual information.
    • Mechanism: The user presses a shortcut key to pause and inputs a question via typing or speech (e.g., "Who just entered the room?"). The system forwards the question, the current timestamp, the video's AD, and representative frames to Gemini 3 Pro to generate a context-aware answer, which is then played back via text-to-speech.
    • Design Motivation: Passive descriptions cannot cover all information; VQA empowers users to actively retrieve missing details.

Loss & Training

No training is required. The system relies entirely on the zero-shot inference capability of Gemini 3 Pro.

Key Experimental Results

Main Results (Longitudinal User Study)

Metric (5-point scale) Default AD Customized AD Gain
Effectiveness 4.00 4.32 +0.32
Enjoyment 3.45 3.97 +0.52
Immersion 3.72 4.06 +0.34
VQA Helpfulness 3.46
SUS Usability 70.6 >68 baseline

Customized AD outperforms default AD on all dimensions, with the largest gain observed in enjoyment.

Ablation Study (Customization Preference Analysis)

Customization Type Most Frequent Choice Proportion
Frequency 8 s (frequent) 54.9%
Length 26–50 words (medium) 49.0%
Focus General content 52.9%
Subjectivity Objective description 72.5%
Color Include color 80.4%

Key Findings

  • 63% of videos were watched with customized settings, indicating that BLV users both need and are willing to use customization features.
  • User preferences shifted over time toward shorter and less frequent descriptions, reflecting preference evolution as users became more proficient.
  • The VQA feature received 66 questions in total; the most common queries concerned character identity and scene details.
  • 6 out of 8 participants stated they would recommend ViDscribe to BLV friends.
  • VQA ratings were slightly lower (3.46), partly because the current implementation draws only from frames near the current timestamp, limiting answers to questions requiring full-video understanding.

Highlights & Insights

  • Longitudinal Real-World Study: This represents the first evaluation of customized AI-AD and VQA in a one-week, real-world usage context rather than a short-term laboratory experiment.
  • Temporal Evolution of Customization Preferences: User preferences are shown to change over time, offering guidance for the design of adaptive systems.
  • Fully Deployable System: The contribution extends beyond methodology to provide a practically usable accessibility tool.

Limitations & Future Work

  • The sample size is small (8 participants) and no statistical significance tests were conducted.
  • VQA relies only on frames near the current timestamp and cannot answer questions requiring full-video understanding.
  • Customization settings require manual adjustment; future work could explore automatic learning of user preferences.
  • Description quality is bounded by the capabilities of Gemini 3 Pro.
  • Future iterations could incorporate user preference memory and cross-session learning.
  • vs. YouDescribe: YouDescribe relies on volunteer human describers and does not scale; ViDscribe generates descriptions automatically.
  • vs. NarrationBot: NarrationBot produces fixed descriptions without any customization.
  • vs. DescribePro: DescribePro assists human describers; ViDscribe is fully automated.

Rating

  • Novelty: ⭐⭐⭐ Primarily a system integration contribution; technical innovation is limited.
  • Experimental Thoroughness: ⭐⭐⭐ Longitudinal study design is strong, but sample size is small.
  • Writing Quality: ⭐⭐⭐⭐ User study is described in thorough detail.
  • Value: ⭐⭐⭐⭐ Practically meaningful contribution to the accessibility community.