Skip to content

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Conference: NeurIPS 2025 arXiv: 2510.11520 Code: https://github.com/KediYing/mmWalk Area: Robotics Keywords: blind and low-vision assistance, VQA benchmark, multi-modal, multi-view, CARLA simulation

TL;DR

mmWalk constructs the first multi-modal multi-view dataset for walking assistance targeting blind and low-vision (BLV) individuals (62K frames / 559K panoramic images generated via the CARLA simulator, plus 69K VQA pairs), and benchmarks reveal that state-of-the-art VLMs perform inadequately on safety-critical tasks such as risk assessment and navigation landmark recognition (best accuracy only 55.21%); fine-tuning yields a 16.7% generalization improvement on real-world datasets.

Background & Motivation

Background: Walking assistance for BLV individuals relies on AI systems capable of understanding complex outdoor scenes. Existing datasets are predominantly for indoor or driving scenarios and lack multi-modal data captured from pedestrian, guide-dog, and drone perspectives.

Limitations of Prior Work: Although VLMs perform well on general VQA tasks, their capabilities on walking-safety-critical tasks—such as detecting uneven surfaces, assessing road-crossing risks, and identifying navigation landmarks—have never been systematically evaluated.

Key Challenge: Collecting real-world BLV walking data faces ethical and privacy barriers (e.g., GDPR), and annotation is difficult. A compliant and controllable data generation approach is therefore needed.

Goal: Construct a systematic benchmark to evaluate the walking assistance capabilities of VLMs and identify safety blind spots in current models.

Key Insight: Use the CARLA simulator to generate multi-modal (RGB / depth / semantic segmentation) × multi-view (pedestrian / guide-dog / drone) walking scenes, and design 9 categories of VQA questions spanning 3 difficulty levels.

Core Idea: Generate compliant multi-view walking data via CARLA simulation, design a hierarchical VQA benchmark to systematically evaluate VLM walking assistance capabilities, and expose critical deficiencies in safety-critical tasks.

Method

Overall Architecture

Dataset: 120 manually controlled walking trajectories × 77 scene categories → 62,167 frames (3 views × 3 modalities = 559,503 panoramic images) + 8 corner-case types + 18 navigation landmark categories. Benchmark: 69,391 VQA pairs × 9 question categories (3 difficulty levels: easy / medium / hard), with QA pairs generated from GPT-4o templates.

Key Designs

  1. Multi-modal Multi-view Data Collection:

    • Function: Collect synchronized multi-modal data from three viewpoints—pedestrian, guide-dog, and drone.
    • Mechanism: A pedestrian agent is manually controlled in CARLA to walk along predefined routes; RGB, depth, and semantic segmentation panoramic images are synchronously recorded at each frame. The 8 corner cases include: road crossing, uneven ground, obstacles, narrow passages, entrances, overhead obstacles, dead ends, etc.
    • Design Motivation: Multi-view data simulates guide-dog and drone-assisted scenarios, more closely approximating real BLV assistive systems.
  2. Hierarchical VQA Benchmark (mmWalkVQA):

    • Function: Design 9 categories of VQA questions covering diverse capability dimensions.
    • Mechanism: Easy (weather/action recognition, existence judgment) → Medium (counting, attribute recognition, spatial reasoning, description) → Hard (viewpoint comparison, risk assessment, navigation landmarks). QA pairs are generated by GPT-4o based on scene metadata and templates.
    • Design Motivation: The hierarchical design enables precise identification of VLM capability bottlenecks—spatial reasoning and risk assessment are safety-critical dimensions.
  3. Benchmark Evaluation + Fine-tuning Validation:

    • Function: Evaluate zero-shot / few-shot / fine-tuned performance of 6 state-of-the-art VLMs.
    • Mechanism: Models evaluated include LLaVA-OneVision/Next, Qwen2VL, InternVL2, Janus-Pro, and Chameleon. After fine-tuning InternVL2, accuracy improves from 41.35% → 55.21% on mmWalk and from 18.5% → 21.55% on the real-world dataset EgoTextVQA.
    • Design Motivation: Validate the training value of the dataset and the simulation-to-real domain transfer capability.

Loss & Training

  • Standard VLM fine-tuning (instruction tuning)
  • Evaluation metric: normalized score (maximum 100%)

Key Experimental Results

Main Results

Model Zero-shot 3-shot Fine-tuned
InternVL2 41.35% 41.72% 55.21%
LLaVA-Next 35.64% 43.71%
Qwen2VL 39.23%

Task Difficulty Analysis

Task Type Best Model Score Notes
Weather/Action (E1) ~70% Easy
Spatial Reasoning (M1) ~30% Most difficult
Risk Assessment (H1) ~35% Safety-critical
Navigation Landmarks (H2) ~25% Severely deficient

Key Findings

  • All VLMs perform very poorly on risk assessment and navigation landmarks (<35%), indicating that current models remain far from meeting BLV safety assistance requirements.
  • Fine-tuning yields a 13.86% improvement (InternVL2), underscoring the importance of domain-specific data.
  • Simulation-to-real transfer is effective: mmWalk fine-tuning improves performance on EgoTextVQA by 16.7%.
  • Spatial reasoning is a common weakness across all evaluated models.

Highlights & Insights

  • Safety-oriented benchmark design: By categorizing VQA questions into safety-related and non-safety-related types, this work is the first to quantify the safety risks of VLMs in BLV assistance contexts.
  • Effective simulation-to-real transfer: Fine-tuning on CARLA data yields gains on real-world datasets, validating the practical utility of simulation data.
  • Thoughtful multi-view design: The guide-dog viewpoint (low angle) and drone viewpoint (bird's-eye view) provide complementary information.

Limitations & Future Work

  • A domain gap remains between simulated and real-world data.
  • The scale of 69K QA pairs is limited.
  • Multi-modal signals such as IMU data, temporal frame sequences, and semantic labels are not fully exploited.
  • The actual user experience of BLV individuals has not been evaluated.
  • vs. Ego4D / EgoTextVQA: These are general egocentric VQA benchmarks and do not target the safety requirements specific to BLV assistance.
  • vs. ATmaps: ATmaps defines standards for navigation landmarks; mmWalk integrates these into the VQA evaluation framework.

Rating

  • Novelty: ⭐⭐⭐⭐ First multi-modal multi-view benchmark targeting BLV walking assistance
  • Experimental Thoroughness: ⭐⭐⭐⭐ 6 VLMs + hierarchical evaluation + fine-tuning + cross-domain validation
  • Writing Quality: ⭐⭐⭐⭐ Detailed description of dataset design
  • Value: ⭐⭐⭐⭐ Reveals VLM gaps on safety-critical assistive tasks