mmWalk: Towards Multi-modal Multi-view Walking Assistance¶

Conference: NeurIPS 2025 arXiv: 2510.11520 Code: https://github.com/KediYing/mmWalk Area: Robotics Keywords: blind and low-vision assistance, VQA benchmark, multi-modal, multi-view, CARLA simulation

TL;DR¶

mmWalk constructs the first multi-modal multi-view dataset for walking assistance targeting blind and low-vision (BLV) individuals (62K frames / 559K panoramic images generated via the CARLA simulator, plus 69K VQA pairs), and benchmarks reveal that state-of-the-art VLMs perform inadequately on safety-critical tasks such as risk assessment and navigation landmark recognition (best accuracy only 55.21%); fine-tuning yields a 16.7% generalization improvement on real-world datasets.

Background & Motivation¶

Background: Walking assistance for BLV individuals relies on AI systems capable of understanding complex outdoor scenes. Existing datasets are predominantly for indoor or driving scenarios and lack multi-modal data captured from pedestrian, guide-dog, and drone perspectives.

Limitations of Prior Work: Although VLMs perform well on general VQA tasks, their capabilities on walking-safety-critical tasks—such as detecting uneven surfaces, assessing road-crossing risks, and identifying navigation landmarks—have never been systematically evaluated.

Key Challenge: Collecting real-world BLV walking data faces ethical and privacy barriers (e.g., GDPR), and annotation is difficult. A compliant and controllable data generation approach is therefore needed.

Goal: Construct a systematic benchmark to evaluate the walking assistance capabilities of VLMs and identify safety blind spots in current models.

Key Insight: Use the CARLA simulator to generate multi-modal (RGB / depth / semantic segmentation) × multi-view (pedestrian / guide-dog / drone) walking scenes, and design 9 categories of VQA questions spanning 3 difficulty levels.

Core Idea: Generate compliant multi-view walking data via CARLA simulation, design a hierarchical VQA benchmark to systematically evaluate VLM walking assistance capabilities, and expose critical deficiencies in safety-critical tasks.

Method¶

Overall Architecture¶

Dataset: 120 manually controlled walking trajectories × 77 scene categories → 62,167 frames (3 views × 3 modalities = 559,503 panoramic images) + 8 corner-case types + 18 navigation landmark categories. Benchmark: 69,391 VQA pairs × 9 question categories (3 difficulty levels: easy / medium / hard), with QA pairs generated from GPT-4o templates.

Key Designs¶

Multi-modal Multi-view Data Collection:
- Function: Collect synchronized multi-modal data from three viewpoints—pedestrian, guide-dog, and drone.
- Mechanism: A pedestrian agent is manually controlled in CARLA to walk along predefined routes; RGB, depth, and semantic segmentation panoramic images are synchronously recorded at each frame. The 8 corner cases include: road crossing, uneven ground, obstacles, narrow passages, entrances, overhead obstacles, dead ends, etc.
- Design Motivation: Multi-view data simulates guide-dog and drone-assisted scenarios, more closely approximating real BLV assistive systems.
Hierarchical VQA Benchmark (mmWalkVQA):
- Function: Design 9 categories of VQA questions covering diverse capability dimensions.
- Mechanism: Easy (weather/action recognition, existence judgment) → Medium (counting, attribute recognition, spatial reasoning, description) → Hard (viewpoint comparison, risk assessment, navigation landmarks). QA pairs are generated by GPT-4o based on scene metadata and templates.
- Design Motivation: The hierarchical design enables precise identification of VLM capability bottlenecks—spatial reasoning and risk assessment are safety-critical dimensions.
Benchmark Evaluation + Fine-tuning Validation:
- Function: Evaluate zero-shot / few-shot / fine-tuned performance of 6 state-of-the-art VLMs.
- Mechanism: Models evaluated include LLaVA-OneVision/Next, Qwen2VL, InternVL2, Janus-Pro, and Chameleon. After fine-tuning InternVL2, accuracy improves from 41.35% → 55.21% on mmWalk and from 18.5% → 21.55% on the real-world dataset EgoTextVQA.
- Design Motivation: Validate the training value of the dataset and the simulation-to-real domain transfer capability.

Loss & Training¶

Standard VLM fine-tuning (instruction tuning)
Evaluation metric: normalized score (maximum 100%)

Key Experimental Results¶

Main Results¶

Model	Zero-shot	3-shot	Fine-tuned
InternVL2	41.35%	41.72%	55.21%
LLaVA-Next	35.64%	43.71%	—
Qwen2VL	39.23%	—	—

Task Difficulty Analysis¶

Task Type	Best Model Score	Notes
Weather/Action (E1)	~70%	Easy
Spatial Reasoning (M1)	~30%	Most difficult
Risk Assessment (H1)	~35%	Safety-critical
Navigation Landmarks (H2)	~25%	Severely deficient

Key Findings¶

All VLMs perform very poorly on risk assessment and navigation landmarks (<35%), indicating that current models remain far from meeting BLV safety assistance requirements.
Fine-tuning yields a 13.86% improvement (InternVL2), underscoring the importance of domain-specific data.
Simulation-to-real transfer is effective: mmWalk fine-tuning improves performance on EgoTextVQA by 16.7%.
Spatial reasoning is a common weakness across all evaluated models.

Highlights & Insights¶

Safety-oriented benchmark design: By categorizing VQA questions into safety-related and non-safety-related types, this work is the first to quantify the safety risks of VLMs in BLV assistance contexts.
Effective simulation-to-real transfer: Fine-tuning on CARLA data yields gains on real-world datasets, validating the practical utility of simulation data.
Thoughtful multi-view design: The guide-dog viewpoint (low angle) and drone viewpoint (bird's-eye view) provide complementary information.

Limitations & Future Work¶

A domain gap remains between simulated and real-world data.
The scale of 69K QA pairs is limited.
Multi-modal signals such as IMU data, temporal frame sequences, and semantic labels are not fully exploited.
The actual user experience of BLV individuals has not been evaluated.

vs. Ego4D / EgoTextVQA: These are general egocentric VQA benchmarks and do not target the safety requirements specific to BLV assistance.
vs. ATmaps: ATmaps defines standards for navigation landmarks; mmWalk integrates these into the VQA evaluation framework.

Rating¶

Novelty: ⭐⭐⭐⭐ First multi-modal multi-view benchmark targeting BLV walking assistance
Experimental Thoroughness: ⭐⭐⭐⭐ 6 VLMs + hierarchical evaluation + fine-tuning + cross-domain validation
Writing Quality: ⭐⭐⭐⭐ Detailed description of dataset design
Value: ⭐⭐⭐⭐ Reveals VLM gaps on safety-critical assistive tasks