Skip to content

LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward

Conference: AAAI 2026
arXiv: 2506.04070
Code: https://github.com/YiyiyiZhao/NIG4VI
Area: Robotics
Keywords: Visually impaired navigation, GRPO, VLM post-training, LLM-as-Follower, navigation instruction generation

TL;DR

This paper proposes the LaF-GRPO framework, which employs an LLM to simulate the responses of visually impaired users to navigation instructions as a reward signal. By applying GRPO-based post-training to a VLM, the framework generates more precise and safer navigation instructions for the visually impaired. The authors also construct the NIG4VI benchmark dataset comprising 27k samples.

Background & Motivation

Approximately 2.2 billion people worldwide are affected by visual impairment. Navigation instruction generation for the visually impaired (NIG-VI) is a critical yet under-studied area. Unlike navigation instruction generation targeting general embodied agents, NIG-VI is human-centric and demands instructions that: (1) incorporate non-visual cues (e.g., sound and touch), (2) provide precise directional and distance guidance (e.g., clock-face direction combined with step count), and (3) include adaptive safety warnings for obstacles.

Limitations of Prior Work:

  • Early methods (e.g., ASSISTER) are constrained by the BERT architecture, resulting in limited generative capacity.
  • VLM + GRPO paradigms show promise but require large volumes of human feedback data, incurring high collection costs.
  • Existing datasets are largely closed-source, lack precise spatial coordinates, or are too small in scale.

Core Motivation: Can an LLM substitute real visually impaired users by simulating their comprehension and execution of navigation instructions, thereby providing low-cost reward feedback? This serves as the central innovation of the paper — the LLM-as-Follower (LaF) concept.

Method

Overall Architecture

LaF-GRPO is built upon the Speaker-Follower paradigm and Theory of Mind (ToM), consisting of two core components:

  1. Action Interpreter: An SFT-trained LLM (LLaMA-3-8B-Instruct) that simulates visually impaired users' responses to navigation instructions.
  2. Navigation Instruction Generator: A VLM (Qwen2.5-VL-3B/7B) trained via SFT + LaF-GRPO post-training to generate navigation instructions.

Key Designs

1. NIG-VI Task Formulation

At each step \(i\), the VLM receives a front-view image \(x_{\text{image}}^{(i)}\), the current pose \(x_{\text{pose}}^{(i)} = (x_{\text{loc}}^{(i)}, x_{\text{rot}}^{(i)})\), and the next target waypoint \(p_{i+1}\), and generates stepwise navigation instructions:

\[y_j \sim \pi_\theta(y_j^{(i)} | x_{\text{image}}^{(i)}, x_{\text{loc}}^{(i)}, x_{\text{rot}}^{(i)}, p_{i+1}, y_{<j}^{(i)})\]

The path \(P = [p_1, \ldots, p_K]\) is generated by the A* algorithm.

2. Action Interpreter

The core idea is to have the LLM role-play as a visually impaired user — lacking a visual encoder, it can only "listen" to instructions and then predict likely user actions. It outputs a structured dictionary \(\mathcal{A}\) containing:

  • move: A movement action with direction (clock-face direction) and distance parameters.
  • detailed_hazard_alert: A boolean flag indicating whether the user perceives an obstacle warning.

Training data are derived from ground-truth instruction–action pairs in NIG4VI, achieving a parsing accuracy of >98% on the validation set.

3. LaF-GRPO Reward Function

Three reward functions operate jointly:

Format reward (\(r_{\text{format}} \in \{0, 1\}\)): Checks whether the output conforms to the <think>...</think><answer>...</answer> format.

Text generation reward (\(r_{\text{meteor}}\)): Computes the METEOR score between the output and the ground truth to evaluate semantic overlap.

LLM-as-Follower reward (\(r_{\text{LaF}}\)): The VLM-generated instruction is fed into the Action Interpreter, and the interpreted action is compared against the ground-truth action:

\[r_{\text{LaF}} = w_{\text{dir}} \delta(a_{\text{dir}}, a_{\text{dir}}^{\text{ref}}) + w_{\text{dist}} \delta(a_{\text{dist}}, a_{\text{dist}}^{\text{ref}}) + w_{\text{alert}} \delta(a_{\text{alert}}, a_{\text{alert}}^{\text{ref}})\]

where \(\delta(\cdot)\) denotes exact match, and weights are set as \((w_{\text{dir}}, w_{\text{dist}}, w_{\text{alert}}) = (0.4, 0.4, 0.2)\). Spatial parameters (direction and distance) receive higher weights than safety alerts, as they are direct determinants of navigation success.

Loss & Training

The standard GRPO objective is adopted. For each query, \(G=8\) outputs are sampled and within-group relative advantages are computed:

\[\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{G} \sum_{i=1}^{G} \mathcal{L}_i - \beta \mathbb{D}_{\text{KL}}(\pi_\theta || \pi_{\text{ref}}) \right]\]

Training is conducted on a single NVIDIA H20 GPU (96GB), with approximately 15 hours required for 3k training samples. Two training modes are supported: - Zero-(LaF-GRPO): LaF-GRPO applied directly to the base model. - SFT+(LaF-GRPO): SFT followed by LaF-GRPO.

Key Experimental Results

NIG4VI Benchmark Dataset

The dataset is collected in the CARLA simulator under diverse environmental and weather conditions. The 27k samples span 6 towns. The training set comprises 1,500 samples (Town01); the test set is split into Intra-town (613) and Inter-town (11,223) subsets. Both pre-computed and non-pre-computed versions are provided.

Main Results

Model / Method BLEU↑ ROUGE↑ METEOR↑ SPICE↑ Setting
GPT-4o (Zero-Shot) 1.748 0.169 0.249 0.149 Intra, w/o pre-cal
Claude-3.5 (Zero-Shot) 2.803 0.216 0.304 0.211 Intra, w/o pre-cal
Gemini-2 (Zero-Shot) 4.105 0.236 0.232 0.232 Intra, w/o pre-cal
Qwen-VL-7B (Zero-Shot) 3.204 0.202 0.211 0.166 Intra, w/o pre-cal
Qwen-VL-7B Zero-(LaF-GRPO) 3.272 0.234 0.256 0.222 Intra, w/o pre-cal
Qwen-VL-7B SFT 9.937 0.291 0.518 0.275 Intra, w/o pre-cal
Qwen-VL-7B SFT+(LaF-GRPO) 10.037 0.284 0.545 0.283 Intra, w/o pre-cal
Qwen-VL-3B SFT+(LaF-GRPO) 10.921 0.323 0.528 0.274 Intra, w/o pre-cal

Key Findings: SFT+(LaF-GRPO) achieves a METEOR of 0.542 on the Inter-town split, substantially outperforming GPT-4o (0.323). Furthermore, instructions generated by LaF-GRPO are more concise (34.1 tokens vs. 117.9 tokens for GPT-4o).

Ablation Study

Reward Configuration BLEU↑ ROUGE↑ METEOR↑ SPICE↑ Note
Format only 10.251 0.318 0.524 0.278 Format reward only
Format + Meteor 10.912 0.317 0.525 0.279 With text generation reward
Format + Meteor + LaF 10.921 0.323 0.528 0.274 Full LaF-GRPO

Training data size ablation (7B model): scaling from 1k to 2k to 3k samples improves METEOR from 0.529 to 0.545, indicating high data efficiency.

Key Findings

  1. Zero-(LaF-GRPO) substantially outperforms Zero-Shot: BLEU improves by approximately 14%, validating the immediate effectiveness of LaF-GRPO.
  2. SFT+(LaF-GRPO) achieves state-of-the-art performance: surpassing strong commercial models including GPT-4o and Claude-3.5.
  3. LaF reward vs. standard GRPO: In a human preference study, 76% of participants preferred instructions generated by LaF-GRPO (Cohen's κ = 0.83).
  4. Safer instructions: LaF-GRPO produces safety-oriented prompts such as "probe the left side with your cane" and "listen for traffic sounds."

Highlights & Insights

  1. The LLM-as-Follower concept is highly innovative — leveraging an LLM to simulate the cognitive and behavioral patterns of a specific user group offers a low-cost alternative to RLHF.
  2. Theory of Mind (ToM) in practice: Having the LLM model the cognitive map of visually impaired users represents an exemplary application of ToM in assistive technology.
  3. Ergonomic reward design: The higher weights assigned to direction and distance (0.4 each) relative to safety alerts (0.2) reflect the actual priority ordering in navigation tasks.
  4. The clock-face direction system (e.g., "1 o'clock direction") is more intuitive than angular representations, constituting a user-centered design choice for visually impaired individuals.

Limitations & Future Work

  1. Validation is conducted exclusively in a simulated environment (CARLA); real-world testing has not been performed.
  2. Proxy users rather than actual visually impaired individuals participated in evaluation, which may introduce cognitive bias.
  3. Generalizability of the Action Interpreter: Whether the 98% parsing accuracy can be maintained in more complex real-world scenarios remains an open question.
  4. Language diversity: The current framework supports English only; multilingual extension is an important future direction.
  • Domain-specific applications of GRPO: AlphaDrive (autonomous driving), MedVLM-R1 (medical imaging), and the present work (assistive technology for the visually impaired) collectively demonstrate the broad applicability of GRPO.
  • Dataset design: The dual-version design of NIG4VI (with/without pre-computation) is worth emulating, as it enables evaluation of model reasoning at different levels of abstraction.
  • This work may inspire extensions of the LaF concept to other assistive technologies (e.g., hearing-impaired assistance, navigation for elderly users).

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The LLM-as-Follower concept is original; this represents the first application of GRPO to assistive technology for the visually impaired.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-model and multi-paradigm comparisons are comprehensive, though real-world and real-user experiments are absent.
  • Writing Quality: ⭐⭐⭐⭐ — The structure is clear and mathematical formulations are well presented.
  • Value: ⭐⭐⭐⭐⭐ — The work carries significant practical implications for assistive technology for the visually impaired.