LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward¶
Conference: AAAI 2026
arXiv: 2506.04070
Code: https://github.com/YiyiyiZhao/NIG4VI
Area: Robotics
Keywords: Visually impaired navigation, GRPO, VLM post-training, LLM-as-Follower, navigation instruction generation
TL;DR¶
This paper proposes the LaF-GRPO framework, which employs an LLM to simulate the responses of visually impaired users to navigation instructions as a reward signal. By applying GRPO-based post-training to a VLM, the framework generates more precise and safer navigation instructions for the visually impaired. The authors also construct the NIG4VI benchmark dataset comprising 27k samples.
Background & Motivation¶
Approximately 2.2 billion people worldwide are affected by visual impairment. Navigation instruction generation for the visually impaired (NIG-VI) is a critical yet under-studied area. Unlike navigation instruction generation targeting general embodied agents, NIG-VI is human-centric and demands instructions that: (1) incorporate non-visual cues (e.g., sound and touch), (2) provide precise directional and distance guidance (e.g., clock-face direction combined with step count), and (3) include adaptive safety warnings for obstacles.
Limitations of Prior Work:
- Early methods (e.g., ASSISTER) are constrained by the BERT architecture, resulting in limited generative capacity.
- VLM + GRPO paradigms show promise but require large volumes of human feedback data, incurring high collection costs.
- Existing datasets are largely closed-source, lack precise spatial coordinates, or are too small in scale.
Core Motivation: Can an LLM substitute real visually impaired users by simulating their comprehension and execution of navigation instructions, thereby providing low-cost reward feedback? This serves as the central innovation of the paper — the LLM-as-Follower (LaF) concept.
Method¶
Overall Architecture¶
LaF-GRPO is built upon the Speaker-Follower paradigm and Theory of Mind (ToM), consisting of two core components:
- Action Interpreter: An SFT-trained LLM (LLaMA-3-8B-Instruct) that simulates visually impaired users' responses to navigation instructions.
- Navigation Instruction Generator: A VLM (Qwen2.5-VL-3B/7B) trained via SFT + LaF-GRPO post-training to generate navigation instructions.
Key Designs¶
1. NIG-VI Task Formulation¶
At each step \(i\), the VLM receives a front-view image \(x_{\text{image}}^{(i)}\), the current pose \(x_{\text{pose}}^{(i)} = (x_{\text{loc}}^{(i)}, x_{\text{rot}}^{(i)})\), and the next target waypoint \(p_{i+1}\), and generates stepwise navigation instructions:
The path \(P = [p_1, \ldots, p_K]\) is generated by the A* algorithm.
2. Action Interpreter¶
The core idea is to have the LLM role-play as a visually impaired user — lacking a visual encoder, it can only "listen" to instructions and then predict likely user actions. It outputs a structured dictionary \(\mathcal{A}\) containing:
- move: A movement action with direction (clock-face direction) and distance parameters.
- detailed_hazard_alert: A boolean flag indicating whether the user perceives an obstacle warning.
Training data are derived from ground-truth instruction–action pairs in NIG4VI, achieving a parsing accuracy of >98% on the validation set.
3. LaF-GRPO Reward Function¶
Three reward functions operate jointly:
Format reward (\(r_{\text{format}} \in \{0, 1\}\)): Checks whether the output conforms to the <think>...</think><answer>...</answer> format.
Text generation reward (\(r_{\text{meteor}}\)): Computes the METEOR score between the output and the ground truth to evaluate semantic overlap.
LLM-as-Follower reward (\(r_{\text{LaF}}\)): The VLM-generated instruction is fed into the Action Interpreter, and the interpreted action is compared against the ground-truth action:
where \(\delta(\cdot)\) denotes exact match, and weights are set as \((w_{\text{dir}}, w_{\text{dist}}, w_{\text{alert}}) = (0.4, 0.4, 0.2)\). Spatial parameters (direction and distance) receive higher weights than safety alerts, as they are direct determinants of navigation success.
Loss & Training¶
The standard GRPO objective is adopted. For each query, \(G=8\) outputs are sampled and within-group relative advantages are computed:
Training is conducted on a single NVIDIA H20 GPU (96GB), with approximately 15 hours required for 3k training samples. Two training modes are supported: - Zero-(LaF-GRPO): LaF-GRPO applied directly to the base model. - SFT+(LaF-GRPO): SFT followed by LaF-GRPO.
Key Experimental Results¶
NIG4VI Benchmark Dataset¶
The dataset is collected in the CARLA simulator under diverse environmental and weather conditions. The 27k samples span 6 towns. The training set comprises 1,500 samples (Town01); the test set is split into Intra-town (613) and Inter-town (11,223) subsets. Both pre-computed and non-pre-computed versions are provided.
Main Results¶
| Model / Method | BLEU↑ | ROUGE↑ | METEOR↑ | SPICE↑ | Setting |
|---|---|---|---|---|---|
| GPT-4o (Zero-Shot) | 1.748 | 0.169 | 0.249 | 0.149 | Intra, w/o pre-cal |
| Claude-3.5 (Zero-Shot) | 2.803 | 0.216 | 0.304 | 0.211 | Intra, w/o pre-cal |
| Gemini-2 (Zero-Shot) | 4.105 | 0.236 | 0.232 | 0.232 | Intra, w/o pre-cal |
| Qwen-VL-7B (Zero-Shot) | 3.204 | 0.202 | 0.211 | 0.166 | Intra, w/o pre-cal |
| Qwen-VL-7B Zero-(LaF-GRPO) | 3.272 | 0.234 | 0.256 | 0.222 | Intra, w/o pre-cal |
| Qwen-VL-7B SFT | 9.937 | 0.291 | 0.518 | 0.275 | Intra, w/o pre-cal |
| Qwen-VL-7B SFT+(LaF-GRPO) | 10.037 | 0.284 | 0.545 | 0.283 | Intra, w/o pre-cal |
| Qwen-VL-3B SFT+(LaF-GRPO) | 10.921 | 0.323 | 0.528 | 0.274 | Intra, w/o pre-cal |
Key Findings: SFT+(LaF-GRPO) achieves a METEOR of 0.542 on the Inter-town split, substantially outperforming GPT-4o (0.323). Furthermore, instructions generated by LaF-GRPO are more concise (34.1 tokens vs. 117.9 tokens for GPT-4o).
Ablation Study¶
| Reward Configuration | BLEU↑ | ROUGE↑ | METEOR↑ | SPICE↑ | Note |
|---|---|---|---|---|---|
| Format only | 10.251 | 0.318 | 0.524 | 0.278 | Format reward only |
| Format + Meteor | 10.912 | 0.317 | 0.525 | 0.279 | With text generation reward |
| Format + Meteor + LaF | 10.921 | 0.323 | 0.528 | 0.274 | Full LaF-GRPO |
Training data size ablation (7B model): scaling from 1k to 2k to 3k samples improves METEOR from 0.529 to 0.545, indicating high data efficiency.
Key Findings¶
- Zero-(LaF-GRPO) substantially outperforms Zero-Shot: BLEU improves by approximately 14%, validating the immediate effectiveness of LaF-GRPO.
- SFT+(LaF-GRPO) achieves state-of-the-art performance: surpassing strong commercial models including GPT-4o and Claude-3.5.
- LaF reward vs. standard GRPO: In a human preference study, 76% of participants preferred instructions generated by LaF-GRPO (Cohen's κ = 0.83).
- Safer instructions: LaF-GRPO produces safety-oriented prompts such as "probe the left side with your cane" and "listen for traffic sounds."
Highlights & Insights¶
- The LLM-as-Follower concept is highly innovative — leveraging an LLM to simulate the cognitive and behavioral patterns of a specific user group offers a low-cost alternative to RLHF.
- Theory of Mind (ToM) in practice: Having the LLM model the cognitive map of visually impaired users represents an exemplary application of ToM in assistive technology.
- Ergonomic reward design: The higher weights assigned to direction and distance (0.4 each) relative to safety alerts (0.2) reflect the actual priority ordering in navigation tasks.
- The clock-face direction system (e.g., "1 o'clock direction") is more intuitive than angular representations, constituting a user-centered design choice for visually impaired individuals.
Limitations & Future Work¶
- Validation is conducted exclusively in a simulated environment (CARLA); real-world testing has not been performed.
- Proxy users rather than actual visually impaired individuals participated in evaluation, which may introduce cognitive bias.
- Generalizability of the Action Interpreter: Whether the 98% parsing accuracy can be maintained in more complex real-world scenarios remains an open question.
- Language diversity: The current framework supports English only; multilingual extension is an important future direction.
Related Work & Insights¶
- Domain-specific applications of GRPO: AlphaDrive (autonomous driving), MedVLM-R1 (medical imaging), and the present work (assistive technology for the visually impaired) collectively demonstrate the broad applicability of GRPO.
- Dataset design: The dual-version design of NIG4VI (with/without pre-computation) is worth emulating, as it enables evaluation of model reasoning at different levels of abstraction.
- This work may inspire extensions of the LaF concept to other assistive technologies (e.g., hearing-impaired assistance, navigation for elderly users).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The LLM-as-Follower concept is original; this represents the first application of GRPO to assistive technology for the visually impaired.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-model and multi-paradigm comparisons are comprehensive, though real-world and real-user experiments are absent.
- Writing Quality: ⭐⭐⭐⭐ — The structure is clear and mathematical formulations are well presented.
- Value: ⭐⭐⭐⭐⭐ — The work carries significant practical implications for assistive technology for the visually impaired.