LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward¶

Conference: AAAI 2026
arXiv: 2506.04070
Code: https://github.com/YiyiyiZhao/NIG4VI
Area: Robotics
Keywords: Visually impaired navigation, GRPO, VLM post-training, LLM-as-Follower, navigation instruction generation

TL;DR¶

This paper proposes the LaF-GRPO framework, which employs an LLM to simulate the responses of visually impaired users to navigation instructions as a reward signal. By applying GRPO-based post-training to a VLM, the framework generates more precise and safer navigation instructions for the visually impaired. The authors also construct the NIG4VI benchmark dataset comprising 27k samples.

Background & Motivation¶

Approximately 2.2 billion people worldwide are affected by visual impairment. Navigation instruction generation for the visually impaired (NIG-VI) is a critical yet under-studied area. Unlike navigation instruction generation targeting general embodied agents, NIG-VI is human-centric and demands instructions that: (1) incorporate non-visual cues (e.g., sound and touch), (2) provide precise directional and distance guidance (e.g., clock-face direction combined with step count), and (3) include adaptive safety warnings for obstacles.

Limitations of Prior Work:

Early methods (e.g., ASSISTER) are constrained by the BERT architecture, resulting in limited generative capacity.
VLM + GRPO paradigms show promise but require large volumes of human feedback data, incurring high collection costs.
Existing datasets are largely closed-source, lack precise spatial coordinates, or are too small in scale.

Core Motivation: Can an LLM substitute real visually impaired users by simulating their comprehension and execution of navigation instructions, thereby providing low-cost reward feedback? This serves as the central innovation of the paper — the LLM-as-Follower (LaF) concept.

Method¶

Overall Architecture¶

LaF-GRPO is built upon the Speaker-Follower paradigm and Theory of Mind (ToM), consisting of two core components:

Action Interpreter: An SFT-trained LLM (LLaMA-3-8B-Instruct) that simulates visually impaired users' responses to navigation instructions.
Navigation Instruction Generator: A VLM (Qwen2.5-VL-3B/7B) trained via SFT + LaF-GRPO post-training to generate navigation instructions.

Key Designs¶

1. NIG-VI Task Formulation¶

At each step \(i\), the VLM receives a front-view image \(x_{\text{image}}^{(i)}\), the current pose \(x_{\text{pose}}^{(i)} = (x_{\text{loc}}^{(i)}, x_{\text{rot}}^{(i)})\), and the next target waypoint \(p_{i+1}\), and generates stepwise navigation instructions:

\[y_j \sim \pi_\theta(y_j^{(i)} | x_{\text{image}}^{(i)}, x_{\text{loc}}^{(i)}, x_{\text{rot}}^{(i)}, p_{i+1}, y_{<j}^{(i)})\]

The path \(P = [p_1, \ldots, p_K]\) is generated by the A* algorithm.

2. Action Interpreter¶

The core idea is to have the LLM role-play as a visually impaired user — lacking a visual encoder, it can only "listen" to instructions and then predict likely user actions. It outputs a structured dictionary \(\mathcal{A}\) containing:

move: A movement action with direction (clock-face direction) and distance parameters.
detailed_hazard_alert: A boolean flag indicating whether the user perceives an obstacle warning.

Training data are derived from ground-truth instruction–action pairs in NIG4VI, achieving a parsing accuracy of >98% on the validation set.

3. LaF-GRPO Reward Function¶

Three reward functions operate jointly:

Format reward (\(r_{\text{format}} \in \{0, 1\}\)): Checks whether the output conforms to the <think>...</think><answer>...</answer> format.

Text generation reward (\(r_{\text{meteor}}\)): Computes the METEOR score between the output and the ground truth to evaluate semantic overlap.

LLM-as-Follower reward (\(r_{\text{LaF}}\)): The VLM-generated instruction is fed into the Action Interpreter, and the interpreted action is compared against the ground-truth action:

\[r_{\text{LaF}} = w_{\text{dir}} \delta(a_{\text{dir}}, a_{\text{dir}}^{\text{ref}}) + w_{\text{dist}} \delta(a_{\text{dist}}, a_{\text{dist}}^{\text{ref}}) + w_{\text{alert}} \delta(a_{\text{alert}}, a_{\text{alert}}^{\text{ref}})\]

where \(\delta(\cdot)\) denotes exact match, and weights are set as \((w_{\text{dir}}, w_{\text{dist}}, w_{\text{alert}}) = (0.4, 0.4, 0.2)\). Spatial parameters (direction and distance) receive higher weights than safety alerts, as they are direct determinants of navigation success.

Loss & Training¶

The standard GRPO objective is adopted. For each query, \(G=8\) outputs are sampled and within-group relative advantages are computed:

\[\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\} \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{G} \sum_{i=1}^{G} \mathcal{L}_i - \beta \mathbb{D}_{\text{KL}}(\pi_\theta || \pi_{\text{ref}}) \right]\]

Training is conducted on a single NVIDIA H20 GPU (96GB), with approximately 15 hours required for 3k training samples. Two training modes are supported: - Zero-(LaF-GRPO): LaF-GRPO applied directly to the base model. - SFT+(LaF-GRPO): SFT followed by LaF-GRPO.

Key Experimental Results¶

NIG4VI Benchmark Dataset¶

The dataset is collected in the CARLA simulator under diverse environmental and weather conditions. The 27k samples span 6 towns. The training set comprises 1,500 samples (Town01); the test set is split into Intra-town (613) and Inter-town (11,223) subsets. Both pre-computed and non-pre-computed versions are provided.

Main Results¶

Model / Method	BLEU↑	ROUGE↑	METEOR↑	SPICE↑	Setting
GPT-4o (Zero-Shot)	1.748	0.169	0.249	0.149	Intra, w/o pre-cal
Claude-3.5 (Zero-Shot)	2.803	0.216	0.304	0.211	Intra, w/o pre-cal
Gemini-2 (Zero-Shot)	4.105	0.236	0.232	0.232	Intra, w/o pre-cal
Qwen-VL-7B (Zero-Shot)	3.204	0.202	0.211	0.166	Intra, w/o pre-cal
Qwen-VL-7B Zero-(LaF-GRPO)	3.272	0.234	0.256	0.222	Intra, w/o pre-cal
Qwen-VL-7B SFT	9.937	0.291	0.518	0.275	Intra, w/o pre-cal
Qwen-VL-7B SFT+(LaF-GRPO)	10.037	0.284	0.545	0.283	Intra, w/o pre-cal
Qwen-VL-3B SFT+(LaF-GRPO)	10.921	0.323	0.528	0.274	Intra, w/o pre-cal

Key Findings: SFT+(LaF-GRPO) achieves a METEOR of 0.542 on the Inter-town split, substantially outperforming GPT-4o (0.323). Furthermore, instructions generated by LaF-GRPO are more concise (34.1 tokens vs. 117.9 tokens for GPT-4o).

Ablation Study¶

Reward Configuration	BLEU↑	ROUGE↑	METEOR↑	SPICE↑	Note
Format only	10.251	0.318	0.524	0.278	Format reward only
Format + Meteor	10.912	0.317	0.525	0.279	With text generation reward
Format + Meteor + LaF	10.921	0.323	0.528	0.274	Full LaF-GRPO

Training data size ablation (7B model): scaling from 1k to 2k to 3k samples improves METEOR from 0.529 to 0.545, indicating high data efficiency.

Key Findings¶

Zero-(LaF-GRPO) substantially outperforms Zero-Shot: BLEU improves by approximately 14%, validating the immediate effectiveness of LaF-GRPO.
SFT+(LaF-GRPO) achieves state-of-the-art performance: surpassing strong commercial models including GPT-4o and Claude-3.5.
LaF reward vs. standard GRPO: In a human preference study, 76% of participants preferred instructions generated by LaF-GRPO (Cohen's κ = 0.83).
Safer instructions: LaF-GRPO produces safety-oriented prompts such as "probe the left side with your cane" and "listen for traffic sounds."

Highlights & Insights¶

The LLM-as-Follower concept is highly innovative — leveraging an LLM to simulate the cognitive and behavioral patterns of a specific user group offers a low-cost alternative to RLHF.
Theory of Mind (ToM) in practice: Having the LLM model the cognitive map of visually impaired users represents an exemplary application of ToM in assistive technology.
Ergonomic reward design: The higher weights assigned to direction and distance (0.4 each) relative to safety alerts (0.2) reflect the actual priority ordering in navigation tasks.
The clock-face direction system (e.g., "1 o'clock direction") is more intuitive than angular representations, constituting a user-centered design choice for visually impaired individuals.

Limitations & Future Work¶

Validation is conducted exclusively in a simulated environment (CARLA); real-world testing has not been performed.
Proxy users rather than actual visually impaired individuals participated in evaluation, which may introduce cognitive bias.
Generalizability of the Action Interpreter: Whether the 98% parsing accuracy can be maintained in more complex real-world scenarios remains an open question.
Language diversity: The current framework supports English only; multilingual extension is an important future direction.

Domain-specific applications of GRPO: AlphaDrive (autonomous driving), MedVLM-R1 (medical imaging), and the present work (assistive technology for the visually impaired) collectively demonstrate the broad applicability of GRPO.
Dataset design: The dual-version design of NIG4VI (with/without pre-computation) is worth emulating, as it enables evaluation of model reasoning at different levels of abstraction.
This work may inspire extensions of the LaF concept to other assistive technologies (e.g., hearing-impaired assistance, navigation for elderly users).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The LLM-as-Follower concept is original; this represents the first application of GRPO to assistive technology for the visually impaired.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-model and multi-paradigm comparisons are comprehensive, though real-world and real-user experiments are absent.
Writing Quality: ⭐⭐⭐⭐ — The structure is clear and mathematical formulations are well presented.
Value: ⭐⭐⭐⭐⭐ — The work carries significant practical implications for assistive technology for the visually impaired.