Explore-on-Graph: Incentivizing Autonomous Exploration of LLMs on Knowledge Graphs¶

Conference: ICLR 2026 arXiv: 2602.21728 Code: Available Area: Graph Learning Keywords: Knowledge Graph Question Answering, Autonomous Exploration, Reinforcement Learning, Path-refined Reward, GRPO

TL;DR¶

This paper proposes Explore-on-Graph (EoG), which leverages SFT followed by two-phase reinforcement learning (outcome reward + path-refined reward) to incentivize LLMs to autonomously explore reasoning paths on knowledge graphs beyond the training distribution, surpassing GPT-5 and Gemini 2.5 Pro on five KGQA benchmarks.

Background & Motivation¶

LLMs are prone to hallucination and knowledge gaps in QA tasks; knowledge graphs (KGs) serve as ideal grounding sources for factual verification.
Existing methods fall into two categories, both with generalization limitations:
- Rule-based methods (e.g., ToG, DoG): predefined rules constrain reasoning, training-free but unable to handle out-of-distribution patterns.
- Imitation-based methods (e.g., RoG, KG-Agent): imitate reasoning patterns in training data, struggling to generalize to novel paths.
Key insight: real-world KG reasoning involves atypical paths (e.g., indirect relations via colleague or subsidiary), requiring autonomous exploration.
Example: common path "Google→employee→Carol→lives_in→London"; atypical path "Google→employee→Bob→colleague→John→resides_at→London".

Method¶

Overall Architecture¶

Two-stage training: 1. SFT Stage: trains the LLM on long-CoT data to acquire structured graph reasoning capabilities (cold start). 2. RL Stage: - Phase 1: GRPO + outcome reward (answer F1 score) to incentivize discovery of correct paths. - Phase 2: GRPO + joint reward (outcome + path-refined) to improve exploration efficiency and semantic quality.

Key Designs¶

Long-CoT Dataset Construction: - Carefully designed prompts require structured, logically rigorous, and KG-aligned reasoning. - Gemini 2.5 Flash is used for knowledge distillation to generate reasoning paths (think tags) and final answers (answer tags). - Additional rule-based filtering removes substandard reasoning traces, ensuring structural and factual correctness. - Directly tasking an LLM to explore KGs faces a vast action space and extreme reward sparsity; SFT cold-start is therefore indispensable.

Phase 1: Outcome Reward: - GRPO algorithm samples \(S\) exploration paths per question. - Reward = entity-level F1 score (rather than simple Hit@1), better suited for multi-answer scenarios. - Paths failing to produce a correctly formatted answer tag automatically receive a reward of 0, implicitly encouraging proper formatting. - Policy optimization is driven by within-group relative advantage normalization in GRPO.

Phase 2: Path-refined Reward:

Ground-truth path acquisition (Search-and-Verify Pipeline): 1. Identify topic entities in the question and answer entities in the KG. 2. BFS searches all connecting paths (with a maximum hop constraint) to ensure high recall. 3. An LLM (Gemini-2.5-Flash) semantically verifies whether each path matches the question intent, filtering spurious topological connections.

Path reward computation: checks whether every triple \((s, r, o)\) in the ground-truth path appears as a substring in the generated reasoning text, and uses the matching ratio as the path reward score.

Joint reward: \(R_{\text{joint}} = R_{\text{outcome}} + \alpha \cdot R_{\text{path}}\), where \(\alpha\) controls the weight of the path reward.

Design Motivation for Two-Phase RL: Phase 1 establishes basic exploration capability via outcome reward; Phase 2 refines exploration quality and efficiency via path reward.

Loss & Training¶

SFT stage: standard language modeling loss (cross-entropy) trained on long-CoT data.
RL stage: GRPO objective with importance sampling ratio, clipping, and KL divergence regularization.
Backbone models: Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct.
SFT data distilled from Gemini-2.5-Flash; RL implemented with the verl framework using GRPO.

Key Experimental Results¶

Main Results (Five KGQA Benchmarks)¶

Method	Model	WebQSP Hit@1	CWQ Hit@1	GrailQA Hit@1	QALD10 Hit@1	2WikiMH Hit@1
DoG	Llama-3.1-8B	91.4	76.2	-	-	84.1
GCR	Llama-3.1-8B	92.2	75.8	-	-	-
GPT-5	-	86.1	74.1	90.5	59.2	84.2
Gemini-2.5-Pro	-	92.1	71.9	91.6	58.6	85.1
EoG	Qwen2.5-7B	90.7	82.7	91.7	67.3	83.9
EoG	Llama-3.1-8B	92.8	86.6	92.1	70.6	85.3

EoG (Llama-3.1-8B) achieves 86.6 Hit@1 on CWQ, substantially outperforming GPT-5 (74.1) and Gemini-2.5-Pro (71.9).

Complex Reasoning (CWQ F1)	Conjunction	Superlative	1-hop	≥4-hop
GCR	63.7	52.6	66.3	45.8
DoG	53.3	45.9	50.3	46.7
EoG	70.2	64.7	76.2	69.6

EoG exhibits the largest advantage in the most challenging ≥4-hop reasoning setting (69.6 vs. 45.8/46.7).

Ablation Study¶

Variant	CWQ Hit@1	CWQ F1	WebQSP Hit@1	WebQSP F1
EoG (Full)	82.6	73.9	92.8	81.3
w/o Path Reward	81.5	70.8	90.2	77.3
w/o Outcome Reward	62.7	51.4	65.5	56.2
w/o SFT	70.3	63.1	75.9	65.8
w/o SFT + w/ ICL	70.7	63.8	77.2	66.5

Key Findings¶

The outcome reward is the most critical component; removing it drops CWQ Hit@1 from 82.6 to 62.7.
The path reward improves exploration efficiency: it reduces output length (CWQ: 2067→1528 tokens) while improving comprehensiveness and relevance.
SFT cold-start is indispensable; pure RL without SFT degrades performance substantially, and ICL cannot compensate.
An excessively small \(\alpha\) leads to incorrect or meaningless paths, while an excessively large \(\alpha\) causes the model to neglect answer correctness.
The path reward yields the greatest improvement on ≥4-hop reasoning, indicating that path-level signals are most critical for deep multi-hop inference.
EoG leads comprehensively across a six-dimensional reasoning quality evaluation, particularly on reasoning depth and exploratory capability.

Highlights & Insights¶

Open-source 7–8B models trained via RL exploration surpass closed-source GPT-5 and Gemini-2.5-Pro, demonstrating the power of autonomous exploration.
The path-refined reward design is elegant: ground-truth paths are obtained via BFS + LLM semantic verification, and rewards are computed via triple substring matching.
The contrast between "exploration" and "imitation" capabilities is compelling: imitation is constrained by the training distribution, while exploration can discover out-of-distribution paths.
The curriculum design of two-phase RL (outcome reward first, then joint reward) is worth adopting in related settings.
The largest gains appear in the most challenging ≥4-hop and superlative scenarios, validating the value of exploration for complex reasoning.

Limitations & Future Work¶

Ground-truth path acquisition relies on BFS + LLM verification, which may be computationally expensive for very large-scale KGs.
The path reward is based on substring matching and may be sensitive to surface-form variation for the same entity.
Validation is limited to Freebase and Wikidata; domain-specific KGs (e.g., biomedical) remain untested.
Training data is distilled from Gemini-2.5-Flash, so data quality is bounded by the teacher model.
Adaptation to dynamic KGs (where knowledge evolves over time) has not been explored.

EoG aligns with the RL-based reasoning paradigm of DeepSeek-R1, extending it to the domain of graph-structured reasoning.
Rule-based methods (ToG, DoG) provide structural guarantees but lack flexibility; EoG implicitly learns structural constraints through RL.
The path-refined reward is conceptually similar to the process reward model (PRM), but more direct (based on factual matching rather than model judgment).
EoG offers inspiration for KG-enhanced RAG: exploration strategies can be introduced into knowledge-augmented retrieval.

Rating¶

Novelty: 4/5 (The idea of RL-based KG exploration is novel; the path-refined reward design is particularly original.)
Experimental Thoroughness: 5/5 (Five datasets, multiple baselines including closed-source models, detailed ablations and complex-scenario analyses.)
Writing Quality: 4/5 (Clear framework presentation; illustrative examples are effective.)
Value: 5/5 (Results showing an 8B model surpassing GPT-5 are highly convincing; practical utility is significant.)