NeurIPS 2025 LLM/NLP Offline Reinforcement Learning Goal-Conditioned Value Function LLM Agent Planning Natural Language Critic Multi-Turn Interactive Tasks

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL¶

Conference: NeurIPS 2025 arXiv: 2505.18098 Code: Project Page Area: NLP Understanding / LLM Agent Planning Keywords: Offline Reinforcement Learning, Goal-Conditioned Value Function, LLM Agent Planning, Natural Language Critic, Multi-Turn Interactive Tasks

TL;DR¶

This paper proposes PNLC, a method that trains a lightweight goal-conditioned value function as a "natural language critic" to guide LLM agents in multi-turn planning and self-refinement at the thought-step level. Without direct fine-tuning or inference-time search, PNLC significantly outperforms existing methods on complex interactive tasks such as web navigation, social reasoning, and persuasion, while achieving 8–10× faster inference.

Background & Motivation¶

Background: LLMs require long-horizon reasoning and strategic behavior for goal-oriented complex interactive tasks (e.g., negotiation, persuasion, social reasoning games). Existing approaches fall into two categories: (a) multi-turn RL fine-tuning — sample-inefficient and computationally expensive; (b) inference-time search (e.g., MCTS) — requires multiple LLM calls with high latency.

Limitations of Prior Work: RL fine-tuning cannot be applied to frontier models exposed only via API (e.g., GPT-4o); MCTS search requires ~46 seconds per sample on WebShop; LLM self-evaluation tends to be overly optimistic, making effective self-refinement difficult.

Key Challenge: How can LLM agents be endowed with long-horizon planning capabilities for complex interactive tasks without directly fine-tuning the LLM or substantially increasing inference cost?

Goal: A lightweight, learnable module is needed that provides value estimates over multiple possible outcomes during LLM inference, enabling effective self-refinement.

Key Insight: Rather than training a policy, train a critic. An offline RL approach is used to train a goal-conditioned value function, which is then deployed at inference time as a "natural language critic" supplying rich outcome evaluation to the LLM.

Core Idea: Train a lightweight MLP value function at the thought-step level to predict goal-achievement probability. At inference time, a natural language critic generates multiple positive/negative goals with associated probabilities to guide iterative self-refinement of high-level strategies — without any search.

Method¶

Overall Architecture¶

PNLC consists of two phases: offline training and inference-time planning. In the offline phase, trajectory data is processed via summarization → embedding → training the goal-conditioned value function. At inference time: current state + proposed thought → critic generates goals + values → LLM self-refines.

Formally, the MDP is defined as \(M=(\mathcal{S}, \mathcal{A}, P, r, \rho, \gamma)\). Agent actions \(a_t\) are decomposed into a thought \(a_t^{\text{tht}}\) and an environment action \(a_t^{\text{env}}\). The goal-conditioned Q-value function \(Q(s, a^{\text{tht}}, g)\) predicts the probability of achieving goal \(g\) from state \(s\) after taking thought \(a_t^{\text{tht}}\).

Key Designs¶

Offline Goal-Conditioned Value Function Training:
- Function: Learns a goal-conditioned value function from task-relevant trajectory datasets.
- Mechanism: (a) Trajectory summarization — compresses full interaction histories into concise, decision-relevant descriptions; (b) Embedding — converts text to low-dimensional vectors using GPT-3 embeddings; (c) Goal sampling — randomly samples future states from each trajectory as goals. Training follows the IQL algorithm with loss: \(L_Q = \mathbb{E}[(r(s,g) + \gamma\hat{V}(s',g) - Q(s,a^{\text{tht}},g))^2]\)
- Design Motivation: Trajectory summarization reduces decision-space complexity; embeddings allow the value function to use only a 2-layer MLP (<1M parameters); random goal sampling enables multi-dimensional evaluation.
Inference-Time Natural Language Critic:
- Function: Generates natural language feedback about possible outcomes for the LLM.
- Mechanism: (a) The LLM generates 4 hypothetical goals (2 positive + 2 negative); (b) the value function estimates the achievement probability for each goal; (c) results are converted to natural language descriptions (e.g., "70% probability the user will agree, 30% risk of rejection"); (d) the LLM iteratively refines its thought based on the feedback (up to 2 rounds).
- Design Motivation: Goal-conditioned value functions provide multi-dimensional feedback that is more informative than scalar values; positive/negative goals help the LLM identify risks.
Lightweight MLP Value Function:
- Function: Supports fast training and inference via a minimal architecture.
- Mechanism: Input is the concatenation of state, thought, and goal embeddings; a 2-layer fully connected network (128×128) outputs a scalar probability.
- Design Motivation: A Transformer of LLM scale cannot serve as a value function for API-only models; a <1M-parameter MLP is sufficiently expressive given that embeddings already encode semantic information.

Loss & Training¶

A goal-conditioned variant of IQL (Implicit Q-Learning) is used. The Q-function is trained with MSE loss; the V-function uses expectile regression (\(\tau=0.8\)). Only 2.5k low-quality trajectories (generated by GPT-3.5) are required to train an effective critic.

Key Experimental Results¶

Main Results¶

Method	WebShop Score	Avalon Win Rate	Persuasion Donation	Inference Time
ReAct	55.1	21.0%	0.54	5s
Reflexion	60.8	26.0%	0.54	~15s
LATS (n=30)	74.9	38.0%	0.78	~46s
Agent Q (n=30)	77.1	—	—	~46s
Online ArCHer	62.3	19.0%	0.36	—
PNLC (Ours)	78.2	47.0%	0.87	5–6s

Ablation Study¶

Configuration	WebShop	Avalon	Persuasion
PNLC (full)	78.2	47.0%	0.87
w/o goal conditioning (scalar value only)	55.4	25.0%	0.53
w/o refinement step	55.6	28.0%	0.61
ReAct+Replan (LLM self-evaluation)	59.1	22.0%	0.62

Key Findings¶

PNLC achieves SOTA across all three diverse tasks with only 5–6s inference time, approximately 8× faster than LATS (n=30).
Goal conditioning is critical: removing it reduces performance to be indistinguishable from ReAct (55.4 vs. 55.1), demonstrating that multi-dimensional goal feedback is essential.
Offline learning outperforms LLM intuition: allowing the LLM to self-assess goal probabilities (ReAct+Replan) yields substantially lower performance than the data-driven critic (59.1 vs. 78.2), confirming that LLMs are overly optimistic about goal reachability in long-horizon tasks.
RL fine-tuning performs worst: Online ArCHer, which fine-tunes a smaller model, underperforms all other methods.

Highlights & Insights¶

"Train the critic, not the policy" paradigm: This elegantly circumvents the limitation of non-fine-tunable API models by shifting the learning burden to a lightweight module, offering substantial practical deployment value.
Abstraction at the thought level: Learning the value function at the level of thoughts (high-level strategic intentions) rather than actions (raw text) substantially reduces decision-space complexity.
Interpretable feedback via goal-conditioned value functions: Natural language descriptions of multiple positive/negative goals with associated probabilities are more amenable to LLM comprehension and utilization than scalar values.

Limitations & Future Work¶

Task-specific value functions: A separate value function must be trained for each new task; cross-task transfer remains an open problem.
Reliance on LLM goal generation and refinement: The approach may fail in specialized domains beyond the LLM's knowledge.
No data sensitivity analysis: The minimum quantity and quality of trajectories required to train an effective critic remain uncharacterized.
Value function calibration: Whether the probability estimates are reliable has not been analyzed.

vs. RL fine-tuning (ArCHer): The proposed method requires no LLM parameter updates, supports API-only models, and incurs orders-of-magnitude lower training cost.
vs. inference-time search (LATS/MCTS): Inference time scales as a constant rather than exponentially with search depth, yielding an enormous practical advantage in deployment.
vs. self-refinement (Reflexion): Reflexion requires multiple full trajectory rollouts, whereas PNLC needs only a single refinement step with a lightweight critic.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Pioneering "planning without search" paradigm combining critic learning with LLM planning; goal-conditioned value functions at the thought level are highly novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across three diverse tasks with detailed ablations, though data sensitivity analysis is absent.
Writing Quality: ⭐⭐⭐⭐ — Clear logic and intuitive figures.
Value: ⭐⭐⭐⭐⭐ — Compatible with any API-accessible LLM; represents a breakthrough in inference efficiency with high practical deployment value.