Interactive and Expressive Code-Augmented Planning with Large Language Models¶

Conference: ACL 2025
arXiv: 2411.13826
Area: LLM/NLP
Keywords: REPL-Plan, LLM-REPL, code-augmented planning, interactive decision-making, task decomposition

TL;DR¶

This paper proposes REPL-Plan, a top-down planning approach that enables LLMs to interact with an extended REPL (Read-Eval-Print Loop). This method leverages the full expressiveness of code while enabling dynamic error correction and handling of vague subproblems, achieving robust performance on ALFWorld, WebShop, and real-world web navigation tasks.

Background & Motivation¶

LLM Planning Challenges: While LLMs exhibit strong capabilities in commonsense reasoning and interactive decision-making, they often make mistakes in complex, long-horizon planning tasks—frequently resulting in hallucinations and incorrect short-term decisions.

Limitations of Code-Augmented Planning: Prior works structure LLM outputs using code to improve planning, including utilizing variables to track information and decomposing subtasks with functions. However, pure code-based approaches suffer from three inherent issues: - Vague Subproblems: Many tasks require handling unstructured data or subjective judgments (e.g., "purchase the item that best matches the user's description"), which are difficult to address directly with code. - Bottom-Up Coding Style: Coding typically requires writing subprocess functions before dealing with the main task. This demands precise advance planning, which places high demands on generating accurate code in a single trial. - Coding Errors: Even for skilled human programmers, writing bug-free code in one go is extremely challenging.

Inspiration from Human Developers: Real-world developers use REPLs (such as IPython or Jupyter Notebooks) for interactive programming: entering code line-by-line, inspecting results, and debugging errors. This interactive mode naturally supports dynamic error correction and exploratory development.

Method¶

Overall Architecture¶

The core of REPL-Plan is LLM-REPL—a recursive extension of REPL, where the LLM writes code line-by-line and interacts with the planning environment via code.

LLM-REPL Design¶

LLM-REPL adds three key primitive functions to the standard REPL:

Primitive	Function
`[subtask](args)`	Generates a child LLM-REPL to handle the subtask, reusing the historical execution state if it already exists
`get_args()`	Retrieves arguments passed by the parent REPL
`answer(a)`	Returns the result to the parent REPL, handing execution control back to the parent

Environment Interaction Primitives:

Primitive	Function
`act(a)`	Executes action \(a\) in the environment
`get_obs()`	Retrieves the current observation text

Key Designs¶

Recursive Subtask Generation: When the code calls an undefined function (triggering a NameError), a new child LLM-REPL is automatically created to handle this subtask. The child REPL has an independent variable space and communicates local context via get_args() and answer().
Top-Down Planning: Unlike traditional bottom-up coding, the LLM starts with the main task and recursively creates child REPLs to resolve subproblems as they arise—akin to the real-world software engineering practice of "designing the interface first, then implementing the details."
Dynamic Error Correction: The line-by-line execution nature of REPL allows the LLM to inspect execution feedback and error logs at each step, enabling immediate code revision without rewriting the entire program.
k-Shot REPL Pool: Keeps a global pool of REPLs that stores code and environments generated in previous tasks and demonstrations. When a new task invokes a REPL of the same name, it can reuse the code/output history, facilitating in-context learning.

Walkthrough Example¶

Taking e-commerce search as an example, the main REPL invokes filter_search(description) to iterate through search pages. This function in turn calls filter_page(description) to filter matching items on the current page, which then calls parse_items() to parse page elements and item_matches(item, description) to check for matches—the entire process is recursively generated from the top down.

Key Experimental Results¶

ALFWorld (Household Simulation Environment)¶

Method	Success Rate (%)	External Memory
ReAct	53.7	No
ADaPT	82.1	No
THREAD	95.5	No
REPL-Plan	97.0	No
Reflexion	76.1	Yes
RAP	85.8	Yes

REPL-Plan achieves a 97.0% success rate without using external memory, outperforming all baselines.

Model	Setting	Strategy	Method	Success Rate (%)	Score (%)
GPT-3.5	k=3	Top-3	THREAD	49	76.3
GPT-3.5	k=3	Top-3	REPL-Plan	47	74.2
GPT-4o-mini	k=10	Top-3	THREAD	21	42.1
GPT-4o-mini	k=10	Top-3	REPL-Plan	37	69.9
GPT-4o-mini	k=10	Top-20	REPL-Plan	52	77.1

In the simple setting (\(k=3\)), REPL-Plan performs comparably to THREAD, but exhibits a significant advantage in larger search spaces (\(k=10\)) and more complex strategies (Top-20).

Real-World Web Tasks¶

Method	Simple Tasks (%)	Complex Tasks (%)
ReACT	86.7	17.6
THREAD	13.3	0.0
REPL-Plan	86.7	39.6

In complex real-world web tasks involving loops, REPL-Plan significantly systemizes and outperforms the baselines (where web observations range from 4k to 20k tokens).

Ablation Study (WebShop \(k=3\), \(n=25\))¶

Ablation	GPT-3.5 SR (%)	GPT-4o-mini SR (%)
Full Model	52	44
Buggy Demonstrations	52	40
W/o Subtask REPL	24	20
Zero-shot Sub-REPL	28	16

Recursive subtask REPL is critical: Performance drops by half after removal.
Robust error correction: Performance remains almost unchanged even after injecting code bugs.
Limited zero-shot capability: Only about half of the trials can correctly infer when demonstrations for a sub-REPL are removed.

Highlights & Insights¶

Unifying Code Expressiveness and Dynamics: REPL-Plan represents the first planning framework that combines full code expressiveness (loops, variables, functions) with dynamic error correction and the capability to handle vague problems.
Top-Down Recursive Decomposition: Through the recursive generation of LLM-REPL, it naturally realizes an "API-first, implementation-later" development pattern, bypassing the challenges of bottom-up planning.
Robustness to Hallucinations: Code control flow mitigates the impact of LLM hallucinations—even if an incorrect element ID is predicted, loop structures guarantee that the agent continues searching instead of getting permanently stuck.
Scalability: The advantages of REPL-Plan are particularly pronounced in tasks with long observations (4k-20k tokens) and high complexity.

Limitations & Future Work¶

Reliance on High-Quality Demonstrations: The capacity to infer sub-REPLs zero-shot is limited, leading to a substantial drop in performance when no demonstrations are provided.
API Call Overhead: Since each line of code requires LLM inference, recursively generating multiple child REPLs significantly increases API call frequency and costs.
Environmental Constraints: Currently validated only in text environments; generalizability to multimodal environments remains to be explored.
Quality of Sub-REPL Task Descriptions: The system relies on the LLM to generate task descriptions for child REPLs; inaccurate descriptions can lead to cascading failures.

LLM Agents: ReAct (Yao et al., 2022b) and Reflexion (Shinn et al., 2023) primarily use observation-action sequences; THREAD (Schroeder et al., 2024) employs textual divide-and-conquer strategies.
Code-Augmented LLMs: PAL (Gao et al., 2023) leverages code to assist reasoning; ProgPrompt (Singh et al., 2022) utilizes program structures for planning.
Recursive Task Decomposition: ADaPT (Prasad et al., 2024) performs adaptive decomposition; THREAD utilizes tree-like subtask structures.

Rating¶

⭐⭐⭐⭐ — Elegant design with clean intuition, systematically integrating human interactive programming experience (REPL) into LLM planning, and showing a distinct edge in real-world web navigation tasks. The recursive REPL design is highly elegant, though warning points include API invocation costs and demo reliance.