ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning¶

Conference: ICLR 2026 arXiv: 2503.24378 Code: https://ibm.github.io/ACPBench Area: Model Compression Keywords: planning benchmark, PDDL, generative evaluation, symbolic validator, action reasoning

TL;DR¶

This paper introduces ACPBench Hard — an open-ended generative planning reasoning benchmark comprising 8 task types grounded in PDDL formal systems (13 domains × 8 tasks = 1,040 questions), equipped with symbolic validators that provide rigorous correctness guarantees. A systematic evaluation of 15 LLMs reveals that even the strongest reasoning model, o1-preview, achieves accuracy ≤66% on half the tasks, and all models fail almost completely on the most fundamental task of enumerating applicable actions, exposing fundamental deficiencies in current LLMs' planning reasoning capabilities.

Background & Motivation¶

Existing LLM planning evaluations face two layers of bottlenecks. First: benchmarks such as PlanBench and AutoPlanBench focus solely on end-to-end plan generation/validation and cannot pinpoint the specific cause of failure when a black-box model fails. ACPBench v1 decomposed the planning process into 7 atomic reasoning tasks (applicability, state progression, reachability, etc.), but used a Boolean/multiple-choice format. Second: the multiple-choice format is misaligned with the requirements of real planners — planners must generate answers from a large action space rather than selecting from 4 options. Answering correctly on multiple-choice questions does not imply the ability to complete generation tasks, and evaluating open-ended generation is itself far more challenging (the verification complexity of some tasks is PSPACE-complete).

The core idea of this paper is to upgrade ACPBench's 7 tasks from multiple-choice to open-ended generation, add a "next action" task (corresponding to optimal planning), and design a PDDL-based symbolic validator for each of the 8 tasks, entirely avoiding the unreliability of LLM-as-judge evaluation.

Method¶

8 Atomic Planning Reasoning Tasks¶

ACPBench Hard decomposes planning capability into 8 atomic tasks spanning three levels: action-level, state-level, and plan-level.

Level	Task	Abbr.	Generation Target	Verification Complexity
Action	Applicability	App	Enumerate all executable actions in the current state	$O(
Action	Progression	Prog	Given an action, list positive effects (newly true) and negative effects (become false)	$O(
State	Proposition Reachability	Reach	Identify propositions that can never become true from the current state	PSPACE-complete
State	Action Reachability	AReach	Identify actions that can never become executable	PSPACE-complete
Plan	Plan Validation	Val	Identify the first non-executable action in an action sequence	$O(1)$
Plan	Plan Justification	Just	Remove 1–2 redundant actions from a plan and output the simplified plan	$O(
Plan	Landmarks	Land	Identify necessary subgoals that every valid plan must pass through	PSPACE-complete
Plan	Next Action (new)	NextA	Select an action that reduces the optimal goal distance by 1	PSPACE-complete

Key Designs¶

Symbolic Validator System: Each task is equipped with a dedicated verification algorithm. Simple tasks (App/Prog/Val) rely on set comparison; harder tasks (Reach/AReach/Land/NextA) first query a precomputed cache and invoke a PDDL planner upon a cache miss, thereby guaranteeing completeness and correctness of verification. For example, Landmarks verification is performed by constructing an auxiliary planning task $\Pi'$ (adding a marker proposition $p_\text{nach}$) to check whether a valid plan exists that bypasses the candidate landmark.
Data Construction Pipeline: Based on the 13 PDDL domains from ACPBench, 10 questions are generated per domain per task (1,040 questions total). Plans are generated by a top-quality planner, with a diverse planner as a fallback. Questions are converted from PDDL to natural language via templates.
Lenient Syntax Parser: To handle inconsistent model output formats, a grammar-based lenient parser is designed that automatically discards syntactically invalid tokens to maximally extract valid answers.

Key Experimental Results¶

Small/Medium Model Results¶

Model	App	AReach	Just	Land	NextA	Prog	Reach	Val
Granite 3.1 8B	0.00	0.00	0.21	0.08	0.22	0.36	0.33	0.09
Llama 3.1 8B	0.00	0.00	0.22	0.06	0.25	0.40	0.33	0.13
DeepSeek Coder 33B	0.02	0.02	0.21	0.10	0.17	0.42	0.18	0.15
Granite 34B Code	0.02	0.00	0.17	0.11	0.18	0.43	0.28	0.12

Small models achieve near-zero scores on App and AReach; the highest score across all tasks, Prog, reaches only 43%.

Large Model & Reasoning Model Results¶

Model	App	AReach	Just	Land	NextA	Prog	Reach	Val
Mixtral 8x22B	0.10	0.02	0.31	0.26	0.32	0.68	0.37	0.23
Llama 3.1 70B	0.12	0.02	0.44	0.20	0.42	0.65	0.28	0.20
GPT-4o mini	0.07	0.01	0.14	0.04	0.35	0.59	0.22	0.27
Llama 3.1 405B	0.14	0.04	0.59	0.15	0.48	0.74	0.26	0.48
GPT-4o	0.25	0.01	0.54	0.29	0.55	0.78	0.32	0.62
DeepSeek V3	0.21	0.05	0.65	0.12	0.47	0.76	0.32	0.56
o1-mini	0.38	0.06	0.44	0.38	0.64	0.70	0.60	0.78
GPT OSS 20B	0.03	0.09	0.14	0.47	0.62	0.72	0.50	0.14
GPT OSS 120B	0.00	0.13	0.05	0.49	0.78	0.79	0.68	0.70
o1-preview	0.44	0.12	0.46	0.56	0.80	0.89	0.66	0.26
DeepSeek R1	0.05	0.01	0.52	0.20	0.36	0.77	0.24	0.53

Representation Ablation (DeepSeek V3)¶

Representation	App	AReach	Just	Land	NextA	Prog	Reach	Val	Avg.
NL	0.21	0.05	0.65	0.12	0.47	0.76	0.32	0.56	0.39
PDDL	0.31	0.07	0.74	0.21	0.53	0.87	0.33	0.55	0.44
PDDL+NL	0.32	0.09	0.68	0.19	0.60	0.88	0.37	0.61	0.47

Including formal PDDL descriptions improves average accuracy from 39% to 47%; however, when PDDL is available, using a classical planner directly is more appropriate.

Key Findings¶

Applicability is near-universally failed: On the most fundamental task of enumerating all executable actions, small models score ≈0% and the strongest model, o1-preview, reaches only 44%. The strict requirement to generate a complete set is the primary cause — switching to Jaccard similarity scoring raises o1-preview to 57% and Mixtral from 10% to 38%.
Action Reachability is the hardest task: It requires reasoning about the joint reachability of multiple propositions in action preconditions; o1-preview scores only 12%. Most correct answers arise from recognizing the "None" case (all actions are reachable).
No universally dominant model: No model achieves the best performance across all 8 tasks. GPT-4o leads on 5 of 8 tasks, but is outperformed by DeepSeek V3 by 11% on Just and 4% on AReach.
Questionable cost-effectiveness of reasoning models: The o1 series incurs far greater computational cost than standard LLMs, yet achieves significant advantages only on Prog (89%) and NextA (80%), with scores ≤66% on half the tasks.
Anomalously low Val score for o1-preview (26%): In 86% of error cases, the predicted index differs from the correct index by exactly 1, indicating the model is close but consistently off by one step.
Progression is the easiest task (o1-preview: 89%), yet models still make basic errors such as failing to recognize that after stacking a block, the top block becomes clear.

Highlights & Insights¶

Precise diagnosis of capability deficits: By decomposing the planning process into 8 atomic tasks, the framework enables exact identification of where models fail. App ≈ 0% indicates that LLMs cannot even enumerate executable actions — a prerequisite for all planning.
Methodological significance of symbolic validators: The framework provides a fully reliable automatic evaluation scheme for open-ended generation tasks; the construction of certain verification algorithms (e.g., the auxiliary planning task for Landmarks) constitutes an independent technical contribution.
The gap between generation and multiple-choice: The same model (GPT-4o) exhibits far lower error rates under multiple-choice format than under generative format (except for Val), demonstrating that multiple-choice questions substantially overestimate models' planning reasoning capabilities.

Limitations & Future Work¶

Template-based natural language is insufficiently naturalistic and diverges from real-world planning scenarios
Coverage of only 13 PDDL domains may be insufficient to represent all planning reasoning patterns
Only the final answer of a single generation is evaluated, without considering iterative self-correction over multiple steps
The lenient parser extracts only the first answer, potentially missing subsequent attempts by the model
Future directions include constructing training data with chain-of-thought (CoT) reasoning and extending to new task types such as object counting

vs. ACPBench v1: v1 uses Boolean/multiple-choice format; this work upgrades to open-ended generation — a qualitative increase in difficulty
vs. PlanBench / AutoPlanBench: Those works focus on end-to-end plan generation/validation and cannot diagnose atomic capability deficits
vs. ActionReasoningBench: That benchmark conflates multiple capabilities into a single question and relies on LLM-as-judge evaluation; this work assigns one atomic capability per task with a corresponding symbolic validator

Rating¶

Novelty: ⭐⭐⭐⭐ Generative planning reasoning benchmark + complete symbolic validators
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 15 models × 8 tasks + representation ablation + complexity analysis + per-domain analysis
Writing Quality: ⭐⭐⭐⭐ Task definitions and verification algorithms are clearly presented; observations and analyses are detailed
Value: ⭐⭐⭐⭐⭐ A benchmark platform for diagnosing LLM planning reasoning capabilities