Once Upon an Input: Reasoning via Per-Instance Program Synthesis¶

Conference: NeurIPS 2025 arXiv: 2510.22849 Code: https://github.com/adaminsky/pips Area: Code Intelligence Keywords: Program Synthesis, LLM Reasoning, Code Generation, Neuro-Symbolic, Multi-Step Reasoning

TL;DR¶

This paper proposes PIPS (Per-Instance Program Synthesis), which iteratively refines programs through instance-level program synthesis and structured feedback, while dynamically selecting between direct reasoning and program synthesis via a confidence measure. PIPS achieves an 8.6% improvement in harmonic mean accuracy across 30 benchmarks.

Background & Motivation¶

Background: LLMs have made substantial progress in zero-shot reasoning, with methods such as Chain-of-Thought (CoT) and Program-of-Thought (PoT) further enhancing multi-step reasoning capabilities.

Limitations of Prior Work: Existing instance-level program synthesis methods (e.g., PoT) face three core challenges: - Open-domain problems: It is unclear when to use programs versus CoT; forcing code generation on non-algorithmic problems (e.g., sentiment understanding) tends to produce trivial programs (i.e., hardcoded answers). - Absence of task specifications: No specification of correct program behavior is available to guide search, causing over 50% of PoT outputs to be trivial programs. - Unstructured inputs: Programs require structured inputs, whereas reasoning problems are typically presented as unstructured text or images.

Key Challenge: More than 50% of PoT-generated programs hardcode answers, 6.3% contain syntax errors, and 11.5% return type errors.

Key Insight: Address three issues at the instance level — (1) decide whether to use a program, (2) iteratively refine programs with structural feedback, and (3) extract symbolic inputs before generating programs.

Method¶

Overall Architecture¶

PIPS formulates the reasoning problem as \(y = P(c(x))\), where \(c\) maps raw inputs to structured symbolic inputs and \(P\) is an executable program. The pipeline proceeds as: confidence assessment → (selection of CoT or synthesis) → symbolic extraction → iterative program generation and evaluation.

Key Designs¶

Selective Synthesis:
- Function: Decides at the instance level whether to apply CoT or program synthesis.
- Mechanism: Ten criteria are designed for LLM self-evaluation (e.g., formalizability, probability of successful execution, logical robustness), producing a confidence vector \(S(x) = (p_1(x),\ldots,p_{10}(x)) \in [0,1]^{10}\); a logistic regression classifier makes the final decision.
- Design Motivation: Experiments show that applying PoT to non-algorithmic tasks almost always produces trivial code (equivalent to CoT but with an additional Python call), making it preferable to skip synthesis entirely.
Specification-Free Program Search:
- Function: Iteratively improves programs in the absence of test cases or task specifications.
- Mechanism: An evaluator \(E\) checks structural properties of programs — non-triviality (no hardcoding), syntactic correctness, type correctness, and absence of placeholders. The loop proceeds: generate → evaluate → provide feedback → regenerate, for up to \(k\) rounds.
- Design Motivation: Traditional program synthesis relies on input-output examples or logical specifications, which are unavailable in single-instance reasoning; detecting common failure patterns (hardcoding, syntax errors, etc.) serves as a surrogate specification.
Symbolic Input Extraction:
- Function: Converts unstructured data (text, images) into structured JSON inputs.
- Mechanism: An LLM infers an ad hoc schema to extract entities, attributes, and relations, yielding an explicit program input \(c(x)\).
- Design Motivation: Programs generated by PoT without explicit inputs must either hardcode data or process raw images in code (e.g., 12.7% use OpenCV/Pillow for image handling), both of which are fragile.

Loss & Training¶

No training is required; the method is entirely based on inference-time prompt engineering and iterative feedback mechanisms.

Key Experimental Results¶

Main Results¶

Model	Metric (HMean)	PIPS	PoT	CoT	Gain
Gemini-2.0-Flash	All 30 tasks	20.8	12.2	11.4	+8.6% vs PoT
GPT-4.1-mini	All 30 tasks	—	—	—	+0.8% vs PoT
o4-mini	All 30 tasks	—	—	—	+5.7% vs PoT
Gemini-2.0-Flash	Algorithmic tasks	Substantially higher	—	—	+15.9%

Ablation Study¶

Configuration	HMean Accuracy
PIPS (full)	20.8%
PIPS (w/o switch)	18.3% (−2.5%)
PIPS-0 (w/o switch, w/o iteration)	12.9% (−7.9%)
PIPS-0 (w/o switch, w/o symbolic extraction, w/o iteration)	4.3% (−16.5%)

Key Findings¶

Even at \(k=0\) (no evaluator), PIPS outperforms PoT by 5.6%, demonstrating that symbolic extraction alone contributes substantially.
The confidence-based switch correctly classifies 65.3% of the 24.8% of critical instances, yielding a 2.2% absolute accuracy gain.
Trivial programs are reduced by 75.6% and syntax errors by 86.8%.
On multimodal tasks (CLEVR, Leaf), PIPS never resorts to OpenCV/Pillow, whereas PoT does so in 12.7% of cases.

Highlights & Insights¶

Instance-level method selection is an important but underexplored problem: the degree of "algorithmicity" varies considerably across instances within the same task set, making a one-size-fits-all approach inadequate.
The structured feedback loop is cleverly designed: it requires no test cases and instead uses static/dynamic code quality checks as a general-purpose program improvement signal.
Symbolic extraction decouples perception from reasoning: this principle is highly transferable to any scenario where LLMs must process structured data.

Limitations & Future Work¶

The confidence-based switch relies on LLM self-assessment, which varies considerably in quality across models.
Symbolic extraction may lose information, particularly fine-grained spatial relations in visual tasks.
Iteration increases latency and cost, especially for large \(k\).
For purely creative or subjective problems, neither CoT nor program synthesis is well-suited.

vs. PAL/FCoT: These methods use fixed per-task programs and do not adapt to instance-level variation; PIPS generates programs at the instance level.
vs. Code Interpreter (CI): CI is agent-based, incurring higher cost and lacking the targeted design of PIPS.
The symbolic extraction idea introduced in this paper can be applied to multimodal reasoning by first structuring image content before processing it programmatically.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic treatment of the three core challenges in instance-level program synthesis.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 30 benchmarks, 3 frontier LLMs, and highly detailed analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Clear problem decomposition and rigorous experimental design.
Value: ⭐⭐⭐⭐ Offers practical guidance for LLM reasoning.