Skip to content

Once Upon an Input: Reasoning via Per-Instance Program Synthesis

Conference: NeurIPS 2025 arXiv: 2510.22849 Code: https://github.com/adaminsky/pips Area: Code Intelligence Keywords: Program Synthesis, LLM Reasoning, Code Generation, Neuro-Symbolic, Multi-Step Reasoning

TL;DR

This paper proposes PIPS (Per-Instance Program Synthesis), which iteratively refines programs through instance-level program synthesis and structured feedback, while dynamically selecting between direct reasoning and program synthesis via a confidence measure. PIPS achieves an 8.6% improvement in harmonic mean accuracy across 30 benchmarks.

Background & Motivation

Background: LLMs have made substantial progress in zero-shot reasoning, with methods such as Chain-of-Thought (CoT) and Program-of-Thought (PoT) further enhancing multi-step reasoning capabilities.

Limitations of Prior Work: Existing instance-level program synthesis methods (e.g., PoT) face three core challenges: - Open-domain problems: It is unclear when to use programs versus CoT; forcing code generation on non-algorithmic problems (e.g., sentiment understanding) tends to produce trivial programs (i.e., hardcoded answers). - Absence of task specifications: No specification of correct program behavior is available to guide search, causing over 50% of PoT outputs to be trivial programs. - Unstructured inputs: Programs require structured inputs, whereas reasoning problems are typically presented as unstructured text or images.

Key Challenge: More than 50% of PoT-generated programs hardcode answers, 6.3% contain syntax errors, and 11.5% return type errors.

Key Insight: Address three issues at the instance level — (1) decide whether to use a program, (2) iteratively refine programs with structural feedback, and (3) extract symbolic inputs before generating programs.

Method

Overall Architecture

PIPS formulates the reasoning problem as \(y = P(c(x))\), where \(c\) maps raw inputs to structured symbolic inputs and \(P\) is an executable program. The pipeline proceeds as: confidence assessment → (selection of CoT or synthesis) → symbolic extraction → iterative program generation and evaluation.

Key Designs

  1. Selective Synthesis:

    • Function: Decides at the instance level whether to apply CoT or program synthesis.
    • Mechanism: Ten criteria are designed for LLM self-evaluation (e.g., formalizability, probability of successful execution, logical robustness), producing a confidence vector \(S(x) = (p_1(x),\ldots,p_{10}(x)) \in [0,1]^{10}\); a logistic regression classifier makes the final decision.
    • Design Motivation: Experiments show that applying PoT to non-algorithmic tasks almost always produces trivial code (equivalent to CoT but with an additional Python call), making it preferable to skip synthesis entirely.
  2. Specification-Free Program Search:

    • Function: Iteratively improves programs in the absence of test cases or task specifications.
    • Mechanism: An evaluator \(E\) checks structural properties of programs — non-triviality (no hardcoding), syntactic correctness, type correctness, and absence of placeholders. The loop proceeds: generate → evaluate → provide feedback → regenerate, for up to \(k\) rounds.
    • Design Motivation: Traditional program synthesis relies on input-output examples or logical specifications, which are unavailable in single-instance reasoning; detecting common failure patterns (hardcoding, syntax errors, etc.) serves as a surrogate specification.
  3. Symbolic Input Extraction:

    • Function: Converts unstructured data (text, images) into structured JSON inputs.
    • Mechanism: An LLM infers an ad hoc schema to extract entities, attributes, and relations, yielding an explicit program input \(c(x)\).
    • Design Motivation: Programs generated by PoT without explicit inputs must either hardcode data or process raw images in code (e.g., 12.7% use OpenCV/Pillow for image handling), both of which are fragile.

Loss & Training

No training is required; the method is entirely based on inference-time prompt engineering and iterative feedback mechanisms.

Key Experimental Results

Main Results

Model Metric (HMean) PIPS PoT CoT Gain
Gemini-2.0-Flash All 30 tasks 20.8 12.2 11.4 +8.6% vs PoT
GPT-4.1-mini All 30 tasks +0.8% vs PoT
o4-mini All 30 tasks +5.7% vs PoT
Gemini-2.0-Flash Algorithmic tasks Substantially higher +15.9%

Ablation Study

Configuration HMean Accuracy
PIPS (full) 20.8%
PIPS (w/o switch) 18.3% (−2.5%)
PIPS-0 (w/o switch, w/o iteration) 12.9% (−7.9%)
PIPS-0 (w/o switch, w/o symbolic extraction, w/o iteration) 4.3% (−16.5%)

Key Findings

  • Even at \(k=0\) (no evaluator), PIPS outperforms PoT by 5.6%, demonstrating that symbolic extraction alone contributes substantially.
  • The confidence-based switch correctly classifies 65.3% of the 24.8% of critical instances, yielding a 2.2% absolute accuracy gain.
  • Trivial programs are reduced by 75.6% and syntax errors by 86.8%.
  • On multimodal tasks (CLEVR, Leaf), PIPS never resorts to OpenCV/Pillow, whereas PoT does so in 12.7% of cases.

Highlights & Insights

  • Instance-level method selection is an important but underexplored problem: the degree of "algorithmicity" varies considerably across instances within the same task set, making a one-size-fits-all approach inadequate.
  • The structured feedback loop is cleverly designed: it requires no test cases and instead uses static/dynamic code quality checks as a general-purpose program improvement signal.
  • Symbolic extraction decouples perception from reasoning: this principle is highly transferable to any scenario where LLMs must process structured data.

Limitations & Future Work

  • The confidence-based switch relies on LLM self-assessment, which varies considerably in quality across models.
  • Symbolic extraction may lose information, particularly fine-grained spatial relations in visual tasks.
  • Iteration increases latency and cost, especially for large \(k\).
  • For purely creative or subjective problems, neither CoT nor program synthesis is well-suited.
  • vs. PAL/FCoT: These methods use fixed per-task programs and do not adapt to instance-level variation; PIPS generates programs at the instance level.
  • vs. Code Interpreter (CI): CI is agent-based, incurring higher cost and lacking the targeted design of PIPS.
  • The symbolic extraction idea introduced in this paper can be applied to multimodal reasoning by first structuring image content before processing it programmatically.

Rating

  • Novelty: ⭐⭐⭐⭐ First systematic treatment of the three core challenges in instance-level program synthesis.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 30 benchmarks, 3 frontier LLMs, and highly detailed analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear problem decomposition and rigorous experimental design.
  • Value: ⭐⭐⭐⭐ Offers practical guidance for LLM reasoning.