Once Upon an Input: Reasoning via Per-Instance Program Synthesis¶
Conference: NeurIPS 2025 arXiv: 2510.22849 Code: https://github.com/adaminsky/pips Area: Code Intelligence Keywords: Program Synthesis, LLM Reasoning, Code Generation, Neuro-Symbolic, Multi-Step Reasoning
TL;DR¶
This paper proposes PIPS (Per-Instance Program Synthesis), which iteratively refines programs through instance-level program synthesis and structured feedback, while dynamically selecting between direct reasoning and program synthesis via a confidence measure. PIPS achieves an 8.6% improvement in harmonic mean accuracy across 30 benchmarks.
Background & Motivation¶
Background: LLMs have made substantial progress in zero-shot reasoning, with methods such as Chain-of-Thought (CoT) and Program-of-Thought (PoT) further enhancing multi-step reasoning capabilities.
Limitations of Prior Work: Existing instance-level program synthesis methods (e.g., PoT) face three core challenges: - Open-domain problems: It is unclear when to use programs versus CoT; forcing code generation on non-algorithmic problems (e.g., sentiment understanding) tends to produce trivial programs (i.e., hardcoded answers). - Absence of task specifications: No specification of correct program behavior is available to guide search, causing over 50% of PoT outputs to be trivial programs. - Unstructured inputs: Programs require structured inputs, whereas reasoning problems are typically presented as unstructured text or images.
Key Challenge: More than 50% of PoT-generated programs hardcode answers, 6.3% contain syntax errors, and 11.5% return type errors.
Key Insight: Address three issues at the instance level — (1) decide whether to use a program, (2) iteratively refine programs with structural feedback, and (3) extract symbolic inputs before generating programs.
Method¶
Overall Architecture¶
PIPS formulates the reasoning problem as \(y = P(c(x))\), where \(c\) maps raw inputs to structured symbolic inputs and \(P\) is an executable program. The pipeline proceeds as: confidence assessment → (selection of CoT or synthesis) → symbolic extraction → iterative program generation and evaluation.
Key Designs¶
-
Selective Synthesis:
- Function: Decides at the instance level whether to apply CoT or program synthesis.
- Mechanism: Ten criteria are designed for LLM self-evaluation (e.g., formalizability, probability of successful execution, logical robustness), producing a confidence vector \(S(x) = (p_1(x),\ldots,p_{10}(x)) \in [0,1]^{10}\); a logistic regression classifier makes the final decision.
- Design Motivation: Experiments show that applying PoT to non-algorithmic tasks almost always produces trivial code (equivalent to CoT but with an additional Python call), making it preferable to skip synthesis entirely.
-
Specification-Free Program Search:
- Function: Iteratively improves programs in the absence of test cases or task specifications.
- Mechanism: An evaluator \(E\) checks structural properties of programs — non-triviality (no hardcoding), syntactic correctness, type correctness, and absence of placeholders. The loop proceeds: generate → evaluate → provide feedback → regenerate, for up to \(k\) rounds.
- Design Motivation: Traditional program synthesis relies on input-output examples or logical specifications, which are unavailable in single-instance reasoning; detecting common failure patterns (hardcoding, syntax errors, etc.) serves as a surrogate specification.
-
Symbolic Input Extraction:
- Function: Converts unstructured data (text, images) into structured JSON inputs.
- Mechanism: An LLM infers an ad hoc schema to extract entities, attributes, and relations, yielding an explicit program input \(c(x)\).
- Design Motivation: Programs generated by PoT without explicit inputs must either hardcode data or process raw images in code (e.g., 12.7% use OpenCV/Pillow for image handling), both of which are fragile.
Loss & Training¶
No training is required; the method is entirely based on inference-time prompt engineering and iterative feedback mechanisms.
Key Experimental Results¶
Main Results¶
| Model | Metric (HMean) | PIPS | PoT | CoT | Gain |
|---|---|---|---|---|---|
| Gemini-2.0-Flash | All 30 tasks | 20.8 | 12.2 | 11.4 | +8.6% vs PoT |
| GPT-4.1-mini | All 30 tasks | — | — | — | +0.8% vs PoT |
| o4-mini | All 30 tasks | — | — | — | +5.7% vs PoT |
| Gemini-2.0-Flash | Algorithmic tasks | Substantially higher | — | — | +15.9% |
Ablation Study¶
| Configuration | HMean Accuracy |
|---|---|
| PIPS (full) | 20.8% |
| PIPS (w/o switch) | 18.3% (−2.5%) |
| PIPS-0 (w/o switch, w/o iteration) | 12.9% (−7.9%) |
| PIPS-0 (w/o switch, w/o symbolic extraction, w/o iteration) | 4.3% (−16.5%) |
Key Findings¶
- Even at \(k=0\) (no evaluator), PIPS outperforms PoT by 5.6%, demonstrating that symbolic extraction alone contributes substantially.
- The confidence-based switch correctly classifies 65.3% of the 24.8% of critical instances, yielding a 2.2% absolute accuracy gain.
- Trivial programs are reduced by 75.6% and syntax errors by 86.8%.
- On multimodal tasks (CLEVR, Leaf), PIPS never resorts to OpenCV/Pillow, whereas PoT does so in 12.7% of cases.
Highlights & Insights¶
- Instance-level method selection is an important but underexplored problem: the degree of "algorithmicity" varies considerably across instances within the same task set, making a one-size-fits-all approach inadequate.
- The structured feedback loop is cleverly designed: it requires no test cases and instead uses static/dynamic code quality checks as a general-purpose program improvement signal.
- Symbolic extraction decouples perception from reasoning: this principle is highly transferable to any scenario where LLMs must process structured data.
Limitations & Future Work¶
- The confidence-based switch relies on LLM self-assessment, which varies considerably in quality across models.
- Symbolic extraction may lose information, particularly fine-grained spatial relations in visual tasks.
- Iteration increases latency and cost, especially for large \(k\).
- For purely creative or subjective problems, neither CoT nor program synthesis is well-suited.
Related Work & Insights¶
- vs. PAL/FCoT: These methods use fixed per-task programs and do not adapt to instance-level variation; PIPS generates programs at the instance level.
- vs. Code Interpreter (CI): CI is agent-based, incurring higher cost and lacking the targeted design of PIPS.
- The symbolic extraction idea introduced in this paper can be applied to multimodal reasoning by first structuring image content before processing it programmatically.
Rating¶
- Novelty: ⭐⭐⭐⭐ First systematic treatment of the three core challenges in instance-level program synthesis.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 30 benchmarks, 3 frontier LLMs, and highly detailed analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear problem decomposition and rigorous experimental design.
- Value: ⭐⭐⭐⭐ Offers practical guidance for LLM reasoning.