Skip to content

Program Synthesis via Test-Time Transduction

Conference: NeurIPS 2025 arXiv: 2509.17393 Code: GitHub Area: Program Synthesis / LLM Reasoning Keywords: program synthesis, transductive inference, active learning, hypothesis elimination, LLM, edge cases

TL;DR

This paper proposes SYNTRA, a framework that reframes program synthesis as transductive learning — at test time, it leverages visible test inputs and LLM judgment to iteratively eliminate inconsistent candidate program hypotheses. A greedy maximin algorithm minimizes the number of LLM queries, achieving accuracy improvements of up to 196% across 4 benchmarks.

Background & Motivation

Background: LLM-based program synthesis typically follows an inductive paradigm — programs are generated from a small number of training examples and expected to generalize to unseen inputs. A common strategy is to sample multiple candidate programs and filter them using training examples.

Limitations of Prior Work: Training examples are often scarce (e.g., a few rows filled by users in spreadsheet scenarios), making synthesized programs prone to errors on edge cases (atypical inputs encountered at test time). The underlying issue is epistemic uncertainty — the model has no knowledge of what inputs will appear at test time.

Key Challenge: Inductive synthesis pursues "deriving a general program from few examples," yet Vapnik's transductive principle states that "one should not solve a more general problem in order to solve a specific one." When test inputs are available, exploiting them directly is more efficient than seeking generalization.

Goal: When test inputs are visible at synthesis time, how can they be exploited to improve program robustness?

Key Insight: The problem is modeled as active learning over a finite hypothesis class — an LLM predicts outputs (pseudo-labels) for selected test inputs, and candidate programs inconsistent with those pseudo-labels are eliminated. This process iterates until a single hypothesis remains. The key question is which test input to query most efficiently.

Core Idea: The LLM's transductive prediction capability serves as a "judge" that disambiguates among test inputs, performing a binary-search-style elimination of inconsistent hypotheses from the candidate program space.

Method

Overall Architecture

The input consists of a program specification \(S\) (containing training I/O pairs and an optional natural language description) along with \(N\) test inputs. The output is a single program. The pipeline proceeds as follows: (1) sample multiple candidate programs \(\mathcal{P}\) via LLM; (2) filter using training examples to obtain \(\mathcal{P}'\); (3) execute candidate programs on test inputs and deduplicate by output tuple to form the hypothesis class \(\mathcal{H}\); (4) iteratively select a test input → query the LLM for its predicted output → eliminate inconsistent hypotheses → repeat until one hypothesis remains.

Key Designs

  1. Construction of the Finite Hypothesis Class:

    • Function: Converts candidate programs into a set of N-tuples of outputs over test inputs.
    • Mechanism: Multiple candidate programs may produce identical output tuples on test inputs — after deduplication, the hypothesis class \(\mathcal{H}\) is obtained, where each hypothesis is an N-dimensional output vector. The realizability assumption holds (i.e., the correct hypothesis is contained in \(\mathcal{H}\)).
    • Design Motivation: Operating in the output space rather than the program space substantially reduces the number of hypotheses.
  2. Greedy Maximin Query Selection:

    • Function: Selects the most discriminative test input for querying the LLM.
    • Mechanism: For each test input \(i\), the worst-case elimination count over the current version space \(\mathcal{V}_t\) is computed as \(\min_y |\{h \in \mathcal{V}_t \mid h[i] \neq y\}|\); the input \(i^*\) with the largest maximin value is selected. Ties are broken by choosing the input with the shortest total candidate output length, as the LLM judges shorter outputs more reliably.
    • Design Motivation: Analogous to binary search — each query eliminates the maximum number of hypotheses in the worst case, minimizing total query count.
  3. LLM Transductive Prediction (Pseudo-labeling):

    • Function: Given a test input and a set of candidate outputs, the LLM selects the correct output.
    • Mechanism: The specification \(S\), the test input \(\tilde{x}_{i^*}\), and the set of candidate outputs at that position \(\{h[i^*] \mid h \in \mathcal{V}_t\}\) are jointly provided to the LLM, which applies its reasoning ability and world knowledge to select the answer. This is a multiple-choice task rather than open-ended generation.
    • Design Motivation: LLMs achieve higher accuracy on multiple-choice tasks than on open-ended generation; this exploits commonsense reasoning rather than solely code generation ability.

Loss & Training

No additional training is involved. The entire framework operates purely at inference time (test-time), leveraging the code generation and reasoning capabilities of pretrained LLMs.

In practice, the program synthesis model \(\sigma\) samples multiple candidate programs via an LLM, filters them using training examples, and executes them on test inputs to construct the hypothesis class. The transduction model \(\tau\) is also implemented via an LLM, using its reasoning ability and world knowledge to select the correct answer from candidate outputs. Both models may use the same or different LLMs. The entire process requires no gradient updates or model fine-tuning.

Key Experimental Results

Main Results

Benchmark Task SYNTRA Gain
Playgol String transformation inductive synthesis Up to +196%
MBPP+ Natural language → code Significant improvement
1D-ARC Visual reasoning Significant improvement
MiniGrid Procedural world modeling Significant improvement

Query Efficiency

Strategy Additional LLM Calls
Random input selection More
Maximin selection Halved (relative to random selection)

Key Findings

  • The number of required queries grows sub-linearly with the number of test inputs \(N\), making the approach scalable to large test sets.
  • LLMs achieve substantially higher accuracy on "multiple-choice" tasks (selecting from candidate outputs) than on "generation" tasks (directly predicting the output).
  • The maximin strategy offers the greatest advantage in edge-case-dense scenarios, where different hypotheses diverge most.

Highlights & Insights

  • Precise Application of Vapnik's Transductive Principle: The philosophy of "do not solve the general problem" is operationalized into a concrete program synthesis framework — elegant and theoretically well-motivated.
  • Role Shift of the LLM: The LLM serves not only as a code generator but also as a "judge" (transduction model) — combining both capabilities yields stronger performance than either alone.
  • Binary-Search-Style Reasoning: The maximin algorithm performs logarithmic-scale search over the hypothesis space, with theoretically grounded query efficiency.
  • Practical Scenario Alignment: Applications such as spreadsheet automation and data transformation naturally satisfy the "test inputs visible" condition, making the framework directly deployable.

Limitations & Future Work

  • Realizability Assumption: The framework assumes the correct program is contained in the candidate set; if LLM sampling fails to include the correct program, the framework breaks down. More samples or ensemble models can mitigate this but increase cost.
  • LLM Pseudo-label Errors: If the LLM selects the wrong output, the correct hypothesis is erroneously eliminated, potentially causing cascading errors that may be unrecoverable.
  • Python-Only Evaluation: Effectiveness under domain-specific languages (DSLs) or other programming paradigms has not been validated; DSLs may exhibit different hypothesis space structures.
  • Scalability of the Hypothesis Class: When the number of candidate programs is extremely large, constructing and eliminating hypotheses in \(\mathcal{H}\) may become a bottleneck.
  • Requires Pre-visible Test Inputs: In purely inductive settings (e.g., API design, library development) where no predefined test inputs exist, the framework does not directly apply.
  • Future Directions: (1) Introduce soft elimination (probabilistic elimination based on LLM confidence) to reduce cascading errors; (2) combine with execution feedback — execute additional test cases to further verify remaining hypotheses; (3) automatically generate additional test inputs to further discriminate among hypotheses; (4) explore integration with self-repair methods such as Reflexion.
  • vs. Standard LLM Program Synthesis (Codex/CodeLlama): Standard approaches filter generated programs using training examples; SYNTRA adds a second filtering stage using test inputs.
  • vs. AlphaCode/CodeT: These works sample large numbers of programs and select the best via clustering or voting; SYNTRA replaces voting with structured hypothesis elimination.
  • vs. Direct LLM Reasoning on 1D-ARC: Pure LLM reasoning cannot guarantee output consistency; synthesizing and executing programs ensures consistency across all test inputs.
  • Broader Implications: The transduction + hypothesis elimination framework generalizes to other LLM multiple-choice settings, such as tool selection and plan disambiguation.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The problem formulation of transductive program synthesis is novel, with an elegant framework and clear theoretical motivation.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Four benchmarks spanning string transformation, code generation, visual reasoning, and world modeling, with complete query efficiency analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ Rigorous problem formulation, clear algorithmic description, and precise invocation of Vapnik's transductive principle.
  • Value: ⭐⭐⭐⭐⭐ Opens a new paradigm for program synthesis; the +196% improvement is remarkable.