Skip to content

Searching Latent Program Spaces

Conference: NeurIPS 2025 arXiv: 2411.08706 Code: Available (presumed open-source, trained on the re-ARC dataset) Area: Program Synthesis / Few-Shot Learning / Test-Time Adaptation Keywords: Latent Program Network, Program Synthesis, Test-Time Search, ARC-AGI, Variational Inference

TL;DR

This paper proposes the Latent Program Network (LPN), which uses an encoder to map input–output examples into a latent program representation, then performs gradient-based search in the latent space at test time to adapt to new tasks. LPN substantially outperforms in-context learning and test-time training methods on the ARC-AGI benchmark.

Background & Motivation

A central challenge in general intelligence is skill acquisition efficiency—learning new tasks from minimal experience and generalizing beyond the training distribution. Existing approaches face a dilemma:

Inductive methods (Program Synthesis): Search for programs within a domain-specific language (DSL). These generalize well but suffer from combinatorial search space explosion and require hand-designed DSLs that do not scale.

Transductive methods (In-Context Learning): Neural networks directly predict outputs from examples. These are scalable but lack structured test-time adaptation and cannot guarantee consistency with the provided demonstrations.

Test-Time Training (TTT) addresses consistency by fine-tuning parameters at test time, but searching the full parameter space is computationally expensive and prone to overfitting.

The core insight of LPN is: searching for programs in a compact latent space is more efficient and less prone to overfitting than fine-tuning in parameter space.

Method

Overall Architecture

LPN consists of three core components: 1. Encoder \(q_\phi(z|x,y)\): Maps input–output pairs to a latent program distribution. 2. Latent space optimization \(z' = f(p_\theta, z, x, y)\): Searches for a better program representation within the latent space. 3. Decoder \(p_\theta(y|x,z')\): Generates the output conditioned on the latent program and a new input.

The inference pipeline proceeds as follows: the encoder produces an intuitive initial guess (System 1) → latent space optimization searches for a better hypothesis (System 2) → the decoder executes the program.

Key Designs

  1. Probabilistic encoder with mean aggregation: The encoder independently maps each I/O pair to a Gaussian distribution \(q_\phi(z|x,y) = \mathcal{N}(\mu, \Sigma)\). For multiple I/O pairs from the same task, the latent variables are aggregated by taking their mean: \(z = \frac{1}{n-1}\sum_{j \neq i} z_j\). This ensures permutation invariance over the set of demonstrations and avoids the quadratic complexity of Transformer self-attention. The design motivation is that a single I/O pair is compatible with multiple programs, and probabilistic modeling naturally handles this ambiguity.

  2. Gradient ascent in latent space: At test time, starting from the encoder-initialized \(z'_0\), the method iterates: \(z'_k = z'_{k-1} + \alpha \cdot \nabla_z \sum_{i=1}^{n} \log p_\theta(y_i | x_i, z)\Big|_{z=z'_{k-1}}\) The search objective is to maximize the decoder's log-likelihood over all provided I/O pairs. Compared to TTT, which fine-tunes the full parameter space, LPN searches only in a low-dimensional latent space (e.g., 256 dimensions), yielding higher efficiency and reduced overfitting. The \(z'\) with the highest likelihood is selected rather than the last iterate.

  3. Search-aware training: A small number of gradient search steps (e.g., 1 step) are also performed during training, so that the latent space is explicitly optimized to support test-time search. Experiments show that training with Grad 1 enables 100-step test-time search to achieve 99.5% accuracy, whereas training with Grad 0 yields only 67.5%. Stop-gradient through the latent update is used to reduce computational overhead while yielding better performance.

Loss & Training

Training uses a variational objective comprising a reconstruction loss and a KL divergence term:

\[\mathcal{L}_{\text{total}}(\phi, \theta) = \mathcal{L}_{\text{rec}}(\phi, \theta) + \beta \mathcal{L}_{\text{KL}}(\phi)\]
  • Reconstruction loss: \(\mathcal{L}_{\text{rec}} = \sum_{i=1}^{n} -\log p_\theta(y_i | x_i, z'_i)\) (cross-entropy)
  • KL loss: \(\mathcal{L}_{\text{KL}} = \sum_{i=1}^{n} D_{\text{KL}}(q_\phi(z|x_i,y_i) \| \mathcal{N}(0,I))\)

A key training detail: when reconstructing the \(i\)-th output, the program is inferred from the encodings of the remaining \(n-1\) I/O pairs, preventing the encoder from directly compressing the output as a shortcut.

Key Experimental Results

Main Results (Pattern Task, Exact Match %)

Training \ Inference Steps Grad 0 Grad 1 Grad 5 Grad 20 Grad 100
Grad 0 3.2 3.6 18.8 52.5 67.5
Grad 1 8.6 44.6 85.4 98.4 99.5
Grad 5 0.0 0.4 31.9 88.5 98.1
In-Context 77.7

ARC-AGI 2024 Results

FLOPs LPN (ID) TTT (ID) LPN (OOD) TTT (OOD)
2E+11 68.75 45.75 7.75 5.85
2E+12 75.95 51.75 10.25 7.35
2E+13 80.00 58.50 15.25 13.50
2E+14 76.25 58.75 15.50 16.00
2E+15 78.50 57.00 15.50 15.25

Ablation Study (OOD Pattern Task)

Method Grad 0 Grad 10 Grad 100 Notes
In-Context 0.0 No test-time adaptation
TTT 0.0 1.8 0.3 Overfitting
LPN Grad 0 0.3 18.8 41.1 Base variant
LPN Grad 1 0.0 59.9 88.0 Best

Key Findings

  • Test-time search yields qualitative gains: On OOD tasks, enabling gradient search doubles LPN's performance (7.75→15.25% on ARC-AGI).
  • Search-aware training is critical: Training with Grad 1 optimizes the latent space to support efficient search (99.5% vs. 67.5%).
  • TTT overfits on small models: OOD performance under 100-step TTT degrades from 1.8% to 0.3%, whereas LPN continues to improve.
  • Encoder initialization is indispensable: Initializing search from the prior \(p(z)\) rather than the encoder leads to a substantial performance drop.
  • Generalization across specification sizes: LPN generalizes smoothly to more I/O pairs than seen during training, whereas in-context learning overfits to the training specification size.
  • More efficient than pretrained LLM-based methods: LPN with 178M parameters outperforms CodeT5 (220M) and text-davinci (175B) on ARC-AGI.

Highlights & Insights

  • Searching for programs in latent space is the core innovation—combining the adaptability of inductive methods with the scalability of transductive methods.
  • The System 1 / System 2 analogy is intuitive: the encoder provides fast intuition, while latent optimization performs deliberate reasoning.
  • The permutation invariance of mean aggregation avoids the quadratic complexity of attention, making the approach scalable to larger sets of demonstrations.
  • Search-aware training (train-with-search-awareness) is the key inductive bias that makes the method work.

Limitations & Future Work

  • The training data distribution is limited—training only on re-ARC constrains the expressiveness of the latent space.
  • Gradient search is a first-order method and may get stuck in local optima; global search strategies such as CMA-ES could be explored.
  • Programs are represented as continuous vectors, sacrificing interpretability.
  • At high computational budgets, TTT may catch up to LPN (the two methods are on par at 2E+14 FLOPs on ARC-AGI OOD).
  • Compositional generalization remains limited—while there are successful cases, robustness is insufficient.
  • LEAPS performs latent search in Karel program space; LPN generalizes this idea to more general tasks.
  • Neural Processes (Garnelo et al.) provide the core architectural inspiration—introducing a bottleneck to learn implicit representations.
  • Semi-amortized variational inference (Semi-Amortized VI) offers a theoretical perspective: the encoder's amortized inference combined with latent optimization closes the amortization gap.
  • LEO (Rusu et al.) performs MAML in latent space but requires a hypernetwork to generate parameters; LPN leverages the in-context capabilities of Transformers to directly condition on the latent variable.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The idea of searching in a latent program space is original and elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Validated across Pattern/String/ARC-AGI tasks with comprehensive ablations.
  • Writing Quality: ⭐⭐⭐⭐ — Clear motivation and systematic experimental analysis.
  • Value: ⭐⭐⭐⭐ — Introduces a new paradigm for test-time adaptation.