EnCompass: Enhancing Agent Programming with Search Over Program Execution Paths¶

Conference: NeurIPS 2025 arXiv: 2512.03571 Code: To be confirmed Area: LLM/NLP Keywords: Agent framework, inference-time search, nondeterministic programming, Beam Search, program execution paths

TL;DR¶

This paper proposes the Probabilistic Angelic Nondeterminism (PAN) programming model and the EnCompass Python framework, which decouple an agent's core workflow logic from its inference-time search strategy. Programmers only need to insert branchpoint() markers at LLM call sites and switch among best-of-N, beam search, tree search, and other strategies via a few configuration parameters, reducing the amount of code modification by 3–6×.

Background & Motivation¶

Background: Inference-time compute scaling has become a key approach for improving LLM agent performance, encompassing strategies such as best-of-N sampling, refinement (Reflexion), and tree search (ToT, LATS).

Limitations of Prior Work: - Programmers typically hard-code inference-time strategies into agent workflows, deeply coupling search logic with business logic. - Switching search strategies (e.g., from best-of-N to beam search) requires extensive structural code refactoring. - Different agents must each re-implement the same search strategies independently, precluding reuse. - Complex search strategies are abandoned due to implementation difficulty, causing missed opportunities for better scaling laws.

Key Challenge: The coupling of agent workflows (what to do) with inference-time strategies (how to search) reduces code readability and constrains the space of strategies that can be explored.

Key Insight: Inspired by the "model–inference separation" paradigm from probabilistic programming—which separates model definition from inference algorithms—agent programming should similarly decouple workflows from search strategies.

Core Idea: Inference-time strategies are fundamentally searches over nondeterministic program execution paths; marking uncertainty points with branchpoint() allows the framework to automatically construct the search tree.

Method¶

Overall Architecture¶

EnCompass is a Python framework implementing the PAN programming model. Programmers annotate agent functions with the @encompass.compile decorator, insert branchpoint() calls before unreliable operations such as LLM calls, and invoke record_score() at evaluation points. At runtime, the decorator compiles the function into a search-space object, which is then explored via .search(algo, **config) using the desired search algorithm.

Key Designs¶

PAN Programming Model:
- Function: Models nondeterministic program execution as a Markov chain; branchpoint() serves as a marker, and the program state is defined as (location, variable values).
- Mechanism: Angelic nondeterminism—programmers write code assuming unreliable operations always produce good outputs, while the runtime search is responsible for finding execution paths that actually do so.
- Distinction from classical search: Unlike classical graph search, all child nodes cannot be enumerated; only random sampling is possible. Classical search algorithms are adapted by specifying a branching factor.
Unified Inference-Time Strategies:
- Best-of-N → depth-1 search tree: A single branchpoint() is added at the start; N samples are drawn and the highest-scoring one is selected.
- Beam Search → multi-level search tree: A branchpoint() is added before each step; beam width and branching factor control search breadth.
- Refinement → backtracking with memory: A NoCopy shared feedback list allows different execution paths to share the history of previous attempts.
- Self-Consistency → group evaluation: record_score() supports group evaluation functions such as majority voting.
- Key insight: Global BoN (beam width = N, branching = 1) and Local BoN (beam width = 1, branching = N) are the two extremes of beam search; beam search interpolates between them.
EnCompass Compiler:
- Function: Compiles Python functions into search-space objects that can be manipulated by search algorithms.
- Mechanism: The @encompass.compile decorator transforms a function into a state machine, with branchpoint() becoming state-transition points.
- vs. hand-written state machines: EnCompass requires 3–6× fewer code modifications than plain Python while preserving code readability.

Key Experimental Results¶

Main Results: Repository-Level Code Translation (Java → Python)¶

Search Strategy	ps0 Performance	Notes
No search (baseline)	~50%	Temperature 0.0, single execution
Global BoN	~65%	N = 16
Local BoN (coarse)	~68%	N = 16, file-level
Beam (coarse) + Beam (fine)	~78%	File-level + method-level beam search

Code Modification Volume Comparison¶

Case Study	Without EnCompass (lines/words added)	With EnCompass (lines/words added)	Savings
Code translation	+423 lines / +2735 words	+75 lines / +514 words	5.3×
Hypothesis Search	+21 lines / +120 words	+8 lines / +27 words	4.4×
Reflexion	+27 lines / +181 words	+9 lines / +32 words	5.7×

Key Findings¶

Beam search scales substantially better than simple BoN—beam search achieves the steepest slope in the performance vs. \(\log(\text{cost})\) relationship.
The best-performing strategy is precisely the most complex to implement manually, underscoring the value of lowering the implementation barrier.
Switching strategies requires changing only 2 lines of parameters, whereas hand-written implementations require refactoring the entire state machine.
On ps1–ps4 (larger repositories), beam search consistently outperforms BoN.

Highlights & Insights¶

"Separation of concerns" reduces the cost of exploring search strategies: The primary contribution is not a new algorithm but enabling programmers to experiment with diverse search strategies at minimal cost. This parallels the significance of PyTorch for deep learning—lowering the barrier to experimentation unleashes innovation.
Unified perspective: Framing BoN, beam search, refinement, and self-consistency as different strategies for searching over an execution-path tree is conceptually elegant.
Probabilistic programming for agents: PAN stands in relation to agent programming as Stan/PyMC stands to probabilistic inference—separating model specification from inference.

Limitations & Future Work¶

Applicable only to "program-in-control" style agents; not suited for scenarios where the LLM autonomously decides tool calls (e.g., SWE-Bench, WebArena).
Case studies are relatively small in scale (MIT OCW assignment repositories); validation on large-scale industrial agents is lacking.
The search space grows exponentially with the number of branchpoints, requiring effective pruning and scoring strategies.
The evaluation metric (self-validation %) may not be fully reliable.
Integration examples with existing agent frameworks such as LangGraph are absent.

vs. LangGraph: LangGraph models workflows as state machines, but search must still be implemented manually; EnCompass elevates search strategies to first-class citizens.
vs. DSPy: DSPy automates prompt engineering; EnCompass automates search strategies—the two are orthogonal and complementary.
vs. ToT/LATS: These are concrete search algorithms; EnCompass is a framework that makes such algorithms easily applicable to arbitrary agents.

Rating¶

Novelty: ⭐⭐⭐⭐ The PAN model and the "separation of concerns" perspective are highly valuable.
Experimental Thoroughness: ⭐⭐⭐ Case studies are persuasive but limited in scale.
Writing Quality: ⭐⭐⭐⭐⭐ Concepts are clearly articulated, code examples are abundant, and the paper is easy to follow.
Value: ⭐⭐⭐⭐ Has direct implications for agent development practice.