SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization¶

Conference: ACL 2026 arXiv: 2604.17351 Code: https://github.com/cruiseresearchgroup/SOCIA/tree/evo Area: Code Intelligence Keywords: Automated simulator construction, dual-anchored evolutionary framework, bi-level optimization, strategy playbook, distributional fidelity

TL;DR¶

This paper proposes SOCIA-EVO, an LLM agent framework that reformulates automated simulator construction as a dual-anchored evolutionary process. It anchors empirical constraints via a static Blueprint, decouples structural revision and parameter calibration through bi-level optimization, and manages repair hypotheses via a self-curated strategy Playbook with Bayesian-weighted retrieval guided by execution feedback. SOCIA-EVO significantly outperforms baselines such as Reflexion and G-SIM on three simulation tasks: user modeling, mask-wearing diffusion, and personal mobility.

Background & Motivation¶

Background: Automatically constructing simulators from observational data (data-driven simulation) is foundational to understanding complex systems. Unlike general software engineering where functional correctness suffices, simulator construction is inherently a scientific modeling task requiring distributional fidelity—generated programs must reproduce the statistical regularities, causal mechanisms, and emergent behaviors present in real data.

Limitations of Prior Work: Standard LLM agents applied to long-horizon simulator construction exhibit two critical failure modes: (1) context drift—as simulator complexity grows, constraints established during initial data analysis lose salience and agents may hallucinate mechanisms absent from the data; (2) optimization instability—agents conflate structural errors (e.g., incorrect transition logic) with parameter mismatches (e.g., suboptimal rates), rewriting correct logic when simple parameter tuning would suffice, leading to oscillatory "whack-a-mole" revisions.

Key Challenge: LLMs excel at discrete logical reasoning but struggle with high-dimensional continuous parameter search. Furthermore, they lack a persistent mechanism for validating repair strategies—previously attempted fixes and their quantified outcomes are lost as the context window advances, causing repeated proposals of superficially plausible but empirically ineffective repairs.

Goal: Design an agent framework that maintains long-horizon consistency, distinguishes structural from parametric issues, and accumulates validated repair experience.

Key Insight: Introduce a dual-anchoring mechanism—a static Blueprint prevents context drift, while a dynamic strategy Playbook accumulates and validates repair hypotheses; bi-level optimization strictly decouples structural revision (LLM-driven outer loop) from parameter calibration (numerical optimizer inner loop).

Core Idea: Simulator construction = Blueprint-anchored search space + bi-level optimization decoupling structure and parameters + self-curated Playbook for repair hypotheses.

Method¶

Overall Architecture¶

SOCIA-EVO employs a closed-loop iterative workflow with six specialized agents. The process begins with a data analysis agent generating a static Blueprint \(\mathcal{B}\), followed by an evolutionary loop: a code generation agent produces simulator code \(P_t\) and calibrator \(C_t\) conditioned on \(\mathcal{B}\) and current Playbook strategies \(\mathcal{K}\); an execution agent runs the inner-loop optimization over \(\theta\) and collects metrics \(\mathcal{M}_t\); a feedback agent diagnoses deviations; a Playbook manager updates the strategy pool; and an iteration controller evaluates convergence.

Key Designs¶

Dual-Anchoring Mechanism (Static Blueprint + Dynamic Playbook):
- Function: Prevents context drift and accumulates validated repair experience.
- Mechanism: The Blueprint \(\mathcal{B}\) is an immutable authoritative specification that strictly defines simulation topology, agent modes, evaluation metrics, etc., and is locked after human expert validation. The Playbook \(\mathcal{K}\) is a dynamic repair strategy pool, where each strategy \(S_i = \langle R_i, I_i, \Sigma_i \rangle\) contains diagnostic content (with metric binding set \(\Lambda_i\)), meta-information (usage/success/failure counts), and a lifecycle state. Strategies undergo state transitions Open→Queued→InProgress→Resolved, with validation success (\(\Delta\mathcal{M}/\mathcal{M}_t > \tau\)), falsification (\(< -\tau\)), or inconclusive status determined via metric-change thresholds.
- Design Motivation: The Blueprint provides immutable constraints to prevent generation from deviating from real data; the Playbook treats repair strategies as hypotheses rather than trusted corrections, verifying or falsifying them through execution evidence to enable self-curation.
Bi-Level Optimization:
- Function: Decouples structural revision (discrete search) from parameter calibration (continuous optimization).
- Mechanism: The outer loop has the code generation agent search the program space to produce structure \(P_t\) and calibrator \(C_t\): \((P_t, C_t) \leftarrow \pi_{code}(P_{t-1}, \mathcal{B}, \text{Knapsack}(\mathcal{K}))\). The inner loop executes calibrator \(C_t\), using numerical methods such as Bayesian optimization to solve for optimal parameters \(\theta_t^* = \arg\min_\theta \mathcal{L}(\text{Sim}(P_t, \theta), \mathcal{D}_{obs})\). Critically, the LLM does not guess parameter values but instead generates an optimization program that finds them.
- Design Motivation: Ensures that metrics observed by the feedback agent reflect the intrinsic capability of structure \(P_t\) at parameter optimality, filtering out noise from untuned parameters and enabling precise attribution of defects to structural logic.
Knapsack-Based Context Engineering:
- Function: Selects the most valuable repair strategies within a limited context budget.
- Mechanism: Solves a 0-1 knapsack problem over the Open/Queued pool: \(\max \sum_i v_i x_i\) subject to \(\sum_i c_i x_i \leq L_{budget}\). Strategy value is \(v_i = w_{sev} \cdot U_i^{queue} \cdot \Phi_{rel}(S_i)\), where \(w_{sev}\) is a severity weight, \(U_i^{queue}\) is a backlog reward for starvation prevention, and \(\Phi_{rel}\) is a Bayesian reliability estimate under a Beta-Bernoulli model: \(\Phi_{rel}(S_i) = (s_i+1)/(s_i+f_i+2)\). Prompts are structured in a three-zone layout (system/context/instruction) to counteract the "lost-in-the-middle" effect.
- Design Motivation: Injecting all strategies dilutes attention (experiments confirm that a full 3200-token Playbook degrades performance to near-amnesiac levels); Bayesian reliability causes empirically falsified strategies to naturally decay.

Loss & Training¶

SOCIA-EVO does not perform model training; instead, it optimizes the simulator through iterative evolution. The inner loop uses Optuna Bayesian optimization or a random calibrator. The outer-loop iteration controller halts upon improvement plateaus or regressions to prevent overcorrection. All experiments use GPT-5.1 as the backbone, with means and 95% confidence intervals reported over 5 random seeds.

Key Experimental Results¶

Main Results¶

Performance comparison across three simulation tasks (mean ± 95% CI, lower is better)

Method	User Modeling MAE↓	Mask Simulation RMSE↓	Mobility N→A WD↓
Reflexion	0.17±0.01	0.26±0.02	0.69±0.02
YuLan-OneSim	0.21±0.02	0.16±0.01	0.64±0.01
G-SIM-SBI	0.19±0.01	0.11±0.02	0.56±0.02
ACE-OL	0.14±0.02	0.24±0.01	0.61±0.02
SOCIA-EVO	0.11±0.01	0.07±0.01	0.53±0.02

Ablation Study¶

Configuration	ΔMAE↓	ΔRMSE↓	ΔWD↓
SOCIA-EVO (full)	—	—	—
w/o inner-loop calibration	+0.25	+0.47	+0.29
w/o Blueprint	+0.20	+0.37	+0.23
w/o HITL	+0.18	+0.34	+0.21
w/o memory mechanism	+0.14	+0.30	+0.20
w/o strategy Playbook	+0.10	+0.23	+0.15
Context window +2200 tokens	+0.12	+0.14	+0.13

Key Findings¶

Inner-loop parameter calibration is the most critical component—removing it causes RMSE to surge by +0.47, as parameter updates degrade to heuristic LLM guessing.
Overly large context windows (+2200 tokens loading the full Playbook) cause performance degradation approaching amnesiac levels, validating the "attention dilution effect."
Cumulative repeated errors decrease by 76% at the fifth iteration, demonstrating that the value-based knapsack mechanism effectively suppresses "whack-a-mole" oscillations.
Open-source backbones (Llama-3.3-70B, Qwen3-80B) also achieve competitive performance, confirming that the framework is not dependent on any specific proprietary model.

Highlights & Insights¶

Treating repair strategies as hypotheses requiring verification or falsification rather than as trusted corrections represents a "scientific method" mindset that is highly novel in LLM agent design and is transferable to any agent system requiring long-horizon memory management.
The core insight of bi-level optimization is elegant: having the LLM generate a program that finds parameter values rather than directly guessing them cleverly leverages LLM strengths in code generation while circumventing their weakness in continuous optimization.
The knapsack + Bayesian reliability context engineering provides a general solution for maximizing information value within limited context windows.

Limitations & Future Work¶

The strongest results rely on a commercial LLM backbone (GPT-5.1); while open-source models are viable, a performance gap remains.
The current framework only addresses tasks expressible through iterative structural revision and bounded parameter calibration; more complex settings (long-horizon planning, strategic multi-agent interaction) may require stronger reasoning modules.
The HITL validation of the Blueprint, while lightweight, introduces human dependency, and Blueprint quality directly affects all subsequent iterations.
Strategy matching relies on thresholded text matching; finer-grained semantic matching could improve strategy consolidation quality.

vs. Reflexion: Reflexion uses episodic memory for verbal reinforcement learning but lacks metric-driven repair validation, resulting in RMSE as high as 0.26 on mask simulation vs. SOCIA-EVO's 0.07.
vs. G-SIM: G-SIM effectively separates structural generation and parameter estimation but lacks long-term evidence tracking, leading to repeated exploration of already-falsified strategies.
vs. Dynamic Cheatsheet: DC accumulates successful strategies but may introduce negative transfer from irrelevant contexts.
vs. YuLan-OneSim: Generates behavior in predefined environments rather than inferring simulator logic from data.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Formalizes simulator construction as a dual-anchored evolutionary process; the strategy hypothesis verification/falsification mechanism is highly creative.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three tasks, six baselines, detailed ablations, convergence analysis, and backbone portability verification.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous problem formalization, clear framework design, and tight integration of theory and experiments.
Value: ⭐⭐⭐⭐ Provides a systematic solution for LLM-driven scientific modeling; the dual-anchor + bi-level optimization design paradigm has broad transfer potential.