AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise¶

Conference: NeurIPS 2025 arXiv: 2507.00310 Code: https://github.com/allenai/autodiscovery Area: Automated Scientific Discovery Keywords: Bayesian Surprise, Open-ended Discovery, MCTS, Hypothesis Generation, LLM Agent

TL;DR¶

AutoDiscovery proposes Bayesian Surprise as an objective reward signal for open-ended scientific discovery — estimating the KL divergence between prior and posterior belief distributions via LLM sampling, combined with MCTS and progressive widening to explore the hypothesis space. On 21 real-world datasets, the method produces 5–29% more surprising discoveries than greedy/beam search baselines. Human evaluation confirms that Bayesian Surprise aligns with expert "surprise" ratings (0.67), substantially outperforming LLM self-evaluated "novelty" and "usefulness."

Background & Motivation¶

Background: Goal-driven automated scientific discovery requires humans to specify research questions. Open-ended discovery — where the system autonomously explores without predefined objectives — is more ambitious but lacks reliable intrinsic reward signals.

Limitations of Prior Work: (a) Diversity heuristics are insufficient — the hypothesis space is vast, and uniform exploration wastes the evaluation budget. (b) Human proxy metrics ("interestingness," "novelty," "usefulness") are subjective, inconsistent across experts, and unreliable to automate — experiments show that LLM-evaluated "interestingness" is nearly uncorrelated with human "surprise."

Key Challenge: There is no objective, automatically computable reward signal for open-ended discovery that aligns with human scientific intuition.

Goal: To define and implement open-ended scientific discovery driven by Bayesian Surprise.

Key Insight: Bayesian Surprise = KL divergence between posterior and prior beliefs — a hypothesis that is "surprised" by experimental evidence (i.e., substantially shifts beliefs) constitutes an interesting finding. Prior/posterior Beta distribution parameters are estimated via LLM sampling.

Core Idea: LLM sampling estimates prior/posterior beliefs → Beta-Bernoulli KL divergence = Bayesian Surprise → serves as a reward signal for MCTS-driven hypothesis space exploration.

Method¶

Overall Architecture¶

Reward: For hypothesis \(H\), the LLM samples \(n\) times to estimate the prior \(P(\theta_H)\) and posterior \(P(\theta_H|\mathcal{V}_D)\) → Beta-Bernoulli fitting → KL divergence = Bayesian Surprise \(\text{BS}(H, \mathcal{V}_D)\). Search: MCTS + progressive widening → UCT balances exploration/exploitation → each iteration: selection → expansion → execution (hypothesis verification) → backpropagation of surprise. Agent: Multi-agent architecture comprising a hypothesis generator, experiment programmer, analyst, reviewer, and reviser.

Key Designs¶

Bayesian Surprise Estimation:
- Function: Quantifies the degree to which experimental evidence shifts belief in a hypothesis.
- Mechanism: The LLM samples \(n\) binary (true/false) judgments for hypothesis \(H\); with \(k_{prior}\) "true" responses, the prior is \(P_{est}(\theta_H) = \text{Beta}(1+k_{prior}, 1+n-k_{prior})\). The posterior is estimated analogously after experimental verification. \(\text{BS} = D_{KL}(P_{post} \| P_{prior})\). An additional belief-shift condition requires the expected posterior to cross a threshold \(\delta=0.5\) (i.e., flipping from "likely true" to "likely false" or vice versa).
- Design Motivation: In information theory, the magnitude of belief change equals information gain — precisely capturing the essence of "surprise." The Beta-Bernoulli model is the simplest conjugate pair, and binary LLM sampling suffices for estimation.
MCTS + Progressive Widening:
- Function: Efficiently searches the hypothesis space for high-surprise hypotheses.
- Mechanism: \(\text{UCT}(H) = \frac{\sum_{h \in subtree(H)} S(h)}{N(H)} + C\sqrt{\frac{2\log N(H_{parent})}{N(H)}}\). Progressive widening limits each node to at most \(kN^\alpha\) children. Four-phase iteration: selection → expansion → execution → backpropagation.
- Design Motivation: Greedy search becomes trapped in local optima (repeatedly exploring near the first high-surprise hypothesis found); UCT in MCTS balances exploration depth and breadth.
LLM-based Deduplication (HAC):
- Function: Merges semantically equivalent hypotheses to avoid redundant evaluation.
- Mechanism: Text embedding → hierarchical agglomerative clustering (HAC) → each merge decision is adjudicated by GPT-4o (merged if >70% of sampled votes indicate "equivalence").
- Design Motivation: Identically worded hypotheses with different phrasings waste the evaluation budget — deduplication is critical for efficiency.

Loss & Training¶

No training — purely inference-time search.
Budget: 500 hypothesis evaluations.
Evaluated on 21 real-world datasets (5 from DiscoveryBench + 15 from BLADE + 1 from SEA-AD).

Key Experimental Results¶

Main Results (Cumulative Surprise, 500 Iterations)¶

Search Method	Cumulative Surprise	vs. AutoDiscovery
Repeated Sampling (baseline)	~20–25	−5 to −29%
Last-K Linear	~25–30	~−15%
Greedy Tree	~25–30	~−25%
Beam Search	~30	~−10%
AutoDiscovery (MCTS)	40+	—

AutoDiscovery achieves the best performance on 17 out of 21 datasets.

Human Evaluation (1,620 LLM-surprising hypotheses, 3 experts per hypothesis)¶

Reward Signal	Human Surprise	Human Interestingness	Human Usefulness
Bayesian Surprise	0.67	0.73	0.79
LLM Surprise	0.11	0.76	0.80
LLM Interestingness	0.15	0.74	0.78
LLM Usefulness	0.21	0.73	0.78

Ablation Study / Validation¶

Metric	Result
Experimental Validity	98.58% (Gwet's AC1 = 0.97)
Implementation Validity	98.01% (Gwet's AC1 = 0.98)
Deduplication Accuracy	90.76%

Key Findings¶

Bayesian Surprise aligns with human "surprise" at 0.67, far exceeding LLM self-evaluation (0.11–0.21) — demonstrating that subjective metrics are unreliable while information-theoretic metrics are robust.
"Interestingness" and "usefulness" scores are nearly identical across all reward signals (~0.73–0.80), rendering them ineffective as discriminative metrics.
MCTS search efficiency does not degrade over iterations (unlike greedy/beam search), as UCT automatically balances exploration and exploitation.
The belief-shift condition is important — it filters out low-quality "surprises" that only marginally adjust beliefs.

Highlights & Insights¶

Bayesian Surprise is the first successful reward signal for open-ended discovery: All prior attempts (diversity, interestingness, novelty) are either insufficiently objective or cannot be automated reliably.
Quantitative evidence that LLM subjective evaluation is unreliable: LLM-judged "surprise" correlates with human ratings at only 0.11 — a strong warning about the limitations of LLM-as-Judge.
Application of MCTS to scientific discovery demonstrates the cross-domain value of search algorithms — from Go to scientific hypothesis spaces.

Limitations & Future Work¶

Assumes that the LLM's knowledge frontier approximates the human knowledge frontier (an assumption that will gradually hold as models improve).
The reasoning process is unsupervised (supervised reasoning could improve sample efficiency in future work).
Evaluation is limited to data-driven discovery (no wet-lab experiments; limited literature-based discovery).
Deployment requires academic caution and peer-review safeguards.

vs. MOOSE-Chem/OpenScienceAgent: These are goal-driven discovery systems (requiring a research question as input); AutoDiscovery operates in an open-ended setting.
vs. Curiosity-driven RL: Curiosity = prediction error; Bayesian Surprise = belief change — the latter is more appropriate for scientific discovery.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Bayesian Surprise + MCTS-driven open-ended scientific discovery constitutes an entirely new paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 21 datasets + 4 search baselines + 1,620-hypothesis human evaluation + agent validity verification.
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical motivation is clear; experimental design is rigorous.
Value: ⭐⭐⭐⭐⭐ Potentially opens a new direction for autonomous scientific discovery with LLMs.