MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier¶
Conference: ICML 2026
arXiv: 2603.03756
Code: https://github.com/ZonglinY/MOOSE-Star (Available)
Area: LLM Reasoning / Scientific Discovery / Decomposed Training
Keywords: Hypothesis generation, inspiration retrieval, hierarchical search, tractable training, TOMATO-Star
TL;DR¶
MOOSE-Star decomposes the problem of "training an LLM to directly generate scientific hypotheses"—originally a search in an \(\mathcal{O}(N^k)\) combinatorial space—into two sequential subtasks: "inspiration retrieval + hypothesis synthesis." By layering hierarchical tree retrieval, bounded composition, and motivation planning, the optimal complexity is reduced from exponential to \(\mathcal{O}(\log N)\). The authors also release the TOMATO-Star dataset, which includes 108,717 papers with decomposed annotations.
Background & Motivation¶
Background: Most work on LLMs for scientific discovery focuses on "how to use LLMs at inference time" or "fine-tuning with external feedback" (e.g., reviewer feedback, rule-based scores, or reward alignment with data). Modeling and training the core conditional distribution \(P(\text{hypothesis}\mid\text{background})\) directly remains largely unexplored.
Limitations of Prior Work: The authors theoretically demonstrate that directly training \(P(h\mid b)\) implicitly requires identifying the correct sequence of \(k\) inspirations within a global scientific literature corpus of \(N\approx 10^7\). The search space is \(\mathcal{O}(N^k)\) (e.g., \(\approx 10^{21}\) for \(N=10^7, k=3\)), creating a "combinatorial complexity wall" that makes end-to-end training mathematically ill-posed.
Key Challenge: Either abandon the direct modeling of \(P(h\mid b)\) (the current feedback-based route) or confront combinatorial complexity (infeasible). Both paths are problematic.
Goal: To directly model \(P(h\mid b)\) while reducing training complexity to a level manageable by modern computing power, and to provide a reproducible dataset and open-source code.
Key Insight: Leveraging the probability decomposition theorem from MOOSE-Chem, \(P(h\mid b)\approx \prod_j P(i_j\mid b,h_{j-1},\mathcal{I})\cdot P(h_j\mid b,h_{j-1},i_j)\) is viewed as a sequence of "inspiration retrieval + incremental synthesis." This decomposition, previously used only for inference, is upgraded in this work to a trainable objective.
Core Idea: The untrainable \(\mathcal{O}(N^k)\) problem is reduced to trainable \(\mathcal{O}(k\cdot N)\) sequential subtasks. Subsequently, hierarchical tree search, bounded composition, and motivation planning are used to further compress the retrieval complexity from \(\mathcal{O}(N)\) to \(\mathcal{O}(\log N)\).
Method¶
Overall Architecture¶
The training framework consists of three layers: (1) Data: 108,717 open papers from 2020–2025 are deconstructed using R1 / R1-distill-Qwen into (research background \(b\), hypothesis \(h\), inspirations \(\{i_j\}\)) triplets. \(h\) is split into \(\Delta h_1,\ldots,\Delta h_k\), each structured into (Motivation, Mechanism, Methodology) layers. (2) Model: \(P(h\mid b)\) is split into "Inspiration Retrieval (IR)" and "Hypothesis Composition (HC)" tasks for Rejection Sampling Fine-Tuning (RFT). IR uses a pool of 1 positive and 14 hard negative samples; HC uses a rubric-based evaluator for rejection sampling. (3) Inference: The literature corpus is organized into a semantic retrieval tree. Motivation variables are used to dynamically prune irrelevant subtrees. HC is trained within a bounded tolerance radius \(M\) to ensure robustness against retrieval errors.
Key Designs¶
-
Decomposed Sequential Training (IR + HC):
- Function: Replaces "end-to-end learning of \(P(h\mid b)\)" with "learning retrieval followed by learning incremental synthesis," repeated \(k\) times.
- Mechanism: Following the chain rule, \(P(h\mid b)\approx \prod_{j=1}^{k} P(i_j\mid b,h_{j-1},\mathcal{I})\cdot P(h_j\mid b,h_{j-1},i_j)\) is decomposed. The IR task involves generatively selecting the most relevant paper from 15 candidates (input: title + abstract; output: CoT reasoning + selection). The HC task involves writing the incremental hypothesis \(\Delta h_j\) given the ground-truth \(i_j\). Both utilize teacher-based RFT. The overall complexity becomes \(\mathcal{O}(k\cdot(N+1))\) instead of \(\mathcal{O}(N^k)\).
- Design Motivation: Converting an exponential Cartesian product into \(k\) linear sums is the key step to bringing the untrainable problem into a trainable domain. IR and HC are clear, supervised, and measurable tasks, making them more stable than globally scoring \(h\).
-
Bounded Composition:
- Function: Ensures the HC model is robust when the retrieved inspiration is not exactly the ground-truth \(i^*\).
- Mechanism: A semantic tolerance neighborhood \(\mathcal{I}_{i^*}\subset\mathcal{I}\) of size \(M\) centered at \(i^*\) is defined. During training, "approximate inspirations" are sampled from this neighborhood to feed the HC model, teaching it to synthesize valid \(\Delta h_j\) from adjacent concepts. This relaxes the retrieval precision requirement from "1/N exact match" to "1/(N/M) fuzzy match," further compressing the effective search space for IR.
- Design Motivation: Even if hierarchical tree retrieval reaches \(\mathcal{O}(\log N)\), the final layer may not be perfectly accurate. Bounded composition explicitly models "retrieval error" into the training distribution, similar to noise-aware training, preventing pipeline collapse under real-world noise.
-
Motivation-Guided Hierarchical Search:
- Function: Replaces "linear scanning of \(N\) papers" with "top-down \(\log N\) steps through a semantic tree," using motivation variables for pruning.
- Mechanism: Literature is clustered into a semantic retrieval tree. At each step, the most relevant branch is selected among children nodes; ideally, search depth is \(\mathcal{O}(\log N)\). An explicit motivation variable \(m\) (from the Motivation layer of \(\Delta h\)) is attached to the background, acting as a "generative root" to dynamically prune subtrees irrelevant to the current target, reducing the searchable space from \(N\) to \(N_m\ll N\).
- Design Motivation: A semantic tree alone only reduces retrieval steps, but "which subtree to search" remains open. The motivation variable provides a generative direction control signal, allowing the model to truly scale during inference.
Loss & Training¶
Both IR and HC move through Rejection Sampling Fine-Tuning (RFT) with CoT supervision: for each sample, \(N\) CoT trajectories are sampled, and low-quality ones are filtered via a rubric evaluator. The HC rubric checks the Motivation, Mechanism, and Methodology layers. The TOMATO-Star dataset was archived only after passing four automated quality checks: necessity, sufficiency, mutual exclusivity, and non-redundancy.
Key Experimental Results¶
Main Results¶
| Dimension | Configuration | Highlight |
|---|---|---|
| Data Scale | 108,717 open papers, 38,400 GPU hours | Training: 2020-09 to 2025; Test: 2025-10 (No temporal leakage) |
| Complexity (Worst → Best) | \(\mathcal{O}(N^k)\) → \(\mathcal{O}(\log N)\) | Compressed via IR/HC decomposition + tree retrieval + motivation pruning |
| Test-time scaling | brute-force vs. MOOSE-Star | Brute-force hits the "complexity wall" quickly in multi-inspiration tasks; MOOSE-Star success rate rises with computation budget |
| Inspiration Hit Rate | IR vs. random/nearest neighbor | Generative selection + CoT supervision significantly outperforms baselines (see § F) |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| W/O Bounded Composition (\(M=1\)) | HC sensitivity to noise increases | Validates the need for synthesis under imperfect retrieval |
| W/O Motivation Variable | Tree search path lengthens | Motivation is key to effective pruning efficacy |
| End-to-end \(P(h\mid b)\) (baseline) | Fails to converge | § 7.1 shows distillation of reasoning traces directly for \(b\to h\) is infeasible |
| Brute-force test-time sampling | Hits "complexity wall" | Highlights the scaling capability of the hierarchical search in MOOSE-Star |
Key Findings¶
- The failure of direct \(P(h\mid b)\) training stems from the implicit combinatorial search space being too large, not from a lack of data or model size. This serves as a fundamental critique of the "feedback-driven discovery" route.
- Decomposing an exponential problem into hierarchical, tolerant, and pruned steps is a transferable paradigm: solving one complexity bottleneck at a time allows reduction from \(N^k\) to \(\log N\).
- The structure of TOMATO-Star—(b, h, i) + (Motivation, Mechanism, Methodology)—represents an upgrade in discovery dataset design, moving beyond "abstract-style hypotheses."
- Strict temporal splitting (test sets after 2025-10) prevents pre-training contamination, a practice recommended for future LLM evaluation.
Highlights & Insights¶
- Provides a rigorous complexity argument (\(\mathcal{O}(N^k)\) vs \(\mathcal{O}(\log N)\)) for why \(P(h|b)\) is hard to train, upgrading "LLM for science" from engineering papers to a theoretically grounded research direction.
- Reinterpreting inference-time probability decomposition as a training objective is the most significant leap since MOOSE-Chem, conceptually similar to breaking the Bellman equation into TD-updates in RL.
- Bounded Composition relaxes the "retrieval-synthesis" alignment to "neighborhood tolerance," an engineering posture toward real-world retrieval noise that aligns with RAG practices.
- By providing 108k structured papers, a full training/inference suite, and models, the work sets a high bar for reproducibility in scientific discovery.
Limitations & Future Work¶
- The system still relies on "author citations = ground-truth inspirations," which may bias toward explicitly cited influences and ignore "uncited but profoundly influential" inspirations.
- The 1 positive + 14 negative IR setting is a limited approximation; the real library is much larger. Hierarchical search lacks a self-correction mechanism if the root node is misidentified.
- The tolerance radius \(M\) in Bounded Composition is a hyperparameter; too small leads to exact match issues, while too large risks uncontrolled generalization. No systematic strategy for choosing \(M\) is provided.
- Validation was primarily in biology, chemistry, and medicine. Effectiveness in fields like ML/CS, with denser citation structures and shorter inspiration chains, remains to be verified.
Related Work & Insights¶
- vs MOOSE-Chem (Yang et al., 2025b): MOOSE-Chem used probability decomposition only at inference. This work converts that decomposition into a training objective, a key transition from "inference tool" to "training paradigm."
- vs Feedback-driven training: Prior works fine-tune hypotheses based on reviewer or data feedback without addressing the core \(P(h\mid b)\) distribution. This is the first work to train that distribution directly.
- vs O'Neill et al. (2025): Attempts to model \(P(h\mid b)\) via distillation, which § 7.1 of this paper argues is infeasible due to the difficulty of replicating reasoning traces.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Rigorous argument of combinatorial complexity as the root cause of training failure, providing an executable roadmap to \(\log N\). Rare mix of theoretical and engineering novelty.
- Experimental Thoroughness: ⭐⭐⭐⭐ Solid data scale and GPU investment (38,400 A800 hours), though comparative experiments lean qualitative and lack some hard benchmarks.
- Writing Quality: ⭐⭐⭐⭐ Clear complexity derivations and logical causality; well-explained transitions across complexity levels.
- Value: ⭐⭐⭐⭐⭐ Provides the framework, dataset (108k TOMATO-Star), code, and models, making it a de facto baseline for "LLM-for-discovery."