scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery¶

Conference: NeurIPS 2025 arXiv: 2602.11609 Code: https://github.com/maitrix-org/scPilot Area: Interpretability Keywords: single-cell RNA-seq, LLM reasoning, omics-native reasoning, cell-type annotation, trajectory inference

TL;DR¶

This work proposes the scPilot framework and scBench benchmark, enabling LLMs to perform "omics-native reasoning" (ONR) directly on single-cell RNA-seq data—reading marker genes, forming hypotheses, invoking tools for verification, and iteratively refining conclusions—achieving an 11% improvement in cell-type annotation accuracy and a 30% reduction in trajectory inference graph-edit distance.

Background & Motivation¶

Background: Single-cell RNA-seq analysis relies on fixed pipelines (Scanpy, Seurat), with large amounts of implicit expert reasoning (e.g., differential genes → cell-type judgment) remaining unautomated. Existing LLM applications treat LLMs merely as "code generators" that invoke pre-existing tools.

Limitations of Prior Work: (a) Single-cell foundation models (e.g., scGPT) embed gene expression into vector spaces, sacrificing interpretability; (b) LLM code agents only wrap tools with default parameters without performing biological reasoning; (c) the biological logic underlying the analysis process is opaque.

Key Challenge: Single-cell analysis demands extensive expert reasoning (identifying cell types from marker genes, inferring developmental relationships from lineage trajectories), yet existing automated tools perform computation without reasoning.

Goal: To enable LLMs not merely to invoke tools, but to interpret data, formulate hypotheses, gather evidence, and iteratively refine conclusions in the manner of a biologist.

Key Insight: Define the omics-native reasoning (ONR) paradigm—LLMs receive textual summaries of single-cell data, perform explicit reasoning, invoke tools to obtain numerical evidence, and iterate until biological conclusions are reached.

Core Idea: Single-cell analysis is formalized as a natural language reasoning problem, where LLMs produce (claim, action) pairs at each step, constituting a dual-track "verbal + computational" proof.

Method¶

Overall Architecture¶

Three core components: (1) Problem-to-Text Converter \(\mathcal{C}\): compresses expression matrices of \(10^5\)–\(10^6\) cells into LLM-digestible textual summaries (e.g., cluster sizes, top-\(k\) marker genes); (2) Bio-Tool Library \(\mathcal{T}\): encapsulates Scanpy, Monocle, pySCENIC, and other tools as callable structured APIs; (3) LLM Reasoner \(\mathcal{R}_\phi\): uses reasoning LLMs such as o1/Gemini as the core, executing closed-loop reasoning \(\mathbf{X} \to \text{Prompt} \to \{(\text{Thought}_k, \text{Call}_k)\}_{k=1}^K \to \hat{y}\).

Key Designs¶

Omics-Native Reasoning (ONR) Formalization:
- Function: Defines bioinformatics analysis tasks as a reasoning sequence \(\mathcal{R} = [(c_1,o_1), \ldots, (c_K,o_K)]\)
- Mechanism: At each step, the LLM produces a natural language claim \(c_k\) (e.g., "cluster 5 highly expresses CD3D and CD3E, likely T cells") and an action \(o_k\) (e.g., "check NK cell marker genes"), with each action updating the data state \(S_k = o_k(S_{k-1})\)
- Design Motivation: The key distinction from code agents—the reasoning trace constitutes an auditable biological argument, not merely code and output
Problem-to-Text Compression:
- Function: Compresses million-scale cell matrices into text processable within an LLM context window
- Mechanism: Task-specific compression strategies are designed: Leiden clustering + top-10 marker genes for cell annotation; PAGA graph + pseudotime for trajectory inference; top-150 TF–gene pairs for GRN
- Design Motivation: Preserving biologically significant information while drastically reducing dimensionality, enabling LLMs to operate in the text domain
scBench Benchmark:
- Function: Covers nine datasets across three major tasks (cell-type annotation, trajectory inference, gene regulatory network prediction)
- Mechanism: Each task includes expert-verified ground truth and automated evaluation metrics (accuracy, graph-edit distance, AUROC)
- Design Motivation: Existing single-cell benchmarks evaluate only embedding quality or numerical metrics, without assessing the biological validity of reasoning

Loss & Training¶

scPilot is a training-free framework that does not fine-tune LLMs. All reasoning capabilities derive from prompt engineering and iterative reasoning strategies. Core design principles: (a) biological context priority; (b) iterative reasoning; (c) minimal hand-crafted heuristics.

Key Experimental Results¶

Main Results¶

Task	Dataset	scPilot (o1)	Best Baseline	Gain
Cell Annotation	PBMC3k	~0.76	CellTypist 0.563	+35%
Cell Annotation	Liver	~0.50	CellTypist 0.464	+8%
Cell Annotation	Retina	~0.49	CellTypist 0.388	+26%
Trajectory Inference	Pancreas	GED reduced 30%	Traditional pipeline	Gemini-2.5-Pro best
GRN Prediction	Multi-organ	AUROC +0.03	pySCENIC direct output	Iterative reasoning gain

Ablation Study¶

Configuration	Effect	Note
Direct prompting (no iteration)	Baseline	Single-pass reasoning
Iterative reasoning (2–3 rounds)	+11% avg accuracy	Iterative hypothesis refinement
Without biological context	Significant drop	Species/tissue information is critical
LLM comparison	o1 best for annotation, Gemini best for trajectory	Task-specific LLM capability

Key Findings¶

Iterative reasoning is critical—LLMs frequently err in the first round (e.g., confusing NK and T cells) but self-correct in the second round upon examining additional marker genes
LLMs can identify potential issues in expert annotations—in some cases, scPilot's reasoning is more consistent than the original labels
Different LLMs excel at different tasks: o1's strong reasoning suits annotation, while Gemini's large context window suits trajectory inference
Reasoning traces are highly interpretable—biologists can audit the logic at every step

Highlights & Insights¶

Paradigm Shift: From "LLM invoking tools" to "LLM performing biological reasoning." scPilot automates the expert thought process, not merely the pipeline
Scientific Value of Reasoning Traces: Generated traces expose marker gene ambiguity, tissue-specific expression patterns, and other insights of independent analytical value to biologists
Generalizable ONR Framework: The same "data → textual summary → LLM reasoning → tool verification" paradigm is transferable to proteomics, metabolomics, and other omics domains

Limitations & Future Work¶

Performance depends on the quality of Problem-to-Text compression—information loss may introduce reasoning bias
Current coverage is limited to three core tasks; spatial transcriptomics, multi-omics integration, and related areas are not addressed
Full reliance on the LLM's biological knowledge may be insufficient for very novel or rare cell types
Multi-round LLM reasoning per analysis incurs substantial computational cost (o1 API expenses)

vs. CellAgent: CellAgent instructs LLMs to write code invoking Scanpy but performs no biological reasoning; scPilot interprets differential genes and formulates hypotheses
vs. scGPT/Geneformer: These models operate in vector space and produce no natural language reasoning; scPilot's traces constitute auditable arguments
vs. GPTCellType: Addresses only cell annotation via direct prompting for a single task; scPilot covers multiple tasks with iterative reasoning

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The formal definition and systematic framework of "omics-native reasoning" is entirely novel, opening a new paradigm for LLMs in computational biology
Experimental Thoroughness: ⭐⭐⭐⭐ Nine datasets, three tasks, and multiple LLMs and baselines, though gains on the GRN task are modest
Writing Quality: ⭐⭐⭐⭐ Framework description is clear, though the mathematical formalization is somewhat over-notated
Value: ⭐⭐⭐⭐⭐ Potentially transformative for the computational biology community—demonstrating the feasibility of LLMs as "scientific reasoning partners" rather than "code generators"