VCWorld: A Biological World Model for Virtual Cell Simulation¶

Conference: ICLR2026
arXiv: 2512.00306
Code: N/A
Area: Interpretability / AI for Science
Keywords: Virtual Cell, world model, LLM Reasoning, Signaling Cascade, Drug Perturbation

TL;DR¶

This paper proposes VCWorld, a cell-level white-box simulator that integrates structured biological knowledge graphs with the iterative reasoning capabilities of large language models (LLMs) to simulate drug perturbation-induced signaling cascades in a data-efficient manner. The framework generates interpretable step-by-step predictions and explicit mechanistic hypotheses, achieving state-of-the-art performance on drug perturbation benchmarks.

Background & Motivation¶

Background: Virtual Cell Modeling is a frontier direction in computational biology, aiming to predict cellular responses under various perturbations (drug treatment, gene knockout, etc.). This is critical for drug discovery, disease mechanism understanding, and precision medicine. In recent years, deep learning models such as scGPT and GEARS have achieved notable progress by learning mappings between gene expression and perturbations from large-scale single-cell RNA-seq data.

Limitations of Prior Work: (1) Heavy data dependence — existing models rely heavily on large-scale, high-quality single-cell datasets, which are costly to acquire and limited in coverage; (2) Limited generalization — data quality, coverage, and batch effects jointly constrain model generalization to novel cell types and perturbation conditions; (3) Black-box problem — end-to-end trained models only output gene expression predictions without providing mechanistic explanations of how perturbations propagate within cells.

Key Challenge: A fundamental conflict exists between the scientific demand for interpretability and mechanistic consistency and the black-box nature of deep learning models. Predictions lacking mechanistic explanation are difficult to validate in scientific research and cannot genuinely advance biological understanding. Even when numerical predictions are accurate, researchers cannot extract verifiable biological hypotheses from them.

Goal: VCWorld departs from the paradigm of data-driven end-to-end fitting and instead combines structured biological knowledge (e.g., protein–protein interaction networks, signaling pathway maps) with LLM prior knowledge acquired through training on biomedical literature. Rather than learning a black-box mapping from \(\text{perturbation} \to \text{gene expression}\), the model explicitly simulates the signaling cascade propagation from target proteins to downstream gene expression changes, producing traceable mechanistic pathways at each reasoning step.

Method¶

Overall Architecture¶

VCWorld frames virtual cell simulation as a Biological World Model. The core pipeline is:

Input: Drug/perturbation information + structured biological knowledge graphs extracted from public databases
Reasoning: LLM-driven iterative reasoning that simulates the cascade propagation of perturbations through cellular signaling networks
Output: Step-by-step interpretable gene expression predictions + explicit signaling pathway mechanistic hypotheses

In contrast to the conventional end-to-end paradigm \(f_\theta(\text{perturbation}, \text{cell\_type}) \to \Delta \text{gene\_expression}\), VCWorld's reasoning chain is:

\[\text{Drug} \xrightarrow{\text{Target Identification}} \text{Target Protein} \xrightarrow{\text{Pathway Propagation}} \text{Signaling Cascade} \xrightarrow{\text{Transcriptional Regulation}} \text{Gene Expression Change}\]

Key Design 1: Structured Biological Knowledge Integration¶

The white-box nature of VCWorld is grounded in structured knowledge:

Knowledge sources: Multi-level biological knowledge — including protein–protein interaction (PPI) networks, signaling pathway topologies, and gene regulatory relationships — is extracted from public databases such as KEGG, Reactome, and STRING.
Knowledge representation: These relationships are structured into LLM-processable formats (textualized pathway descriptions or graph representations), enabling the LLM to query and leverage these prior constraints during reasoning.
Design Motivation: Rather than training an end-to-end model purely on data, decades of accumulated biological knowledge are explicitly injected into the reasoning process, ensuring that every prediction step has a clear biological basis.

Key Design 2: LLM-Driven Iterative Signaling Cascade Reasoning¶

The core reasoning engine of VCWorld employs an LLM to simulate perturbation propagation within cells:

Iterative reasoning process: The model incrementally infers the signal transduction chain — drug binding to target protein → activation/inhibition of downstream signaling pathways → modulation of transcription factor activity → upregulation/downregulation of specific genes.
LLM as reasoning engine: Having been trained on vast biomedical literature, the LLM implicitly encodes rich molecular biology knowledge. VCWorld leverages this implicit knowledge to "complete" missing interaction relationships in the knowledge graph and to evaluate the plausibility of each signal transduction path.
Traceable mechanistic pathways: Each reasoning step produces explicit causal hypotheses (e.g., "Drug X inhibits Protein A → Protein A cannot phosphorylate Protein B → Pathway C is blocked → Gene D is downregulated"), providing direct leads for downstream experimental validation.

Key Design 3: Data-Efficient Operation¶

The knowledge-driven paradigm of VCWorld substantially reduces the requirement for training data:

Conventional methods require large-scale paired (perturbation condition, gene expression change) data to train end-to-end mappings.
VCWorld's core reasoning capability derives from structured knowledge graphs combined with LLM prior knowledge; training data are primarily used for calibration and validation.
This property allows VCWorld to maintain effective predictions in data-scarce scenarios, such as rare cell types and novel drug perturbations.

Key Experimental Results¶

Drug Perturbation Benchmark¶

Method	Type	Core Feature	Prediction Accuracy
scGPT	Data-driven	Large-scale pretraining + fine-tuning	Baseline level
GEARS	Data-driven	Graph neural network modeling of gene relationships	Moderate
Multi-source fusion methods	Data-driven	Integration of multi-omics data	Limited improvement
VCWorld (Ours)	Knowledge + LLM Reasoning	White-box, interpretable	SOTA

VCWorld achieves state-of-the-art performance on drug perturbation prediction benchmarks while being the only method that provides complete mechanistic explanations.

Ablation Study¶

Configuration	Performance	Notes
Remove structured knowledge	Significant degradation	LLM internal knowledge alone is insufficient for reliable reasoning
Remove iterative reasoning	Performance drop	Single-step prediction loses stepwise propagation information of signaling cascades
Remove LLM reasoning	Large performance drop	Knowledge graphs alone cannot handle knowledge gaps
Full VCWorld	Best	Synergistic effect of structured knowledge + LLM reasoning

Key Findings¶

Mechanistic consistency: Signaling pathways inferred by VCWorld show high agreement with evidence in published biological literature, validating the biological plausibility of the reasoning process.
Interpretability advantage: Each prediction is accompanied by a complete signaling cascade pathway, enabling researchers to inspect reasoning logic step by step and identify potential errors.
Data efficiency: Performance under limited training data surpasses data-driven baselines that rely on large-scale datasets.

Highlights & Insights¶

The white-box simulator concept breaks beyond the "prediction accuracy first" paradigm in AI for Science — in scientific research, a moderately accurate prediction with a sound mechanistic explanation is often more valuable than a high-accuracy prediction that cannot be explained.
LLM as a "biological reasoning engine" is an elegant design — having been trained on vast biomedical literature such as PubMed, LLMs implicitly encode extensive intermolecular relationships and biological principles; VCWorld converts this implicit knowledge into explicit reasoning capability.
The "world model" perspective elevates cellular response prediction from statistical fitting to causal simulation — given an initial perturbation condition, the model can "rehearse" the dynamic response process of a cell.
Cross-domain methodological inspiration: The paradigm of combining LLM reasoning with domain-specific knowledge graphs can be extended to other scientific domains such as materials science and chemical reaction prediction.

Limitations & Future Work¶

LLM hallucination risk: LLMs may generate reasoning chains that appear plausible but are biologically incorrect; additional verification mechanisms are needed to filter unreliable inferences.
Incomplete knowledge graph coverage: Databases such as KEGG and Reactome still lack many unknown signaling relationships, and model performance may degrade in regions with knowledge gaps.
Reasoning efficiency: The computational cost of iteratively invoking the LLM for step-by-step reasoning is significantly higher than end-to-end forward inference.
Perturbation type coverage: Current validation primarily focuses on drug perturbations; generalization to other perturbation types such as gene knockout and overexpression remains to be verified.
Single-cell-level heterogeneity: Significant intercellular heterogeneity exists within the same cell type, and the current framework offers limited modeling of this variability.

vs. scGPT / GEARS: End-to-end data-driven approaches whose prediction accuracy depends on data scale and which cannot provide mechanistic explanations; VCWorld trades knowledge and reasoning for interpretability and data efficiency.
vs. Virtual Cell Initiative (CZI): A virtual cell research initiative driven by the Chan Zuckerberg Initiative; VCWorld provides a complementary technical approach from the perspective of a "white-box world model."
vs. GeneGPT / BioGPT: Early applications of LLMs in biology focused on knowledge question answering; VCWorld further employs LLMs for structured causal reasoning and dynamic simulation.
Inspiration: The "white-box world model" paradigm combining LLM reasoning with domain knowledge graphs has the potential to be replicated in other knowledge-intensive scientific domains, such as medicinal chemistry and materials design.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The concept of a white-box biological world model is highly original; combining LLM reasoning with knowledge graphs is a first in the virtual cell domain.
Experimental Thoroughness: ⭐⭐⭐⭐ Drug perturbation benchmarks are comprehensive and mechanistic validation is convincing.
Writing Quality: ⭐⭐⭐⭐ Concepts are clearly articulated and accessible to readers across disciplines.
Value: ⭐⭐⭐⭐⭐ Provides important directional insights for both AI for Science and interpretable AI.