Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows¶

Conference: AAAI 2026 arXiv: 2511.15977 Code: None (related to commercial product StrataRisk™) Area: Bioinformatics / Precision Medicine Keywords: Genomic workflows, chromosome parallelization, RAM prediction, scheduling optimization, symbolic regression

TL;DR¶

Three complementary chromosome-level genomic parallelization scheduling schemes are proposed — static scheduling (optimizing processing order), dynamic scheduling (knapsack-based batching with online RAM prediction), and a symbolic regression RAM predictor — achieving significant reductions in out-of-memory errors and execution time in both simulated and real precision medicine pipelines.

Background & Motivation¶

Computational Challenges in Precision Medicine¶

Precision medicine relies on whole-genome sequencing (WGS) to compute polygenic risk scores (PRS) and local ancestry inference (LAI). Modern genomic pipelines chain multiple tools (C/C++ read alignment, Java variant calling, Python downstream analysis, etc.), and VCF files for a single sample can reach tens to hundreds of gigabytes.

Core Problem: Splitting work by chromosome for parallel processing is a natural strategy for managing data volume and memory demands, but it faces several key challenges:

Large variation in chromosome size: Human chromosome 1 is approximately 4× larger than chromosome 22 (Fig. 1), causing severe load imbalance under naive one-task-per-core assignment.

Unpredictable memory consumption: RAM requirements for each chromosome processing task are difficult to estimate precisely; static resource allocation easily leads to out-of-memory (OOM) errors.

Low resource utilization: Conservative estimates cause over-allocation and wasted compute; aggressive estimates cause task failures and reruns.

Limitations of Prior Work¶

Workflow engines such as GATK Scatter-Gather, Nextflow, and Snakemake only support static memory declarations.
ML methods (SVM, LSTM, etc.) can predict resource usage but are complex to deploy (requiring model storage/loading infrastructure).
Methods such as RA-GCN suffer from GAN training instability.

Method¶

Overall Architecture¶

Three complementary systems work in concert: 1. Static scheduler: Given a fixed concurrency \(K\), optimizes chromosome processing order to minimize peak memory. 2. Dynamic scheduler: Treats batching as a knapsack problem, with online RAM prediction and adaptive scheduling. 3. RAM prediction module: Symbolic regression learns an interpretable RAM prediction formula.

Key Designs¶

1. Static Scheduler¶

Problem formulation: Given \(n=22\) chromosomes and \(K\) parallel worker threads, find the optimal permutation \(\pi^*\) minimizing peak memory:

\[\pi^* \in \arg\min_{\pi \in S_n} J(\pi; K) = \arg\min_{\pi \in S_n} \sup_{t \geq 0} M(t; \pi, K)\]

where \(M(t; \pi, K) = \sum_{i \in \mathcal{A}(t; \pi, K)} m_i\) is the instantaneous memory usage at time \(t\).

Search algorithm: Stochastic Hill Climbing - Starting from an initial permutation, randomly swap \(M_r \sim \text{Unif}\{1, \dots, M_{max}\}\) positions at each step. - Accept the new permutation if its peak memory is lower. - Multiple random restarts (\(T\) times) to avoid local optima.

Core finding: Optimized orderings exhibit an alternating pattern of large and small chromosomes (Fig. 2); the sliding-window average stabilizes around 11 (the median), ensuring a uniform mix of large and small chromosomes across adjacent batches.

Design Motivation: These optimized orderings can be precomputed and used directly at runtime with zero additional overhead, making them suitable for simple deployment scenarios.

2. Dynamic Scheduler¶

Online RAM prediction: Uses polynomial regression to online-learn the relationship between chromosome index \(c\) and RAM demand:

\[\hat{r}_c = \sum_{n=0}^{d} w_n c^n\]

Regression coefficients are updated after each chromosome is processed.

Conservative Bias: A residual-percentile-based bias term \(b\) is introduced:

\[\hat{r}_{c,b,t} = \hat{r}_c + b_t\]

The bias percentile is interpolated from \(\gamma_{max}\) in early stages to \(\gamma_{min}\) in later stages — more conservative early on (reducing OOM) and more aggressive later when predictions are accurate (improving efficiency).

Task packing: Two strategies - Greedy: Maximizes the number of concurrently running tasks (sorts by predicted RAM ascending, adds tasks until the limit is exceeded). - Knapsack: Maximizes RAM utilization (solves the 0-1 knapsack problem via dynamic programming, using predicted RAM as item weights).

Predictor initialization: Three strategies compared - Largest-first: Processes large chromosomes first (high initial RAM utilization but inaccurate predictions). - Smallest-first: Processes small chromosomes first (rapidly initializes the predictor). → Optimal choice - Mixed: Alternates large and small (robust predictor but suboptimal speed).

Design Motivation: Compared to static scheduling, dynamic scheduling adapts to actual memory variations at runtime and continuously improves prediction accuracy through online learning.

3. Symbolic Regression RAM Prediction¶

Objective: Distill a complex ensemble model into a concise, directly deployable formula.

Teacher model: A Voting Regressor combining Random Forest + Histogram Gradient Boosting + Gradient Boosting.

Distillation process: - Generate a large set of synthetic inputs covering the feature space. - Use the teacher model to predict RAM. - Fit PySR symbolic regression to the teacher's predictions.

Taking Beagle (a genotype imputation tool) as an example, the learned formula incorporates 8 features (number of threads, burn-in iterations, main iterations, window size, number of variants, number of samples, reference panel variants, and reference panel samples). Although the resulting expression (Eq. 16) is complex, it can be deployed in a single line of code.

Conservative calibration (Conformal Prediction): One-sided conformal prediction constructs a piecewise linear interpolation function mapping predicted values to conservative estimates, adapting to heteroscedasticity and ensuring consistent safety margins across different prediction ranges.

Loss & Training¶

Static scheduling: Precompute optimized ordering tables; zero overhead at runtime.
Dynamic scheduling parameters: \(s=1.30\), \(\gamma_{max}=0.95\), \(\gamma_{min}=0.80\), \(p=2\), polynomial degree \(d=1\).
Symbolic regression conservative priors replace the initial sequential execution phase of the dynamic scheduler.

Key Experimental Results¶

Main Results (Static Scheduling)¶

Concurrency K	Sequential Peak RAM	Optimized Peak RAM	Reduction
2	492.45	297.38	39.61%
3	690.47	413.47	40.12%
5	1062.54	784.03	26.21%
7	1392.80	1037.98	25.48%
10	1815.91	1440.64	20.67%

At low concurrency, peak memory is reduced by up to 40% through processing order optimization alone.

Ablation Study (Dynamic Scheduling)¶

Configuration	Makespan	Overcommits	Notes
Knapsack	703.06	2.83	Base dynamic scheduler (40% task size)
+LR Bias	723.16	1.65	Bias reduces OOM by 38%
+Smallest Init	671.26	2.03	Smallest-first initialization is fastest
+Prior (symbolic regression)	627.63	0.95	Prior eliminates initial sequential phase
Sizey (competitor)	648.04	0.68	Ours outperforms on most task sizes
Theoretical optimum	452.40	0.00	Lower bound

The complete dynamic scheduler closes approximately 13% of the gap to the theoretical optimum in average makespan while reducing OOM by 77%.

Key Findings¶

Knapsack > Greedy (Fig. 3 Packer Comparison): The knapsack method consistently yields lower makespan than greedy, as maximizing RAM utilization is more important than maximizing task count.
Smallest-first initialization is optimal (Fig. 3 Initialization): Although mixed initialization yields a more accurate predictor and reduces OOM by 20%, the speed advantage of rapid initialization outweighs this benefit.
Value of prior information (Fig. 3 Effect of Priors): Even noisy priors eliminate the initial sequential execution phase, with particularly significant effects when task size is below 50% of RAM.
Symbolic regression accuracy: Pearson correlation drops from 0.92 (ensemble teacher) to 0.85, but the model is deployable in one line of code. The 80th-percentile conservative estimate achieves zero OOM.
Real-world deployment results (Fig. 5): In Galatea Bio's StrataRisk™ pipeline, the dynamic scheduler with conservative priors reduces Beagle's makespan by approximately 2× with zero OOM events.

Highlights & Insights¶

Seamless integration of theory and practice: The work forms a complete loop from scheduling theory (knapsack/permutation optimization) to ML prediction (symbolic regression distillation) to real deployment (clinical PRS pipelines).
Clever application of symbolic regression distillation: Black-box ensemble predictions are distilled into an interpretable mathematical formula deployable in a single line of code, perfectly suited to the operational requirements of genomic workflows.
Adaptive design of conservative bias: The interpolation strategy — conservative early and aggressive later — reduces early OOM risk while improving late-stage efficiency.
Multi-level solutions for different scenarios: Static scheduling suits simple scenarios (precomputed, zero overhead); dynamic scheduling suits complex scenarios (online adaptive); symbolic regression suits known task types.
Practical clinical value: Reduces computational costs and report turnaround time, directly benefiting patients.

Limitations & Future Work¶

Symbolic regression is modeled only for Beagle; generalization to other tools requires retraining.
Static scheduling assumes a linear relationship between processing time and chromosome size, which may be more complex in practice.
The \(O(N^3)\) complexity of dynamic scheduling may become a bottleneck at very large scales of parallelism.
Sub-chromosomal sequence-level finer-grained parallelization is not considered (noted by the authors as a future direction).
Simulation experiments use a linear noise model, which may not fully reflect the complexity of real workloads.
Conflict of interest: multiple authors hold equity in Galatea Bio.
Validation focuses primarily on RAM optimization, with insufficient attention to CPU/GPU utilization and I/O optimization.

Rating¶

Novelty: ⭐⭐⭐ — Individual components (hill climbing, knapsack scheduling, symbolic regression) are not original, but their combined application to genomic parallelization is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluation on both simulated and real pipelines is thorough with detailed ablations, though the simulation environment may be idealized.
Writing Quality: ⭐⭐⭐⭐ — Structure is clear and mathematical derivations are complete, though some content leans toward engineering details.
Value: ⭐⭐⭐⭐ — Direct practical applicability to large-scale genomic workflows, though academic contribution is relatively limited.