Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows¶
Conference: AAAI 2026 arXiv: 2511.15977 Code: None (related to commercial product StrataRisk™) Area: Bioinformatics / Precision Medicine Keywords: Genomic workflows, chromosome parallelization, RAM prediction, scheduling optimization, symbolic regression
TL;DR¶
Three complementary chromosome-level genomic parallelization scheduling schemes are proposed — static scheduling (optimizing processing order), dynamic scheduling (knapsack-based batching with online RAM prediction), and a symbolic regression RAM predictor — achieving significant reductions in out-of-memory errors and execution time in both simulated and real precision medicine pipelines.
Background & Motivation¶
Computational Challenges in Precision Medicine¶
Precision medicine relies on whole-genome sequencing (WGS) to compute polygenic risk scores (PRS) and local ancestry inference (LAI). Modern genomic pipelines chain multiple tools (C/C++ read alignment, Java variant calling, Python downstream analysis, etc.), and VCF files for a single sample can reach tens to hundreds of gigabytes.
Core Problem: Splitting work by chromosome for parallel processing is a natural strategy for managing data volume and memory demands, but it faces several key challenges:
Large variation in chromosome size: Human chromosome 1 is approximately 4× larger than chromosome 22 (Fig. 1), causing severe load imbalance under naive one-task-per-core assignment.
Unpredictable memory consumption: RAM requirements for each chromosome processing task are difficult to estimate precisely; static resource allocation easily leads to out-of-memory (OOM) errors.
Low resource utilization: Conservative estimates cause over-allocation and wasted compute; aggressive estimates cause task failures and reruns.
Limitations of Prior Work¶
- Workflow engines such as GATK Scatter-Gather, Nextflow, and Snakemake only support static memory declarations.
- ML methods (SVM, LSTM, etc.) can predict resource usage but are complex to deploy (requiring model storage/loading infrastructure).
- Methods such as RA-GCN suffer from GAN training instability.
Method¶
Overall Architecture¶
Three complementary systems work in concert: 1. Static scheduler: Given a fixed concurrency \(K\), optimizes chromosome processing order to minimize peak memory. 2. Dynamic scheduler: Treats batching as a knapsack problem, with online RAM prediction and adaptive scheduling. 3. RAM prediction module: Symbolic regression learns an interpretable RAM prediction formula.
Key Designs¶
1. Static Scheduler¶
Problem formulation: Given \(n=22\) chromosomes and \(K\) parallel worker threads, find the optimal permutation \(\pi^*\) minimizing peak memory:
where \(M(t; \pi, K) = \sum_{i \in \mathcal{A}(t; \pi, K)} m_i\) is the instantaneous memory usage at time \(t\).
Search algorithm: Stochastic Hill Climbing - Starting from an initial permutation, randomly swap \(M_r \sim \text{Unif}\{1, \dots, M_{max}\}\) positions at each step. - Accept the new permutation if its peak memory is lower. - Multiple random restarts (\(T\) times) to avoid local optima.
Core finding: Optimized orderings exhibit an alternating pattern of large and small chromosomes (Fig. 2); the sliding-window average stabilizes around 11 (the median), ensuring a uniform mix of large and small chromosomes across adjacent batches.
Design Motivation: These optimized orderings can be precomputed and used directly at runtime with zero additional overhead, making them suitable for simple deployment scenarios.
2. Dynamic Scheduler¶
Online RAM prediction: Uses polynomial regression to online-learn the relationship between chromosome index \(c\) and RAM demand:
Regression coefficients are updated after each chromosome is processed.
Conservative Bias: A residual-percentile-based bias term \(b\) is introduced:
The bias percentile is interpolated from \(\gamma_{max}\) in early stages to \(\gamma_{min}\) in later stages — more conservative early on (reducing OOM) and more aggressive later when predictions are accurate (improving efficiency).
Task packing: Two strategies - Greedy: Maximizes the number of concurrently running tasks (sorts by predicted RAM ascending, adds tasks until the limit is exceeded). - Knapsack: Maximizes RAM utilization (solves the 0-1 knapsack problem via dynamic programming, using predicted RAM as item weights).
Predictor initialization: Three strategies compared - Largest-first: Processes large chromosomes first (high initial RAM utilization but inaccurate predictions). - Smallest-first: Processes small chromosomes first (rapidly initializes the predictor). → Optimal choice - Mixed: Alternates large and small (robust predictor but suboptimal speed).
Design Motivation: Compared to static scheduling, dynamic scheduling adapts to actual memory variations at runtime and continuously improves prediction accuracy through online learning.
3. Symbolic Regression RAM Prediction¶
Objective: Distill a complex ensemble model into a concise, directly deployable formula.
Teacher model: A Voting Regressor combining Random Forest + Histogram Gradient Boosting + Gradient Boosting.
Distillation process: - Generate a large set of synthetic inputs covering the feature space. - Use the teacher model to predict RAM. - Fit PySR symbolic regression to the teacher's predictions.
Taking Beagle (a genotype imputation tool) as an example, the learned formula incorporates 8 features (number of threads, burn-in iterations, main iterations, window size, number of variants, number of samples, reference panel variants, and reference panel samples). Although the resulting expression (Eq. 16) is complex, it can be deployed in a single line of code.
Conservative calibration (Conformal Prediction): One-sided conformal prediction constructs a piecewise linear interpolation function mapping predicted values to conservative estimates, adapting to heteroscedasticity and ensuring consistent safety margins across different prediction ranges.
Loss & Training¶
- Static scheduling: Precompute optimized ordering tables; zero overhead at runtime.
- Dynamic scheduling parameters: \(s=1.30\), \(\gamma_{max}=0.95\), \(\gamma_{min}=0.80\), \(p=2\), polynomial degree \(d=1\).
- Symbolic regression conservative priors replace the initial sequential execution phase of the dynamic scheduler.
Key Experimental Results¶
Main Results (Static Scheduling)¶
| Concurrency K | Sequential Peak RAM | Optimized Peak RAM | Reduction |
|---|---|---|---|
| 2 | 492.45 | 297.38 | 39.61% |
| 3 | 690.47 | 413.47 | 40.12% |
| 5 | 1062.54 | 784.03 | 26.21% |
| 7 | 1392.80 | 1037.98 | 25.48% |
| 10 | 1815.91 | 1440.64 | 20.67% |
At low concurrency, peak memory is reduced by up to 40% through processing order optimization alone.
Ablation Study (Dynamic Scheduling)¶
| Configuration | Makespan | Overcommits | Notes |
|---|---|---|---|
| Knapsack | 703.06 | 2.83 | Base dynamic scheduler (40% task size) |
| +LR Bias | 723.16 | 1.65 | Bias reduces OOM by 38% |
| +Smallest Init | 671.26 | 2.03 | Smallest-first initialization is fastest |
| +Prior (symbolic regression) | 627.63 | 0.95 | Prior eliminates initial sequential phase |
| Sizey (competitor) | 648.04 | 0.68 | Ours outperforms on most task sizes |
| Theoretical optimum | 452.40 | 0.00 | Lower bound |
The complete dynamic scheduler closes approximately 13% of the gap to the theoretical optimum in average makespan while reducing OOM by 77%.
Key Findings¶
- Knapsack > Greedy (Fig. 3 Packer Comparison): The knapsack method consistently yields lower makespan than greedy, as maximizing RAM utilization is more important than maximizing task count.
- Smallest-first initialization is optimal (Fig. 3 Initialization): Although mixed initialization yields a more accurate predictor and reduces OOM by 20%, the speed advantage of rapid initialization outweighs this benefit.
- Value of prior information (Fig. 3 Effect of Priors): Even noisy priors eliminate the initial sequential execution phase, with particularly significant effects when task size is below 50% of RAM.
- Symbolic regression accuracy: Pearson correlation drops from 0.92 (ensemble teacher) to 0.85, but the model is deployable in one line of code. The 80th-percentile conservative estimate achieves zero OOM.
- Real-world deployment results (Fig. 5): In Galatea Bio's StrataRisk™ pipeline, the dynamic scheduler with conservative priors reduces Beagle's makespan by approximately 2× with zero OOM events.
Highlights & Insights¶
- Seamless integration of theory and practice: The work forms a complete loop from scheduling theory (knapsack/permutation optimization) to ML prediction (symbolic regression distillation) to real deployment (clinical PRS pipelines).
- Clever application of symbolic regression distillation: Black-box ensemble predictions are distilled into an interpretable mathematical formula deployable in a single line of code, perfectly suited to the operational requirements of genomic workflows.
- Adaptive design of conservative bias: The interpolation strategy — conservative early and aggressive later — reduces early OOM risk while improving late-stage efficiency.
- Multi-level solutions for different scenarios: Static scheduling suits simple scenarios (precomputed, zero overhead); dynamic scheduling suits complex scenarios (online adaptive); symbolic regression suits known task types.
- Practical clinical value: Reduces computational costs and report turnaround time, directly benefiting patients.
Limitations & Future Work¶
- Symbolic regression is modeled only for Beagle; generalization to other tools requires retraining.
- Static scheduling assumes a linear relationship between processing time and chromosome size, which may be more complex in practice.
- The \(O(N^3)\) complexity of dynamic scheduling may become a bottleneck at very large scales of parallelism.
- Sub-chromosomal sequence-level finer-grained parallelization is not considered (noted by the authors as a future direction).
- Simulation experiments use a linear noise model, which may not fully reflect the complexity of real workloads.
- Conflict of interest: multiple authors hold equity in Galatea Bio.
- Validation focuses primarily on RAM optimization, with insufficient attention to CPU/GPU utilization and I/O optimization.
Rating¶
- Novelty: ⭐⭐⭐ — Individual components (hill climbing, knapsack scheduling, symbolic regression) are not original, but their combined application to genomic parallelization is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluation on both simulated and real pipelines is thorough with detailed ablations, though the simulation environment may be idealized.
- Writing Quality: ⭐⭐⭐⭐ — Structure is clear and mathematical derivations are complete, though some content leans toward engineering details.
- Value: ⭐⭐⭐⭐ — Direct practical applicability to large-scale genomic workflows, though academic contribution is relatively limited.