Skip to content

SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models

Conference: NeurIPS 2025 arXiv: 2506.05745 Code: https://github.com/stanford-futuredata/Sprint Area: LLM Reasoning Keywords: Reasoning Acceleration, Parallel Inference, Planning and Execution, Chain-of-Thought, Reasoning Models

TL;DR

By restructuring long chain-of-thought reasoning traces into interleaved planning and parallel execution stages, Sprint reduces sequential token counts by up to 39% on in-distribution tasks (up to 65% on OOD tasks) while maintaining accuracy, enabling dynamic parallelization of the reasoning process.

Background & Motivation

Background: Large reasoning models (LRMs) such as DeepSeek-R1 and OpenAI o1 solve complex reasoning tasks by generating long chains of thought, achieving high accuracy but incurring substantial inference latency—a single math problem may require thousands of tokens before an answer is produced.

Limitations of Prior Work: Existing acceleration approaches either lack coordination (repeated sampling / self-consistency generates multiple independent trajectories without information sharing), rely on predefined search structures (Tree-of-Thought / Graph-of-Thought requires manually specified search patterns), or do not support multi-step dependent reasoning (Skeleton-of-Thought assumes semantic independence among sub-tasks, making it ineffective for mathematical reasoning where later steps depend on earlier computations).

Key Challenge: Reasoning accuracy demands long sequential chains of thought, yet many reasoning steps are mutually independent in practice—such as simultaneously attempting two solution strategies or independently computing different sub-parts of a complex problem. Existing methods do not enable models to autonomously discover and exploit these parallelization opportunities.

Goal: To enable reasoning models to autonomously identify parallelization opportunities within the reasoning process and reduce sequential token generation without sacrificing accuracy.

Key Insight: A careful analysis of R1 reasoning traces reveals that they contain steps such as reflection, task decomposition, and trial-and-error, many of which carry no dependency on one another. Identifying these independent steps and executing them in parallel can substantially shorten the "critical path" of reasoning.

Core Idea: A data curation pipeline restructures sequential reasoning traces into DAG-structured parallel stages, and SFT teaches the model to autonomously perform interleaved planning and parallel execution at inference time.

Method

Overall Architecture

Sprint consists of two components: (1) a training-time data curation pipeline that transforms ordinary sequential reasoning traces into a structured planning–parallel execution format; and (2) an inference-time planner–executor architecture in which the model alternately acts as a planner (decomposing independent sub-tasks) and as executors (executing sub-tasks in parallel).

Given a user query, the planner generates a plan for the current stage (potentially containing multiple independent sub-task prompts); multiple executors execute sub-tasks in parallel; execution results are merged back into the accumulated context; the planner then produces the next round of planning. This process iterates until the final answer is output.

Key Designs

  1. Data Curation Pipeline (Training Data Construction):

    • Function: Transforms raw sequential reasoning traces generated by DeepSeek-R1 into parallelizable, stage-structured formats.
    • Mechanism: Four steps — (a) Step Extraction: GPT-4o decomposes each reasoning trace into discrete reasoning steps \(S = \{S_1, S_2, ..., S_n\}\), with each step further divided into a planning phase \(P_i\) and an execution phase \(E_i\); (b) DAG Creation: GPT-4o-mini infers inter-step dependencies and constructs a directed acyclic graph \(G=(S,D)\); (c) Packing: Dependency-free steps at the same DAG depth are packed into the same stage, with a special optimization for plan-only steps—if a parent node is plan-only, its children can be merged into the same stage; (d) Filtering: Only traces with a parallelization ratio (steps/stages) \(\geq 1.5\) are retained.
    • Design Motivation: Mining parallel structure directly from real reasoning traces, rather than engineering it manually, preserves the natural distribution and diversity of training data.
  2. Planner–Executor Inference Architecture:

    • Function: At inference time, a single fine-tuned model alternately plays the roles of planner and executor.
    • Mechanism: At stage \(i\), the planner receives the full accumulated context (query plus all previous plans and execution results) and generates planning text within a <Plan_i> tag, which may contain multiple <prompt_i.j> sub-tasks. Each sub-task is executed by an independent executor in parallel, with results encapsulated in <execution_i.j> tags and synchronized back to the main context.
    • Design Motivation: Since the planner observes all previous execution results at every round, coordination problems arising from single-round planning (as in SoT) are avoided; parallel execution by executors directly reduces sequential token counts.
  3. Packing Optimization (Stage Assignment Strategy):

    • Function: Optimally assigns steps to stages to maximize parallelism.
    • Mechanism: Stage numbers are computed as \(\sigma(S_i) = \max_{S_p \in \text{Parents}(S_i)} (\sigma(S_p) + \mathbb{1}(E_p \neq \emptyset))\). The key insight is that if a parent node \(S_p\) is plan-only (\(E_p = \emptyset\)), its child need not wait for an execution result and can be assigned to the same stage.
    • Design Motivation: Greedily minimizes the number of stages; short executions (< 3 lines) are merged as plan-only to reduce executor invocation overhead.

Loss & Training

  • From 6,000 traces generated by R1 on the MATH training set, approximately 1,700 training samples are retained after filtering for answer correctness and parallelization ratio.
  • Full-parameter SFT is applied to DeepSeek-R1-Distill-Qwen-7B (LoRA fails to produce format-compliant outputs) for 5 epochs with a learning rate of \(1 \times 10^{-5}\).
  • Training is conducted on 8×A100 GPUs using ZeRO Stage 3 with 4-bit quantization.

Key Experimental Results

Main Results

Dataset Method Accuracy Sequential Tokens Total Tokens
MATH-500 R1-Distill-7B 89.1%
MATH-500 RFT 91.0% 2880 2880
MATH-500 SoT-reasoning 90.8% 3836 11538
MATH-500 Self-consistency 80.5% 590 11645
MATH-500 Sprint 92.5% 2440 3622
Countdown RFT 84.9% 4917
Countdown Sprint 85.9% 2284
GPQA-Diamond RFT 50.5% 7103
GPQA-Diamond Sprint 51.0% 6336

Efficiency Analysis (Stratified by Reasoning Length)

RFT Sequential Token Range Sprint Sequential Token Reduction
< 2000 tokens +5% (slight overhead)
4000–6000 tokens ~20% reduction
> 8000 tokens 39% reduction
Countdown long traces 65% reduction
GPQA long traces 45% reduction

Key Findings

  • Sprint achieves 92.5% accuracy on MATH-500, surpassing RFT's 91.0%, possibly because independently executed sub-tasks do not interfere with one another, reducing error propagation.
  • The harder the problem (the more tokens required), the greater the parallelization benefit—a 39% reduction is observed on problems exceeding 8,000 tokens.
  • Simple problems incur approximately 5% overhead (from plan/execution tags and additional prompt formatting), yet an average sequential token reduction of 15% is still observed overall.
  • Sprint reduces sequential token counts by 53.5% on the fully OOD Countdown task, indicating that the model has learned a general parallelization capability rather than task-specific patterns.
  • Although SoT-reasoning also attempts parallelism, its total token count is three times that of Sprint (11,538 vs. 3,622), because single-round planning leads to redundant computation across sub-tasks.

Highlights & Insights

  • Elegant data curation pipeline design: Rather than manually defining parallel structures, the pipeline uses LLMs to automatically extract dependency graphs from real reasoning traces, preserving the natural distribution of the original reasoning while reorganizing execution order.
  • Only 1,700 SFT samples required: A remarkably small number of structured samples suffices to unlock parallel reasoning capabilities, suggesting that reasoning models inherently possess the potential for parallel reasoning and merely need to be "reminded" how to apply it.
  • Accuracy improves rather than degrades: Sprint achieves 92.5% vs. RFT's 91.0%; parallel execution produces an ensemble-like effect in which independent sub-executions do not contaminate one another.
  • Interleaved planning design: Each planning round observes all results from the previous execution round, forming a closed loop of plan–execute–synchronize–replan, which avoids the sub-task coordination failures caused by one-shot planning in SoT.

Limitations & Future Work

  • Actual wall-clock speedup not validated: The paper reports only reductions in sequential token counts; realizing actual inference speedup requires dedicated KV-cache optimization and a GPU parallel execution framework, which the authors acknowledge they were unable to implement due to computational resource constraints.
  • Training data construction depends on GPT-4o: Both step extraction and DAG creation require calls to a closed-source model, imposing cost and reproducibility limitations.
  • LoRA fine-tuning fails: Only full-parameter SFT enables format compliance, limiting applicability to larger models.
  • Parallelism is bounded: The degree of parallelism is constrained by the parallel patterns observed in SFT data; RL training (using latency as a reward signal) may discover more aggressive parallelization strategies.
  • Not extended to tool-use settings: In reasoning-plus-tool-use scenarios, tool-call latency far exceeds token generation latency, where parallelization gains could be substantially larger.
  • vs. SoT (Skeleton-of-Thought): SoT performs a single round of planning and then executes all sub-tasks in parallel, precluding multi-step dependent reasoning. Sprint addresses this through multi-round interleaved planning, and its total token count is only one-third that of SoT.
  • vs. PASTA: PASTA also trains models to perform parallel sub-task decomposition but does not support multi-step planning. Sprint's interleaved planning–execution framework supports an arbitrary number of iterative parallel rounds.
  • vs. Repeated Sampling: Repeated sampling generates multiple fully independent reasoning paths without information sharing, causing redundant computation. Sprint's executors share the accumulated context, enabling better coordination.
  • vs. APR: APR relies on a symbolic solver to generate training data and is therefore limited to synthetic tasks such as Countdown. Sprint's pipeline extracts parallel structure from natural reasoning traces, making it applicable to general reasoning.

Rating

  • Novelty: ⭐⭐⭐⭐ The interleaved planning–parallel execution framework is a novel design, and the DAG-based data curation pipeline is creative; however, concurrent works such as PASTA and APR pursue similar directions in parallel reasoning.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three benchmarks (including 2 OOD) with multiple baselines and difficulty-stratified efficiency analysis are provided, but validation of actual wall-clock time and experiments with larger models are absent.
  • Writing Quality: ⭐⭐⭐⭐⭐ The paper is clearly structured, figures and tables are well designed, and illustrative examples effectively demonstrate how the method operates.
  • Value: ⭐⭐⭐⭐ The work offers an elegant approach to reasoning model acceleration; the fact that 1,700 samples suffice to unlock the capability is of substantial practical value, though real-world deployment still requires hardware-level support.