Skip to content

DeepPrune: Parallel Scaling without Inter-Trace Redundancy

Conference: ACL 2026 arXiv: 2510.08483 Code: https://deepprune.github.io/ Area: Model Compression Keywords: parallel inference, CoT pruning, reasoning redundancy, answer equivalence prediction, inference efficiency

TL;DR

This paper proposes DeepPrune, which trains a dedicated judge model to predict answer equivalence from partial reasoning traces and combines it with an online greedy clustering algorithm to dynamically prune redundant parallel CoT paths. DeepPrune reduces token consumption by 65.73%–88.50% while maintaining competitive accuracy within 3 percentage points.

Background & Motivation

Background: Parallel scaling (e.g., best-of-n sampling) enhances LLM reasoning by generating multiple reasoning traces simultaneously, with total token consumption potentially exceeding 100M. Existing efficient inference methods primarily focus on overthinking in sequential scaling, leaving parallel scaling efficiency largely underexplored.

Limitations of Prior Work: (1) Over 80% of parallel reasoning traces produce identical final answers, representing substantial wasted computation. (2) Confidence-based early stopping fails to reduce inter-trace redundancy and risks prematurely terminating correct reasoning. (3) Shallow semantic similarity methods (e.g., SentenceBERT) cannot predict final answer equivalence from early reasoning stages.

Key Challenge: The benefit of parallel scaling stems from answer diversity—a small number of distinct answers may contain the correct one—yet the vast majority (80%+) of parallel traces yield identical answers, offering minimal diversity.

Goal: Proactively prune redundant parallel reasoning traces while preserving answer diversity.

Key Insight: Train a dedicated judge model to understand the deep semantics of reasoning processes, predicting from partial traces whether two traces will ultimately reach the same answer.

Core Idea: Early detection of answer equivalence → retain diverse traces + prune redundant traces → efficient parallel scaling.

Method

Overall Architecture

DeepPrune comprises two components: (1) a judge model that predicts answer equivalence from partial reasoning traces (AUROC 0.7072); and (2) an online greedy clustering algorithm that dynamically clusters traces into answer-equivalent groups during inference, prunes redundant traces within each group, and retains only one representative trace per group.

Key Designs

  1. Answer Equivalence Judge Model:

    • Function: Predict whether two partial reasoning traces will produce the same final answer.
    • Mechanism: Fine-tuned from Qwen3-4B using OOD data (AIME 2022/2023 and MATH 500) with oversampling to balance positive and negative pairs. The model takes the first \(N\) tokens of two traces as input and outputs an answer equivalence probability.
    • Design Motivation: Shallow similarity methods (AUROC = 0.58) and general-purpose LLMs (AUROC = 0.66) are insufficiently accurate; a dedicated model trained to understand reasoning processes is required.
  2. Online Greedy Clustering and Dynamic Pruning:

    • Function: Prune redundant paths in real time during inference.
    • Mechanism: A set of answer-equivalent groups is maintained. Upon generating each new trace segment, the judge model checks whether the segment is equivalent to any existing group. If equivalent, the trace is pruned (generation halted); otherwise, a new group is created. One representative trace per group continues generation.
    • Design Motivation: Online pruning saves more computation than post-hoc pruning, and the greedy strategy practically balances efficiency and diversity.
  3. OOD Generalization Training Strategy:

    • Function: Ensure the judge model generalizes to unseen reasoning models.
    • Mechanism: Training is conducted on AIME 2022/2023 and MATH 500 (disjoint from the evaluation sets AIME 2024/2025), enabling generalization to traces generated by diverse reasoning models.
    • Design Motivation: Retraining the judge model for every new reasoning model is impractical in real-world deployment.

Loss & Training

The judge model is trained with binary cross-entropy loss. Oversampling of the minority class (non-equivalent pairs) is applied to address data imbalance. Training data consists of parallel trace pairs from multiple reasoning models.

Key Experimental Results

Main Results

Comparison with Standard Consensus Sampling (LLaDA Reasoning Model)

Method Token Reduction Accuracy Gap
Standard Consensus Sampling 0% Baseline
Confidence-based Early Stopping ~30% May degrade
DeepPrune 65.73%–88.50% ≤3%

Ablation Study

Component Performance
Judge Model AUROC 0.7072 (OOD generalization)
SentenceBERT Baseline 0.58 (near random)
General-purpose LLM Baseline 0.66 (suboptimal)

Key Findings

  • DeepPrune reduces token consumption by 65%–88% across three challenging benchmarks (AIME 2024, AIME 2025, GPQA).
  • Accuracy loss is constrained to within 3 percentage points.
  • The judge model successfully generalizes to unseen reasoning models.
  • Pruning preserves answer diversity—high-diversity traces are not erroneously pruned.

Highlights & Insights

  • Quantitatively reveals the core efficiency problem in parallel reasoning: 80%+ of traces produce identical answers.
  • Training the judge model from a "reasoning comprehension" perspective rather than "text similarity" represents a significant improvement over shallow approaches.
  • The online pruning design enables acceleration to take effect immediately during inference.

Limitations & Future Work

  • The judge model's AUROC (0.7072) leaves room for improvement, potentially causing a small number of valuable traces to be incorrectly pruned.
  • The greedy strategy for online clustering may be suboptimal.
  • The method relies on a specific equivalence threshold that may require tuning across different scenarios.
  • Validation is limited to mathematical reasoning tasks; effectiveness on other reasoning types remains to be confirmed.
  • vs. Confidence-based Early Stopping: Confidence-based methods cannot reduce inter-trace redundancy; DeepPrune directly addresses the redundancy problem.
  • vs. Sequential Pruning: Sequential methods reduce the length of individual traces, whereas DeepPrune reduces the number of parallel traces.

Rating

  • Novelty: ⭐⭐⭐⭐ The analysis of parallel reasoning redundancy and the answer equivalence judge model are novel contributions.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Validated on three benchmarks, multiple models, and with OOD generalization testing.
  • Writing Quality: ⭐⭐⭐⭐ Problem analysis is clear and the method is intuitive.
  • Value: ⭐⭐⭐⭐ Provides a practical tool for improving the efficiency of parallel scaling at inference time.