Skip to content

Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning

Conference: NeurIPS 2025 arXiv: 2505.14684 Code: Project Page Area: LLM Reasoning Keywords: Chain-of-Thought, Thought Leap, Reasoning Completeness, Data Augmentation, Mathematical Reasoning

TL;DR

This paper provides the first systematic formalization of the "Thought Leap" phenomenon in CoT reasoning chains, and proposes CoT-Bridge, a model that automatically detects and fills omitted intermediate steps. It achieves up to +5.87% improvement on NuminaMath and can serve as a plug-and-play module to enhance distillation and RL pipelines.

Background & Motivation

Background: LLMs have achieved remarkable progress on mathematical tasks via Chain-of-Thought reasoning, and the quality of CoT datasets directly determines the performance ceiling of trained models.

Limitations of Prior Work: Existing CoT datasets (e.g., MetaMathQA, NuminaMath) widely exhibit the Thought Leap phenomenon — human experts tend to omit "obvious" intermediate steps when writing reasoning chains due to their background knowledge, resulting in incomplete reasoning traces.

Key Challenge: Steps omitted by human experts are trivial to them but constitute fatal cognitive gaps for LLMs — models cannot bridge these reasoning gaps via implicit knowledge, severely impairing generalization ability.

Goal: (a) How to automatically detect leap positions in reasoning chains? (b) How to generate high-quality intermediate bridging steps? (c) Can the completed data consistently improve downstream model performance?

Key Insight: Through controlled experiments on MetaMathQA, the authors find that artificially introducing varying degrees of step omissions causes accuracy drops of up to 27.83% and significantly slower convergence, demonstrating that incomplete reasoning chains are more harmful than factual errors.

Core Idea: Train a dedicated Bridge model to detect reasoning leaps in CoT chains and automatically insert missing steps, thereby improving training data quality and downstream model reasoning capability.

Method

Overall Architecture

Given a potentially incomplete reasoning chain \(C=(s_0, s_1, \ldots, s_n)\), CoT-Bridge jointly outputs: (1) a predicted set of leap positions \(\hat{\mathcal{L}}\) and (2) the corresponding sequence of missing steps \(\hat{\mathcal{M}}\). The generated steps are then inserted at the corresponding positions in the original chain to produce the completed chain \(C_{bridged}\). The overall pipeline consists of three stages: training data construction → Bridge model training → application to existing datasets for augmentation.

Key Designs

  1. Task Formalization — Definition of Thought Leap:

    • Function: Define a completeness function \(V(s_i, s_{i+1})\) to judge whether the reasoning between adjacent steps is sufficient.
    • Mechanism: If \(V(s_k, s_{k+1}) = \text{False}\) for some \(k\), a Thought Leap exists between \(s_k\) and \(s_{k+1}\), and a missing step sequence \(S'_{miss}\) must be generated such that every pair of adjacent steps in the completed chain satisfies the completeness condition.
    • Design Motivation: Unlike prior work focusing on factual accuracy, this paper addresses the overlooked dimension of reasoning structural completeness.
  2. ScaleQM+ Dataset Construction:

    • Function: Systematically remove intermediate steps from the structurally complete ScaleQuestMath dataset to construct "incomplete→complete" training pairs.
    • Mechanism: For a chain of length \(m\), short chains (\(m \leq 10\)) have 1–2 steps removed, and long chains (\(m > 10\)) have 1–3 steps removed; the final step (containing the answer) is always retained; complete chains are preserved with probability 0.2 to teach the model to recognize cases requiring no completion.
    • Design Motivation: The "forward deletion + reverse completion" construction avoids costly manual annotation while naturally guaranteeing ground-truth quality, yielding 588k training samples in total.
  3. CoT-Bridge Model:

    • Function: Fine-tuned on Qwen2.5-Math-7B, the model learns the joint mapping \(f: C \rightarrow (\hat{\mathcal{L}}, \hat{\mathcal{M}})\) for detecting leap positions and generating bridging steps.
    • Mechanism: Contrasted against CoT-Bridge-Random (which generates content given random positions), CoT-Bridge must jointly learn where to bridge and what to insert.
    • Design Motivation: Experiments demonstrate that accurate leap localization is critical — inserting steps at random positions can disrupt reasoning coherence.

Data Augmentation Application

CoT-Bridge is applied to MetaMathQA and NuminaMath-CoT, adapting to each dataset's step delimiter ("\n" or "\n\n") to generate augmented versions MetaMath-Bridge and NuminaMath-Bridge.

Key Experimental Results

Main Results (Meta-Llama3.1-8B + NuminaMath, average over 6 benchmarks)

Method GSM8K MATH500 GaoKao2023EN AMC23 Avg.
Direct SFT 84.86 51.45 49.03 20.00 43.87
QwenBridger-72B 85.25 54.20 51.62 35.00 48.41
CoT-Bridge-Random 84.82 54.20 51.88 33.75 48.50
CoT-Bridge 85.97 56.80 54.42 35.63 49.74 (+5.87)

Plug-and-Play Augmentation (Qwen2.5-Math-1.5B + Distill/Reject Sampling)

Configuration GSM8K MATH500 Avg. Note
Distill - Direct SFT 81.86 68.15 55.23 Distillation data, direct training
Distill - CoT-Bridge 82.52 71.50 58.25 (+3.02) Completion applied after distillation
Reject Sampling - Direct SFT 83.36 74.90 60.44 Rejection sampling data, direct training
Reject Sampling - CoT-Bridge 83.74 75.25 61.81 (+1.37) Completion applied after sampling

Key Findings

  • Leap localization accuracy is critical: CoT-Bridge-Random causes performance degradation on multiple benchmarks (e.g., GaoKao -1.56% and MathOdyssey -3.68% on Qwen2.5-Math-1.5B + NuminaMath), while CoT-Bridge yields consistent gains.
  • Largest gains on competition-level problems: AMC23 sees a +15.63% improvement for LLaMA, indicating that harder problems benefit most from complete reasoning chains.
  • Improved OOD generalization: Across 5 out-of-domain logical reasoning datasets, LLaMA achieves an average improvement of +2.99%, with a concurrent reduction in invalid response rate.
  • Enhanced RL cold-start: Using Bridge-augmented data for SFT cold-start followed by GRPO yields a final RL accuracy of 63.98% vs. 60.88% (+3.1%).

Highlights & Insights

  • Novel problem formulation: Unlike prior work focusing on factual errors and answer accuracy, this paper is the first to systematically study the structural completeness of CoT chains, formalizing Thought Leap as a detectable and repairable task — an insightful and underexplored perspective.
  • Elegant data construction: Constructing training pairs by deleting steps from complete data and learning to restore them avoids annotation costs while naturally guaranteeing ground-truth quality, since the supervision signal derives from the original complete chains.
  • Practical plug-and-play design: CoT-Bridge can be seamlessly integrated on top of distillation, rejection sampling, and RL pipelines as a general-purpose data quality enhancement module, with low barriers to adaptation in other settings.

Limitations & Future Work

  • Reliance on the completeness assumption of ScaleQuestMath: The approach treats ScaleQuestMath as an approximately ideal complete CoT source; however, this dataset may itself contain Thought Leaps, which limits the quality ceiling of the Bridge model's completions.
  • Restricted to the mathematical domain: Although OOD gains are observed on logical reasoning tasks, generalization to code generation, scientific reasoning, and other domains remains unvalidated.
  • Fixed Bridge model scale at 7B: The scaling behavior of larger or smaller Bridge models is unexplored, as is the feasibility of distilling Bridge capabilities into smaller models.
  • Lack of fine-grained completion quality evaluation: Completion quality is assessed primarily via downstream task accuracy; direct evaluation of the mathematical correctness and logical coherence of generated steps is absent.

Rating

  • Novelty: ⭐⭐⭐⭐ First formalization of the Thought Leap problem with a unique perspective, though the core method (deletion then recovery) is relatively intuitive.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation across multiple models, datasets, and settings (distillation/RL/OOD) with thorough ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Problem definition is clear and experimental organization is well-structured.
  • Value: ⭐⭐⭐⭐ Offers a practical plug-and-play data quality enhancement tool with concrete guidance for CoT dataset construction.