Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning¶

Conference: NeurIPS 2025 arXiv: 2505.14684 Code: Project Page Area: LLM Reasoning Keywords: Chain-of-Thought, Thought Leap, Reasoning Completeness, Data Augmentation, Mathematical Reasoning

TL;DR¶

This paper provides the first systematic formalization of the "Thought Leap" phenomenon in CoT reasoning chains, and proposes CoT-Bridge, a model that automatically detects and fills omitted intermediate steps. It achieves up to +5.87% improvement on NuminaMath and can serve as a plug-and-play module to enhance distillation and RL pipelines.

Background & Motivation¶

Background: LLMs have achieved remarkable progress on mathematical tasks via Chain-of-Thought reasoning, and the quality of CoT datasets directly determines the performance ceiling of trained models.

Limitations of Prior Work: Existing CoT datasets (e.g., MetaMathQA, NuminaMath) widely exhibit the Thought Leap phenomenon — human experts tend to omit "obvious" intermediate steps when writing reasoning chains due to their background knowledge, resulting in incomplete reasoning traces.

Key Challenge: Steps omitted by human experts are trivial to them but constitute fatal cognitive gaps for LLMs — models cannot bridge these reasoning gaps via implicit knowledge, severely impairing generalization ability.

Goal: (a) How to automatically detect leap positions in reasoning chains? (b) How to generate high-quality intermediate bridging steps? (c) Can the completed data consistently improve downstream model performance?

Key Insight: Through controlled experiments on MetaMathQA, the authors find that artificially introducing varying degrees of step omissions causes accuracy drops of up to 27.83% and significantly slower convergence, demonstrating that incomplete reasoning chains are more harmful than factual errors.

Core Idea: Train a dedicated Bridge model to detect reasoning leaps in CoT chains and automatically insert missing steps, thereby improving training data quality and downstream model reasoning capability.

Method¶

Overall Architecture¶

Given a potentially incomplete reasoning chain \(C=(s_0, s_1, \ldots, s_n)\), CoT-Bridge jointly outputs: (1) a predicted set of leap positions \(\hat{\mathcal{L}}\) and (2) the corresponding sequence of missing steps \(\hat{\mathcal{M}}\). The generated steps are then inserted at the corresponding positions in the original chain to produce the completed chain \(C_{bridged}\). The overall pipeline consists of three stages: training data construction → Bridge model training → application to existing datasets for augmentation.

Key Designs¶

Task Formalization — Definition of Thought Leap:
- Function: Define a completeness function \(V(s_i, s_{i+1})\) to judge whether the reasoning between adjacent steps is sufficient.
- Mechanism: If \(V(s_k, s_{k+1}) = \text{False}\) for some \(k\), a Thought Leap exists between \(s_k\) and \(s_{k+1}\), and a missing step sequence \(S'_{miss}\) must be generated such that every pair of adjacent steps in the completed chain satisfies the completeness condition.
- Design Motivation: Unlike prior work focusing on factual accuracy, this paper addresses the overlooked dimension of reasoning structural completeness.
ScaleQM+ Dataset Construction:
- Function: Systematically remove intermediate steps from the structurally complete ScaleQuestMath dataset to construct "incomplete→complete" training pairs.
- Mechanism: For a chain of length \(m\), short chains (\(m \leq 10\)) have 1–2 steps removed, and long chains (\(m > 10\)) have 1–3 steps removed; the final step (containing the answer) is always retained; complete chains are preserved with probability 0.2 to teach the model to recognize cases requiring no completion.
- Design Motivation: The "forward deletion + reverse completion" construction avoids costly manual annotation while naturally guaranteeing ground-truth quality, yielding 588k training samples in total.
CoT-Bridge Model:
- Function: Fine-tuned on Qwen2.5-Math-7B, the model learns the joint mapping \(f: C \rightarrow (\hat{\mathcal{L}}, \hat{\mathcal{M}})\) for detecting leap positions and generating bridging steps.
- Mechanism: Contrasted against CoT-Bridge-Random (which generates content given random positions), CoT-Bridge must jointly learn where to bridge and what to insert.
- Design Motivation: Experiments demonstrate that accurate leap localization is critical — inserting steps at random positions can disrupt reasoning coherence.

Data Augmentation Application¶

CoT-Bridge is applied to MetaMathQA and NuminaMath-CoT, adapting to each dataset's step delimiter ("\n" or "\n\n") to generate augmented versions MetaMath-Bridge and NuminaMath-Bridge.

Key Experimental Results¶

Main Results (Meta-Llama3.1-8B + NuminaMath, average over 6 benchmarks)¶

Method	GSM8K	MATH500	GaoKao2023EN	AMC23	Avg.
Direct SFT	84.86	51.45	49.03	20.00	43.87
QwenBridger-72B	85.25	54.20	51.62	35.00	48.41
CoT-Bridge-Random	84.82	54.20	51.88	33.75	48.50
CoT-Bridge	85.97	56.80	54.42	35.63	49.74 (+5.87)

Plug-and-Play Augmentation (Qwen2.5-Math-1.5B + Distill/Reject Sampling)¶

Configuration	GSM8K	MATH500	Avg.	Note
Distill - Direct SFT	81.86	68.15	55.23	Distillation data, direct training
Distill - CoT-Bridge	82.52	71.50	58.25 (+3.02)	Completion applied after distillation
Reject Sampling - Direct SFT	83.36	74.90	60.44	Rejection sampling data, direct training
Reject Sampling - CoT-Bridge	83.74	75.25	61.81 (+1.37)	Completion applied after sampling

Key Findings¶

Leap localization accuracy is critical: CoT-Bridge-Random causes performance degradation on multiple benchmarks (e.g., GaoKao -1.56% and MathOdyssey -3.68% on Qwen2.5-Math-1.5B + NuminaMath), while CoT-Bridge yields consistent gains.
Largest gains on competition-level problems: AMC23 sees a +15.63% improvement for LLaMA, indicating that harder problems benefit most from complete reasoning chains.
Improved OOD generalization: Across 5 out-of-domain logical reasoning datasets, LLaMA achieves an average improvement of +2.99%, with a concurrent reduction in invalid response rate.
Enhanced RL cold-start: Using Bridge-augmented data for SFT cold-start followed by GRPO yields a final RL accuracy of 63.98% vs. 60.88% (+3.1%).

Highlights & Insights¶

Novel problem formulation: Unlike prior work focusing on factual errors and answer accuracy, this paper is the first to systematically study the structural completeness of CoT chains, formalizing Thought Leap as a detectable and repairable task — an insightful and underexplored perspective.
Elegant data construction: Constructing training pairs by deleting steps from complete data and learning to restore them avoids annotation costs while naturally guaranteeing ground-truth quality, since the supervision signal derives from the original complete chains.
Practical plug-and-play design: CoT-Bridge can be seamlessly integrated on top of distillation, rejection sampling, and RL pipelines as a general-purpose data quality enhancement module, with low barriers to adaptation in other settings.

Limitations & Future Work¶

Reliance on the completeness assumption of ScaleQuestMath: The approach treats ScaleQuestMath as an approximately ideal complete CoT source; however, this dataset may itself contain Thought Leaps, which limits the quality ceiling of the Bridge model's completions.
Restricted to the mathematical domain: Although OOD gains are observed on logical reasoning tasks, generalization to code generation, scientific reasoning, and other domains remains unvalidated.
Fixed Bridge model scale at 7B: The scaling behavior of larger or smaller Bridge models is unexplored, as is the feasibility of distilling Bridge capabilities into smaller models.
Lack of fine-grained completion quality evaluation: Completion quality is assessed primarily via downstream task accuracy; direct evaluation of the mathematical correctness and logical coherence of generated steps is absent.

Rating¶

Novelty: ⭐⭐⭐⭐ First formalization of the Thought Leap problem with a unique perspective, though the core method (deletion then recovery) is relatively intuitive.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation across multiple models, datasets, and settings (distillation/RL/OOD) with thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ Problem definition is clear and experimental organization is well-structured.
Value: ⭐⭐⭐⭐ Offers a practical plug-and-play data quality enhancement tool with concrete guidance for CoT dataset construction.