A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings¶

Conference: NeurIPS 2025 arXiv: 2505.24550 Code: GitHub Area: LLM Reasoning / Model Compression Keywords: reasoning efficiency, CoT compression, A* search, bidirectional importance, long-to-short

TL;DR¶

This paper proposes A-Thought, a CoT compression framework based on the A search algorithm. It introduces Bidirectional Importance Scoring (BIS) to measure each reasoning step's relevance to both the question and the answer, and combines path-level A search to efficiently identify the most compact valid reasoning path within an exponentially large search space. Under a 512-token budget, A-Thought improves QwQ-32B accuracy by 2.39×; under a 4096-token budget, it reduces output tokens by approximately 50% with negligible accuracy loss.

Background & Motivation¶

Background: Large Reasoning Models (LRMs) such as o1, R1, and QwQ achieve strong reasoning capabilities via long chain-of-thought (CoT), but their verbose reasoning traces incur substantial inference overhead, severely limiting deployment in resource-constrained settings.

Limitations of Prior Work: (a) Existing long-to-short methods commonly assume "overthinking" and attempt to compress CoT directly, often at the cost of performance degradation; (b) prompt-based methods (e.g., CoD) yield imprecise compression; (c) training-based methods (e.g., TokenSkip) require additional training with limited effectiveness; (d) no unified framework exists for systematically identifying the optimal subset from an exponentially large search space.

Key Challenge: Not all steps in a CoT are equally important, yet identifying which steps are necessary and which are dispensable is a combinatorial explosion problem—\(N\) steps yield \(2^N\) possible subsets.

Key Insight: CoT compression is formalized as a search problem, with the A* algorithm applied to efficiently search over ranked reasoning steps.

Core Idea: Bidirectional Importance Scoring (considering each step's relevance to both the question and the answer) combined with A* search (current path quality + estimated cost to reach the correct answer) to find the shortest valid reasoning path in a vast search space.

Method¶

Overall Architecture¶

A two-level design: (1) Step level — Bidirectional Importance Scoring (BIS) evaluates the criticality of each reasoning step to guide search priority; (2) Path level — A* search iteratively expands nodes on a search tree, using a cost function to identify the shortest reasoning path that leads to a correct solution.

Key Designs¶

Bidirectional Importance Scoring (BIS):
- Function: Quantifies the importance of each reasoning step \(\mathbf{t}^{(n)}\)
- Mechanism: A good intermediate step should be highly relevant to both the question (forward) and the answer (backward)
- Attention level: \(\text{ATTN}(\mathbf{q}|\mathbf{t}^{(n)})\) and \(\text{ATTN}(\mathbf{s}|\mathbf{t}^{(n)})\)
- Model level: \(\text{NLL}(\mathbf{q}|\mathbf{t}^{(n)})\) and \(\text{NLL}(\mathbf{s}|\mathbf{t}^{(n)})\)
- BIS = weighted sum of attention scores / weighted sum of NLL, with \(\alpha\) controlling the balance between question and answer relevance
- Computed using a lightweight model (GPT-2) for efficiency
- Key observation: BIS scores are highly skewed — only a small fraction of steps simultaneously contribute strongly to both the question and the answer
Path-Level A* Search:
- Function: Efficiently identifies the shortest valid reasoning path within a \(2^N\) search space
- Initialization: All steps are ranked in descending BIS order to form a search queue (best-first strategy)
- Node representation: Each node is not a single step but a triplet \(\mathbf{r}^{(n)} = \langle \mathbf{t}^{(n-1)}, \mathbf{t}^{(n)}, \mathbf{t}^{(n+1)} \rangle\) consisting of a step and its neighbors — mitigating fragmentation
- Verification: When path depth \(\geq k_{min}\), a verification model checks whether the current path leads to the correct answer
- Expansion: Upon verification failure, \(W\) candidate steps are drawn from the queue as child nodes
Cost Function Design:
- Total cost: \(f = g + h\) (classical A* framework)
- Current cost \(g\): negative log-likelihood of the current path under the verification model (path quality)
- Future cost \(h\): conditional self-information of generating the correct answer given the current path, \(\mathcal{I}(\mathbf{s}|\mathbf{q}, \mathbf{t}'_k) = -\frac{1}{|\mathbf{s}|}\log P_\mathcal{V}(\mathbf{s}|\mathbf{q}, \mathbf{t}'_k)\)
- Intuition: A larger \(h\) indicates that the current path is farther from the correct answer and requires additional steps
- A depth upper bound \(k_{max}\) prevents excessive search depth

Loss & Training¶

A* search is used offline to compress CoT data
The resulting compact reasoning paths are used for SFT distillation, training LRMs to directly generate efficient reasoning traces
Verification model: s1.1-32B
Training data: long CoT from the s1K-1.1 dataset

Key Experimental Results¶

Main Results (QwQ-32B, varying token budgets)¶

Method	512 Avg Acc.	512 ACU	1024 Avg Acc.	2048 Avg Acc.	4096 Avg Acc.	4096 Avg Len.
QwQ-32B baseline	12.3%	2.41	21.8%	43.8%	63.2%	2812
+ CoD	17.3%	3.34	24.3%	51.3%	69.3%	2804
+ BtC Effective	19.5%	3.80	25.0%	52.3%	68.6%	2795
+ TokenSkip	16.6%	3.20	22.8%	48.8%	64.9%	2549
*+ A-Thought**	29.4%	5.99	48.1%	58.9%	69.3%	1877

Out-of-Domain Generalization (LiveCodeBench + MMLU)¶

Budget	QwQ-32B Acc.	+ A*-Thought Acc.	+ A*-Thought ACU
512	18.8%	30.6%	6.74
1024	28.7%	41.9%	5.37
4096	45.8%	59.7%	3.16

Key Findings¶

Substantial gains under low budgets: At 512 tokens, accuracy improves from 12.3% to 29.4% (2.39×), and ACU from 2.41 to 5.99 (2.49×)
Token reduction without accuracy loss at high budgets: At 4096 tokens, average length decreases from 2826 to 1877 (−33.6%) with negligible accuracy drop (70.6% → 69.3%)
Cross-model generality: Effective across QwQ-32B, R1-Distill-32B, and s1.1-32B
Out-of-domain generalization: Trained solely on math data, yet yields significant gains on LiveCodeBench and MMLU

Highlights & Insights¶

Formalizing CoT compression as a search problem: Viewing reasoning compression through the lens of combinatorial optimization is more principled than heuristic deletion
Key insight of bidirectional importance: Relevance to the question alone is insufficient; relevance to the answer must also be considered — intermediate steps that appear superficially unrelated should be retained if they contribute to deriving the answer
Triplet node design: Steps are not selected in isolation; their immediate neighbors are preserved to avoid fragmented reasoning chains
Natural fit of A* search: The cost function is elegantly designed — the current cost measures path quality while the future cost measures the remaining difficulty of reaching the answer

Limitations & Future Work¶

Dependency on pre-generated CoT: A* search is a post-processing method that requires a complete long CoT to be generated first; it cannot reduce the cost of initial inference
Verification model overhead: Each expansion and verification step during A* search requires invoking a 32B verification model, making the search process itself costly
BIS requires the ground-truth answer: Computing BIS presupposes knowledge of the correct answer \(\mathbf{s}\), limiting direct applicability in unannotated settings
Future directions: (1) Train a forward BIS predictor that does not require the answer; (2) explore online CoT compression (pruning during generation); (3) reduce search cost by using smaller verification models

vs. CoD (Chain-of-Draft): CoD uses prompting to elicit concise reasoning, but compression is imprecise; A*-Thought precisely selects the most critical steps from existing CoT
vs. TokenSkip: TokenSkip requires training a token-level skipping policy; A*-Thought operates at the reasoning-step level, which is a more appropriate granularity
vs. Budget Forcing (s1): Budget Forcing controls length via truncation or extension but does not optimize which content is retained; A*-Thought explicitly optimizes the combination of preserved steps
Insight: The success of A* search in reasoning compression suggests that CoT contains substantial redundancy — truly critical reasoning steps may constitute only 30–50% of the full trace

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of A* search and bidirectional importance scoring is original and insightful
Experimental Thoroughness: ⭐⭐⭐⭐ Three models × four budgets × six benchmarks, with comprehensive ablation studies
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and methodological derivation is rigorous
Value: ⭐⭐⭐⭐ Directly applicable practical value for deploying LRMs in resource-constrained settings