A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings¶
Conference: NeurIPS 2025 arXiv: 2505.24550 Code: GitHub Area: LLM Reasoning / Model Compression Keywords: reasoning efficiency, CoT compression, A* search, bidirectional importance, long-to-short
TL;DR¶
This paper proposes A-Thought, a CoT compression framework based on the A search algorithm. It introduces Bidirectional Importance Scoring (BIS) to measure each reasoning step's relevance to both the question and the answer, and combines path-level A search to efficiently identify the most compact valid reasoning path within an exponentially large search space. Under a 512-token budget, A-Thought improves QwQ-32B accuracy by 2.39×; under a 4096-token budget, it reduces output tokens by approximately 50% with negligible accuracy loss.
Background & Motivation¶
Background: Large Reasoning Models (LRMs) such as o1, R1, and QwQ achieve strong reasoning capabilities via long chain-of-thought (CoT), but their verbose reasoning traces incur substantial inference overhead, severely limiting deployment in resource-constrained settings.
Limitations of Prior Work: (a) Existing long-to-short methods commonly assume "overthinking" and attempt to compress CoT directly, often at the cost of performance degradation; (b) prompt-based methods (e.g., CoD) yield imprecise compression; (c) training-based methods (e.g., TokenSkip) require additional training with limited effectiveness; (d) no unified framework exists for systematically identifying the optimal subset from an exponentially large search space.
Key Challenge: Not all steps in a CoT are equally important, yet identifying which steps are necessary and which are dispensable is a combinatorial explosion problem—\(N\) steps yield \(2^N\) possible subsets.
Key Insight: CoT compression is formalized as a search problem, with the A* algorithm applied to efficiently search over ranked reasoning steps.
Core Idea: Bidirectional Importance Scoring (considering each step's relevance to both the question and the answer) combined with A* search (current path quality + estimated cost to reach the correct answer) to find the shortest valid reasoning path in a vast search space.
Method¶
Overall Architecture¶
A two-level design: (1) Step level — Bidirectional Importance Scoring (BIS) evaluates the criticality of each reasoning step to guide search priority; (2) Path level — A* search iteratively expands nodes on a search tree, using a cost function to identify the shortest reasoning path that leads to a correct solution.
Key Designs¶
-
Bidirectional Importance Scoring (BIS):
- Function: Quantifies the importance of each reasoning step \(\mathbf{t}^{(n)}\)
- Mechanism: A good intermediate step should be highly relevant to both the question (forward) and the answer (backward)
- Attention level: \(\text{ATTN}(\mathbf{q}|\mathbf{t}^{(n)})\) and \(\text{ATTN}(\mathbf{s}|\mathbf{t}^{(n)})\)
- Model level: \(\text{NLL}(\mathbf{q}|\mathbf{t}^{(n)})\) and \(\text{NLL}(\mathbf{s}|\mathbf{t}^{(n)})\)
- BIS = weighted sum of attention scores / weighted sum of NLL, with \(\alpha\) controlling the balance between question and answer relevance
- Computed using a lightweight model (GPT-2) for efficiency
- Key observation: BIS scores are highly skewed — only a small fraction of steps simultaneously contribute strongly to both the question and the answer
-
Path-Level A* Search:
- Function: Efficiently identifies the shortest valid reasoning path within a \(2^N\) search space
- Initialization: All steps are ranked in descending BIS order to form a search queue (best-first strategy)
- Node representation: Each node is not a single step but a triplet \(\mathbf{r}^{(n)} = \langle \mathbf{t}^{(n-1)}, \mathbf{t}^{(n)}, \mathbf{t}^{(n+1)} \rangle\) consisting of a step and its neighbors — mitigating fragmentation
- Verification: When path depth \(\geq k_{min}\), a verification model checks whether the current path leads to the correct answer
- Expansion: Upon verification failure, \(W\) candidate steps are drawn from the queue as child nodes
-
Cost Function Design:
- Total cost: \(f = g + h\) (classical A* framework)
- Current cost \(g\): negative log-likelihood of the current path under the verification model (path quality)
- Future cost \(h\): conditional self-information of generating the correct answer given the current path, \(\mathcal{I}(\mathbf{s}|\mathbf{q}, \mathbf{t}'_k) = -\frac{1}{|\mathbf{s}|}\log P_\mathcal{V}(\mathbf{s}|\mathbf{q}, \mathbf{t}'_k)\)
- Intuition: A larger \(h\) indicates that the current path is farther from the correct answer and requires additional steps
- A depth upper bound \(k_{max}\) prevents excessive search depth
Loss & Training¶
- A* search is used offline to compress CoT data
- The resulting compact reasoning paths are used for SFT distillation, training LRMs to directly generate efficient reasoning traces
- Verification model: s1.1-32B
- Training data: long CoT from the s1K-1.1 dataset
Key Experimental Results¶
Main Results (QwQ-32B, varying token budgets)¶
| Method | 512 Avg Acc. | 512 ACU | 1024 Avg Acc. | 2048 Avg Acc. | 4096 Avg Acc. | 4096 Avg Len. |
|---|---|---|---|---|---|---|
| QwQ-32B baseline | 12.3% | 2.41 | 21.8% | 43.8% | 63.2% | 2812 |
| + CoD | 17.3% | 3.34 | 24.3% | 51.3% | 69.3% | 2804 |
| + BtC Effective | 19.5% | 3.80 | 25.0% | 52.3% | 68.6% | 2795 |
| + TokenSkip | 16.6% | 3.20 | 22.8% | 48.8% | 64.9% | 2549 |
| + A*-Thought | 29.4% | 5.99 | 48.1% | 58.9% | 69.3% | 1877 |
Out-of-Domain Generalization (LiveCodeBench + MMLU)¶
| Budget | QwQ-32B Acc. | + A*-Thought Acc. | + A*-Thought ACU |
|---|---|---|---|
| 512 | 18.8% | 30.6% | 6.74 |
| 1024 | 28.7% | 41.9% | 5.37 |
| 4096 | 45.8% | 59.7% | 3.16 |
Key Findings¶
- Substantial gains under low budgets: At 512 tokens, accuracy improves from 12.3% to 29.4% (2.39×), and ACU from 2.41 to 5.99 (2.49×)
- Token reduction without accuracy loss at high budgets: At 4096 tokens, average length decreases from 2826 to 1877 (−33.6%) with negligible accuracy drop (70.6% → 69.3%)
- Cross-model generality: Effective across QwQ-32B, R1-Distill-32B, and s1.1-32B
- Out-of-domain generalization: Trained solely on math data, yet yields significant gains on LiveCodeBench and MMLU
Highlights & Insights¶
- Formalizing CoT compression as a search problem: Viewing reasoning compression through the lens of combinatorial optimization is more principled than heuristic deletion
- Key insight of bidirectional importance: Relevance to the question alone is insufficient; relevance to the answer must also be considered — intermediate steps that appear superficially unrelated should be retained if they contribute to deriving the answer
- Triplet node design: Steps are not selected in isolation; their immediate neighbors are preserved to avoid fragmented reasoning chains
- Natural fit of A* search: The cost function is elegantly designed — the current cost measures path quality while the future cost measures the remaining difficulty of reaching the answer
Limitations & Future Work¶
- Dependency on pre-generated CoT: A* search is a post-processing method that requires a complete long CoT to be generated first; it cannot reduce the cost of initial inference
- Verification model overhead: Each expansion and verification step during A* search requires invoking a 32B verification model, making the search process itself costly
- BIS requires the ground-truth answer: Computing BIS presupposes knowledge of the correct answer \(\mathbf{s}\), limiting direct applicability in unannotated settings
- Future directions: (1) Train a forward BIS predictor that does not require the answer; (2) explore online CoT compression (pruning during generation); (3) reduce search cost by using smaller verification models
Related Work & Insights¶
- vs. CoD (Chain-of-Draft): CoD uses prompting to elicit concise reasoning, but compression is imprecise; A*-Thought precisely selects the most critical steps from existing CoT
- vs. TokenSkip: TokenSkip requires training a token-level skipping policy; A*-Thought operates at the reasoning-step level, which is a more appropriate granularity
- vs. Budget Forcing (s1): Budget Forcing controls length via truncation or extension but does not optimize which content is retained; A*-Thought explicitly optimizes the combination of preserved steps
- Insight: The success of A* search in reasoning compression suggests that CoT contains substantial redundancy — truly critical reasoning steps may constitute only 30–50% of the full trace
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of A* search and bidirectional importance scoring is original and insightful
- Experimental Thoroughness: ⭐⭐⭐⭐ Three models × four budgets × six benchmarks, with comprehensive ablation studies
- Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and methodological derivation is rigorous
- Value: ⭐⭐⭐⭐ Directly applicable practical value for deploying LRMs in resource-constrained settings