ICLR2026 LLM Reasoning Model Compression Reasoning Models Quantization Distillation Pruning Interpretability DeepSeek-R1

When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models¶

Conference: ICLR2026
arXiv: 2504.02010
Code: github.com/psunlpgroup/Compression-Effects
Area: LLM Reasoning
Keywords: Model Compression, Reasoning Models, Quantization, Distillation, Pruning, Interpretability, DeepSeek-R1

TL;DR¶

This paper systematically studies the effects of three compression methods—quantization, distillation, and pruning—on Large Reasoning Models (LRMs) through performance benchmarking and mechanistic interpretability analysis. Key findings include: parameter count affects knowledge memorization more than reasoning ability; the last-layer MLP up_proj is the most critical component; and current quantization methods over-compress the final layers.

Background & Motivation¶

Large reasoning models such as DeepSeek-R1 achieve strong performance on complex reasoning tasks but incur high deployment costs.
Prior compression research faces two bottlenecks:
- Evaluation bottleneck: Existing quantization/pruning evaluations primarily rely on perplexity and simple tasks, with insufficient testing on complex reasoning benchmarks.
- Analysis bottleneck: In-depth interpretability analysis of compression effects is lacking.
Core question: How is the reasoning capability of LRMs degraded during compression, and which weights are most critical for reasoning?

Method¶

1. Evaluation Framework¶

Model selection: DeepSeek-R1 (671B) and its compressed variants:
- Quantization: Unsloth dynamic quantization (2.51/1.73/1.58-bit), AWQ, GPTQ, GPTAQ, ANY4/ANY3
- Distillation: R1-Distill-Llama (70B/8B), R1-Distill-Qwen (32B/7B)
- Pruning: SparseGPT, AlphaPruning (multiple sparsity levels)
Evaluation datasets (in increasing difficulty):
- AIME 2024 (mathematical reasoning)
- FOLIO (logical reasoning)
- Temporal Sequences (temporal reasoning, from BIG-Bench Hard)
- MuSiQue (multi-hop reasoning, closed-book setting to test knowledge + reasoning)

2. Mechanistic Interpretability Analysis¶

Targeting four core reasoning behaviors: backtracking, uncertainty estimation, example testing, and adding knowledge.

Direction vector extraction via mean difference: For each linear module \(m\) at layer \(\ell\), the direction vector for behavior \(c\) is extracted as:

\[\mathbf{u}_{m\ell}^c = \frac{1}{|\mathcal{D}_+|} \sum_{s_i^c \in \mathcal{D}_+} \bar{\mathbf{a}}_{m\ell}^c(s_i^c) - \frac{1}{|\mathcal{D}_-|} \sum_{s_j \in \mathcal{D}_-} \bar{\mathbf{a}}_{m\ell}(s_j)\]

where \(\bar{\mathbf{a}}_{m\ell}^c(s_i^c)\) is the mean activation over the behavior token sequence.

Importance score computation via attribution patching:

\[\mathbf{I}_{m\ell}^c \approx \frac{1}{|\mathcal{D}_+|} \left| \sum_{s_i^c \in \mathcal{D}_+} (\tilde{\mathbf{u}}_{m\ell}^c)^\top \frac{\partial}{\partial \mathbf{a}_{m\ell}} \mathcal{L}(s_i^c) \right|\]

A higher \(\mathbf{I}_{m\ell}^c\) indicates a stronger causal relationship between the module and reasoning behavior \(c\).

Compression effect decoding: The impact of compression is tracked by measuring changes in relative importance \(\mathbf{RI}_{m\ell}^c\) (importance shift).

Key Experimental Results¶

Overall Performance Comparison¶

Model	Params	Compression	AIME 2024	FOLIO	Temporal	Avg	MuSiQue (EM, F1)
DeepSeek-R1	671B	None	73.3	76.4	99.6	83.1	(17.0, 27.51)
DeepSeek-R1	671B	2.51-bit	76.7	77.8	100.0	84.8	(17.0, 24.43)
DeepSeek-R1	671B	1.58-bit	66.7	75.4	94.0	78.7	(14.0, 22.34)
R1-Distill-Llama	70B	Distillation	65.6	79.8	99.9	81.8	(13.3, 21.57)
R1-Distill-Qwen	32B	Distillation	64.4	82.3	99.9	82.2	(2.7, 10.95)
R1-Distill-Llama	8B	Distillation	42.2	71.9	81.5	65.2	(0.0, 4.43)
R1-Distill-Llama	70B	50% SparseGPT	23.3	71.6	97.6	64.2	(6.7, 13.49)

Selective Quantization Validates Component Importance¶

Quantized Component	Rank	AIME 2024	FOLIO	Temporal	Avg
32_up (last-layer up_proj)	Global #1	20.0	63.1	63.6	48.9
32_gate	#2 in column	33.3	62.1	67.2	54.2
32_v	Last in column	43.3	68.0	79.6	63.6
Unquantized baseline	-	42.2	71.9	81.5	65.2

Quantizing only 32_up (0.7% of total weights) causes an average accuracy drop of 16.3%.

Effect of Protecting Critical Weights¶

Compression	Protected	AIME 2024	FOLIO	Temporal	Avg
3-bit AWQ	No	10.0	59.6	68.4	46.0
3-bit AWQ	Last-layer MLP protected	16.7	67.0	74.0	52.57

Protecting only ~2% of weights at full precision yields an average accuracy gain of 6.57%, surpassing the previous SOTA quantization method by up to 23.17%.

Collapse Point Analysis (SparseGPT at Various Sparsity Levels)¶

Sparsity	R1-Distill-Llama-70B AIME	R1-Distill-Llama-70B FOLIO
0%	63.3	78.8
30%	63.3	79.3
40%	56.7	73.9
50%	26.7	70.9
60%	0.0	65.0
70%	0.0	49.8

The collapse point is negatively correlated with task difficulty: AIME collapses at 40–50% sparsity, while FOLIO collapses at 60–70%.

Three Core Findings¶

Finding 1: Parameter Count Affects Knowledge Memorization More Than Reasoning¶

Qwen-based models demonstrate stronger reasoning than Llama-based ones, yet score substantially lower on MuSiQue (knowledge-intensive).
Pruning causes knowledge memorization to collapse earlier than reasoning (MuSiQue collapses at 30–40% sparsity).
Conclusion: For knowledge-intensive tasks, quantization (which preserves parameter count) is preferable over pruning or distillation.

Finding 2: The Last-Layer MLP `up_proj` Is the Most Critical Component¶

This pattern is consistently observed across both R1-Distill-Llama-8B and R1-Distill-Qwen-7B.
Distillation is identified as the cause of this component's heightened importance (the original Llama model does not exhibit this property).
This finding supplements prior work claiming o_proj is the most important component.

Finding 3: Current Quantization Methods Over-Compress the Last Layer and `gate_proj`¶

Both AWQ and GPTQ excessively compress the last-layer modules and mid-layer gate_proj.
Protecting only the last-layer MLP modules yields significant performance improvements (+6.57% average).
This finding generalizes to pruning methods as well.

Highlights & Insights¶

First systematic comparison of three compression methods on LRMs: fills a gap in the LRM compression literature.
Fine-grained interpretability analysis: importance is analyzed at the level of individual linear modules, going beyond prior layer-level analyses.
High practical value: protecting only 2% of weights yields substantial gains, providing clear guidance for future compression methods.
Generalizable findings: core findings hold across both R1 and non-R1 model families.
Integration of theory and practice: each finding is supported by validation experiments.

Limitations & Future Work¶

Interpretability analysis uses only 120 instances, limiting statistical power.
Optimal strategies for mixed-precision quantization remain unexplored (only simple last-layer protection is validated).
Pruning analysis is less comprehensive than quantization and distillation due to the unavailability of high-sparsity models.
Distillation analysis is limited to SFT-based approaches; RL-stage distillation is not examined.
Inference latency and deployment efficiency metrics are not reported.

Compared to existing compression benchmarks (e.g., EleutherAI harness): this paper employs more challenging reasoning datasets.
Compared to Venhoff et al.'s layer-level analysis: this paper provides module-level fine-grained analysis.
Compared to Shao & Wu's claim that o_proj is most important: this paper finds up_proj to be more critical in distilled models.
Compared to surveys by Liu et al. and Feng et al.: this paper contributes a unique mechanistic interpretability perspective.
The finding on last-layer MLP up_proj importance can directly guide future quantization and pruning algorithm design.
The mixed-precision protection strategy is generalizable to broader compression scenarios.
The knowledge-vs-reasoning decomposition provides a theoretical basis for selecting appropriate compression methods.
The correlation between collapse points and task difficulty can be used to estimate post-compression capability bounds.

Rating¶

Novelty: ⭐⭐⭐⭐ (Novel combination of systematic study and interpretability analysis, though the underlying methods are not original)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Covers quantization/distillation/pruning across multiple models, benchmarks, and validation experiments)
Writing Quality: ⭐⭐⭐⭐ (Clear structure and concise presentation of findings, though the abundance of tables slightly burdens readability)
Value: ⭐⭐⭐⭐⭐ (Three core findings are directly actionable for improving compression methods; protecting 2% of weights for a 6.57% gain is highly practical)