ICLR2026 Model Compression SFT domain fine-tuning general capability degradation learning rate token-adaptive reweighting continual learning LLM

SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs¶

Conference: ICLR2026 arXiv: 2509.20758 Code: Not released Area: Model Compression Keywords: SFT, domain fine-tuning, general capability degradation, learning rate, token-adaptive reweighting, continual learning, LLM

TL;DR¶

This paper systematically revisits the impact of domain-specific SFT on the general capabilities of LLMs, demonstrating that using a smaller learning rate can substantially mitigate general capability degradation, and proposes Token-Adaptive Loss Reweighting (TALR), which further optimizes the trade-off between domain adaptation and general capability retention by adaptively down-weighting the loss of low-probability tokens.

Background & Motivation¶

Domain SFT is a standard paradigm: LLMs perform well on general tasks but still require SFT to inject domain knowledge for specialized fields such as medicine and e-commerce.
General capability degradation is widely reported: Multiple studies have noted that SFT on domain data severely impairs general capabilities such as mathematical reasoning, code generation, and instruction following, raising doubts about the practicality of SFT.
Prior work uses excessively large learning rates: Existing studies commonly adopt learning rates such as 5e-6 or 2e-5, which may be a contributing factor to the exaggeration of degradation phenomena.
Data-oblivious settings are more realistic: Pre-training data is typically unavailable in practice, making mitigation strategies that do not rely on such data more valuable.
Token-level analysis is lacking: Prior research has primarily analyzed degradation at the sample or benchmark level, lacking fine-grained understanding of the learning difficulty of individual tokens in training data.
Theoretical justification is absent: A formal information-theoretic analysis of why learning rate magnitude affects the degree of general capability degradation has not been established.

Method¶

Core Finding: Small Learning Rates Achieve a Favorable Trade-off¶

The authors conduct systematic experiments on two datasets—MedCalc (medical calculation) and ESCI (e-commerce classification)—and find:

Finding 1: Using a smaller learning rate (e.g., 1e-6) substantially reduces general capability degradation while achieving domain performance comparable to larger learning rates. This stands in sharp contrast to the conventional wisdom in deep learning that larger learning rates yield better downstream performance.
Finding 2: When the training objective covers only labels (without chain-of-thought reasoning), the range of learning rates that achieve Pareto optimality is wider, and 5e-6 also performs well.

Theoretical Analysis from an Information-Theoretic Perspective¶

Treating the LLM as a data compressor and leveraging a token tree and arithmetic coding framework, the authors derive:

Proposition 3.1: The expected change in encoding length from model \(\theta_1\) to \(\theta_2\) equals the difference in KL divergences, which can quantify changes in general capability.
Theorem 3.1: A smaller distribution update step \(\lambda\) (corresponding to a smaller learning rate) reduces the upper bound on general capability degradation.
Theorem 3.2: As the number of hard tokens decreases, the safe step-size range expands—explaining why label-only training tolerates larger learning rates.

TALR: Token-Adaptive Loss Reweighting¶

The theoretical analysis identifies the gradient contribution of hard tokens (low-probability tokens) as the primary driver of general capability degradation, motivating the proposed TALR method:

Constrained optimization: Minimizing weighted loss with entropy regularization over the simplex yields the closed-form solution \(w_i^* \propto p_\theta(x_i)^{1/\tau}\)
Adaptive weights: High-probability (easy) tokens receive larger weights, while low-probability (hard) tokens are down-weighted
Dynamic \(\tau\) parameter: \(\tau\) is set to the median of token losses within the batch and decays automatically during training
Curriculum learning effect: Training initially focuses on easy tokens and progressively incorporates formerly hard tokens as the model improves
Stop-gradient: Weight computation is excluded from backpropagation to ensure optimization stability

Key Experimental Results¶

Table 1: Domain/General Performance Comparison on MedCalc Benchmark at Learning Rate 1e-6¶

Method	Qwen2.5-3B Domain	Qwen2.5-3B General	Qwen3-4B Domain	Qwen3-4B General	Avg. Domain	Avg. General
Standard (Ours)	0.495	0.620	0.548	0.784	0.534	0.692
L2-Reg	0.490	0.621	0.469	0.796	0.506	0.697
LoRA	0.126	0.583	0.195	0.764	0.181	0.490
Wise-FT	0.195	0.629	0.143	0.788	0.198	0.727
FLOW	0.364	0.597	0.477	0.787	0.469	0.692
TALR (Ours)	0.481	0.648	0.489	0.788	0.501	0.717

At a small learning rate, performance gaps across methods are modest; TALR achieves the best general capability retention.

Table 2: Domain/General Performance Comparison on MedCalc Benchmark at Learning Rate 5e-6¶

Method	Avg. Domain	Avg. General
Standard	0.558	0.381
L2-Reg	0.555	0.395
FLOW	0.553	0.450
TALR (Ours)	0.542	0.502

At a larger learning rate, general capability degradation intensifies; TALR demonstrates the most pronounced advantage—outperforming Standard by 12 percentage points on general performance.

Token-Level Analysis¶

The vast majority of SFT training tokens are of low learning difficulty for LLMs (median probability close to 1.0), even when the model exhibits poor zero-shot performance on domain tasks.
A small number of hard tokens appear primarily at domain-specific concepts (e.g., clinical conversion factors) and constitute the performance bottleneck.
During TALR training, the proportion of tokens with \(p > 0.2\) increases steadily from Epoch 1 to Epoch 2, exhibiting curriculum learning dynamics.

Highlights & Insights¶

Challenging prevailing assumptions: The paper systematically demonstrates that SFT does not always significantly harm general capabilities, and that the exaggerated conclusions in prior literature are partly attributable to inappropriate learning rate selection.
Theory and practice unified: The information-theoretic analysis not only explains empirical observations but also directly informs the design of TALR.
Elegant design of TALR: The method features a closed-form solution, no additional hyperparameter search (with adaptive \(\tau\)), and stop-gradient for stability—resulting in a clean implementation.
Clear practical guidelines: (1) Prioritize small learning rates; (2) Apply TALR when a stronger balance is required.

Limitations & Future Work¶

Degradation not fully eliminated: No method, including TALR, can entirely prevent general capability degradation at large learning rates.
Limited dataset coverage: Validation is conducted only on MedCalc and ESCI, without extending to a broader range of domains.
Restricted model scale: Experiments are limited to models with 3B–4B parameters; applicability to larger models or MoE architectures has not been verified.
Optimal learning rate selection: The theoretical analysis does not provide practical criteria for automatically selecting the optimal learning rate.
Computational resource constraints: The authors acknowledge that resource limitations precluded more extensive experimental validation.

Method Category	Representative Work	Relationship to This Paper
L2 Regularization	EWC, L2-Reg	Constrains parameter drift, but with limited effectiveness
Model Merging	Wise-FT	Causes substantial domain performance drops; unsuitable when the domain gap is large
LoRA	Hu et al. 2022	Low-rank constraints lead to severely insufficient domain performance
Data Reweighting	FLOW	Based on sample-level difficulty distinction; this paper proposes a finer-grained token-level scheme
Continual Learning	Data-dependent methods	Require pre-training data, which is unavailable in practical settings

TALR achieves the best Pareto trade-off under a data-oblivious setting.

Rating¶

Novelty: ⭐⭐⭐⭐ — Revisits an overlooked learning rate factor, combines information-theoretic analysis with token-level adaptive reweighting
Experimental Thoroughness: ⭐⭐⭐ — Validated across multiple models and settings, but the variety of datasets is limited
Writing Quality: ⭐⭐⭐⭐ — Clear logical structure with tight integration of theory and experiments
Value: ⭐⭐⭐⭐ — Provides direct practical guidance for domain fine-tuning of LLMs