SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe¶
Conference: ACL 2026 arXiv: 2410.05248 Code: None Area: LLM Alignment Keywords: Instruction Tuning, Mixup Regularization, Training Dynamics, Confidence Partitioning, Data Utilization Efficiency
TL;DR¶
This paper proposes SFTMix, a Mixup-based instruction tuning method that partitions SFT data into high-confidence and low-confidence subsets via training dynamics, applies linear interpolation between the two subsets in the hidden representation space with Mixup regularization, and consistently improves instruction-following ability across LLM families and dataset scales without relying on high-quality curated datasets.
Background & Motivation¶
State of the Field: Instruction tuning (SFT) is a critical stage for endowing LLMs with instruction-following capabilities. The dominant paradigm trains models on instruction–response pairs via next-token prediction (NTP) loss. Efforts to improve SFT performance have primarily focused on data quality: filtering data via LLM scoring (AlpaGasus), collecting human-annotated high-quality data (LIMA), or generating responses with stronger LLMs (GPT-4 distillation).
Limitations of Prior Work: (1) Obtaining high-quality SFT data requires powerful closed-source LLMs or expensive human annotation. (2) Standard NTP training treats all samples equally, despite significant variation in how well the model has learned different samples. (3) High-confidence samples are prone to overfitting, low-confidence samples are hard to generalize from, and the two groups are clearly separated in the semantic representation space.
Root Cause: The NTP paradigm treats every training sample identically, ignoring the non-uniform confidence distribution of LLMs in the semantic representation space—samples in different regions should play distinct roles during training.
Paper Goals: To design a general method that improves instruction tuning by optimizing data utilization rather than relying on dataset curation quality.
Starting Point: Training dynamics (perplexity statistics across multiple checkpoints) are used to partition SFT data into high-confidence and low-confidence subsets; Mixup is then applied to interpolate between them, facilitating the flow of supervision signals across confidence regions.
Core Idea: Linear interpolation between high- and low-confidence samples in the hidden representation space, combined with Mixup regularization, encourages the model to establish smooth transitions between regions where it has and has not yet learned, alleviating overfitting and enhancing generalization.
Method¶
Overall Architecture¶
SFTMix consists of a three-step pipeline: (1) a reference LLM is trained for one round of NTP on the SFT data, perplexity statistics are collected across multiple checkpoints, a confidence score is computed for each sample, and the dataset is split at the median into high- and low-confidence subsets; (2) during training of the target LLM, hidden representations and labels of high- and low-confidence samples in each batch are linearly interpolated; (3) the Mixup cross-entropy loss is incorporated as a regularization term alongside the standard NTP loss.
Key Designs¶
-
Training-Dynamics-Based Confidence Partitioning:
- Function: Partitions the SFT dataset into two complementary subsets according to model-specific learning difficulty.
- Mechanism: The perplexity of each sample is computed at \(C\) training checkpoints of the reference LLM; the negative mean yields the confidence score \(\text{Conf}(\mathcal{Y}_i|\mathcal{X}_i) = -\frac{1}{C}\sum_{c=1}^{C}\text{Perp}_c(\mathcal{Y}_i|\mathcal{X}_i)\). The dataset is split at the median into \(\mathcal{D}^c\) (high-confidence) and \(\mathcal{D}^u\) (low-confidence). t-SNE visualizations confirm clear separation between the two subsets in representation space.
- Design Motivation: Data quality (GPT-4-generated vs. raw) does not correlate with training-dynamics confidence—confidence reflects the model's current learning state rather than the intrinsic quality of the data, which is the prerequisite for Mixup to be effective.
-
Hidden-Space Mixup Interpolation:
- Function: Creates "intermediate" training signals between high- and low-confidence samples.
- Mechanism: Hidden states from the last Transformer layer and one-hot labels of the target LLM are linearly interpolated: \(\tilde{\mathbf{Z}}_n = \lambda \mathbf{Z}_n^c + (1-\lambda)\mathbf{Z}_n^u\), \(\tilde{\mathbf{Y}}_n = \lambda \mathbf{Y}_n^c + (1-\lambda)\mathbf{Y}_n^u\), where \(\lambda \sim \text{Beta}(\alpha, \alpha)\), \(\alpha=0.5\). Alignment is performed by truncating to \(\min(N_i^c, N_i^u)\) tokens for shorter responses.
- Design Motivation: Due to the nonlinearity of softmax, the gradient of the interpolated sample is not equal to the weighted sum of the two original gradients—meaning Mixup introduces genuinely distinct gradient directions rather than mere sample reweighting.
-
Integration of Mixup as a Regularization Term:
- Function: Introduces cross-confidence supervision signals without interfering with standard NTP learning.
- Mechanism: The total loss is \(\ell_{\text{SFTMix}} = \ell_{\text{NTP}}(\mathcal{D}) + \mu \cdot \ell_{\text{Mixup}}(\mathcal{D}^c, \mathcal{D}^u)\), where \(\mu=0.2\). Each batch is ensured to contain equal numbers of high- and low-confidence samples, which are randomly paired for interpolation.
- Design Motivation: Experiments demonstrate that Mixup performs best as a regularization term (rather than as the primary loss or with equal weighting), preserving the fundamental learning capacity of NTP while capturing the generalization benefits of Mixup.
Loss & Training¶
Standard NTP cross-entropy loss + Mixup cross-entropy regularization term, \(\mu=0.2\), \(\alpha=0.5\). AdamW optimizer, learning rate \(2\times10^{-6}\), weight decay 0.1, cosine scheduler, warm-up ratio 0.1. Alpaca-52K is trained for 3 epochs; UltraChat-200K and Tulu3-939K are trained for 1 epoch; batch size 32; 8× H100 GPUs.
Key Experimental Results¶
Main Results¶
Instruction-Following Evaluation (Alpaca-52K dataset)
| LLM | Method | MT-Bench Overall | AlpacaEval-2 WR | AlpacaEval-2 LC WR |
|---|---|---|---|---|
| Llama-3.1-8B | NTP | 4.3625 | 4.0714 | 8.6528 |
| Llama-3.1-8B | SFTMix | 4.5825 | 4.9031 | 10.3195 |
| Mistral-7B | NTP | 4.6163 | 4.3560 | 9.1759 |
| Mistral-7B | SFTMix | 4.9100 | 4.5386 | 9.4994 |
| Qwen-2.5-14B | NTP | 6.1930 | 7.0764 | 13.9508 |
| Qwen-2.5-14B | SFTMix | 6.5247 | 7.8810 | 15.0235 |
Medical-Domain SFT (MedAlpaca-263K)
| LLM | Method | MedQA | MedQA-5 | PubMedQA | MedMCQA | Avg |
|---|---|---|---|---|---|---|
| Llama | NTP | 59.31 | 54.52 | 75.40 | 53.65 | 60.72 |
| Llama | SFTMix | 60.88 | 55.38 | 77.80 | 54.15 | 62.05 |
| Mistral | NTP | 49.10 | 44.62 | 75.40 | 48.15 | 54.32 |
| Mistral | SFTMix | 51.77 | 45.72 | 77.40 | 49.03 | 55.98 |
Ablation Study¶
Role Analysis of Mixup (Llama-3.1-8B + Alpaca-52K)
| NTP Role | Mixup Role | MT-Bench | AlpacaEval-2 LC WR |
|---|---|---|---|
| Loss | — | 4.3625 | 8.6528 |
| Loss | Reg. | 4.5825 | 10.3195 |
| Loss | Loss | 4.4062 | 8.2856 |
| — | Loss | 4.5062 | 7.2964 |
Key Findings¶
- SFTMix yields larger gains on multi-turn dialogue capability (MT-Bench multi-turn average +0.32 vs. single-turn +0.27), suggesting that Mixup regularization benefits contextual understanding.
- In human evaluation, SFTMix wins 42.5% of head-to-head comparisons, while NTP wins only 26.5%.
- Training-dynamics confidence does not correspond to data quality—the confidence distributions of GPT-4-generated "high-quality" and raw "low-quality" responses overlap substantially.
- Confidence partitioning derived from a weaker reference LLM (Gemma-2B) transfers to a stronger target LLM (Llama-8B), supporting weak-to-strong generalization.
- SFTMix is compatible with data selection methods (AlpaGasus, Long) and yields further improvements when combined; it is also compatible with LoRA for compute-constrained settings.
- SFTMix reduces the standard deviation of confidence scores by 7%, indicating a more uniform confidence distribution and mitigated overfitting.
Highlights & Insights¶
- The insight that "samples of different confidence levels should play different roles" is concise and compelling—high-confidence samples lie far from the decision boundary and are prone to overfitting, while low-confidence samples lie near the boundary and are difficult to learn from; Mixup precisely bridges the two.
- Gradient analysis demonstrates that Mixup introduces genuinely new gradient directions (softmax nonlinearity prevents gradient decomposition), rather than simple sample reweighting—this explains why Mixup is more effective than resampling.
- The method is highly practical: only one additional training round is required to obtain confidence scores, making it a plug-and-play addition to any SFT pipeline.
Limitations & Future Work¶
- Experiments are limited to models up to 14B parameters; effectiveness on larger models remains to be validated.
- An extra training round is required to obtain training dynamics, incurring additional cost similar to data selection methods such as LESS and Rho-1.
- The binary confidence split at the median may be too coarse; multi-level partitioning or continuous confidence weighting warrants exploration.
- The method has not been validated in the pre-training stage—dynamic Mixup scheduling and extension to pre-training are promising future directions.
Related Work & Insights¶
- vs. IR-DRO (Chen et al., 2024b): The latter optimizes distributional robustness via sample reweighting, but underperforms SFTMix on both MT-Bench and AlpacaEval-2—indicating that hidden-space interpolation is more effective than loss reweighting.
- vs. Data Selection (AlpaGasus, LESS): These methods improve performance by selecting better data; SFTMix improves performance by making better use of existing data. The two approaches are orthogonal and can be combined for further gains.
Rating¶
- Novelty: ⭐⭐⭐⭐ Introducing Mixup into LLM SFT in conjunction with training-dynamics confidence is a clear and well-motivated idea, though Mixup itself is not novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three LLM families, three dataset scales, medical-domain validation, and six analytical dimensions.
- Writing Quality: ⭐⭐⭐⭐ Method motivation and gradient analysis are clearly presented; ablation experiments are systematically designed.
- Value: ⭐⭐⭐⭐ Highly practical, plug-and-play, and compatible with existing methods.