Importance-Aware Data Selection for Efficient LLM Instruction Tuning¶

Conference: AAAI 2026 arXiv: 2511.07074 Code: None Area: Alignment RLHF / Data Selection Keywords: Data Selection, MIWV, ICL, Data Efficiency, Instruction Tuning

TL;DR¶

This paper proposes MIWV (Model Instruction Weakness Value), a metric that measures the importance of each instruction sample for improving model capability by comparing LLM loss with and without a one-shot ICL demonstration. Using only 1% (520 samples) of the Alpaca dataset, the method comprehensively outperforms fine-tuning on the full 52,002 samples.

Background & Motivation¶

Background: Instruction tuning is a key approach for improving LLMs' instruction-following ability. Existing methods largely focus on collecting larger and more diverse datasets (e.g., Alpaca 52K, WizardLM 63K) or filtering data by scoring with ChatGPT or external models.

Limitations of Prior Work: (a) Blindly scaling datasets introduces noise and redundancy without guaranteed improvement; (b) Quality-scoring methods such as InstructMining show limited effectiveness and fail to surpass full-data training; (c) Alpagasus and Deita rely on the ChatGPT API for scoring, incurring high cost and limited efficiency; (d) SelectIT and DiverseEvol require additional model training, reducing efficiency.

Key Challenge: Existing data selection methods either depend on external models (costly and potentially biased) or evaluate data quality in isolation without considering the characteristics of the target model, making it impossible to identify the most valuable data for a specific LLM.

Goal: To select the subset of instruction data that maximally improves a target LLM's capability using only that model itself—without external models or additional training.

Key Insight: A key observation: if providing the model with a similar one-shot demonstration causes its loss on a given instruction to increase rather than decrease, the model lacks fundamental capability for that type of instruction—making such samples the most valuable for learning.

Core Idea: The difference in model loss with and without an ICL prompt (MIWV) is used to measure the importance of each sample for improving model capability. A higher MIWV indicates that the model has greater need to learn from that sample.

Method¶

Overall Architecture¶

The input is the full instruction dataset \(D = \{(x_1,y_1),...,(x_n,y_n)\}\), and the output is a high-quality subset ranked by MIWV. The pipeline proceeds as follows: vector embedding → one-shot example retrieval → MIWV computation → Top-K selection → instruction fine-tuning.

Key Designs¶

One-Shot Example Retrieval:
- Function: Identifies the most similar sample in the dataset to serve as an ICL demonstration for each instruction.
- Mechanism: The BGE-en-large embedding model computes vector representations for all instructions as \(h_i = \frac{1}{Q}\sum_{q=1}^Q h_i^q\); for each \(x_i\), the sample with the highest cosine similarity (excluding itself) is retrieved: \(k = \arg\max_{j \neq i} sim(h_i, h_j)\).
- Design Motivation: The one-shot demonstration must be relevant to the target instruction so that the ICL effect faithfully reflects the model's true capability on that instruction type.
MIWV Computation:
- Function: Quantifies the degree to which each instruction sample contributes to improving model capability.
- Mechanism: The loss without demonstration \(L_\theta(y_i|x_i)\) and the loss with a one-shot demonstration \(L_\theta(y_i|x_i, C)\) are computed separately, where \(C = \text{Prompt}(x_k, y_k)\). MIWV is defined as: \(\text{MIWV}(x_i, y_i) = L_\theta(y_i|x_i, C) - L_\theta(y_i|x_i)\)
- Design Motivation: A high MIWV indicates that providing a demonstration increases model confusion—implying a fundamental capability gap for that instruction type, which makes it the most valuable data for training. Notably, when no highly similar sample exists in the dataset (i.e., the retrieved demonstration is weakly related), high MIWV naturally selects for diverse data.
Top-K High-Quality Data Selection:
- Function: Selects a data subset for instruction fine-tuning by ranking samples in descending order of MIWV.
- Mechanism: The top-K% samples by MIWV score are directly used as fine-tuning data.
- Design Motivation: The approach is straightforward and requires no complex multi-objective balancing. Ablation studies confirm that both Low-MIWV and random selection perform significantly worse than full-data training.

Loss & Training¶

Standard next-token prediction loss is used for instruction fine-tuning.
LLaMA-7B, LLaMA2-7B, and LLaMA2-13B are fine-tuned with LoRA following the Alpaca codebase training configuration.
All experiments are repeated three times and averaged.

Key Experimental Results¶

Main Results¶

Dataset/Model	MIWV Ratio	Pairwise Win Rate	Open LLM Avg	AlpacaEval
Alpaca/LLaMA2-7B	100%	1.000	55.25	27.75
Alpaca/LLaMA2-7B	1% (520 samples)	1.127	56.17	39.50
Alpaca/LLaMA2-7B	5% (2600 samples)	1.214	56.91	39.87
Alpaca/LLaMA2-13B	100%	1.000	58.78	35.00
Alpaca/LLaMA2-13B	1% (520 samples)	1.063	60.36	41.30
WizardLM/LLaMA2-7B	100%	1.000	55.02	59.25
WizardLM/LLaMA2-7B	1% (636 samples)	1.048	55.45	60.12

Across all configurations, 1% of the data consistently surpasses full-data training.
AlpacaEval improvements are particularly substantial: on Alpaca, scores increase from 27.75 to 39.50 (+42.5% relative gain).

Ablation Study¶

Data Selection Strategy	Win Rate Trend	Note
MIWV Top-K (Ours)	Substantially >1.0	Consistently outperforms full data
Random Selection	<1.0	Consistently below full data
High Prompt Loss	<1.0	High loss ≠ high value
Low MIWV	Lowest	Confirms correctness of MIWV direction

Key Findings¶

1% > 100% is the core finding: 520 carefully selected samples outperform 52,000 full-dataset samples, demonstrating that data quality far outweighs quantity.
Low-MIWV selection yields the worst performance—confirming the directionality of MIWV: data on which the model is already proficient contributes no value to fine-tuning.
High Prompt Loss selection (selecting samples with high loss given demonstrations) also underperforms—such samples may simply be inherently difficult or noisy rather than genuinely needed by the model.
Win rate decreases as the data ratio increases—redundant data introduces noise and interference.
The choice of embedding model is not sensitive: BGE-en-large and BGE-en-base yield comparable results.

Highlights & Insights¶

Using ICL difference as a data importance measure is an elegant design: rather than assessing inherent data quality, it evaluates whether a given sample is useful for a specific model—the same sample may have entirely different MIWV values for different models, achieving genuine model-data adaptation.
A fully automated method requiring no external models and no training: only two forward passes through the target LLM (with and without ICL) are needed, making it far more efficient than Alpagasus (requiring ChatGPT) and SelectIT (requiring model training).
A counterintuitive insight from MIWV: data on which model performance degrades after receiving a relevant demonstration is the most valuable—this reveals the model's capability gaps. The same principle is broadly applicable to active learning and curriculum learning in any domain.

Limitations & Future Work¶

Validation is limited to the LLaMA family; applicability to other architectures (e.g., Qwen, Mistral) remains untested.
MIWV relies on embedding-based retrieval of similar samples and may degrade on highly heterogeneous datasets where each instruction is unique.
Only small-scale fine-tuning (7B/13B) is evaluated; effectiveness on larger models (70B+) has yet to be confirmed.
No analysis is provided on the common characteristics of MIWV-selected data—what properties (task type, complexity, or other factors) tend to yield high MIWV scores remains an open question.

vs. IFD Score: IFD assesses data difficulty via loss differences before and after fine-tuning, requiring one full training round; MIWV requires no training at all and directly estimates importance via ICL differences.
vs. SelectIT: SelectIT computes quality scores from token probability distributions over multiple inference runs, requiring complex multi-pass inference; MIWV requires only two forward passes (with and without ICL), making it simpler and more efficient.
vs. Superfiltering: Superfiltering is more computationally efficient (using a smaller model for scoring), but the mismatch between the scoring model and the training model results in a lower win rate; MIWV consistently uses the target model itself for evaluation.

Rating¶

Novelty: ⭐⭐⭐⭐ MIWV is a concise and effective metric; using ICL difference as a proxy for data value represents a novel perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluations span multiple datasets, models, and baselines, complemented by ablation studies.
Writing Quality: ⭐⭐⭐⭐ The method is clearly presented and experimental results are well reported.
Value: ⭐⭐⭐⭐⭐ The finding that 1% of data surpasses full-data training has significant practical implications for real-world training pipelines.