Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning¶

Conference: ACL 2025
arXiv: 2505.12212
Code: https://github.com/gszfwsb/Data-Whisperer
Area: LLM Pre-training
Keywords: data selection, in-context learning, attention weighting, fine-tuning efficiency, coreset selection

TL;DR¶

Data Whisperer proposes a training-free few-shot ICL data selection method using attention weighting. It leverages the pre-trained model's own ICL capabilities and attention scores to evaluate training samples, outperforming full-data fine-tuning with only 10% of the data while operating 7-20 times faster than existing methods.

Background & Motivation¶

Background: LLM fine-tuning requires training on task-specific data. As datasets grow increasingly larger, data selection (coreset selection) has become a critical challenge to balance performance and computational costs.

Limitations of Prior Work: - Existing methods (such as GraNd, EL2N, Nuggets) require first fine-tuning a scoring model on the target dataset, taking even more time than direct full-scale fine-tuning (\(\text{STR} > 1\)). - Heuristic methods (such as CCS) fail to fully exploit the predictive capabilities of the model. - Nuggets uses one-shot ICL for scoring, which has low computational efficiency (evaluating only one demonstration at a time).

Key Challenge: The overhead of data selection itself should be far smaller than fine-tuning overhead. However, the Selection-to-Tuning Ratio (STR) of existing methods generally exceeds 1, meaning that data selection takes even longer than fine-tuning.

Goal: To design a training-free data selection method that significantly reduces selection time while preserving election quality.

Key Insight: ICL is theoretically equivalent to implicit fine-tuning (\(\text{ICL} \approx \text{implicit fine-tuning}\)), meaning that ICL performance can be utilized to predict fine-tuning outcomes.

Core Idea: To evaluate the contribution value of each training sample to the task using the attention-weighted scores of demonstrations in few-shot ICL.

Method¶

Overall Architecture¶

The pipeline of Data Whisperer consists of two steps: (1) Few-shot ICL evaluation: randomly sample \(n_d\) demonstrations and \(n_q\) queries from the training set, perform ICL inference via the pre-trained model, and calculate the average performance score; (2) Context-Aware Weighting: weight the scores of demonstrations using attention scores to eliminate order sensitivity. This process is iteratively executed until all samples are scored, and the top-k samples are selected as the training subset.

Key Designs¶

Selection-to-Tuning Ratio (STR) Metric:
- Function: Defined as \(\text{STR} = t_p(\tau, \rho) / t_{ft}\) to quantify the efficiency of data selection methods.
- Mechanism: An \(\text{STR} < 1\) implies that selection time is lower than full fine-tuning time, which indicates actual practical value.
- Design Motivation: Exposes a critical flaw of existing methods—most of them have an \(\text{STR} > 1\), making them slower than direct training on the full dataset.
Few-shot ICL Scoring Mechanism:
- Function: Periodically samples \(n_d\) demonstrations and \(n_q\) queries from the dataset to perform ICL inference using the pre-trained model.
- Mechanism: Combines demonstrations and queries into context \(C\). After the model generates answers for the queries, the average performance score is calculated using task-specific metrics (e.g., Accuracy/ROUGE-L): \(s = \frac{1}{n_q}\sum_{j=1}^{n_q} f(\hat{y}_q^{(j)}, y_q^{(j)})\).
- Design Motivation: Based on the theory that \(\text{ICL} \approx \text{implicit fine-tuning}\), the attention influence of a demonstration on a query during ICL is equivalent to the parameter update \(\Delta W_{icl}\) in fine-tuning.
Context-Aware Attention Weighting:
- Function: Weights the scores of demonstrations using self-attention scores to eliminate positional bias in ICL.
- Mechanism: Extracts the sub-attention matrix between each demonstration and the queries from the \(l\)-th attention layer, sums across all attention heads, and normalizes by demonstration length: \(w_{(x_d^{(i)}, y_d^{(i)})} = \sum_h \mathbf{1}^\top A_{(x_d^{(i)}, y_d^{(i)})}^{(h)} \mathbf{1}\).
- Design Motivation: ICL is highly sensitive to demonstration ordering; directly assigning identical scores leads to systematic overestimation or underestimation of samples placed later in the sequence.
Weak-to-Strong Strategy:
- Function: Employs a smaller model within the same family to perform ICL scoring (e.g., using Qwen-2.5-3B to score data for Qwen-2.5-7B).
- Mechanism: Models within the same family share similar knowledge representations, allowing scoring results from smaller models to transfer well to larger models.
- Design Motivation: To further reduce selection costs, lowering the STR to 0.03–0.17.

Theoretical Analysis¶

Under linear attention approximation, the influence of demonstrations on predictions in ICL can be decomposed as: \(\mathcal{M}_p(q) = (W_{zsl} + \Delta W_{icl})q\), where \(\Delta W_{icl} = \sum_i (W_V x_d^{(i)}) \otimes (W_K x_d^{(i)})\). In comparison, fine-tuning is expressed as \(\mathcal{M}_f(q) = (W_{zsl} + \Delta W_{ft})q\). The structural similarity between the two indicates that using ICL scoring as a proxy for fine-tuning performance is theoretically well-founded.

Key Experimental Results¶

Main Results¶

Evaluated on three datasets (GSM8K, DialogSum, BioInstruct) using three models (Llama-3-8B, Qwen-2.5-7B, Mistral-Nemo):

Dataset	Model	Data Whisperer (10%)	Full Fine-Tuning	vs Random
GSM8K	Llama-3-8B	72.46	71.39	+2.80
GSM8K	Qwen-2.5-7B	85.03	85.43	+4.95
DialogSum	Llama-3-8B	42.18	43.33	+0.73
BioInstruct	Llama-3-8B	39.20	40.21	+0.50

Key Findings: On GSM8K, using only 10% of the data selected by Data Whisperer outperforms full fine-tuning (72.46 > 71.39).

Efficiency Comparison¶

Method	STR (GSM8K 10%)	STR (DialogSum 10%)	Speedup vs Nuggets
GraNd	1.08	1.11	-
Nuggets	1.26	2.53	1×
Data Whisperer	0.17	0.25	7.4-10×

Ablation Study¶

Configuration	GSM8K 10% (Llama-3)	Description
Full (nd=5, nq=5)	72.46	Full model
w/o attention weighting	~70.0	Stripping attention weighting drops performance by ~2%
nd=1 (similar to Nuggets)	~69.5	Degrades to one-shot with poor performance
Weak-to-strong (3B→7B)	Close to direct 7B scoring	Scores from the smaller model are transferable

Key Findings¶

STR metric exposes critical issues: While the STR of prior works is systematically \(>1\), Data Whisperer significantly reduces it to 0.03–0.17.
Outperforming full dataset with minor subsets: On GSM8K, using a selected 10% subset outperforms fine-tuning on 100% of the data.
Crucial role of attention weighting: Removing context-aware weighting leads to a substantial performance drop.
Few-shot > One-shot: \(n_d=5\) is significantly better than \(n_d=1\), since multiple demonstrations provide much richer task-specific information.

Highlights & Insights¶

The STR metric serves as an excellent evaluation dimension. It highlights a neglected issue in the data selection community—that many selection methods take longer than fine-tuning itself, which limits their practical utility. This metric holds potential for analyzing other AutoML methods.
The theoretical bridge of \(\text{ICL} \approx \text{Fine-tuning}\) elegantly links ICL scoring to fine-tuning outcomes. The clean mathematical derivation (\(\Delta W_{icl}\) vs \(\Delta W_{ft}\) under linear attention approximation) provides a solid theoretical foundation for ICL-based data selection.
The Weak-to-strong strategy is a highly practical acceleration technique. Using a 3B model to score data for a 7B model further curtails costs with negligible performance degradation.

Limitations & Future Work¶

The theoretical analysis is based on a linear attention approximation, which still differs from practical softmax attention.
Experiments were primarily verified on 7-8B models; the effectiveness on larger scales (e.g., 70B+) remains to be investigated.
Selecting the attention layer \(l\) relies on hyperparameters (targeting a fixed layer); adaptive layer selection could be a better alternative.
The evaluation focused solely on LoRA; applicability to full parameter fine-tuning or other PEFT methods is yet to be validated.

vs Nuggets: Nuggets utilizes one-shot ICL combined with a fine-tuned model for scoring, which incurs high computational overhead (\(\text{STR} > 1\)). Data Whisperer instead leverages few-shot ICL with a pre-trained model and attention weighting, resulting in a 7-20× speedup.
vs STAFF: STAFF estimates gradients using small models for scoring, which still requires training. Data Whisperer is entirely training-free.
vs CCS: CCS balances coverage and importance based on heuristics, whereas Data Whisperer is strictly data-driven.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of the STR metric and attention-weighted ICL scoring is highly novel, though the theoretical base of ICL being equivalent to fine-tuning is not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensively verified across 3 datasets, 3 models, and multiple selection ratios, with detailed ablation and synthetic data studies.
Writing Quality: ⭐⭐⭐⭐ Clear theoretical formulations, highly precise definition of the STR metric, and informative visualizations.
Value: ⭐⭐⭐⭐ Highly practical. It brings a 7-20× speedup without dropping performance, offering high utility for realistic LLM fine-tuning pipelines.