SCAR: Data Selection via Style Consistency-Aware Response Ranking for Efficient Instruction-Tuning¶
Conference: ACL 2025
arXiv: 2406.10882
Code: https://github.com/zhuang-li/SCAR
Area: LLM Pre-training
Keywords: Data selection, Instruction tuning, Style consistency, Data-efficient, LoRA
TL;DR¶
SCAR identifies "linguistic form" and "instructional surprisal" of responses as two key style factors influencing LLM instruction-tuning performance. It proposes a style-consistency-aware ranking method to automatically select high-quality training data, enabling the fine-tuned LLM to match or exceed the performance of training on the full dataset using only 0.7% of the original data.
Background & Motivation¶
Background: Instruction tuning (SFT) is a critical step in aligning LLMs with human preferences. LIMA proposed the "Superficial Alignment Hypothesis", suggesting that pre-trained models already possess knowledge, and SFT merely guides the model to learn specific response styles. AlpaGasus demonstrated that high-quality small datasets can outperform larger datasets.
Limitations of Prior Work: (a) LIMA relies on human experts to manually ensure style consistency, which is highly expensive; (b) the exact definition and constituent elements of "style" are unclear; (c) the interactive relationship between style consistency and data quality (correctness, helpfulness) regarding their impact on SFT remains unclarified.
Key Challenge: High-quality and style-consistent data significantly enhances SFT efficacy, but existing methods either rely on expensive human curation or focus solely on data quality while neglecting style consistency.
Goal: (a) Define key style elements affecting SFT; (b) clarify the relationship between style consistency and data quality; (c) develop an automated method to select style-consistent training data.
Key Insight: Through a comparative stylometric analysis of LLM-generated and human-written responses, it is found that LLM responses exhibit higher style consistency, primarily across two dimensions: linguistic form (lexical/syntactic) and instructional surprisal (predictability of answers).
Core Idea: Among data of comparable quality, subsets with higher style consistency yield better fine-tuning performance—SCAR automatically leverages this finding for data selection.
Method¶
Overall Architecture¶
SCAR is a ranking model that takes instruction-response pairs as input and outputs a style-consistency score. The framework consists of three steps: (1) defining and quantifying two key style dimensions; (2) training a ranking model to distinguish between high- and low-style-consistency responses; (3) using the ranking model to rank the target dataset and select the top-K subset for SFT.
Key Designs¶
-
Linguistic Form:
- Function: Measures the consistency of responses in non-semantic dimensions (lexical choice, syntactic structure, formatting, etc.).
- Mechanism: Quantified using 6 author attribution metrics: Type-Token Ratio (TTR), Measure of Textual Lexical Diversity (MTLD), function word frequency, Flesch Reading Ease score, average sentence length, and punctuation/formatting feature frequency. The standard deviation is calculated to measure consistency.
- Design Motivation: Experiments reveal that the standard deviation of TTR for GPT-3.5 responses is only 8.14 (vs 24.23 for humans), indicating that LLM responses are highly consistent in linguistic form, whereas human responses vary widely due to multi-author origins.
-
Instructional Surprisal:
- Function: Measures the "unexpectedness" of the response content relative to the instruction.
- Mechanism: Evaluated using \(\text{PPL}(y|x)\) to measure the perplexity of response \(y\) given instruction \(x\). "Direct responses" generated by GPT have low and consistent perplexity (predictable standard answers), while human response perplexity distributions are more scattered (potentially containing unexpected solutions like StoogeSort).
- Design Motivation: Surprisal consistency affects training stability—high-variance surprisal leaves the model "conflicted" during the learning process.
-
SCAR Ranking Model:
- Function: Automatically ranks instruction-response pairs based on style consistency.
- Mechanism: A Bradley-Terry ranking model is trained, using style-consistent LLM-generated responses as positive examples and style-inconsistent human-written responses as negative examples. The model is enhanced with representation learning, disentangling the representations of linguistic form and surprisal via contrastive learning to better distinguish between these two style dimensions.
- Design Motivation: Directly measuring style consistency with rules has limited efficacy; a learned ranking model can capture fine-grained style patterns.
Loss & Training¶
The ranking loss is based on the Bradley-Terry model: \(\mathcal{L} = -\log\sigma(r(x, y_w) - r(x, y_l))\), where \(y_w\) is the style-consistent LLM-generated response and \(y_l\) is the style-inconsistent human-written response. Representation learning utilizes a contrastive loss to disentangle the embedding representations of linguistic form and surprisal.
Key Experimental Results¶
Main Results¶
Code Domain (StackExchange \(\rightarrow\) CodeLlama-7b, HumanEval Pass@1):
| Method | Data Size | Pass@1 ↑ |
|---|---|---|
| Full Dataset Training | 10,000 (100%) | 26.56 |
| SCAR Selection | 70 (0.7%) | 31.00 |
| AlpaGasus (GPT-4 Quality Ranking) | 2,500 (25%) | 28.05 |
| Random | 70 (0.7%) | 22.51 |
Open QA Domain (LIMA \(\rightarrow\) Meta-Llama-3-8B, AlpacaEval LC WinRate):
| Method | Data Size | LC WinRate ↑ |
|---|---|---|
| Full Dataset Training | 1,000 (100%) | 1.93 |
| GPT-3.5-turbo Direct | 1,000 (100%) | 5.67 |
| SCAR Selection from Mixed Data | 250 (25%) | 4.82 |
Ablation Study¶
| Configuration | HumanEval Pass@1 | Note |
|---|---|---|
| SCAR (Full) | 31.00 | Linguistic Form + Surprisal |
| Linguistic Form Only | 29.26 | Without surprisal |
| Surprisal Only | 27.80 | Without linguistic form |
| Without Representation Learning | 28.90 | Without contrastive learning disentanglement |
Interactive effects of style consistency vs. data quality:
| Data Source | TTR Std ↓ | PPL Std ↓ | Quality | Performance |
|---|---|---|---|---|
| GPT-3.5 Direct | 8.14 | 0.30 | 3.32/3.45 | 31.00/47.12 |
| GPT-3.5 Referenced | 8.16 | 0.33 | 3.44/3.70 | 29.82/46.89 |
| Human Responses | 24.23 | 0.33 | 3.29/3.70 | 26.56/41.63 |
| Llama2-13b Direct | 12.76 | 0.36 | 2.44/2.50 | 22.14/33.84 |
Key Findings¶
- 0.7% of data can outperform the full dataset: On code tasks, a model trained on only 70 SCAR-selected data points outperforms a model trained on 10,000 data points.
- Style consistency is more critical than data quality: When quality is comparable, style consistency determines the performance gap; however, extremely low quality (such as hallucinated responses from Llama2-13b) will offset the advantages of style consistency.
- Linguistic form and surprisal are complementary: Both contribute to performance improvement, and removing either degrades effectiveness.
- LLM-generated responses are inherently style-consistent: The standard deviation of GPT-3.5 response styles is much lower than that of humans, explaining why fine-tuning with LLM-generated data typically yields better results.
Highlights & Insights¶
- First to decompose "style consistency" into two actionable dimensions: linguistic form and instructional surprisal, forming a complete closed loop from concept to quantification to automated selection.
- Quantitative validation of the "Superficial Alignment Hypothesis": While LIMA qualitatively proposed that SFT primarily learns style, this work quantitatively demonstrates this through experiments and identifies which style dimensions are crucial.
- Transfer value: SCAR's data selection methodology can be directly applied to SFT data curation in any domain, making it particularly suitable for scenarios with large but mixed-quality datasets.
Limitations & Future Work¶
- The experiments are relatively small-scale (StackExchange 10K, LIMA 1K), and performance on larger datasets remains to be verified.
- It has only been evaluated in two domains—code and open QA; domain generalizability requires further experimentation.
- The ranking model itself requires training data (LLM-generated vs. human-written), necessitating extra preparation for new scenarios without existing contrastive data.
- The impact of style consistency on multi-turn conversations or reasoning tasks is not discussed.
Related Work & Insights¶
- vs LIMA: LIMA manually ensures the style consistency of 1,000 data points, whereas SCAR automates this process and requires only 0.7% of the data.
- vs AlpaGasus: AlpaGasus uses GPT-4 to rank data by quality, while SCAR additionally incorporates style consistency, achieving better performance under the same data ratio.
- vs Deita/Cherry LLM: These methods focus on data diversity and quality, whereas SCAR complements them by introducing the orthogonal dimension of style consistency.
Rating¶
- Novelty: ⭐⭐⭐⭐ Systematically defines and quantifies the impact of response style on SFT for the first time, offering deep insights.
- Experimental Thoroughness: ⭐⭐⭐⭐ Clear multidimensional analysis, though dataset scale and task coverage could be broader.
- Writing Quality: ⭐⭐⭐⭐⭐ Progressive research pipeline, smoothly transitioning from observation to definition, methodology, and verification.
- Value: ⭐⭐⭐⭐ Provides direct guidance for SFT data curation practices.