Skip to content

LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs

Conference: ACL 2025
arXiv: 2503.02502
Code: ZNLP/LADM
Area: LLM Efficiency
Keywords: long-context modeling, data selection, attention mechanism, contextual dependency, continual pre-training

TL;DR

LADM proposes an attention-based long-context training data selection framework. By training a small-scale Long Attention Calculator to compute attention dependency scores between spans (PFS → AFS → CDS), it efficiently screens high-quality samples with strong long-range dependency from large-scale corpora for continual pre-training. Using only 1B tokens, it significantly enhances the long-context capabilities of LLMs.


Background & Motivation

Background: Long-context modeling is a hot topic in the LLM field, with GPT-4 supporting 128K and Gemini 1.5 supporting 1M tokens. Continual pre-training on long-context data is the standard approach to endow LLMs with long-input processing capabilities. However, measuring the quality of training data has not received sufficient attention.

Limitations of Prior Work: (1) Simple concatenation of short texts results in a lack of cross-segment dependencies within samples, causing models to ignore long-range information; (2) ProLong (Chen et al., 2024a) segments samples to compute delta perplexity but ignores the intrinsic structures and relationships in the full context; (3) Similarity-based document aggregation methods (Staniszewski et al., 2023) cannot accurately measure the dependency strength within existing samples.

Core Motivation: The attention mechanism itself possesses an inherent retrieval capability—assigning higher weights to historical tokens that are more relevant to the current token. Leveraging this characteristic can accurately measure the cross-segment dependency relationships in long contexts.


Method

Overall Architecture

LADM consists of four steps: (1) training a Long Attention Calculator → (2) calculating the Pairwise Focus Score (PFS) → (3) aggregating into the Aggregated Focus Score (AFS) → (4) combining into the Contextual Dependency Score (CDS) for sample ranking and selection.

Key Designs

1. Long Attention Calculator: A small-scale long-context model trained based on TinyLlama-1.1B, optimized with 5B randomly sampled tokens to handle a 32K context window. Validation shows that it can distinguish samples with different dependency intensities through attention scores.

2. Pairwise Focus Score (PFS): Measures the attention dependency of the \(j\)-th span in a sample on the \(i\)-th span: $\(\text{PFS}(i,j) = \text{Sum}\left(\text{Softmax}\left(\frac{Q_j K_{0:j}^T}{\sqrt{d_k}}\right)[:,i]\right)\)$ representing the cumulative attention weight assigned from span \(j\) to span \(i\).

3. Aggregated Focus Score (AFS): For each span, aggregates its PFS with all preceding spans, excluding the first \(m\) and local \(n\) spans (to avoid biases from initial tokens and nearby neighbors), and incorporates distance and variance weighting: $\(\text{AFS}(j) = \sigma_j \sum_{i=m}^{j-n-1} \frac{j-i}{N} \cdot \text{PFS}(i,j)\)$

Sample-level Scoring

The Contextual Dependency Score (CDS) is obtained through a weighted sum of the AFS of all spans: $\(\text{CDS}(S) = \sum_{j=n_0}^{N-1} \frac{j}{N} \cdot \text{AFS}(j)\)$

Spans at later positions are assigned higher weights, encouraging long-range dependency.


Experiments

Main Results: Perplexity Evaluation

Model Method 2K 4K 8K 16K 32K
L-7B Random 4.515 3.900 3.264 2.780 2.458
L-7B ProLong 4.516 3.906 3.275 2.792 2.470
L-7B LADM 4.481 3.878 3.252 2.773 2.453
M-7B Random 4.620 3.936 3.267 2.775 2.455
M-7B ProLong 4.293 3.696 3.095 2.644 2.346
M-7B LADM 4.266 3.673 3.076 2.629 2.332

LongBench Empirical Evaluation (Partial)

Model Method Single-Doc QA Avg Multi-Doc QA Avg
L-7B Random 30.13 25.80
L-7B ProLong 29.50 29.04
L-7B LADM 32.24 31.10
M-7B Random 24.08 24.63
M-7B ProLong 23.76 24.43
M-7B LADM 33.85 29.09

Ablation Study & Analysis

Analysis Dimension Finding
Concatenated Data Length Average retrieval accuracy of 4K concatenation → 32K is only 0.57, whereas native 32K data achieves 0.88
Attention Score Validation Long Attention Calculator successfully distinguishes samples of different dependency intensities
Training Efficiency LADM with 0.5B tokens outperforms the Random method using 1B tokens
Needle-in-Haystack L-7B/13B and M-7B achieve near 100% retrieval rate

Key Findings

  • Dependency is Core: Training data using concatenated short texts lacks cross-segment dependencies, significantly degrading long-context modeling performance.
  • Attention as dependency probes: Attention distribution can effectively serve as a metric signal for long-context dependency.
  • Data Efficiency Boost: LADM outperforms the full training of random sampling while using only half the training tokens.
  • Cross-model Robustness: Consistently effective across four models: OpenLlama-3B, Llama2-7B, 13B, and Mistral-7B.

Highlights & Insights

  • Proposes a clear and actionable quality measurement framework for long-context data (PFS → AFS → CDS).
  • Innovatively leverages the intrinsic retrieval capabilities of the attention mechanism to measure contextual dependency.
  • Uses a low-cost TinyLlama as a data filter to achieve an efficient data selection pipeline.
  • Achieves a substantial improvement in long-context capabilities with only 1B tokens of continual pre-training.

Limitations & Future Work

  • Experiments are constrained to a context length of 32K tokens; applicability to longer contexts (64K/128K) has not been verified.
  • The Long Attention Calculator itself requires 5B tokens for training, introducing a certain amount of front-end cost.
  • Data selection maintains the original domain distribution, without exploring potential benefits of adjusting the domain distribution.
  • Analysis of parameter sensitivity for multiple hyperparameters in the CDS score (\(m\), \(n\), \(d\), \(n_0\)) is currently lacking.
  • Long-context Modeling: Training methods such as RoPE scaling (PI, YaRN, NTK), LongLoRA, S²-Attn, etc.
  • Long-context Data: ProLong (Chen et al., 2024a) based on delta perplexity for data filtering.
  • Document Aggregation: Similarity-based document composition by Staniszewski et al. (2023).
  • Attention Retrieval: Wu et al. (2024) revealing retrieval operations within the attention mechanism.

Rating

Dimension Rating
Novelty ⭐⭐⭐⭐
Technical Depth ⭐⭐⭐⭐
Experimental Thoroughness ⭐⭐⭐⭐⭐
Writing Quality ⭐⭐⭐⭐
Value ⭐⭐⭐⭐⭐
Overall Rating 8/10