From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts¶
Conference: ACL 2025
arXiv: 2506.16912
Code: https://github.com/Jabbawukis/sample-efficiency-evaluation
Area: LLM/NLP
Keywords: sample efficiency, fact learning, knowledge probing, pre-training, power law
TL;DR¶
This work is the first to directly investigate the relationship between the frequency of facts in pre-training data and an LLM's ability to recall them. It proposes two sample efficiency metrics and reveals that while models of different architectures and scales perform similarly on high-frequency facts, they differ significantly on low-frequency facts—making the ability to learn low-frequency facts the key differentiator of model sample efficiency.
Background & Motivation¶
Background¶
Background: LLMs store vast amounts of factual knowledge through pre-training, but their sample efficiency (how many times they need to see a fact before learning it) has not been systematically investigated.
Limitations of Prior Work: Information in real-world text follows a long-tail distribution, requiring models to learn rare facts from few occurrences. Existing works compare model performance without considering the frequency information in the training data.
Key Challenge: Given two models trained on the same data, which one is more capable of learning facts from limited exposure?
Goal: Establish a quantitative framework mapping factual frequency to recall capability to measure sample efficiency.
Key Insight: Train various models on the same pre-training corpus, label the frequency of each fact in the training data, and evaluate recall capability using the BEAR probe.
Core Idea: Sample efficiency should be modeled as a function of the fact recall probability with respect to the number of training exposures—the larger the slope \(\alpha_m\), the more efficient the model is.
Method¶
Overall Architecture¶
Wikipedia corpus fact frequency statistics -> pre-train multiple models on the same corpus -> BEAR knowledge probing -> analysis by frequency bins -> compute weighted accuracy and power-law fitted sample efficiency metrics.
Key Designs¶
-
Fact Frequency Statistics
- For each fact triple (s,r,o) in the BEAR probe, search the training corpus for the number of co-occurrences of s and o within the same sentence.
- Use aliases and lemmatization to increase matching rates.
- Design Motivation: Estimate the number of times the model "sees" a specific fact during pre-training.
-
Two Sample Efficiency Metrics
- Weighted Accuracy: Bin by frequency, where low-frequency bins are assigned higher weights (\(w_i = \exp(-0.05 \cdot l_i)\)).
- Power-Law Fitting \(\alpha_m\): \(F(x) = 1 - (L_0 + \frac{x_0}{(1+x)^{\alpha_m}})\), where a larger \(\alpha_m\) indicates higher sample efficiency.
- Design Motivation: Weighted accuracy is intuitive but difficult to compare, whereas \(\alpha_m\) provides a single comparable metric.
-
Model Training
- Train on ~5B tokens of Wikipedia.
- Three architectures \(\times\) two scales = 6 models.
- Save intermediate checkpoints to track learning dynamics.
- Design Motivation: Control the training data variable and isolate the impact of architecture and scale.
Key Experimental Results¶
Main Results -- Accuracy by Frequency Bins¶
| Frequency Bins | Model A (Large) | Model A (Small) | Model B (Large) | Model B (Small) |
|---|---|---|---|---|
| 1-5 times | ~25% | ~15% | ~20% | ~12% |
| 6-20 times | ~40% | ~30% | ~35% | ~25% |
| 21-100 times | ~60% | ~50% | ~55% | ~45% |
| 100+ times | ~75% | ~70% | ~73% | ~68% |
Sample Efficiency Metric \(\alpha_m\)¶
| Model | \(\alpha_m\) | Description |
|---|---|---|
| Large Model A | 0.35 | Most efficient |
| Large Model B | 0.30 | |
| Small Model A | 0.25 | |
| Small Model B | 0.20 | Least efficient |
Key Findings¶
- Small performance gaps on high-frequency facts (almost all models can learn them), but large gaps on low-frequency facts.
- Larger models are more sample-efficient: \(\alpha_m\) increases with scale.
- Architectural variations are most pronounced on low-frequency facts.
- Power-law functions fit the frequency-accuracy relationship well.
- Sample efficiency gradually improves during the training process (validated through checkpoint tracking).
Highlights & Insights¶
- First to directly correlate fact frequency with knowledge probing—bridging the gap in research on "training data characteristics \(\to\) model behavior."
- Power-law fitting provides an elegant, single metric \(\alpha_m\) to compare model efficiency.
- The discovery of low-frequency facts as a key differentiator has direct implications for training data strategies.
Limitations & Future Work¶
- The frequency estimation method (co-occurrence in the same sentence) may be imprecise.
- Models are only trained on Wikipedia, which may not represent diverse pre-training corpora.
- Future Directions: More precise frequency estimation and analysis on larger-scale pre-training data.
Related Work & Insights¶
- vs Kandpal et al. (2023): They discovered that LLMs perform poorly on low-frequency entities, while this work quantifies "how much worse" they perform.
- vs Neural Scaling Laws (Kaplan et al.): They study the power-law relationship of loss with respect to data volume, whereas this work studies the power-law relationship of fact recall probability with respect to frequency.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to establish a quantitative framework mapping factual frequency to recall capability
- Experimental Thoroughness: ⭐⭐⭐⭐ Sufficient variable control, covering multiple architectures and scales
- Writing Quality: ⭐⭐⭐⭐ Elegantly designed metrics
- Value: ⭐⭐⭐⭐ Provides important insights for pre-training data strategies and model comparison