Mapping 1,000+ Language Models via the Log-Likelihood Vector¶

Conference: ACL 2025
arXiv: 2502.16173
Code: https://github.com/shimo-lab/modelmap
Area: LLM / NLP
Keywords: model mapping, log-likelihood vector, KL divergence, model clustering, benchmark prediction

TL;DR¶

This paper proposes mapping 1,000+ language models into a unified space using the log-likelihood vector, proving that the Euclidean distance between vectors approximates the KL divergence. This approach enables model clustering visualization, benchmark performance prediction (\(r=0.96\)), and data leakage detection.

Background & Motivation¶

Background: The LLM ecosystem is experiencing explosive growth, with massive model variants (base, fine-tuned, merged, etc.) on HuggingFace, yet systematic methods to compare and understand the relationships among these models are lacking.

Limitations of Prior Work: Existing approaches have their own limitations: classification based on model names/attributes relies on metadata, comparison based on output text lacks theoretical foundations, and leaderboard rankings are discrete and only reflect specific task dimensions.

Key Challenge: Language models are essentially probability distributions and should be analyzed using the geometric structure of probability distributions rather than relying on arbitrary metrics.

Goal: How to systematically compare 1,000+ models at scale using a theory-driven approach, revealing their relationships and structures?

Key Insight: Leveraging concepts from information geometry, the log-likelihood values of models on a fixed text set are used as coordinates, proving that this representation approximates the KL divergence.

Core Idea: Each language model is represented by its log-likelihood vector across 10,000 text segments, where the Euclidean distance between vectors approximates the KL divergence, thereby constructing a "model map."

Method¶

Overall Architecture¶

Given \(K\) models and \(N\) text segments, a \(K \times N\) log-likelihood matrix \(\mathbf{L}\) is computed. After double centering, the coordinate vector \(\mathbf{q}_i \in \mathbb{R}^N\) is obtained for each model. In this space, t-SNE visualization, clustering analysis, and regression for benchmark score prediction are performed.

Key Designs¶

Log-Likelihood Vector Representation:
- Function: For each model \(p_i\) and each text segment \(x_s\), calculate \(\ell_i(x_s) = \sum_{t=1}^n \log p_i(y_t | y^{t-1})\) to construct the vector \(\boldsymbol{\ell}_i \in \mathbb{R}^N\).
- Mechanism: This is the negative of the cross-entropy loss and requires no extra computation—it is naturally yielded during model training or evaluation.
- Design Motivation: Log-likelihood is the most fundamental quantity of probabilistic models, directly reflecting how well a model models each text segment.
Double Centering:
- Function: First perform row centering (subtract the mean log-likelihood \(\bar{\ell}_i\) of each model to eliminate differences in overall model capabilities), and then perform column centering (subtract the mean across models for each text segment to eliminate differences in text difficulty).
- Mechanism: \(\xi_{is} = \ell_i(x_s) - \bar{\ell}_i\), and then \(\mathbf{q}_i = \boldsymbol{\xi}_i - \bar{\boldsymbol{\xi}}\).
- Design Motivation: Row centering eliminates details of perplexity differences caused by model scale (otherwise large models would cluster together), while column centering removes the inherent difficulty differences of the texts.
KL Divergence Approximation (Core Theory):
- Function: Prove that under the assumption that the models approximate the true distribution, \(2 \text{KL}(p_i, p_j) \approx \text{Var}_{x \sim p_0}(\ell_i(x) - \ell_j(x))\).
- Mechanism: The data-driven estimation is \(2 \text{KL}(p_i, p_j) \approx \|\mathbf{q}_i - \mathbf{q}_j\|^2 / N\).
- Design Motivation: Convert distribution differences into Euclidean distances in vector space, making large-scale model comparison efficient and theoretically grounded.

Application Scenarios¶

Visualization: t-SNE dimensionality reduction to paint the "model map", where models from the same family naturally cluster.
Performance Prediction: Use \(\mathbf{q}_i\) for Ridge regression to predict benchmark scores.
Data Leakage Detection: Compare normalized average log-likelihood with benchmark scores; anomalously high ones may indicate data leakage.

Key Experimental Results¶

Benchmark Performance Prediction (Ridge Regression)¶

Benchmark	Pearson's r	Spearman's ρ
ARC	0.946	0.948
HellaSwag	0.909	0.956
MMLU	0.932	0.934
TruthfulQA	0.901	0.884
Winogrande	0.941	0.948
GSM8K	0.884	0.857
6-TaskMean	0.953	0.960

Comparison: Prediction Directly Using Average Log-Likelihood (Perplexity)¶

Benchmark	Pearson's r	Spearman's ρ
ARC	0.453	0.432
MMLU	0.346	0.422
TruthfulQA	0.072	0.048
6-TaskMean	0.395	0.400

Key Findings¶

Utilizing log-likelihood vectors of 10,000 texts can predict the average score across 6 benchmarks with an accuracy of \(r=0.96\)—vastly outperforming simple perplexity (\(r=0.40\)).
Reaching a cumulative explained variance of 90% requires only 42 dimensions, and 95% requires only 82 dimensions, demonstrating that the effective dimensionality of differences between models is very low.
Models of the same family (Llama-2, Mistral, Gemma, etc.) cluster tightly on the map.
Code-specialized models possess unique features along the GitHub/StackExchange dimensions.
Weight interpolation experiments confirm that linear interpolation in weight space maintains a linear structure in log-likelihood space as well.
Theoretical validation: token-level KL approximation correlation is \(r=0.893\), and text-level correlation is \(r=0.904\).
The computation across 1,018 models requires only ~10 minutes on a single GPU.

Highlights & Insights¶

Probability Theory-Driven Model Analysis: Unlike empirical leaderboard comparisons, this work approaches from the perspective of information geometry to provide theoretical guarantees for "model distance" (\(\approx\) KL divergence), which is its most elegant aspect.
The Design of Double Centering is Crucial: It eliminates two confounding factors—model scale and text difficulty—without which all analyses would be dominated by these two factors.
Data Leakage Detection is an Unexpected and Valuable Application: Models pre-trained on the Pile display anomalously high log-likelihoods mismatching their benchmark scores, which can be leveraged for screening.
High Computational Efficiency: It avoids generating text or running benchmarks; a single forward pass yields coordinates of a model in the entire space.

Limitations & Future Work¶

Reliance on the choice of the reference text set: different text sets might yield different model maps.
The premise of the KL divergence approximation (that models approximate the true distribution) does not entirely hold in practice.
Only models with 1-13B parameters were tested; the performance of larger models or novel architectures (such as MoEs) remains unknown.
Unable to directly compare models using different tokenizers (requiring token-level approximations).

vs Open LLM Leaderboard: Leaderboards only offer discrete rankings, whereas this work provides positions and distances in a continuous space, delivering substantially richer information.
vs Activation Space Comparison: Prior model comparisons relied on internal activations and required white-box access; this method only requires log-likelihoods, enabling black-box applicability.
This methodology can be extended to: selecting the most suitable model for a specific task (by locating the nearest neighbor on the map) and detecting whether a model is a fine-tuned version of a certain base model.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First large-scale theory-driven mapping of model space
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 1,018 models + theoretical validation + demonstration of multiple applications
Writing Quality: ⭐⭐⭐⭐ Clear theoretical derivation, but highly dense content
Value: ⭐⭐⭐⭐⭐ Provides a foundational tool for understanding the LLM ecosystem