Geometric Data Valuation via Leverage Scores¶

Conference: NeurIPS 2025 arXiv: 2511.02100 Code: GitHub Area: Model Compression / LLM Efficiency Keywords: Data Valuation, Shapley Values, Leverage Scores, Active Learning, Experimental Design

TL;DR¶

This paper proposes a geometric data valuation method based on statistical leverage scores as an efficient proxy for Data Shapley values. The proposed method satisfies the axioms of symmetry, efficiency, and dummy player, and extends to ridge leverage scores to address the dimensionality saturation problem, providing theoretical guarantees of \(O(\varepsilon)\)-approximate optimality.

Background & Motivation¶

Data valuation aims to quantify the contribution of each training data point to model performance, with important applications in data cleaning, subset selection, federated learning incentive allocation, and data market pricing. Data Shapley is the most theoretically rigorous approach, but its combinatorial computational complexity—requiring enumeration of marginal contributions over all subsets—renders it infeasible at scale.

Limitations of prior work: - Computationally expensive, requiring repeated model retraining - Requires access to model weights and gradients - Primarily focused on data quality control in pretraining, neglecting uncertainty in data acquisition decisions

This paper approaches the problem from the perspective of geometry and numerical linear algebra, proposing a model-agnostic data valuation method.

Method¶

Overall Architecture¶

Core Idea: Leverage scores measure the structural importance of each data point in feature space—high-leverage points span unique directions and are thus valuable, while low-leverage points tend to be redundant.

Given a data matrix \(X \in \mathbb{R}^{n \times d}\), the leverage score of the \(i\)-th data point is defined as the \(i\)-th diagonal entry of the projection matrix \(H = X(X^TX)^{-1}X^T\):

\[\ell_i = x_i^T (X^TX)^{-1} x_i\]

Normalizing yields the valuation function:

\[\pi_i = \frac{\ell_i}{\sum_{j=1}^n \ell_j}\]

Key Designs¶

Theorem 1 (Shapley Axiom Satisfaction): When \(\text{rank}(X) = d\), the normalized leverage scores \(\pi_i\) satisfy the following axioms under the utility function \(U(S) = \text{span}\{x_i: i \in S\}\): - Symmetry: Data points with equivalent contributions receive equal valuations - Efficiency: All valuations sum to 1 (since \(\text{Tr}(H) = d\), normalization yields unit sum) - Dummy Player: Data points that do not increase the subspace dimension receive zero valuation

Dimensionality Saturation: Standard leverage scores assign zero marginal value to all new data points once the subspace reaches full rank \(d\), which is a significant practical limitation.

Ridge Leverage Extension—regularization is introduced to resolve saturation:

\[\ell_i^{(\lambda)} = x_i^T (X^TX + \lambda I)^{-1} x_i\]

Key properties of ridge leverage scores: - For any \(\lambda > 0\), \(\ell_i^{(\lambda)} \in (0,1)\), and the statistical dimension \(k_\lambda = \sum_i \ell_i^{(\lambda)}\) lies strictly in \((0, d)\) - Even when \(\text{rank}(X_S) = d\), newly added data points retain positive values, as regularization shrinks the inverse matrix without eliminating marginal gains

Connection to Classical Experimental Design: - D-optimality marginal gain: \(\log\det(A + xx^T) - \log\det(A) = \log(1 + \ell^{(\lambda)}(x)) > 0\) - A-optimality marginal gain: \(\text{Tr}((A+xx^T)^{-1}) - \text{Tr}(A^{-1}) = -\|A^{-1}x\|_2^2 / (1+\ell^{(\lambda)}(x)) < 0\)

The marginal gains of both classical criteria are monotone functions of the ridge leverage score.

Theorem 3 (\(\varepsilon\)-Approximation Guarantee): Sampling \(m \geq C \frac{k_\lambda + \log(2d/\delta)}{\varepsilon^2}\) points according to ridge leverage probabilities yields, with probability \(\geq 1-\delta\):

\[\|\hat{\theta} - \theta^\star\|_A \leq 4\varepsilon \|\theta_{\text{lin}}\|_A, \quad R(\hat{\theta}) - R(\theta^\star) \leq 8\varepsilon^2 \|\theta_{\text{lin}}\|_A^2\]

That is, the model parameters and prediction risk obtained from training on the sampled subset are \(O(\varepsilon)\)-close to those of the full-data model.

Loss & Training¶

The proposed method does not involve conventional training procedures—valuations are obtained via closed-form computation. In active learning experiments, ridge leverage scores are used to select the most informative samples for annotation, requiring neither gradients nor backpropagation.

Key Experimental Results¶

Main Results (Active Learning)¶

Active Learning on MNIST (3-layer MLP, 784→256→64→10): - Initial pool of 100 labeled samples; 40 active learning rounds with 5 samples selected per round

Method	Final Test Accuracy
Ridge Leverage (adaptive \(\lambda\))	0.846 ± 0.006
K-center	~0.82
Margin	~0.83
Entropy	~0.83
EGL	~0.81
Random	~0.80

Ridge leverage demonstrates a clear advantage after the pretraining phase and exhibits the lowest cross-trial variance.

Theoretical vs. Empirical Guarantees:

Guarantee Type	Specific Result
Matrix approximation	\((1-\varepsilon)A \preceq A_S \preceq (1+\varepsilon)A\)
Parameter approximation	\(\\|\hat{\theta} - \theta^\star\\|_A \leq 4\varepsilon\\|\theta_{\text{lin}}\\|_A\)
Risk approximation	\(R(\hat{\theta}) - R(\theta^\star) \leq 8\varepsilon^2\\|\theta_{\text{lin}}\\|_A^2\)

Ablation Study¶

As a workshop paper, no systematic ablation is included. However, the theoretical analysis examines the effect of different choices of \(\lambda\): - \(\lambda = 0\): degenerates to standard leverage scores, subject to dimensionality saturation - \(\lambda > 0\): resolves saturation; the adaptive setting \(\lambda = 0.01 \times \text{Tr}(X^TX)/64\) performs best

Key Findings¶

Gradient-free data selection: Ridge leverage scores require neither model gradients nor backpropagation, relying solely on the geometric structure of the data
Axiomatic guarantees: Satisfies Shapley symmetry and efficiency (the ridge extension does not satisfy the dummy player axiom, which is justified—redundant data still contributes to variance reduction)
Deep connection to experimental design: The marginal gains of A/D-optimality criteria are both governed by leverage scores

Highlights & Insights¶

Novel perspective: Classical concepts from numerical linear algebra (leverage scores) are reinterpreted as data valuation tools, bridging NLA and operations research
Computational efficiency: All leverage scores can be computed in \(O(nd^2)\) time, far below the exponential complexity of Shapley values
Theoretical elegance: The \(\varepsilon\)-approximation theorem is based on matrix Chernoff concentration inequalities, with a compact proof
Model-agnostic: Valuation depends only on the data matrix, requiring no model specification or training

Limitations & Future Work¶

Workshop paper: Experiments are limited in scale (MNIST only); validation on large-scale datasets or LLM settings is absent
Linearity assumption: Theoretical guarantees are grounded in ridge regression (linear models); extension to nonlinear settings requires further investigation
Linearity axiom not satisfied: Normalized leverage scores do not satisfy the linearity axiom of Shapley values, limiting multi-task aggregation
Embedding space dependency: Active learning experiments require penultimate-layer embeddings from a pretrained model; the method's effectiveness depends on embedding quality
Comparisons with more advanced active learning methods (BALD, BatchBALD, LESS, TRAK) are not included

Data Shapley: The theoretical counterpart of this work; leverage scores provide an efficient though slightly weaker alternative
A/D-Optimal Experimental Design: Ridge leverage naturally connects to classical criteria in optimal experimental design
Randomized Numerical Linear Algebra: Leverage score sampling is widely used in randomized least-squares solvers
Insights: Ridge leverage can be extended to feature spaces (e.g., kernel leverage scores) to handle nonlinear problems, or applied to training data selection for LLMs

Rating¶

Novelty: ⭐⭐⭐⭐ — Distinctive geometric perspective with elegant connections
Technical Depth: ⭐⭐⭐⭐ — Rigorous theoretical proofs with complete axiomatic verification
Experimental Thoroughness: ⭐⭐ — Limited experiments as a workshop paper; MNIST only
Practical Value: ⭐⭐⭐ — Computationally efficient, but applicable scenarios require further exploration
Overall: ⭐⭐⭐⭐

Geometric Data Valuation via Leverage Scores¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results (Active Learning)¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Background & Motivation¶

Core Problem¶

Method¶

Key Experimental Results¶

Highlights & Insights¶

Limitations & Future Work¶

Insights & Connections¶

Rating¶

Geometric Data Valuation via Leverage Scores¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results (Active Learning)¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Background & Motivation¶

Core Problem¶

Method¶

Key Experimental Results¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Insights & Connections¶

Rating¶

Related Papers¶