Predicting Kernel Regression Learning Curves from Only Raw Data Statistics¶
Conference: ICLR 2026
arXiv: 2510.14878
Code: https://github.com/JoeyTurn/hermite-eigenstructure-ansatz
Area: Theory / Learning Theory / Kernel Methods
Keywords: Kernel Regression Learning Curves, Hermite Eigenstructure, Anisotropic Data, Kernel Ridge Regression, Feature Learning
TL;DR¶
The authors propose the Hermite Eigenstructure Ansatz (HEA), which enables analytical prediction of learning curves (test error vs. sample size) for rotation-invariant kernels on real image datasets (CIFAR-5m, SVHN, ImageNet) using only two statistics: the data covariance matrix and the Hermite decomposition of the target function. The paper proves this ansatz holds for Gaussian data and demonstrates that MLPs in the feature-learning regime learn Hermite polynomials in the order predicted by HEA.
Background & Motivation¶
Background: Kernel Ridge Regression (KRR) serves as an important proxy for understanding neural networks (via NTK equivalence). An established eigenframework can predict test error from the eigensystem (eigenvalues and eigenfunctions) of a kernel relative to the data distribution.
Limitations of Prior Work: While the eigenframework is theoretically complete, practical application requires constructing and diagonalizing the kernel matrix to obtain the eigensystem, which is computationally expensive for high-dimensional real data and lacks analytical interpretability. More importantly, most existing theories rely on simplified data assumptions (e.g., isotropic spherical distributions) that do not apply to real anisotropic datasets.
Key Challenge: Real data distributions are too complex for full analytical descriptions, yet learning behavior is profoundly shaped by data structure. Predicting behavior on actual data requires a balance between a "parsimonious description of data" and "prediction accuracy."
Goal: (a) Can a "parsimonious description" of data be found that is simple enough to predict kernel regression learning behavior? (b) Can learning curves be predicted directly from data statistics without constructing the kernel matrix?
Key Insight: The authors observe that for Gaussian data, the eigenfunctions of rotation-invariant kernels are naturally multidimensional Hermite polynomials. Since real image data is "Gaussian enough" (coordinatewise marginals are approximately Gaussian), they hypothesize that this structure approximately holds for real data.
Core Idea: The eigensystem of rotation-invariant kernels on anisotropic data is approximately equal to the Hermite eigensystem—eigenfunctions are Hermite polynomials along the PCA directions of the data, and eigenvalues are monomials of the covariance eigenvalues multiplied by the kernel's hierarchy coefficients.
Method¶
Overall Architecture¶
To understand the learning behavior of KRR on a specific dataset, the existing KRR eigenframework can compute the test error vs. sample size curve from "kernel eigenvalues + eigenfunctions + target coefficients in this basis." However, this traditionally requires kernel matrix diagonalization. This paper fills the missing link: it analytically derives the entire eigensystem from two raw statistics—the data covariance matrix \(\Sigma = U \Gamma U^\top\) and the Hermite decomposition of the target function \(f_*\)—along with the kernel function form, without ever touching the kernel matrix. Specifically, the kernel is expanded as a dot-product kernel on the typical norm sphere to obtain hierarchy coefficients \(c_\ell\); HEA then combines \((\Gamma, c_\ell)\) to derive the Hermite eigensystem \((\lambda_\alpha, \phi_\alpha)\); simultaneously, target coefficients \(v_i\) are estimated from labels via Gram-Schmidt; finally, \((\lambda_\alpha, v_i)\) are fed into the KRR eigenframework to output the learning curve.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
SIG["Raw Data<br/>Covariance Σ=UΓUᵀ"] --> HEA["Hermite Eigenstructure Ansatz (HEA)<br/>Eigenvalue=Hierarchy Coeff × Monomial of Cov Eigenvalues<br/>Eigenfunction=Hermite Polynomials on PCA directions"]
K["Rotation-Invariant Kernel K"] --> COEF["Spherical Hierarchy Coeff c_ℓ<br/>Expanded as dot-product kernel<br/>on typical norm sphere"]
COEF --> HEA
LAB["Labels for Target f*"] --> DEC["Target Hermite Decomposition<br/>Gram-Schmidt Orthogonalization → Coeff v_i"]
SIG --> DEC
HEA --> KRR["KRR Eigenframework<br/>Input (λ_α, v_i)"]
DEC --> KRR
KRR --> OUT["Learning Curve Prediction<br/>Test Error vs. Sample Size"]
Key Designs¶
1. Hermite Eigenstructure Ansatz (HEA): Compressing the Eigensystem into an Analytical Formula
The most critical step is bypassing high-dimensional kernel matrix diagonalization. HEA asserts that for anisotropic data, the eigensystem has a closed-form: for any multi-index \(\alpha \in \mathbb{N}_0^d\), eigenvalues are \(\lambda_\alpha = c_{| \alpha |} \cdot \prod_{i=1}^d \gamma_i^{\alpha_i}\) and eigenfunctions are multidimensional Hermite polynomials \(h_\alpha^{(\Sigma)}\) on PCA directions, where \(c_\ell\) are hierarchy coefficients on the typical norm sphere and \(\gamma_i\) are covariance eigenvalues.
The authors substantiate this with two theorems for Gaussian data. Theorem 1 (Wide Gaussian Kernel) shows the true eigensystem converges to the Hermite eigensystem as width \(\sigma \to \infty\). Theorem 2 (Fast-decaying Dot-product Kernel) uses perturbation theory to show linear convergence to HEA when \(c_{\ell+1} \leq \epsilon \cdot c_\ell\). HEA performs best when hierarchy coefficients decay rapidly, effective data dimension is high (\(d_\text{eff} \gg 1\)), and data is "Gaussian enough"—interestingly, complex images often satisfy this better than simple datasets like MNIST due to central limit effects.
2. Spherical Hierarchy Coefficients: Treating Any Rotation-Invariant Kernel as a Dot-Product Kernel
To use the formula, one needs \(c_\ell\). On the sphere of typical norm \(r = \text{Tr}[\Sigma]^{1/2}\), the kernel is expanded as a dot-product kernel \(K(x,x') = \sum_\ell \frac{c_\ell}{\ell!}(x^\top x')^\ell\). For a Gaussian kernel, \(c_\ell = e^{-r^2/\sigma^2} \cdot \sigma^{-2\ell}\). While not all kernels are dot-product kernels globally (e.g., Laplace), high-dimensional data norms concentrate near \(r\), making this approximation valid.
3. Target Function Hermite Decomposition: Estimating Coefficients from Finite Labels
To find the target coefficients \(v_i\), the authors address the slight non-Gaussianity of real data which makes Hermite bases not perfectly orthogonal. They perform Gram-Schmidt orthogonalization on sampled Hermite polynomials \(h_i^{(\text{GS})} = \text{unitnorm}(h_i - \sum_{j<i} \langle h_j^{(\text{GS})}, h_i \rangle h_j^{(\text{GS})})\) before projecting \(\hat{v}_i = \langle h_i^{(\text{GS})}, y \rangle\). This step is kernel-independent and can be reused for any kernel on the same dataset.
Key Experimental Results¶
Main Results: Learning Curve Prediction¶
| Dataset | Kernel | Target Type | HEA Prediction | Note |
|---|---|---|---|---|
| CIFAR-5m | Gaussian (σ=6) | Synthetic \(h_1(z_1)\) | Exact match | Accuracy across linear/quadratic/cubic complexities |
| CIFAR-5m | Gaussian (σ=6) | vehicles vs. animals | Good match | Accurate for binarized real labels |
| CIFAR-5m | Laplace (σ=8√2) | domesticated vs. wild | Good match | Requires ZCA to increase \(d_\text{eff}\) |
| SVHN | Gaussian (σ=6) | even vs. odd | Good match | Validation across different datasets |
| ImageNet-32 | ReLU NTK | Synthetic poly | Exact match | NTK kernel + high-res real data |
| ImageNet-32 | ReLU NTK | Power-law target | Exact match | Accurate across source exponents \(\beta\) |
Eigenstructure Verification¶
| Kernel/Data | \(d_\text{eff}\) | Eigenvalue Match | Subspace Overlap | Note |
|---|---|---|---|---|
| Gaussian + Gaussian | ~7 | Exact | Diagonal concentration | Theoretically guaranteed setting |
| Gaussian + CIFAR-5m | ~9 | Good | Diagonal concentration | Natural images also satisfy |
| Laplace + SVHN (ZCA) | ~21 | Good | Diagonal concentration | Requires \(d_\text{eff} \geq 20\) |
| ReLU NTK + ImageNet | ~40 | Good | Diagonal concentration | Replaces wide-kernel condition |
Key Findings¶
- HEA performs better on complex image datasets than on simple ones (MNIST) due to the "blessing of dimensionality" making coordinates more Gaussian.
- For non-smooth kernels (Laplace), \(c_\ell\) grows superexponentially; truncation at \(\ell \in [5,10]\) provides a valid asymptotic expansion.
- Gram-Schmidt orthogonalization for target decomposition is vital for accuracy due to data non-Gaussianity.
- The "kernel-independence" of the target decomposition significantly reduces the computational cost for multi-kernel evaluation.
Highlights & Insights¶
- Proof of Concept for End-to-End Analytical Theory: This work is potentially the first to achieve a complete analytical pipeline from "data structure → model performance" on real datasets without constructing kernel matrices.
- Parsimonious Description of Data: Using the covariance matrix and Hermite coefficients as a "reduced description" effectively captures the information relevant to kernel learners.
- Applicability to MLPs: Despite being a kernel theory, experiments show MLPs in the feature-learning regime learn Hermite polynomials in the sequence predicted by HEA, suggesting HEA might reflect a more general learning law.
- Blessing of Dimensionality: High dimensionality is usually a curse, but here it helps satisfy HEA as coordinates of complex high-dimensional images tend toward Gaussianity.
Limitations & Future Work¶
- Rotation-Invariant Kernels Only: The theory does not currently apply to non-rotation-invariant kernels like learned NTKs or attention kernels.
- Quantification of "Gaussian Enough": There is no formal quantitative threshold for the degree of Gaussianity required; failures on MNIST suggest this condition is non-trivial.
- Divergence of Hierarchy Coefficients: For non-smooth kernels, \(c_\ell\) grows too fast, requiring manual truncation. More elegant asymptotic treatments are needed.
- Empirical MLP Connection: The link between feature-learning MLPs and HEA is currently purely empirical without a formal theoretical explanation.
Related Work & Insights¶
- Vs. KRR Eigenframework (Simon et al. 2021): Previous work mapped eigensystems to risk but required numerical diagonalization; this work provides the analytical link from raw statistics to the eigensystem.
- Vs. Wortsman & Loureiro (2025): Contemporaneous work studied dot-product kernels on anisotropic Gaussian data but focused on eigenvalue bounds. HEA provides a more direct theoretical justification for replacing kernels with Hermite bases.
- Vs. Ghorbani et al. (2020): While they provided results on products of spheres, HEA generalizes this to continuous anisotropic settings.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐