Function Spaces Without Kernels: Learning Compact Hilbert Space Representations¶
Conference: ICLR 2026 arXiv: 2509.20605 Code: Available Area: LLM Evaluation Keywords: function encoders, kernel methods, Hilbert space, generalization bounds, PCA-guided basis selection
TL;DR¶
This paper proves that Function Encoders, which learn neural network basis functions, implicitly define a valid kernel, thereby bridging neural feature learning and RKHS theory. It further proposes PCA-guided compact basis selection algorithms and establishes finite-sample generalization bounds.
Background & Motivation¶
Machine learning methods have long faced a trade-off between computational efficiency and theoretical guarantees:
- Neural networks: flexible and scalable, but with limited theoretical guarantees (existing bounds are often vacuous or rely on restrictive assumptions).
- Kernel methods: theoretically well-founded (RKHS theory, precise statistical guarantees), but poorly scalable — inference cost scales with training set size \(m\) (requiring inversion of an \(m \times m\) Gram matrix in the dual representation).
Function Encoders are a recently proposed transfer/representation learning technique that learns a set of neural network basis functions \(\{\psi_j\}_{j=1}^n\) as an explicit feature map \(\phi(x) = [\psi_1(x), \ldots, \psi_n(x)]^\top\), with inference cost depending only on the number of bases \(n\) rather than dataset size \(m\). However, their theoretical foundations remain incomplete:
- No formal connection to Hilbert space theory.
- No principled method for selecting the number of bases \(n\).
- No finite-sample generalization guarantees.
This paper aims to close all three theoretical gaps.
Method¶
Overall Architecture¶
The central insight is that Function Encoders can be understood from two complementary perspectives — primal and dual:
- Primal space: provides an explicit feature map \(\phi(x)\), enabling efficient \(\mathcal{O}(n)\) linear prediction.
- Dual space: the inner product \(\langle\phi(x), \phi(x')\rangle\) corresponds to kernel evaluation, amenable to the full suite of kernel-theoretic analysis tools.
Function Encoders operate in two stages: 1. Offline phase: jointly train basis functions on a collection of functions \(\{D_1, \ldots, D_N\}\) by minimizing a multi-task reconstruction loss. 2. Online inference: fix the basis functions and solve for coefficients \(c\) via regularized least squares.
Key Designs¶
1. Function Encoders as Learnable Kernels
Core theorem (Proposition 1): the inner product defined by the basis functions automatically constitutes a valid kernel:
This kernel is symmetric and positive semi-definite (since \(\|\sum_i \alpha_i \phi(x_i)\|^2 \geq 0\)), and hence a legitimate reproducing kernel. This connection implies that Function Encoders not only enable efficient prediction in primal space but also instantiate a valid kernel in dual space — making the entire theoretical toolkit of kernel methods applicable.
2. Progressive Training
Analogous to PCA's incremental addition of principal components, basis functions are trained one at a time:
- Step \(b=1\): train \(\psi_1\), then freeze it.
- Step \(b=2\): add \(\psi_2\), train only \(\psi_2\) to fit the residual.
- At each step \(b\): compute the coefficient covariance matrix \(\Sigma_b\) with eigenvalues \(\lambda_1 \geq \ldots \geq \lambda_b\).
- Stopping criterion: cumulative explained variance \(\text{CEV}_r = \sum_{i=1}^r \lambda_i / \sum_{j=1}^b \lambda_j \geq \tau\) (e.g., 99%).
Advantage: ordered and interpretable bases with a clear stopping rule. Limitation: training is inherently sequential and cannot be parallelized on GPU.
3. Train-Then-Prune
- Jointly train \(B\) bases (\(B \gg r\)).
- Compute the eigendecomposition of the coefficient covariance matrix.
- Effective rank: \(r = \min\{n : \sum_{i=1}^n \lambda_i / \sum_{j=1}^B \lambda_j \geq \tau\}\).
- Score each basis by importance: \(s_p = \sum_{i=1}^r \lambda_i U_{pi}^2\).
- Retain the top-\(r\) bases and prune the rest.
- Fine-tune briefly to recover performance.
Advantage: training is fully parallelizable and computationally efficient.
4. Generalization Bounds (Theoretical Core)
Rademacher complexity bound:
Key scaling: complexity grows with the number of bases \(n\) and decays with data size \(m\) and regularization \(\lambda\). More bases improve expressiveness but increase overfitting risk.
PAC-Bayes bound:
Technical contribution: a truncated Gaussian distribution is used to handle unbounded loss functions, extending PAC-Bayes analysis to multivariate outputs and learned feature maps.
Loss & Training¶
- Offline training loss: multi-task regularized least squares $\(\frac{1}{Nm}\sum_{j=1}^N \sum_{i=1}^m \|f_j(x_i) - \hat{f}_j(x_i)\|^2 + \lambda\|\hat{f}_j\|^2\)$
- Online inference: solve for coefficients via regularized least squares with fixed bases.
Key Experimental Results¶
Main Results¶
Polynomial space benchmark (known intrinsic dimension \(d+1\)):
| Polynomial degree | Intrinsic dim. | Progressive recovery | Train-Prune recovery |
|---|---|---|---|
| \(d=3\) | 4 | 4 ✓ | 4 ✓ |
| \(d=4\) | 5 | 5 ✓ | 5 ✓ |
| \(d=5\) | 6 | 6 ✓ | 6 ✓ |
Both algorithms exactly recover the known intrinsic dimension; pruned models match the accuracy of overparameterized originals.
Van der Pol oscillator:
| Method | # Bases | Prediction accuracy |
|---|---|---|
| Prior work (Ingebrand et al.) | 100 | Baseline |
| Progressive / Prune | 2 | Same accuracy |
Basis count reduced from 100 to 2 — a 50× reduction — with no loss in accuracy.
Two-body orbital problem:
| Method | # Bases | Explained variance >99% |
|---|---|---|
| Overparameterized | Many | — |
| Progressive / Prune | 5–6 | ✓ |
Five to six bases capture >99% of variance, consistent with the 5-dimensional space of orbital parameters.
Ablation Study¶
Comparison with deep kernel learning (degree-3 polynomial): - Training time: Function Encoders are approximately 10× faster than deep kernels (RBF) at \(m=20\). - Reason: deep kernels require \(\mathcal{O}(m^3)\) Gram matrix inversion. - The geometric structure of the Gram matrices is nearly identical, validating the interpretation of Function Encoders as adaptive kernels in primal space.
Key Findings¶
- Both PCA-guided algorithms accurately identify the intrinsic dimensionality of the function space.
- Compact basis representations can reduce the number of bases by up to 50× without accuracy loss.
- Function Encoders are significantly more training-efficient than deep kernel methods.
- The derived bounds provide non-vacuous guarantees at inference time.
- Learned basis functions exhibit interpretable physical structure in dynamical systems.
Highlights & Insights¶
- Unifying two worlds: the approach bridges the scalability of neural networks with the theoretical guarantees of kernel methods — Function Encoders sit precisely at this intersection.
- Elegant analogy between PCA and basis selection: the problem of selecting the dimensionality of a function space is recast as PCA truncation in coefficient space.
- Solid theoretical contributions: the truncated Gaussian PAC-Bayes technique has independent value and is applicable to other unbounded-loss settings.
- Layered experimental design: validation progresses from polynomial spaces with known dimensionality to real dynamical systems, demonstrating the method's effectiveness in increasing complexity.
Limitations & Future Work¶
- The stopping threshold \(\tau\) remains heuristic; a principled, non-heuristic basis selection criterion is lacking.
- Progressive training is inherently sequential, limiting efficiency on large-scale problems.
- Experiments are confined to low-dimensional dynamical systems; performance on high-dimensional complex settings (PDEs, vision tasks) is unknown.
- Constants in the generalization bounds may be large; while non-vacuous, their practical tightness remains to be verified.
- End-to-end integration with downstream applications (e.g., robotic control, trajectory prediction) has not been explored.
Related Work & Insights¶
- Kernel methods (GP/KRR/SVM): theoretically rigorous but poorly scalable; Function Encoders address this limitation while preserving theoretical properties.
- Kernel approximations (Nyström / random Fourier features): alleviate computational cost but use fixed kernels that cannot adapt to data.
- Deep kernel learning: parameterized kernels enhance expressiveness but retain \(\mathcal{O}(m)\) inference cost.
- Dictionary learning / Koopman / SINDy: also learn basis functions but pursue different objectives (sparse recovery / dynamics linearization).
- Insight: the kernel-theoretic perspective on Function Encoders may provide theoretical analysis tools for other transfer learning approaches.
Rating¶
- Novelty: ★★★★☆ — The unifying perspective has theoretical originality, though Function Encoders themselves are not new.
- Technical depth: ★★★★★ — Dual theoretical analysis via Rademacher complexity and PAC-Bayes; the truncated Gaussian technique has independent value.
- Experimental persuasiveness: ★★★★☆ — Polynomial verification is thorough and dynamical system results are compelling, but large-scale experiments are absent.
- Practical value: ★★★☆☆ — Theoretical contributions outweigh immediate applicability; compact bases show potential for embedded systems.
- Clarity: ★★★★★ — Theoretical derivations are well-structured and the experimental design builds progressively.