Function Encoders: A Principled Approach to Transfer Learning in Hilbert Spaces¶
Conference: ICML2025
arXiv: 2501.18373
Code: tyler-ingebrand/FEtransfer
Area: Transfer Learning
Keywords: Transfer Learning, Hilbert Space, Basis Functions, Function Encoder, Least Squares
TL;DR¶
Proposes a taxonomy of transfer learning from the geometric perspective of Hilbert spaces (convex hull interpolation / linear span extrapolation / full-space extrapolation), and designs the Function Encoder method utilizing learnable neural network basis functions to achieve all three types of transfer, outperforming methods such as MAML and Transformers on multiple benchmarks.
Background & Motivation¶
- Core Problem: When can transfer learning algorithms effectively transfer to new tasks? Existing methods lack characterization of the "conditions for successful transfer."
- Limitations of Prior Work: Meta-learning methods like MAML require fine-tuning on new tasks and can easily fail when source and target tasks are only weakly related; large-scale pre-training relies on massive data rather than structural insights; kernel methods suffer from Gram matrix expansion as the data volume grows.
- Design Motivation: Frame the transfer learning problem as a geometric one in Hilbert spaces—the geometric position of the target task relative to the set of source tasks determines the transfer difficulty.
- Based on the existing Function Encoder theory (Ingebrand et al., 2024b), the authors further extend it to transfer learning scenarios across the entire Hilbert space.
Method¶
1. Geometric Taxonomy of Transfer Learning¶
The inductive transfer problem is modeled in a Hilbert space \(\mathcal{H}\) and categorized into three types based on the geometric relationship between the target function \(f_T\) and the set of source functions \(\{f_{S_1}, \ldots, f_{S_n}\}\):
| Type | Name | Definition | Difficulty |
|---|---|---|---|
| Type 1 | Convex Hull Interpolation | \(f_T \in \text{Conv}(f_{S_1}, \ldots, f_{S_n})\), i.e., \(f_T = \sum \alpha_i f_{S_i}\), \(\alpha_i \ge 0\), \(\sum \alpha_i = 1\) | Easiest |
| Type 2 | Linear Span Extrapolation | \(f_T \in \text{span}\{f_{S_1}, \ldots, f_{S_n}\}\), coefficients are unconstrained | Moderate |
| Type 3 | Full-Space Extrapolation | \(f_T \in \mathcal{H}\) but \(f_T \notin \text{span}\{f_{S_1}, \ldots, f_{S_n}\}\) | Hardest |
2. Function Encoder Architecture¶
Learn a set of basis functions \(\{g_1, \ldots, g_k\}\) parameterized by neural networks, representing any function \(f \in \mathcal{H}\) as:
Coefficient Computation via Least Squares (LS, newly proposed in this work):
where the inner product is approximated via Monte Carlo integration: \(\langle f, g_j \rangle \approx \frac{1}{m} \sum_{i=1}^{m} y_i \cdot g_j(x_i)\).
Key advantages of the LS method compared to the original Inner Product (IP) method: - Does not require orthogonal basis functions, only linear independence (a weaker condition). - Provides the theoretically optimal projection (in the least-squares sense). - Yields faster training convergence and higher accuracy.
3. Loss & Training¶
The second term is a regularization term to prevent the divergence of the basis function magnitudes.
4. Universal Function Space Approximation Theorem¶
Theorem 1: For any separable Hilbert space \(\mathcal{H}\), there exists a set of neural network basis functions such that any function in \(\mathcal{H}\) can be approximated with arbitrary precision.
Proof sketch: A separable Hilbert space possesses a countable orthonormal basis \(\rightarrow\) the universal approximation theorem of neural networks guarantees that each orthonormal basis can be approximated by an NN \(\rightarrow\) error decays geometrically \(\rightarrow\) overall approximation with finite precision.
5. Online Inference¶
Given a small amount of data \(D_{f_T}\) from the target task, the coefficients can be directly computed using the LS formula without retraining. The size of the Gram matrix is \(k \times k\) (a hyperparameter), independent of the data size, which allows extremely fast inference.
Key Experimental Results¶
Compare FE (LS), FE (IP), AutoEncoder, Transformer, TFE, MAML, BF, BFB, and other methods on 4 benchmark tasks:
| Benchmark Task | Type 1 (Interpolation) | Type 2 (Span Extrapolation) | Type 3 (Full-Space Extrapolation) |
|---|---|---|---|
| Polynomial Regression | FE(LS) is optimal, other methods perform acceptably | FE(LS) leads by several orders of magnitude | FE(LS) leads by several orders of magnitude |
| CIFAR-100 Classification | FE(LS) is slightly better than Siamese Network | — | FE(LS) is optimal, close to Siamese |
| 7-Scenes Pose Estimation | FE(LS) is optimal | — | FE(LS) is optimal |
| MuJoCo Ant Dynamics | FE(LS) is optimal | FE(LS) significantly leads | FE(LS) is optimal and stable |
Key Findings:
- In polynomial regression, FE(LS) achieves \(L^2\) errors that are several orders of magnitude lower than other methods on Type 2/3.
- On CIFAR-100, although FE is a general-purpose method, its performance is comparable to or slightly better than specialized Siamese/Prototypical Networks.
- In the MuJoCo dynamics task, AutoEncoder performs well on Type 3 during early training but degrades sharply as training progresses, whereas FE(LS) remains stable.
- Increasing the number of basis functions (e.g., from 3 to 100) significantly improves Type 3 transfer, as the redundant dimensions are optimally utilized by LS.
Highlights & Insights¶
- Novel Geometric Taxonomy: Systematically classifies three transfer types from the geometric perspective of Hilbert spaces for the first time, providing intuitive understanding.
- LS Coefficient Computation is a Core Innovation: Does not rely on orthogonality assumptions, making basis function training more flexible and speeding up convergence.
- Universal Approximation Theorem: Provides a theoretical guarantee for the expressing capability of the Function Encoder.
- No Fine-Tuning Required: Unlike MAML, inference only requires solving least squares, without gradient computation.
- Utilization of Redundant Dimensions: When the number of basis functions \(k\) is larger than the number of source tasks, LS can automatically utilize the redundant dimensions to adapt to Type 3 tasks, which is unique compared to other methods.
Limitations & Future Work¶
- Inner Product Selection: Inner products (such as \(L^2\), probability distribution inner product) need to be manually designed for different problems, limiting generality.
- Number of Basis Functions \(k\) Needs Tuning: If \(k\) is too small, it limits expression capability; if too large, it increases computational and regularization difficulty.
- Monte Carlo Approximation Error: When data is scarce, inner product estimation is inaccurate, affecting coefficient computation quality.
- Scalability: Scaling efficiency on extremely large task spaces (e.g., thousands of source tasks) has not been verified.
- Limited Theoretical Guarantees for Type 3: The universal approximation theorem is an existence proof and does not provide a quantitative relationship between \(k\) and the approximation error.
- Inductive Transfer Only: Does not cover cross-domain scenarios such as domain adaptation.
Related Work & Insights¶
- MAML (Finn et al., 2017): Learns a good initialization for fast fine-tuning, but requires gradient steps; FE does not require fine-tuning.
- Kernel Methods: Also use the concept of basis functions, but the number of basis functions grows with data and requires pre-selected kernels.
- Dictionary Learning: Performs atomic decomposition on discrete points, whereas FE's basis functions can be evaluated on a continuous domain.
- Transformer/TFE: Performs poorly even in simple polynomial regression, lacking structural inductive biases.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Geometric taxonomy + LS training scheme + universal approximation theorem, solid theoretical contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 4 benchmarks from different domains and comprehensive ablation analyses, but lacks larger-scale experiments.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear framework, intuitive figures, and tight integration of theory and experiments.
- Value: ⭐⭐⭐⭐ — Provides a new geometric perspective for transfer learning; the LS-based FE offers strong practicality.