Random Feature Representation Boosting¶

Conference: ICML2025
arXiv: 2501.18283
Authors: Nikita Zozoulenko, Thomas Cass, Lukas Gonon
Code: To be confirmed
Area: Optimization
Keywords: random features, gradient boosting, residual neural networks, tabular data, convex optimization

TL;DR¶

RFRBoost is proposed, leveraging the gradient representation boosting theory to construct deep residual random feature neural networks. It obtains a closed-form solution under the MSE loss and reduces to a quadratically constrained least squares problem under general loss functions. It significantly outperforms single-layer RFNNs and end-to-end trained MLP ResNets on tabular data.

Background & Motivation¶

Random Feature Neural Networks (RFNNs) only train the linear output layer while keeping the hidden layer parameters randomly fixed, enjoying advantages such as convex optimization, rapid training, and provable generalization guarantees. However, existing deep RFNN theories are limited to Fourier activation functions, and simply stacking random layers into a ResNet architecture faces key difficulties:

Residual block scaling problem: If the residual block \(g_t\) is too small, the initial representation \(\Phi_0\) dominates, rendering the newly added random features ineffective. If it is too large, the information from preceding layers is lost.

Direction problem: Ideally, each residual block should approximate the negative functional gradient of the loss function with respect to the network representation, but random layers cannot guarantee this property.

Lack of training mechanism: End-to-end training can adjust the weights of all layers via SGD and backpropagation, whereas RFNNs lack such a mechanism because their hidden layers are fixed.

These challenges make the question of "how to construct effective deep residual RFNNs while preserving the computational advantages of random features" a core problem.

Core Idea¶

The core insight of RFRBoost: restrict the residual block to the form \(g_t = A_t f_t\) (where \(f_t\) represents the random features and \(A_t\) is a learnable linear mapping), and then solve for the optimal \(A_t\) using gradient representation boosting theory. This preserves the convex optimization characteristics of RFNNs while providing theoretical guarantees for deep architectures through the boosting framework.

The ResNet representation is recursively defined as:

\[\Phi_t(x) = \Phi_{t-1}(x) + \eta A_t f_t(x, \Phi_{t-1}(x))\]

where \(f_t\) can depend on both the original input \(x\) and the preceding layer's output \(\Phi_{t-1}(x)\), and \(\eta\) is the learning rate.

Method¶

1. Definition of Random Feature Layers¶

The random feature layer generates a fixed feature vector \(f_t \in \mathbb{R}^p\) which is no longer trained after initialization. In the experiments, the specific choice is:

\[f_t(x) = \sigma(\text{concat}(B_t \Phi_{t-1}(x),\; C_t x))\]

where \(B_t, C_t\) are initialized using SWIM random features (non-i.i.d.), and \(\sigma\) is an activation function. This design allows the random features to utilize both the raw data and the representations from previous layers, thereby enhancing expressiveness.

2. Exact-Greedy Strategy (MSE Loss)¶

For the MSE loss \(l(x,y) = \frac{1}{2}\|x - y\|^2\), each layer greedily solves the joint optimization:

\[W_t, g_t = \arg\min_{W, g} \hat{\mathcal{R}}(W, \Phi_{t-1} + g)\]

Substituting \(g_t = A_t f_t\) reformulates this into a "sandwiched least squares" problem:

\[A_t = \arg\min_A \frac{1}{n} \sum_{i=1}^n \|r_i - W_{t-1}^\top A f_{t,i}\|^2 + \lambda\|A\|_F^2\]

where \(r_i = y_i - W_{t-1}^\top \Phi_{t-1}(x_i)\) is the residual. Key contribution: Theorem 3.1 proves that closed-form solutions exist for this problem under three types of \(A_t\): scalar, diagonal, and dense:

\(A_t\) Type	Closed-Form Solution	Explanation
Scalar	\(A_{\text{scalar}} = \frac{\langle R, ZW \rangle_F}{\\|ZW\\|_F^2 + n\lambda}\)	Global uniform scaling
Diagonal	\((WW^\top \odot Z^\top Z + \lambda I)^{-1} \text{diag}(WR^\top Z)\)	Dimension-wise independent learning rates
Dense	Based on spectral decomposition of \(WW^\top\) and \(Z^\top Z\)	Complete linear mapping, learning the direction of the functional gradient

After solving for \(A_t\), update \(\Phi_t = \Phi_{t-1} + \eta A_t f_t\), and then update the top-layer linear predictor \(W_t\) via least squares.

3. Gradient-Greedy Strategy (General Loss)¶

For general differentiable losses (e.g., cross-entropy), a first-order functional Taylor expansion is used to approximate the risk:

\[\mathcal{R}(W, \Phi + g) \approx \mathcal{R}(W, \Phi) + \langle g, \nabla_2 \mathcal{R}(W, \Phi) \rangle_{L_2^D(\mu)}\]

The paper highlights the issue of directly minimizing the inner product: \(g\) might decrease the inner product by simply increasing its own norm rather than aligning with the direction of the functional gradient. The solution is to impose a unit norm constraint \(\|g\|_{L_2^D(\hat{\mu})} = 1\).

Theorem 3.2 proves that the constrained optimization problem is equivalent to quadratically constrained least squares:

\[\min_A \frac{1}{n}\sum_{i=1}^n \|\nabla_2\hat{\mathcal{R}}(W,\Phi)(x_i) - A f(x_i)\|^2 \quad \text{s.t.} \quad \frac{1}{n}\sum_{i=1}^n \|Af(x_i)\|^2 = 1\]

and a closed-form solution exists: \(A = \frac{\sqrt{n}}{\|G\|_F} G^\top F (F^\top F)^{-1}\) (when \(F\) is of full rank).

Practical training consists of three steps: 1. Generate random features \(f_t\), compute the functional gradient matrix \(G\), and fit \(A_t\) using regularized least squares. 2. Perform line search to find the optimal step size \(\alpha_t\) (using Theorem 3.1 for MSE, and Newton's method for cross-entropy). 3. Update \(\Phi_t = \Phi_{t-1} + \eta \alpha_t A_t f_t\), and update \(W_t\) via convex optimization.

4. Theoretical Guarantees¶

Based on the generalized boosting framework of Suggala et al. (2020), a regret bound is derived under the \((\beta, \epsilon)\)-weak learning condition, which describes the rate at which excess risk decays with respect to the number of layers \(T\) and the sample size \(\tilde{n}\). This theoretical guarantee is not shared by end-to-end trained ResNets.

Key Experimental Results¶

Experimental Settings¶

Datasets: 91 tabular data tasks from OpenML (regression + classification), covering small to medium scales.
Baselines: Single-layer RFNN, end-to-end trained MLP ResNet.
Evaluation: Standard train/val/test split, MSE (for regression) / Accuracy (for classification).

Main Results¶

Comparison Dimension	Result
RFRBoost vs Single-layer RFNN	Significantly superior, proving that deep residual structures effectively enhance the expressive capacity of random features
RFRBoost vs MLP ResNet (SGD)	Significantly outperforms end-to-end trained models on small-to-medium-scale data
Computational Efficiency	No backpropagation required; training speed is far faster than gradient descent methods
Gradient-Greedy vs Exact-Greedy	The Gradient-Greedy strategy with unit-norm constraint outperforms Exact-Greedy

Key Findings¶

Effectiveness of Depth: As the number of layers \(T\) increases, the performance of RFRBoost continuously improves, whereas naive methods that simply stack random layers degrade in performance.
Importance of Norm Constraints: Imposing the \(\|g\|=1\) constraint in the Gradient-Greedy strategy is critical to outperforming Exact-Greedy—this finding corrects the conclusions in prior literature regarding the poor performance of gradient-greedy strategies.
Optimality of Dense \(A_t\): Among the three types of \(A_t\), the dense matrix (which learns the complete functional gradient direction) performs the best.
Computational Advantage: The optimization problems for all layers are convex (least squares or quadratically constrained least squares), eliminating the need for hyperparameter search over the scale of random features.

Highlights & Insights¶

Extending boosting from "boosting predictions" to "boosting representations" and combining it with random features is both elegant and practical.
The combination of "convex optimization + deep structures" is feasible and theoretically guaranteed in specific contexts, challenging the stereotype that depth must be non-convex.
The analysis of the failure modes of the gradient-greedy strategy (due to the missing norm constraint) is insightful, correcting the pessimistic conclusions drawn in previous literature.

Limitations & Future Work¶

Scope of Applicability: The paper explicitly targets "small-to-medium-scale" tabular data. It remains unverified on high-dimensional tasks such as large-scale data, images, or text, where it is expected to struggle to compete with deep learning.
Linear Prediction Head: The top layer is restricted to a linear mapping \(W^\top \Phi\), which limits expressiveness.
Dependency on Random Feature Quality: Performance depends heavily on the quality of random features (e.g., SWIM initialization), and the impact of different initialization strategies across diverse datasets has not been fully explored.
Lack of Comparison with GBDTs: XGBoost/LightGBM/CatBoost are the strongest baselines on tabular data, but the paper does not compare directly with them, limiting its comparisons solely to neural network methods.
Empirical Validation of Weak Learning Condition: The theoretical guarantees rely on the \((\beta, \epsilon)\)-weak learning condition, but the paper does not verify whether this condition holds on practical datasets.
Limited to Supervised Learning: The algorithm framework has not been extended to unsupervised, self-supervised, or semi-supervised settings.

Classic Gradient Boosting: XGBoost, LightGBM, and CatBoost use decision trees as weak learners to boost in the label space.
Neural Network Boosting: AdaNet (Cortes et al., 2017) and GrowNet (Badirli et al., 2020) perform boosting in the label space.
Gradient Representation Boosting: Nitanda & Suzuki (2018, 2020) and Suggala et al. (2020) boost in the representation space—RFRBoost directly inherits this framework.
Random Feature Models: RFNN, Reservoir Computing, Random Fourier Features, and SWIM (Bolager et al., 2023).

Rating¶

Novelty: ⭐⭐⭐⭐ — Combining gradient representation boosting theory with random feature ResNets is novel and natural, and the closed-form solution to the sandwiched least squares offers a solid technical contribution.
Experimental Thoroughness: ⭐⭐⭐ — While experimental evaluations on 91 OpenML datasets are extensive, the lack of direct comparison with GBDTs weakens the persuasiveness.
Writing Quality: ⭐⭐⭐⭐ — The theoretical derivations are clear, the algorithmic pseudocode is complete, and the analytical comparison between the two strategies is in-depth.
Value: ⭐⭐⭐ — Makes a solid contribution within the niche domain of random features and tabular data, though its practical applicability limits wider impact.