Autoencoding Random Forests¶

Conference: NeurIPS 2025 arXiv: 2505.21441 Code: https://github.com/bips-hb/RFAE Area: Representation Learning / Tree Models Keywords: Random Forests, Autoencoder, Diffusion Maps, Spectral Decomposition, Tabular Data

TL;DR¶

RFAE is the first principled encode-decode framework for random forests. It exploits the positive-definiteness and universality of the RF kernel to derive low-dimensional encodings via diffusion-map spectral decomposition, and decodes back to the original feature space through k-NN regression in leaf-node space. Across 20 tabular datasets, RFAE achieves an average reconstruction rank of 1.80, substantially outperforming TVAE (3.38) and AE (3.27), and is successfully applied to MNIST reconstruction and scRNA-seq batch-effect removal.

Background & Motivation¶

Background: Deep learning dominates representation learning (AE, VAE), yet random forests frequently outperform deep models on tabular data. Tabular datasets contain mixed continuous/categorical features, which tree models handle more naturally than neural networks.

Limitations of Prior Work: Random forests lack a principled encode-decode pipeline—there is no established way to project RF-learned representations into a low-dimensional space and decode them back. Existing approaches (e.g., applying MDS directly to the RF similarity matrix) offer no theoretical guarantees and provide no decoding capability.

Key Challenge: RFs implicitly learn rich data similarity structure through leaf-node co-occurrence frequencies, but this structure is locked inside the trees and cannot be exploited for compression, visualization, denoising, or other tasks that require encoding and decoding.

Goal: To construct a theoretically grounded encoder (low-dimensional embedding) and decoder (recovery of original features from the embedding) for random forests.

Key Insight: The RF kernel \(k_n^{RF}(\mathbf{x}, \mathbf{x}')\) (the fraction of trees in which two samples share a leaf node) is shown to be positive-definite, asymptotically universal, and characteristic (Theorem 3.4), making it directly amenable to spectral decomposition via diffusion maps.

Core Idea: Positive-definiteness of the RF kernel → diffusion-map spectral decomposition = encoder; k-NN regression in leaf-node space = decoder; theoretical guarantees of universal consistency.

Method¶

Overall Architecture¶

Training data → Random Forest → RF similarity kernel matrix → Encoding: diffusion-map spectral decomposition \(\mathbf{Z} = \sqrt{n}\mathbf{V}\bm{\Lambda}^t\) → Decoding: k-NN regression (neighbors found in low-dimensional space, reconstruction via distance-weighted average of original features) → Reconstructed data

Key Designs¶

Theoretical Guarantees for the RF Kernel (Theorem 3.4):
- Function: Establishes that the RF kernel is suitable for diffusion maps.
- Mechanism: Proves that \(k_n^{RF}\) is positive-definite (enabling spectral decomposition), asymptotically universal (approximating any continuous kernel), and characteristic (distinct distributions yield distinct embeddings), guaranteeing the quality of the diffusion-map embedding.
- Design Motivation: Without these guarantees, applying diffusion maps to the RF kernel could produce degenerate or inconsistent embeddings.
Diffusion-Map Encoder:
- Function: Decomposes the RF similarity matrix into a low-dimensional embedding.
- Mechanism: \(\mathbf{Z} = \sqrt{n}\mathbf{V}\bm{\Lambda}^t\), where \(\mathbf{V}\) contains the eigenvectors of the kernel matrix, \(\bm{\Lambda}\) contains the eigenvalues, and \(t\) controls the diffusion time scale. New samples are projected via the Nyström extension.
- Design Motivation: Diffusion maps naturally capture nonlinear manifold structure, making them more suitable for complex data than PCA.
k-NN Regression Decoder (Theorem 4.4):
- Function: Recovers original features from the low-dimensional embedding.
- Mechanism: Identifies \(k\) nearest neighbors in the low-dimensional space and computes a distance-weighted average of their original features as the reconstruction. Theorem 4.4 proves universal consistency of the k-NN decoder—reconstruction error converges to zero as sample size grows.
- Design Motivation: Achieves a better accuracy-efficiency trade-off than ILP (exact but slow) and split relabeling (fast but less accurate).

Loss & Training¶

Reconstruction evaluation: \(1 - R^2\) for continuous features; classification error rate for categorical features.
Aggregated DICE metric: \(\text{DICE} = 1 - \frac{2|A \cap B|}{|A| + |B|}\)

Key Experimental Results¶

Main Results (Average Rank across 20 Tabular Datasets)¶

Method	Avg. Rank	# Best Datasets
RFAE	1.80	12/20
TTVAE	2.45	—
AE	3.27	—
TVAE	3.38	—
VAE	4.10	—

Ablation Study¶

Decoding Method	Speed	Accuracy	Characteristics
ILP (integer linear programming)	Slow	Highest	Exact leaf-node assignment
Split relabeling	Fast	Moderate	CART synthetic data
k-NN regression	Fast	High	Best trade-off

Key Findings¶

RFAE ranks first on 12/20 tabular datasets, substantially outperforming deep autoencoder methods.
At \(d_\mathcal{Z}=32\) on MNIST, recognizable digit reconstructions are obtained, demonstrating that the RF kernel retains sufficient visual information.
scRNA-seq batch-effect removal succeeds, achieving batch harmonization across the Zeisel (2,874 samples) and Tasic (1,590 samples) datasets.
The framework applies to both fully random forests and supervised RFs.

Highlights & Insights¶

Solid theoretical foundation: The triple guarantees of positive-definiteness, universality, and characteristicness for the RF kernel are the central theoretical contribution, legitimizing the RF kernel as a proper kernel method.
Another demonstration that tree models >> deep models on tabular data: RFAE substantially outperforms AE/VAE/TVAE on tabular benchmarks.
Cross-domain applicability: From tabular reconstruction to genomic denoising, the results demonstrate the generality of RFAE.

Limitations & Future Work¶

Decoding is computationally non-trivial—k-NN requires storing and querying the original RF.
The choice of \(k\) in k-NN is sensitive.
The framework is not suited to end-to-end optimization for high-dimensional data such as images.
The model cannot be distilled into a smaller model, as the original RF must be retained.

vs. TVAE/CTGAN: Deep generative models generally underperform tree models on tabular data; RFAE extends this advantage to the autoencoding setting.
vs. Kernel PCA: RFAE employs an RF-specific kernel with stronger theoretical guarantees and automatic handling of mixed feature types.
Insight: The RF kernel can serve as a general nonparametric similarity measure, applicable not only to classification but also to representation learning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First complete autoencoding framework for random forests, with substantial theoretical contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ 20 tabular datasets + MNIST + scRNA-seq across three application domains.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations with clearly stated theorems.
Value: ⭐⭐⭐⭐ Opens a new direction for combining tree models with representation learning.

Supplementary Notes on the Method¶

Complexity of the encoding process: Constructing the kernel matrix requires \(O(n^2 d)\) time, and spectral decomposition requires \(O(n^3)\); for large datasets, the Nyström approximation reduces this to \(O(nm^2)\) using \(m\) anchor points.
Consistency guarantee for k-NN decoding (Theorem 4.4): As \(n \to \infty\) with \(k/n \to 0\) and \(k \to \infty\), the reconstruction error converges to the Bayes optimum—the first universal consistency result for decoding with random forests.
Fundamental difference from deep autoencoders: Both the encoder (determined by the RF kernel) and the decoder (k-NN) in RFAE are nonparametric, eliminating overfitting risk at the cost of flexibility being bounded by the kernel.
Significance of the scRNA-seq application: Batch effects are a central challenge in single-cell omics; RFAE's nonlinear embedding removes technical batch variation while preserving cell-type structure.
Multi-task potential: The low-dimensional embedding of RFAE can simultaneously support clustering, visualization, and denoising—multiple uses from a single training run.