Distributionally Robust Feature Selection¶

Conference: NeurIPS 2025 arXiv: 2510.21113 Code: Available (link provided in the paper) Area: Other Keywords: Feature Selection, Distributionally Robust Optimization, Group DRO, Noise Injection, Model-Agnostic

TL;DR¶

This paper proposes a model-agnostic distributionally robust feature selection method that achieves a continuous relaxation of discrete selection by injecting controlled Gaussian noise into covariates, and optimizes the conditional variance of the Bayes-optimal predictor, so that the selected feature subset enables high-quality downstream models to be trained simultaneously across multiple subpopulations.

Background & Motivation¶

Background: Feature selection is a fundamental problem in machine learning. Classical methods such as Lasso, forward/backward selection, and XGBoost feature importance all optimize for a single distribution. In the area of distributionally robust optimization (DRO), methods such as Group DRO aim to find a single model that performs well across all subgroups, but do not address feature selection.

Limitations of Prior Work: In practice, it is often necessary to select a small number of features under budget constraints (e.g., a medical questionnaire can only ask a limited number of questions), and the selected features must support training good models separately for different subpopulations. Existing feature selection methods do not consider cross-distribution robustness, and DRO methods do not perform feature selection — the intersection of these two directions has been entirely unexplored.

Key Challenge: Feature selection is a discrete combinatorial optimization problem (NP-hard), while DRO requires minimax optimization. The coupling of these two levels of optimization makes direct solutions extremely difficult — each time the feature subset changes, models for all subpopulations must be retrained.

Goal: Select \(k\) features such that models trained separately on each subpopulation \(P_i\) using only those \(k\) features achieve the best possible worst-group performance.

Key Insight: (a) Replace hard selection with noise injection to achieve a continuous relaxation; (b) shift to optimizing the performance of the Bayes-optimal predictor to eliminate dependence on any specific model; (c) leverage the Gaussian noise model to derive closed-form kernel weight expressions.

Core Idea: By combining feature-level noise injection with variance optimization of the Bayes-optimal predictor, the intractable discrete-minimax problem is transformed into a differentiable continuous optimization, enabling model-agnostic distributionally robust feature selection.

Method¶

Overall Architecture¶

The input consists of labeled data from multiple subpopulations \(\{(X_i^j, Y_i^j)\}\), and the output is the index set of \(k\) optimal features. The method proceeds in three steps: (1) fit the conditional expectation \(\hat{\mu}_i(X)\) once for each subpopulation; (2) control the information content of each feature via the noise injection parameter \(\alpha\), and optimize the minimax objective through gradient descent; (3) select the \(k\) features with the smallest \(\alpha\) values.

Key Designs¶

Continuous Relaxation via Noise Injection:
- Function: Relaxes the discrete binary mask \(\alpha \in \{0,1\}^m\) to a continuous noise parameter \(\alpha \in \mathbb{R}_{\geq 0}^m\).
- Mechanism: Rather than directly scaling features (\(\alpha \odot X\), which a model can compensate for inversely), Gaussian noise is injected as \(S_i(\alpha)|X \sim \mathcal{N}(X_i, \alpha_i)\). Here \(\alpha_i = 0\) denotes full information retention, and \(\alpha_i \to \infty\) denotes discarding the feature.
- Design Motivation: Deterministic scaling does not alter information content (a model can learn to compensate via \(w_i/\alpha_i\)), whereas stochastic noise genuinely reduces the signal-to-noise ratio.
- Distinction from Lasso: Lasso's \(\ell_1\) regularization is only effective for linear models and assumes a single distribution.
Variance Optimization via the Bayes-Optimal Predictor:
- Function: Replaces the loss of a specific model with the Bayes-optimal loss, bypassing the inner optimization.
- Mechanism (Theorem 1): Under MSE loss, the problem is equivalent to \(\min_\alpha \max_{P_i} -\mathbb{E}_{S(\alpha)}[\mathbb{E}_X[\mu_i(X)|S(\alpha)]^2] + \lambda \text{Reg}(\alpha)\), where \(\mu_i(X) = \mathbb{E}_{P_i}[Y|X]\).
- Design Motivation: (a) Eliminates the need to retrain a model for each \(\alpha\); (b) is agnostic to the downstream model architecture; (c) requires fitting \(\mu_i\) only once per subpopulation at the outset.
Closed-Form Kernel Representation (Theorem 2):
- Function: Transforms the empirical estimate of the conditional expectation into a Gaussian kernel-weighted sum.
- Closed-Form Weights: \(w_i^j(S,\alpha) = \frac{\exp(-\frac{1}{2}(X_i^j - S)^T \text{diag}(\alpha)^{-1}(X_i^j - S))}{\sum_k \exp(-\frac{1}{2}(X_i^k - S)^T \text{diag}(\alpha)^{-1}(X_i^k - S))}\)
- Design Motivation: \(\alpha\) directly controls the kernel bandwidth — a small \(\alpha_i\) yields a narrow kernel along that dimension (information preserved), while a large \(\alpha_i\) yields a wide kernel (information discarded).

Loss & Training¶

The final optimization objective is \(\min_\alpha \max_{P_i} -\frac{1}{b}\sum_{\ell=1}^b (\sum_j w_i^j(S^\ell, \alpha) \mu_i(X_i^j))^2\), with the reparameterization trick \(S = X + \sqrt{\alpha} \odot \epsilon\) to ensure gradient flow. The inner max is approximated via softmax with temperature \(\beta\). The regularizer \(\lambda \cdot \text{Reg}(\alpha) = \lambda / \|\alpha\|_1\) encourages sparsity. In practice, summation is restricted to \(k\) nearest neighbors for efficiency.

Key Experimental Results¶

Main Results: Synthetic Dataset 1 (Linear Model, 3 Subgroups, 15 Features)¶

Method	Budget=5 Group A MSE	Group B MSE	Group C MSE	Budget=10 Worst-Group
Ours	Lowest	Balanced	Balanced	Tied for best
DRO-Lasso	Moderate	Moderate	Good	Second best
DRO-XGBoost	Good	Moderate	Good	Tied with Ours
Vanilla Lasso	Poor	Poor	Best	Poor
Embedded MLP	Poor	Poor	Poor	Poor

Main Results: Real-World Dataset ACS (Income Prediction, 3 States)¶

Method	CA MSE↓	FL MSE↓	NY MSE↓	CA R²↑	FL R²↑	NY R²↑
Ours	Lowest (order-of-magnitude lead)	Lowest	Lowest	Highest	Highest	Highest
DRO-XGBoost	High	High	High	Low	Low	Low
DRO-Lasso	Moderate	Moderate	Moderate	Moderate	Moderate	Moderate

Ablation Study: Method Property Comparison¶

Property	Ours	Lasso	DRO-Lasso	XGBoost	DRO-XGB	Embedded MLP
Model-Agnostic	✓	✗ (linear)	✗ (linear)	✗ (tree)	✗ (tree)	✗ (MLP)
Distributionally Robust	✓	✗	✓	✗	✓	✓
Handles Nonlinearity	✓	✗	✗	✓	✓	✓
Low Variance	✓	Medium	Medium	High	High	High

Key Findings¶

On the ACS dataset, the proposed method achieves MSE one order of magnitude lower than all baselines, with substantially higher R² as well.
On synthetic data, when coefficients have opposite signs across subgroups (Groups A and B), vanilla Lasso fails completely, while the proposed method maintains balanced performance.
The proposed method consistently exhibits the lowest variance across random seeds, reflecting stability of the selected features.
The choice of downstream prediction model (random forest vs. MLP) does not affect the relative ranking of feature selection methods.

Highlights & Insights¶

Noise Injection vs. Deterministic Scaling: Deterministic scaling does not alter mutual information (a linear model can compensate by inverting the scale), whereas only stochastic noise genuinely reduces the information content of a feature.
From Specific Models to the Bayes Optimum: Abandoning optimization of a concrete model's loss in favor of the theoretically optimal loss not only circumvents the computational difficulty of bilevel optimization, but also naturally decouples the method from any downstream model architecture.
Kernel Bandwidth as Feature Importance: In the resulting Gaussian kernel weights, \(\alpha\) directly controls the kernel bandwidth along each dimension, establishing a direct connection between feature selection and kernel methods.

Limitations & Future Work¶

The theoretical derivation is grounded in the bias-variance decomposition under MSE loss; extension to other losses such as cross-entropy requires non-trivial generalization.
Fitting \(\mu_i(X)\) for each subpopulation in advance means that when subgroup sample sizes are very small, estimation quality may degrade and affect the final results.
The \(k\)-nearest-neighbor approximation introduces approximation error; neighbor quality may deteriorate in high-dimensional sparse settings.
The method does not account for interaction effects among features — certain features may be individually uninformative yet highly informative in combination.

vs. Lasso: Lasso assumes linearity and a single distribution; the proposed method is model-agnostic and robust across multiple distributions.
vs. Group DRO (Sagawa et al. 2019): Group DRO trains a single robust model, whereas the proposed method first selects features and then trains separate models per subpopulation — offering greater flexibility.
vs. MAML (Finn et al. 2017): Both have a bilevel structure, but MAML requires differentiating through inner-loop training, while the proposed method entirely avoids this by appealing to the Bayes optimum.

Rating¶

Novelty: ⭐⭐⭐⭐ First formal study of distributionally robust feature selection; the combination of noise injection and the Bayes-optimal predictor is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Coverage of both synthetic and real-world data across regression and classification settings, though large-scale high-dimensional experiments are absent.
Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is clear, derivations proceed in a logical sequence, and the flow from motivation to method to experiments is coherent throughout.
Value: ⭐⭐⭐⭐ The problem has strong practical relevance (medical questionnaires, sensor deployment); the method is concise and practically applicable.