Distributionally Robust Feature Selection¶
Conference: NeurIPS 2025 arXiv: 2510.21113 Code: Available (link provided in the paper) Area: Other Keywords: Feature Selection, Distributionally Robust Optimization, Group DRO, Noise Injection, Model-Agnostic
TL;DR¶
This paper proposes a model-agnostic distributionally robust feature selection method that achieves a continuous relaxation of discrete selection by injecting controlled Gaussian noise into covariates, and optimizes the conditional variance of the Bayes-optimal predictor, so that the selected feature subset enables high-quality downstream models to be trained simultaneously across multiple subpopulations.
Background & Motivation¶
Background: Feature selection is a fundamental problem in machine learning. Classical methods such as Lasso, forward/backward selection, and XGBoost feature importance all optimize for a single distribution. In the area of distributionally robust optimization (DRO), methods such as Group DRO aim to find a single model that performs well across all subgroups, but do not address feature selection.
Limitations of Prior Work: In practice, it is often necessary to select a small number of features under budget constraints (e.g., a medical questionnaire can only ask a limited number of questions), and the selected features must support training good models separately for different subpopulations. Existing feature selection methods do not consider cross-distribution robustness, and DRO methods do not perform feature selection — the intersection of these two directions has been entirely unexplored.
Key Challenge: Feature selection is a discrete combinatorial optimization problem (NP-hard), while DRO requires minimax optimization. The coupling of these two levels of optimization makes direct solutions extremely difficult — each time the feature subset changes, models for all subpopulations must be retrained.
Goal: Select \(k\) features such that models trained separately on each subpopulation \(P_i\) using only those \(k\) features achieve the best possible worst-group performance.
Key Insight: (a) Replace hard selection with noise injection to achieve a continuous relaxation; (b) shift to optimizing the performance of the Bayes-optimal predictor to eliminate dependence on any specific model; (c) leverage the Gaussian noise model to derive closed-form kernel weight expressions.
Core Idea: By combining feature-level noise injection with variance optimization of the Bayes-optimal predictor, the intractable discrete-minimax problem is transformed into a differentiable continuous optimization, enabling model-agnostic distributionally robust feature selection.
Method¶
Overall Architecture¶
The input consists of labeled data from multiple subpopulations \(\{(X_i^j, Y_i^j)\}\), and the output is the index set of \(k\) optimal features. The method proceeds in three steps: (1) fit the conditional expectation \(\hat{\mu}_i(X)\) once for each subpopulation; (2) control the information content of each feature via the noise injection parameter \(\alpha\), and optimize the minimax objective through gradient descent; (3) select the \(k\) features with the smallest \(\alpha\) values.
Key Designs¶
-
Continuous Relaxation via Noise Injection:
- Function: Relaxes the discrete binary mask \(\alpha \in \{0,1\}^m\) to a continuous noise parameter \(\alpha \in \mathbb{R}_{\geq 0}^m\).
- Mechanism: Rather than directly scaling features (\(\alpha \odot X\), which a model can compensate for inversely), Gaussian noise is injected as \(S_i(\alpha)|X \sim \mathcal{N}(X_i, \alpha_i)\). Here \(\alpha_i = 0\) denotes full information retention, and \(\alpha_i \to \infty\) denotes discarding the feature.
- Design Motivation: Deterministic scaling does not alter information content (a model can learn to compensate via \(w_i/\alpha_i\)), whereas stochastic noise genuinely reduces the signal-to-noise ratio.
- Distinction from Lasso: Lasso's \(\ell_1\) regularization is only effective for linear models and assumes a single distribution.
-
Variance Optimization via the Bayes-Optimal Predictor:
- Function: Replaces the loss of a specific model with the Bayes-optimal loss, bypassing the inner optimization.
- Mechanism (Theorem 1): Under MSE loss, the problem is equivalent to \(\min_\alpha \max_{P_i} -\mathbb{E}_{S(\alpha)}[\mathbb{E}_X[\mu_i(X)|S(\alpha)]^2] + \lambda \text{Reg}(\alpha)\), where \(\mu_i(X) = \mathbb{E}_{P_i}[Y|X]\).
- Design Motivation: (a) Eliminates the need to retrain a model for each \(\alpha\); (b) is agnostic to the downstream model architecture; (c) requires fitting \(\mu_i\) only once per subpopulation at the outset.
-
Closed-Form Kernel Representation (Theorem 2):
- Function: Transforms the empirical estimate of the conditional expectation into a Gaussian kernel-weighted sum.
- Closed-Form Weights: \(w_i^j(S,\alpha) = \frac{\exp(-\frac{1}{2}(X_i^j - S)^T \text{diag}(\alpha)^{-1}(X_i^j - S))}{\sum_k \exp(-\frac{1}{2}(X_i^k - S)^T \text{diag}(\alpha)^{-1}(X_i^k - S))}\)
- Design Motivation: \(\alpha\) directly controls the kernel bandwidth — a small \(\alpha_i\) yields a narrow kernel along that dimension (information preserved), while a large \(\alpha_i\) yields a wide kernel (information discarded).
Loss & Training¶
The final optimization objective is \(\min_\alpha \max_{P_i} -\frac{1}{b}\sum_{\ell=1}^b (\sum_j w_i^j(S^\ell, \alpha) \mu_i(X_i^j))^2\), with the reparameterization trick \(S = X + \sqrt{\alpha} \odot \epsilon\) to ensure gradient flow. The inner max is approximated via softmax with temperature \(\beta\). The regularizer \(\lambda \cdot \text{Reg}(\alpha) = \lambda / \|\alpha\|_1\) encourages sparsity. In practice, summation is restricted to \(k\) nearest neighbors for efficiency.
Key Experimental Results¶
Main Results: Synthetic Dataset 1 (Linear Model, 3 Subgroups, 15 Features)¶
| Method | Budget=5 Group A MSE | Group B MSE | Group C MSE | Budget=10 Worst-Group |
|---|---|---|---|---|
| Ours | Lowest | Balanced | Balanced | Tied for best |
| DRO-Lasso | Moderate | Moderate | Good | Second best |
| DRO-XGBoost | Good | Moderate | Good | Tied with Ours |
| Vanilla Lasso | Poor | Poor | Best | Poor |
| Embedded MLP | Poor | Poor | Poor | Poor |
Main Results: Real-World Dataset ACS (Income Prediction, 3 States)¶
| Method | CA MSE↓ | FL MSE↓ | NY MSE↓ | CA R²↑ | FL R²↑ | NY R²↑ |
|---|---|---|---|---|---|---|
| Ours | Lowest (order-of-magnitude lead) | Lowest | Lowest | Highest | Highest | Highest |
| DRO-XGBoost | High | High | High | Low | Low | Low |
| DRO-Lasso | Moderate | Moderate | Moderate | Moderate | Moderate | Moderate |
Ablation Study: Method Property Comparison¶
| Property | Ours | Lasso | DRO-Lasso | XGBoost | DRO-XGB | Embedded MLP |
|---|---|---|---|---|---|---|
| Model-Agnostic | ✓ | ✗ (linear) | ✗ (linear) | ✗ (tree) | ✗ (tree) | ✗ (MLP) |
| Distributionally Robust | ✓ | ✗ | ✓ | ✗ | ✓ | ✓ |
| Handles Nonlinearity | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ |
| Low Variance | ✓ | Medium | Medium | High | High | High |
Key Findings¶
- On the ACS dataset, the proposed method achieves MSE one order of magnitude lower than all baselines, with substantially higher R² as well.
- On synthetic data, when coefficients have opposite signs across subgroups (Groups A and B), vanilla Lasso fails completely, while the proposed method maintains balanced performance.
- The proposed method consistently exhibits the lowest variance across random seeds, reflecting stability of the selected features.
- The choice of downstream prediction model (random forest vs. MLP) does not affect the relative ranking of feature selection methods.
Highlights & Insights¶
- Noise Injection vs. Deterministic Scaling: Deterministic scaling does not alter mutual information (a linear model can compensate by inverting the scale), whereas only stochastic noise genuinely reduces the information content of a feature.
- From Specific Models to the Bayes Optimum: Abandoning optimization of a concrete model's loss in favor of the theoretically optimal loss not only circumvents the computational difficulty of bilevel optimization, but also naturally decouples the method from any downstream model architecture.
- Kernel Bandwidth as Feature Importance: In the resulting Gaussian kernel weights, \(\alpha\) directly controls the kernel bandwidth along each dimension, establishing a direct connection between feature selection and kernel methods.
Limitations & Future Work¶
- The theoretical derivation is grounded in the bias-variance decomposition under MSE loss; extension to other losses such as cross-entropy requires non-trivial generalization.
- Fitting \(\mu_i(X)\) for each subpopulation in advance means that when subgroup sample sizes are very small, estimation quality may degrade and affect the final results.
- The \(k\)-nearest-neighbor approximation introduces approximation error; neighbor quality may deteriorate in high-dimensional sparse settings.
- The method does not account for interaction effects among features — certain features may be individually uninformative yet highly informative in combination.
Related Work & Insights¶
- vs. Lasso: Lasso assumes linearity and a single distribution; the proposed method is model-agnostic and robust across multiple distributions.
- vs. Group DRO (Sagawa et al. 2019): Group DRO trains a single robust model, whereas the proposed method first selects features and then trains separate models per subpopulation — offering greater flexibility.
- vs. MAML (Finn et al. 2017): Both have a bilevel structure, but MAML requires differentiating through inner-loop training, while the proposed method entirely avoids this by appealing to the Bayes optimum.
Rating¶
- Novelty: ⭐⭐⭐⭐ First formal study of distributionally robust feature selection; the combination of noise injection and the Bayes-optimal predictor is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Coverage of both synthetic and real-world data across regression and classification settings, though large-scale high-dimensional experiments are absent.
- Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is clear, derivations proceed in a logical sequence, and the flow from motivation to method to experiments is coherent throughout.
- Value: ⭐⭐⭐⭐ The problem has strong practical relevance (medical questionnaires, sensor deployment); the method is concise and practically applicable.