Intervention Efficiency and Perturbation Validation Framework: Capacity-Aware and Robust Clinical Model Selection under the Rashomon Effect¶
Conference: AAAI 2026 arXiv: 2511.14317 Code: GitHub Area: Medical Imaging / Clinical Machine Learning Keywords: Rashomon Effect, Model Selection, Intervention Efficiency, Perturbation Validation, Class Imbalance, Clinical Deployment
TL;DR¶
To address the challenge of model selection under the Rashomon Effect—where multiple models achieve similar performance on small, class-imbalanced clinical datasets—this paper proposes Intervention Efficiency (IE), a capacity-aware evaluation metric, and the Perturbation Validation Framework (PVF), a robustness validation framework, jointly enabling reliable model selection under resource constraints.
Background & Motivation¶
Core Challenges in Clinical Predictive Modeling¶
Small samples + class imbalance: High data acquisition costs and strict ethical constraints in clinical settings result in extremely rare positive events (e.g., adverse reactions), rendering traditional accuracy metrics misleading.
Rashomon Effect (model multiplicity): On small datasets, different models (e.g., logistic regression, SVM, random forest) may achieve similar performance while relying on entirely different feature subsets, making the question of "which model to choose" non-trivial.
Limitations of Prior Work: - F1 Score ignores true negatives and may favor suboptimal models - AUC-ROC overestimates performance on imbalanced data - AUC-PR, though more suitable for rare events, is sensitive to prevalence and offers poor clinical interpretability
Validation instability: Small datasets amplify variance, causing model rankings to fluctuate drastically across different data splits, making single-split validation results unreliable.
Core Motivation¶
- Resource constraints are neglected: Clinical settings often allow intervention for only a limited number of patients, yet existing metrics do not account for "intervention capacity."
- Robustness is overlooked: Traditional validation provides only point estimates without assessing model stability under data perturbation.
- A single deployable model is required: Clinical practice prioritizes interpretable single-model predictions over ensemble methods.
Method¶
Overall Architecture¶
Two complementary tools are proposed:
┌─────────────────────────────────────────────────────┐
│ Candidate Model Set F = {f₁, f₂, ..., f_Q} │
├───────────────┬─────────────────────────────────────┤
│ IE Evaluation │ PVF Robustness Screening │
│ (Capacity-Aware)│ (Perturbation + Aggregation) │
│ │ Original Val Set → M Perturbed Sets │
│ IE_γ(f,D) │ → Evaluate all models on M sets │
│ │ → Aggregate into a single score │
├───────────────┴─────────────────────────────────────┤
│ Select f* = argmax A_f │
└─────────────────────────────────────────────────────┘
Key Design 1: Intervention Efficiency (IE)¶
Core Idea: Quantify how many additional true positive cases a model-guided intervention captures over random intervention, given a finite intervention capacity \(\gamma\) (the ratio of intervened individuals to the total population).
Closed-form formula:
where: - \(p\) = precision, \(r\) = recall, \(\pi\) = prevalence - \(s = \min(\gamma, \frac{\pi r}{p})\), representing the actual proportion of model-guided interventions that can be utilized - \(\gamma = c / \beta\), the ratio of intervention capacity to total population size
Two operating regimes: - Resource-scarce (Regime A): The number of model-predicted positives exceeds capacity \(c\); only the top-\(c\) predictions are intervened upon, and IE simplifies to \(\beta p / \alpha\) - Resource-abundant (Regime B): Intervention capacity covers all model-predicted positives; remaining capacity is allocated randomly
Key Design 2: Perturbation Validation Framework (PVF)¶
Procedure: 1. Fix the original validation set \(\mathcal{D}_{val}\) 2. Independently perturb each sample's features to generate \(M\) perturbed validation sets (each containing \(k \cdot n\) samples) 3. Evaluate all candidate models on each perturbed validation set 4. Aggregate \(M\) scores per model using an aggregation function \(\mathcal{A}\) (e.g., 25th percentile) into a single robustness score 5. Select the model with the highest aggregated score: \(f^* = \arg\max_{f \in \mathcal{F}} A_f\)
Perturbation mechanism (by feature type):
| Feature Type | Perturbation Method | Control Parameter |
|---|---|---|
| Numerical | Add Gaussian noise \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\) | Noise std \(\sigma\) |
| Categorical (nominal) | Randomly flip to another category with probability \(\xi\) (uniform sampling) | Flip probability \(\xi\) |
| Ordered (ordinal) | Sample neighboring categories with distance-decay probability \(\xi\) | Flip probability \(\xi\), decay parameter \(\lambda\) |
Key hyperparameters: - \(d\): number of features to perturb - \(k\): number of copies per original sample (preserving distribution and class imbalance) - \(M\): number of perturbed validation sets - \(\mathcal{A}\): aggregation function (Q1, i.e., 25th percentile, used in experiments)
Computational complexity: \(\mathcal{O}(Q \cdot M \cdot k \cdot n \cdot d)\); no model retraining required.
Why Not Perturb Labels in the Validation Set?¶
Label perturbation is explicitly excluded: flipping labels disproportionately penalizes stronger models (as their previously correct predictions are flipped) while minimally affecting weaker models, compressing or even reversing performance differences. On highly imbalanced datasets, even flipping a small number of labels can drastically alter precision and recall.
Theoretical Guarantees¶
The paper provides a complete theoretical analysis of PVF in the appendix: - Proposition B.1: Perturbed scores exhibit an i.i.d. structure - Propositions B.2–B.4: PVF scores converge as \(M\), \(k\), \(n \to \infty\), respectively - Proposition B.5: Uniform convergence guarantee - Proposition B.6: Selection consistency of PVF — asymptotically selects the optimal model - Proposition B.7: PVF is essentially an optimized selection toward a user-specified property \(\Phi_{A,K}\)
Key Experimental Results¶
Main Results: Synthetic Data¶
Setup: 2 informative features + 3 noise features, 10 feature combinations → 10 candidate logistic regression models, 5,000 repetitions.
| Configuration | Sample size \(n\) | Class separation \(\mu\) | Perturbation noise \(\sigma\) | Total combinations |
|---|---|---|---|---|
| Sample size | 50, 100 | — | — | 2 |
| Separation | — | 0.1–2.9 (step 0.2) | — | 15 |
| Perturbation noise | — | — | 1e-6 ~ 0.1 | 6 |
| Total | 180 × 5,000 |
Key findings (Figure 3): - At \(\gamma=0.1\) and \(\gamma=0.3\), PVF outperforms traditional methods in approximately 90% of configurations - The advantage diminishes as \(\gamma\) increases but remains consistent - PVF also consistently outperforms traditional methods under F1/accuracy metrics
Main Results: Real Clinical Data¶
Datasets: - Cervical cancer dataset (808 samples, 34 features): optimal \(\sigma=0.01\) (IE) / \(\sigma=1e{-6}\) (F1) - Breast cancer dataset (569 samples, 30 features): optimal \(\sigma=0.2\)–\(0.3\)
| Dataset | \(\gamma\) | PVF Win Rate | Traditional Win Rate | Tie |
|---|---|---|---|---|
| Cervical | 0.1 | 60.0% | 26.7% | 13.3% |
| Cervical | 0.3–0.9 | 43.3–46.7% | 33.3–36.7% | ~20% |
| Cervical | F1 | 50.0% | 33.3% | 16.7% |
| Breast | 0.1 | 52.0% | 20.0% | 28.0% |
| Breast | 0.3 | 48.0% | 20.0% | 32.0% |
| Breast | F1 | 52.0% | 16.0% | 32.0% |
Ablation Study: \(\sigma\) Sensitivity Analysis¶
| Scenario | Optimal \(\sigma\) Range | Key Pattern |
|---|---|---|
| Low separation (\(\mu \leq 0.9\)) | \(\sigma \leq 1e{-3}\) | Small perturbations yield stable positive gains |
| Medium separation (\(\mu\) 1.1–1.9) | Small \(\sigma\) or \(\sigma=0.1\) | Moderate perturbations reduce the advantage; largest perturbation recovers it |
| High separation (\(\mu \geq 2.1\)) | \(\sigma=0.1\) | Large perturbations consistently amplify PVF advantage |
| Cervical cancer (real) | \(\sigma \approx 0.01\) | Small \(\sigma\) most effective |
| Breast cancer (real) | \(\sigma \approx 0.2\)–\(0.3\) | Larger \(\sigma\) required |
Key Findings: - \(\sigma\) is the most critical hyperparameter of PVF; no universally optimal value exists and dataset-specific tuning is necessary - \(\sigma=0.01\) is a reasonable empirical starting point - PVF's advantage is most pronounced when \(\gamma\) is small (i.e., intervention capacity is highly constrained) - PVF requires no model retraining, keeping computational overhead manageable
Highlights & Insights¶
- Novelty of IE: For the first time, intervention capacity constraints are explicitly incorporated into an evaluation metric; the closed-form formula is elegant and interpretable, directly linking the precision–recall tradeoff to clinical resource limitations.
- Flexibility of PVF: Compatible with arbitrary evaluation metrics (IE, F1, accuracy, etc.), composable with cross-validation, and requires no model retraining.
- Theoretical completeness: A complete proof chain for PVF convergence and selection consistency is provided (6 propositions).
- Rigorous experimental design: Synthetic experiments involve \(180 \times 5{,}000 = 900{,}000\) repetitions, thoroughly exploring the hyperparameter space.
- Clinical orientation: Emphasis on interpretable single-model deployment rather than black-box ensembles.
Limitations & Future Work¶
- σ tuning requires prior knowledge: The optimal perturbation noise scale varies by dataset, necessitating domain expert input or additional hyperparameter search.
- Validation limited to binary classification: The closed-form IE formula is derived for binary classification; extension to multi-class settings remains incomplete.
- Limited scale of real-data experiments: Validation is conducted on only 2 public datasets, insufficient to cover the diversity of clinical scenarios.
- Fairness constraints not considered: IE does not incorporate fairness factors across different subgroups.
- Comparison with stronger baselines absent: Methods such as nested CV and Bayesian model selection are not included.
- Area classification is debatable: This work is more aligned with "clinical ML evaluation methodology" than traditional "medical imaging" research.
Related Work & Insights¶
- Rashomon Effect: Breiman (2001) introduced the "two cultures" perspective; Rudin et al. (2024) advocated leveraging model diversity.
- Perturbation robustness: Mutation Validation (Zhang et al. 2023) injects noise into training labels; PVF differs by applying perturbations exclusively to validation set features.
- Interpretable clinical models: FIGS (Tan et al. 2022) and sparse logistic regression focus on model-level interpretability, whereas PVF targets the reliability of the selection process.
- Cross-validation improvements: Nested CV (Wainer & Cawley 2021) remains unstable on small data; PVF can complement such approaches.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of IE and PVF offers a genuinely novel perspective, unifying resource constraints with robustness evaluation.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Synthetic experiments are exceptionally thorough (900K repetitions); real-data experiments are limited in scale.
- Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear and rigorous, with complete theoretical proofs; however, the balance between main text and appendix is uneven, with substantial content relegated to the appendix.
- Value: ⭐⭐⭐⭐ — Directly applicable to resource-constrained clinical deployment scenarios; the \(\sigma\) tuning requirement limits plug-and-play usability.