Enhancing Statistical Validity and Power in Hybrid Controlled Trials: A Randomization Inference Approach with Conformal Selective Borrowing¶

Conference: ICML 2025
arXiv: 2410.11713
Code: None
Area: Medical Imaging
Keywords: Hybrid controlled trials, randomization inference, conformal inference, selective borrowing, finite-sample inference

TL;DR¶

An inference framework for hybrid controlled trials is proposed based on the Fisher Randomization Test (FRT) and Conformal Selective Borrowing (CSB). It achieves finite-sample exact Type I error rate control and model-free statistical inference, minimizing MSE through adaptive thresholding to enhance statistical power while maintaining strict Type I error control.

Background & Motivation¶

Hybrid Controlled Trials: Integrating external controls (EC) into RCTs to enhance statistical power, applicable to sample-constrained scenarios such as rare diseases.
Limitations of Prior Work:
- Bayesian methods (power prior, commensurate prior): may inflate Type I error rates.
- Frequentist methods (propensity score weighting, doubly robust estimation): rely on large-sample asymptotic theory and are unreliable in small samples.
- When the RCT sample size is small (precisely the scenario where borrowing EC is most needed), model misspecification and asymptotic approximations fail.
Unobserved Confounding: ECs are non-randomized, so implicit bias may still exist even after adjusting for observed confounding.
Key Challenge: Borrowing ECs reduces variance but introduces bias; not borrowing leads to insufficient power.

Key Insight: Hybrid controlled trials offer two major advantages to exploit: (1) randomization within the RCT guarantees Type I error control, and (2) the randomized control group can be used to assess bias in the EC.

Method¶

Overall Architecture¶

Three-step Progressive Approach:

Fisher Randomization Test (FRT): Uses the randomization of the RCT as the sole source of randomness, providing finite-sample exact p-values for any test statistic.
Conformal Selective Borrowing (CSB): Tests whether each EC is exchangeable one-by-one using conformal p-values, selectively incorporating them.
Adaptive Thresholding: Minimizes the MSE of the CSB estimator to select the optimal selection threshold \(\gamma\).

Key Designs¶

1. Core Guarantee of FRT

Under the Fisher null hypothesis \(H_0: Y_i(0) = Y_i(1), \forall i \in \mathcal{R}\): - All potential outcomes can be completely imputed: \(Y_i^{\text{imp}}(0) = Y_i^{\text{imp}}(1) = Y_i\) - A reference distribution is generated by resampling the RCT treatment assignment \(\mathbf{A}^*\). - Crucial: The EC assignment \(A_i \equiv 0\) remains fixed in resampling (following the "analyze-as-randomized" principle).

\[p^{\text{FRT}} = \mathbb{P}_{\mathbf{A}^*}\{T(\mathbf{A}^*) \ge T(\mathbf{A})\}\]

Theorem 2.3: Under \(H_0\), \(\mathbb{P}_{\mathbf{A}}(p^{\text{FRT}} \le \alpha) \le \alpha\) holds for any test statistic, requiring no model correct specification or no-unobserved-confounding assumptions.

2. CSB Estimator

Selected EC set: \(\hat{\mathcal{E}}(\gamma) = \{j \in \mathcal{E}: p_j^* > \gamma\}\)

\[\hat{\tau}_\gamma = \frac{1}{n_\mathcal{R}} \sum_{i=1}^n \left[S_i \hat{\mu}_{1,\mathcal{R}}(X_i) + \frac{S_i A_i}{\hat{e}(X_i)}\{Y_i - \hat{\mu}_{1,\mathcal{R}}(X_i)\} - S_i \hat{\mu}_{0,\mathcal{R}+\hat{\mathcal{E}}(\gamma)}(X_i) - V_i\{Y_i - \hat{\mu}_{0,\mathcal{R}+\hat{\mathcal{E}}(\gamma)}(X_i)\}\right]\]

\(\gamma=1\): No Borrowing (NB), degenerating to pure RCT estimation.
\(\gamma=0\): Full Borrowing (FB), using all ECs.

3. Conformal p-values

Split Conformal: The randomized control group is split into a calibration set and a training set, using residuals as non-conformity scores.
\(p_j^{\text{split}} = \frac{\sum_{i \in \mathcal{C}_1} \mathbb{I}(s_i \ge s_j) + 1}{|\mathcal{C}_1| + 1}\)
Proposition 3.1: If EC \(j\) is exchangeable with the control group, then \(\mathbb{P}(p_j^{\text{split}} \le \gamma) \le \gamma\).
CV+ p-values: K-fold cross-validation version, fully utilizing data at the cost of a slightly weaker guarantee (\(\le 2\gamma + O(1/K)\)).

4. Adaptive Thresholding

Select \(\hat{\gamma}\) by minimizing \(\widehat{\text{MSE}}(\gamma) = (\hat{\tau}_\gamma - \hat{\tau}_1)^2 - \hat{\mathbb{V}}(\hat{\tau}_\gamma - \hat{\tau}_1) + \hat{\mathbb{V}}(\hat{\tau}_\gamma)\)

Approximating the true value \(\tau\) using the consistent NB estimator \(\hat{\tau}_1\).
Estimating variance terms via Bootstrap.
Theorems 3.4/3.5: Providing non-asymptotic excess risk bounds.

Loss & Training¶

Outcome model \(\mu_a(x)\): can use any ML method (linear, random forest, etc.).
Propensity score \(e(x)\): known from RCT design or via logistic regression.
Conformal score function: absolute residual \(|Y_i - \hat{f}(X_i)|\) or conformal quantile regression.
\(B=5000\) Monte Carlo replications to approximate FRT p-values.

Key Experimental Results¶

Main Results¶

Simulation setup: \((n_1, n_0, n_\mathcal{E}) = (50, 25, 50)\), small-sample RCT + 50% biased EC, bias \(b=0,1,...,8\).

Bias b	NB MSE	FB MSE	CSB MSE	NB Power	FB Power	CSB Power
0	Baseline	-42%	-20%	Baseline	+46%	+45%
1-2	Baseline	+454%	+1-18%	Baseline	-51%	-7~-20%
5-8	Baseline	+200%	-13~-16%	Baseline	-30%	+13~+36%

Type I Error Rate: All methods (NB/FB/CSB) strictly control the Type I error below \(\alpha=0.05\) across all bias levels.

Ablation Study¶

When the EC sample size increases to \(n_\mathcal{E}=300\): the power gain of CSB is even more significant.
Selection performance of CSB: effectively screens out most biased ECs, while also partially filtering out unbiased ECs that are insufficiently similar to the randomized controls.

Key Findings¶

Real-world Data: CALGB 9633 lung cancer trial + NCDB external database.

Method	Estimate	SE	Asymptotic p-value	Exact p-value	Borrowed EC Count
NB(AIPW)	0.142	0.074	0.055	0.051	0
FB	0.241	0.061	<0.001	0.031	335
CSB	0.138	0.058	0.018	0.046	264

CSB borrows 264 out of 335 ECs, reducing SE from 0.074 to 0.058, leading to a statistically significant exact p-value of 0.046.
FB may overestimate the treatment effect (0.241 vs. 0.138 for CSB).

Highlights & Insights¶

Finite-Sample Exactness: FRT guarantees strict Type I error control under any sample size, any model (even misspecified ones), and in the presence of unobserved confounding.
Model-free & Distribution-free: Conformal p-values do not rely on asymptotic theory, allowing for the flexible use of black-box ML models.
Theoretical Elegance: Resolves the three core issues of hybrid controlled trials (Type I error control, bias detection, and threshold selection) progressively using three distinct inference tools (FRT, conformal inference, and MSE optimization).
Post-selection Validity: Allows \(\hat{\mathcal{E}}(\gamma)\) to vary with FRT resampling, properly incorporating selection uncertainty.
Non-asymptotic Theory: Both the excess risk bounds and MSE estimation error bounds are non-asymptotic.

Limitations & Future Work¶

No-free-lunch: When the bias is non-negligible but difficult to detect (\(b=2,3,4\)), CSB may yield lower power than NB—an inherent limitation of all selective borrowing methods.
The Fisher null hypothesis is a sharp null hypothesis (treatment effect is zero for all individuals); extending this to the weak null hypothesis (zero average treatment effect) requires further development.
The power of conformal testing is limited by the sample size of the randomized control group—the smaller the sample size, the harder it is to distinguish between biased and unbiased ECs.
Bonferroni correction is relatively conservative and does not exploit the correlation structure among ECs.
Computational cost: The CSB estimator must be re-computed within each Monte Carlo resampling step of the FRT.

Angelopoulos & Bates (2023): Theoretical foundations of conformal prediction.
Li et al. (2023b): Doubly robust estimators in hybrid controlled trials; this work replaces asymptotic inference with FRT.
Gao et al. (2025): Adaptive Lasso selective borrowing (ALSB), for which CSB in this paper serves as a conformal alternative.
Fisher (1935): Foundational work on randomization inference.
Bates et al. (2023): Conformal anomaly detection.

Insight: The "finite-sample and distribution-free" property of conformal inference naturally complements the "finite-sample exactness" of randomization inference, offering a rigorous and practical statistical toolkit for small-sample clinical trials.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to combine randomization inference and conformal inference for hybrid controlled trials, providing a complete theoretical framework.
Experimental Thoroughness: ⭐⭐⭐⭐ — Simulation and real-world lung cancer data, though evaluation is limited to continuous outcomes.
Writing Quality: ⭐⭐⭐⭐⭐ — Logically rigorous, balancing theory and intuition with highly informative tables and figures.
Value: ⭐⭐⭐⭐⭐ — Directly addresses the inference problem of small-sample hybrid controlled trials, a major area of focus for the FDA.