Distributionally Robust Classification for Multi-Source Unsupervised Domain Adaptation¶

Conference: ICLR 2026 arXiv: 2601.21315 Code: N/A Area: Other Keywords: Distributionally robust optimization, unsupervised domain adaptation, multi-source domain adaptation, Wasserstein distance, pseudo-labels

TL;DR¶

This paper proposes a distributionally robust learning framework that jointly models uncertainty over both the target-domain covariate distribution and the conditional label distribution, achieving significant generalization improvements in UDA settings where target data is extremely scarce or spurious correlations exist in the source domain.

Background & Motivation¶

Unsupervised domain adaptation (UDA) assumes a distributional shift between training (source) and test (target) data, with labeled source data and unlabeled target data only. Existing methods fall into two main categories:

Distribution alignment methods (DANN, CDAN, MK-MMD): reduce domain discrepancy by aligning source and target distributions, but tend to align spurious features (e.g., background, color) when spurious correlations are present.

Pseudo-label methods (STAR, ATDOC): use source-trained models to generate pseudo-labels for the target domain, but label quality depends heavily on the initial model.

Both categories perform poorly in two practical scenarios: - Scarce target data: alignment estimates are unreliable and pseudo-label noise is high. - Spurious correlations: models rely on non-causal features (e.g., background, gender, color) that do not transfer to the target domain.

Existing DRO methods (e.g., GroupDRO) typically require group labels and do not exploit unlabeled target data. This paper aims to design a robust framework that simultaneously handles covariate shift and conditional distribution shift.

Method¶

Overall Architecture¶

The paper introduces a distributionally robust learning framework whose core is a two-level ambiguity set that jointly models: 1. Uncertainty over the target-domain input distribution (via a Wasserstein ball). 2. Uncertainty over the conditional label distribution (via a mixture of multi-source conditionals).

Although the framework is formulated for the multi-source setting, it also applies to single-source scenarios by simulating multiple pseudo-source domains through random subsampling.

Key Designs¶

1. Unifying the Single-Source Problem via a Multi-Source Framework¶

For a single-source dataset \(\mathbf{D}^{\text{sc}}\), \(K\) subsets \(\mathbf{D}^{(k)}\) are generated by random subsampling with replacement (\(K=10\), each of size \(N^{\text{sc}}/5\)). When the source distribution is a mixture of heterogeneous subpopulations, repeated subsampling increases the probability that certain subsets approximate individual subpopulations, thereby providing robustness to shifts in the mixture proportions.

2. Ambiguity Set Definition¶

Given tolerance parameters \(\epsilon_1, \epsilon_2 \geq 0\) and a reference vector \(\bar{\beta} \in \Delta_{K-1}\), the ambiguity set is defined as:

\[\mathcal{Q} = \left\{ Q = (Q_X, Q_{Y|X}) \mid Q_{Y|X} = \sum_{k=1}^K \beta_k \hat{P}_{Y|X}^{(k)}, \ D_1(Q_X, \hat{P}_X^{\text{tg}}) \leq \epsilon_1, \ D_2(\beta, \bar{\beta}) \leq \epsilon_2 \right\}\]

where: - \(\hat{P}_{Y|X}^{(k)}\): estimated conditional distribution from the \(k\)-th source subset. - \(D_1\): infinite-order Wasserstein distance, controlling covariate distribution shift. - \(D_2\): Euclidean distance, controlling deviation of the mixture weights. - \(\epsilon_1\): robustness radius for the covariate distribution, particularly important when target data is scarce. - \(\epsilon_2\): robustness radius for the conditional mixture weights.

3. Conditional Distribution Estimation¶

A classification model is first trained on the full source data to extract a feature map \(z: \mathcal{X} \to \mathcal{Z}\) (with the final classification layer removed). A linear logistic regression is then independently trained on each subset, with softmax outputs serving as probability estimates \(\hat{P}_{Y|X}^{(k)}\). The framework is compatible with existing UDA methods (CDAN, STAR), which can serve as the feature extractor before constructing the conditional estimates.

4. Tractable Surrogate Objective¶

Via Proposition 3.1, the minimax optimization problem is reformulated as a tractable upper-bound surrogate:

\[\sup_{\beta} \mathbb{E}_{\hat{P}_X^{\text{tg}}} \left[ \sup_{\|z' - z(X)\|_2 \leq \epsilon_1} \ell(f_Z^\theta(z'), y^\circ(\beta, X)) \right]\]

where \(y^\circ(\beta, x) = \sum_{k=1}^K \beta_k \hat{p}_{Y|X}^{(k)}(\cdot|x)\) is the soft pseudo-label vector.

Loss & Training¶

Three variables are updated alternately (Algorithm 1):

Update adversarial features \(z'\): with \(\theta\) and \(\beta\) fixed, projected gradient ascent is applied to the features, searching within the \(\epsilon_1\)-ball for perturbations that maximize the loss (analogous to adversarial training).
Update mixture weights \(\beta\): with \(\theta\) and \(z'\) fixed, exponentiated gradient ascent followed by projection onto the \(\epsilon_2\)-ball assigns higher weights to conditional estimates that incur larger losses.
Update model parameters \(\theta\): with \(z'\) and \(\beta\) fixed, standard gradient descent minimizes the loss.

The core intuition is that updating \(\beta\) constructs an adversarial mixture of conditional distributions, while updating \(\theta\) forces the classifier to be robust against such adversarial mixtures.

Key Experimental Results¶

Main Results¶

Experiment 1: Digit Recognition (MNIST/SVHN/USPS)

Method	SVHN→MNIST (100)	SVHN→MNIST (10)	MNIST→USPS (100)	USPS→MNIST (100)
ERM	59.6	-	63.4	60.4
DANN	66.0	61.2	82.0	74.8
CDAN	63.4	56.9	80.8	58.3
MCD	79.1	61.3	89.3	96.1
Ours (STAR)	94.4	91.3	95.6	97.3

With only 10 target samples per class, Ours (STAR) still achieves 91.3% on SVHN→MNIST, far surpassing all baselines.

Experiment 2: Spurious Correlation Benchmarks (Waterbirds / CelebA / CMNIST)

Method	Waterbirds	CelebA	CMNIST
ERM	48.4	35.5	0.9
CORAL	50.9	31.7	1.7
MCD	59.0	30.7	1.9
GroupDRO (requires group labels)	61.4	63.0	3.4
Ours (ERM)	87.3	85.0	7.5

Compared to ERM: +38.9% on Waterbirds and +49.5% on CelebA. The proposed method substantially outperforms GroupDRO without requiring group labels.

Ablation Study¶

Hyperparameter sensitivity: heatmaps of \(\epsilon_1\) and \(\epsilon_2\) show that performance is stable under moderate uncertainty levels (\(\epsilon_1 \in \{0.2, 0.4\}\), \(\epsilon_2 \geq 0.2\)), with a wide optimal plateau.
Extremely scarce target data: the effect of \(\epsilon_1\) becomes more pronounced as covariate distribution estimates grow less reliable; \(\epsilon_2\) can be set to larger values without destabilizing performance.
LODO-CV validation: the variant that does not rely on labeled target validation data yields slightly lower results but still outperforms all baselines.

Key Findings¶

Combining with CDAN yields a +29.1% improvement on SVHN→MNIST.
When the number of target samples decreases from 100 to 10, the proposed method degrades far less than baselines.
The covariate robustness radius \(\epsilon_1\) is critical under data scarcity, while the conditional mixture radius \(\epsilon_2\) exerts a more stable influence.

Highlights & Insights¶

Two-level uncertainty modeling: jointly accounting for uncertainty over both covariate and conditional distributions represents an important generalization beyond single-level DRO methods.
No group labels required: unlike GroupDRO, the proposed method requires no knowledge of group or subpopulation membership.
Plug-and-play compatibility: the framework integrates seamlessly with existing UDA methods such as CDAN and STAR, functioning as a post-hoc robustification module.
Unification of single- and multi-source settings: subsampling elegantly reformulates the single-source problem within the multi-source framework.

Limitations & Future Work¶

Experiments cover only visual benchmarks (digit recognition, spurious correlation tasks); validation on NLP or time-series data is absent.
A small amount of labeled target validation data is needed for hyperparameter selection, although LODO-CV provides an alternative.
Conditional distribution estimation depends on the quality of the pretrained feature extractor.
The number of subsets \(K=10\) is fixed; an adaptive selection strategy may yield further improvements.
Computational overhead arises from \(K\) independent logistic regressions and the alternating minimax optimization iterations.

Maximin Effect (Meinshausen & Bühlmann, 2015): the direct inspiration for this work, extending DRO from regression to classification settings.
GroupDRO (Sagawa et al., 2019): handles subpopulation shift but requires group labels.
Wasserstein DRO (Gao et al., 2024): provides the theoretical foundation for covariate perturbation.
The approach may inspire robust aggregation strategies for heterogeneous clients in federated learning.

Rating¶

Dimension	Score
Novelty	★★★★☆
Technical Depth	★★★★☆
Experimental Thoroughness	★★★★☆
Writing Quality	★★★★☆
Value	★★★★☆