Semi-supervised Deep Transfer for Regression without Domain Alignment¶
Conference: ICCV 2025 arXiv: 2509.05092 Code: Available (see Appendix E.2) Area: Medical Imaging Keywords: Source-free domain adaptation, semi-supervised transfer learning, regression, EEG decoding, brain age prediction
TL;DR¶
This paper proposes CRAFT (Contradistinguisher-based Regularization Approach for Flexible Training), a semi-supervised transfer learning framework that requires neither source data nor domain alignment, specifically designed for regression tasks. CRAFT jointly optimizes a supervised loss and an unsupervised Contradistinguisher-based regularization term to substantially improve prediction performance under label-scarce conditions.
Background & Motivation¶
Deep learning models deployed in practice suffer from domain shift: models trained on a source domain perform poorly on shifted target data, a problem particularly pronounced in medical and neuroscientific applications. Conventional domain adaptation approaches face several practical challenges:
Source data unavailability: Medical data cannot be shared due to privacy regulations or prohibitive storage and computational costs.
Scarcity of target-domain labels: Annotation is expensive, leaving only a small number of labeled samples available.
Neglect of regression tasks: Most source-free domain adaptation (SF-DA) methods are designed for classification and rely on concepts such as class prototypes, making them ill-suited for continuous-valued outputs.
Among existing methods: - CUDA (Contradistinguisher) is effective but requires source data and supports classification only. - TASFAR addresses regression in the SF-UDA setting but is fully unsupervised and relies on uncertainty estimation. - BBCN depends on class prototypes and does not generalize naturally to regression.
Method¶
Overall Architecture¶
Given a model \(\theta^s\) pre-trained on source data, the target dataset consists of a small labeled set \(\{(\mathbf{x}_i^t, y_i^t)\}_{i=1}^{N_l}\) and a large unlabeled set \(\{\mathbf{x}_i^t\}_{i=1}^{N_{ul}}\), where \(y \in \mathbb{R}\) is a continuous value. CRAFT initializes model parameters with \(\theta^s\) and adapts to the target domain via alternating two-step optimization: parameters are fixed to select pseudo-labels, and pseudo-labels are then fixed to update parameters.
Key Designs¶
-
Semi-supervised objective: CRAFT combines a supervised loss and a CUDA-based unsupervised regularization term with weight \(\alpha\): $\(\mathcal{L}(\mathcal{D}^t, \theta) = \sum_{i=1}^{N_l} \log p(y_i^t | \mathbf{x}_i^t, \theta) + \alpha \sum_{i=1}^{N} \log q(\mathbf{x}_i^t, y_i^t | \theta).\)$ The supervised term is a standard Gaussian log-likelihood (MSE), and the unsupervised term is the CUDA joint distribution: $\(\log q(\mathbf{x}^t, y^t | \theta) = \log \frac{p(y^t|\mathbf{x}^t, \theta)}{\sum_{i=1}^N p(y^t|\mathbf{x}_i^t, \theta)} p(y^t).\)$ The denominator normalizes the model's total predicted probability for a given label, thereby removing prediction bias inherited from the source domain; \(p(y^t)\) introduces a target-domain label prior. This can be theoretically derived as a prior term in MAP estimation (Appendix A.1).
-
Pseudo-label selection for regression: In classification, labels can be selected from a finite set by maximizing the joint distribution. Since the label space is continuous in regression, CRAFT discretizes the label range into small intervals and uses interval midpoints as candidate pseudo-labels: $\(\tilde{y}_i^t = \arg\max_{y^t \in \mathcal{Y}} \frac{\mathcal{N}(y^t; f(\mathbf{x}_i^t; \theta), c) p(y^t)}{\sum_{l=1}^N \mathcal{N}(y^t; f(\mathbf{x}_l^t; \theta), c)}.\)$ Importantly, discretization is used solely for efficient pseudo-label selection and does not constrain model outputs to discrete values. The label prior \(p(y)\) is estimated from data via a mixture model. This design avoids nested gradient descent, making optimization efficient while admitting informative priors.
-
Parameter update (maximization step): With pseudo-labels fixed, three terms are jointly optimized: $\(\theta^* = \arg\max_\theta -\sum_{i=1}^{N_l}(y_i^t - f(\mathbf{x}_i^t;\theta))^2 - \alpha\!\left(\sum_{i=1}^{N}(\tilde{y}_i^t - f(\mathbf{x}_i^t;\theta))^2 - \sum_{i=1}^{N}\log\sum_{l=1}^N \exp\!\left(-(\tilde{y}_i^t - f(\mathbf{x}_l^t;\theta))^2\right)\right).\)$ Intuitively, the first term encourages alignment with ground-truth labels; the second term forces samples with different pseudo-labels to produce distinct predictions (learning a better regression function); since parameters are initialized from the source model, the adapted model is implicitly regularized to remain close to the source model.
Loss & Training¶
- Adam optimizer (lr=1e-4), batch size 128 (EEG) / 4 (MRI).
- \(\alpha\) is selected via grid search over {0.01, 0.1, 1.0}; \(\alpha=0.1\) is consistently preferred across experiments.
- Each iteration: pseudo-labels for the current batch of unlabeled data are computed first, then supervised and unsupervised terms are jointly optimized with pseudo-labels fixed.
- The checkpoint with the best validation performance is retained (EEG), or the final model is used (MRI, where the dataset is too small for hyperparameter search).
- The log-sum-exp trick is applied to avoid numerical instability.
Key Experimental Results¶
Main Results — Saccade Amplitude Prediction (EEG, 1% labels)¶
| Method | R ↑ | RMSE (pixels) ↓ |
|---|---|---|
| Naive Baseline | — | 149.12 ± 0.02 |
| TL (100% labels, upper bound) | 0.93 | 51.47 ± 0.63 |
| TL (1% labels) | 0.77 | 92.26 ± 1.66 |
| Progressive Mixup | 0.48 | 135.70 ± 1.25 |
| BBCN | 0.76 | 99.80 ± 3.35 |
| TASFAR | 0.76 | 86.41 ± 1.05 |
| DataFree | 0.80 | 87.64 ± 3.08 |
| CRAFT | 0.81 | 84.17 ± 3.95 |
CRAFT achieves a 9% RMSE improvement over supervised transfer learning (TL) and more than 4% over the best SF-SSDA baseline.
Ablation / Extension — Brain Age Prediction (MRI, 20% labels)¶
| Method | R ↑ | RMSE (years) ↓ |
|---|---|---|
| Naive Baseline | — | 7.91 ± 0.05 |
| TL (100% labels, upper bound) | 0.66 | 6.14 ± 0.03 |
| TL (20% labels) | 0.41 | 7.41 ± 0.21 |
| Progressive Mixup | 0.34 | 7.71 ± 0.14 |
| BBCN | 0.28 | 8.00 ± 0.15 |
| TASFAR | 0.42 | 7.47 ± 0.15 |
| DataFree | 0.50 | 7.36 ± 0.14 |
| CRAFT | 0.51 | 7.14 ± 0.11 |
CRAFT improves over TL by approximately 4% and over the state-of-the-art SF-SSDA method by more than 3%. On crowd counting and tumor size prediction, gains exceed 5% and 2%, respectively.
Key Findings¶
- CRAFT maintains the lowest RMSE across all proportions of unlabeled data; its advantage grows as the proportion of unlabeled data increases.
- The closest competing methods are TASFAR (pseudo-label approach) and DataFree (feature alignment approach).
- \(\alpha=0.1\) is consistently optimal across all experiments, suggesting that the unsupervised term provides moderate regularization rather than a dominant learning signal.
- Sampling bias experiment: when the label distribution of the training set is deliberately skewed (removing 80% of elderly samples), CRAFT effectively mitigates bias by incorporating an unbiased prior \(p(y)\), yielding an RMSE improvement of approximately 5%.
- Computational cost: training time is comparable to DataFree (EEG: 0.45 min/epoch vs. 0.55 min) and far below BBCN (2.71 min).
Highlights & Insights¶
- The problem is precisely formulated: the intersection of no source data, sparse labels, and regression represents a genuinely challenging real-world scenario in medicine and neuroscience that prior methods have largely overlooked.
- The extension of CUDA to regression is methodologically elegant: by discretizing the pseudo-label search space rather than the model outputs, the approach efficiently preserves continuous prediction capability.
- The theoretical grounding is solid: the unsupervised objective is derived as a maximum-entropy prior over model parameters (MAP estimation), providing motivation beyond heuristic regularization.
- The incorporation of the label prior \(p(y)\) enables the model to actively mitigate sampling bias in the training set, an important property for real-world data.
Limitations & Future Work¶
- The bin size for discretization remains a hyperparameter requiring manual tuning, albeit with guidance provided in the paper.
- Validation is limited to relatively small datasets (EEG ~12K samples; MRI ~188 samples); performance on larger datasets remains unknown.
- Applicability to high-dimensional output spaces (e.g., dense prediction or segmentation) is not explored.
- The Gaussian conditional distribution assumption may be inappropriate for multimodal or heavy-tailed regression problems.
- No comparison is made against recent parameter-efficient transfer methods such as LoRA.
- While \(\alpha=0.1\) is empirically stable, its optimal value may depend on the degree of domain shift.
Related Work & Insights¶
- CRAFT builds on the theoretical foundation of CUDA (Contradistinguisher), with two key extensions: (a) from UDA to SF-SSDA, and (b) from classification to regression.
- Unlike SHOT++, CRAFT requires neither feature alignment nor prototype-based clustering of pseudo-labels.
- The feature alignment approach of DataFree/BUFR is complementary to CRAFT — CRAFT bypasses intermediate representation alignment and directly learns the joint distribution.
- The EEGNet-LSTM architecture introduced for saccade prediction substantially outperforms baselines (>16%), representing a contribution independent of CRAFT.
- The work provides important practical guidance for transfer learning in medical imaging and neuroscience.
Rating¶
- Novelty: ⭐⭐⭐⭐ The extension of CUDA to SF-SSDA regression carries genuine theoretical novelty.
- Experimental Thoroughness: ⭐⭐⭐⭐ Four datasets, computational complexity analysis, and sampling bias evaluation.
- Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear and problem motivation is well-articulated.
- Value: ⭐⭐⭐⭐ Fills a gap in SF-SSDA for regression tasks with clearly defined practical applications.