Semi-Supervised Regression with Heteroscedastic Pseudo-Labels¶

Conference: NeurIPS 2025 arXiv: 2510.15266 Code: GitHub Area: Semi-Supervised Learning / Regression Keywords: semi-supervised regression, pseudo-labels, heteroscedastic uncertainty, bilevel optimization, uncertainty estimation

TL;DR¶

This paper proposes an uncertainty-aware pseudo-label framework based on heteroscedastic modeling, which dynamically calibrates per-sample pseudo-label uncertainty via bilevel optimization to mitigate the negative impact of noisy pseudo-labels on regression models, achieving state-of-the-art performance on multiple SSR benchmarks.

Background & Motivation¶

Semi-supervised regression (SSR) faces a fundamentally different challenge from classification: regression outputs are continuous values, making it impossible to filter pseudo-label reliability via confidence thresholds as in classification. Existing SSR methods primarily rely on consistency regularization (e.g., cyclic consistency in TNNR, uncertainty consistency in UCVME, rank consistency in RankUp), but consistency constraints alone are insufficient to handle noise in pseudo-labels.

The authors visualize the pseudo-label distribution generated by UCVME on the UTKFace dataset and observe that even with uncertainty consistency constraints, pseudo-label variance remains large, particularly for age groups with fewer samples.

The key insight is that pseudo-label noise is heteroscedastic, i.e., the degree of error varies across samples and is correlated with input features. This motivates a mechanism capable of dynamically assessing and adjusting the reliability of each pseudo-label.

However, naively introducing an auxiliary network trained jointly end-to-end for uncertainty estimation suffers from a fundamental flaw: the model cannot distinguish between "hard but correct samples" (where the model prediction deviates largely from the correct pseudo-label) and "easy but incorrect samples" (where the model prediction is close to the ground truth but far from an erroneous pseudo-label). Both cases yield high uncertainty, yet the former should not be suppressed.

Method¶

Overall Architecture¶

The method comprises two networks: - Regression network \(f_\theta\): the primary model that predicts regression values. - Uncertainty learner \(g_\phi\): a lightweight MLP that dynamically estimates uncertainty for each pseudo-label.

Both are trained under a bilevel optimization framework: the inner level updates the regression network (using labeled data and uncertainty-weighted pseudo-labeled data), while the outer level updates the uncertainty learner (evaluating the generalization of the regression network on a separate batch of labeled data).

Key Designs¶

1. Heteroscedastic Pseudo-Label Modeling¶

Each unlabeled sample \(x_j^u\) is modeled with a heteroscedastic Gaussian pseudo-label:

\[\hat{y}_j = f_\theta(x_j^u) + \epsilon_j, \quad \epsilon_j \sim \mathcal{N}(0, \sigma_j^2)\]

The corresponding negative log-likelihood loss is:

\[\mathcal{L}_u = \sum_{x_j^u \in \mathcal{B}_u} \frac{1}{\sigma_j^2} (\hat{y}_j - f_\theta(x_j^u))^2 + \sum_{x_j^u \in \mathcal{B}_u} \log(\sigma_j^2)\]

When \(\sigma_j^2 = 1\), this reduces to standard MSE. Design Motivation: when a pseudo-label is inaccurate, the model can downweight its contribution by increasing \(\sigma_j^2\), while the \(\log(\sigma_j^2)\) term prevents all uncertainties from diverging to infinity.

2. Bilevel Optimization Framework¶

The uncertainty learner \(g_\phi\) outputs the log-variance \(z_j = \log \sigma_j^2 = g_\phi(r_j, \hat{y}_j)\), where \(r_j\) is the regression model's prediction on strongly augmented inputs and \(\hat{y}_j\) is the pseudo-label from weakly augmented inputs (following FixMatch).

Inner-level optimization (updating \(\theta\)):

\[\theta^*(\phi) = \arg\min_\theta \mathcal{L}^{inner} := \mathcal{L}_l(\theta) + \lambda \mathcal{L}_u(\theta, \phi)\]

where \(\mathcal{L}_l = \sum (y_i - f_\theta(x_i^l))^2\) and \(\mathcal{L}_u\) applies uncertainty weighting.

Outer-level optimization (updating \(\phi\)):

\[\phi^* = \arg\min_\phi \mathcal{L}^{outer} := \sum_{x_k^l \in \hat{\mathcal{B}}_l} (y_k - f_{\theta^*(\phi)}(x_k^l))^2\]

The outer level evaluates the updated regression network on a separate batch of labeled data, ensuring that the uncertainty estimates promote generalization.

Design Motivation: The outer objective "audits" the effect of the inner update — if \(g_\phi\) assigns low uncertainty to erroneous pseudo-labels causing the inner level to overfit, the outer loss increases and corrects \(g_\phi\) accordingly.

3. Efficient Approximation and Training¶

Full bilevel optimization requires second-order gradient unrolling over the entire network, which is computationally expensive. The key approximation proposed here is to unroll gradients only through the regression head (a single fully connected layer), since \(\phi\) influences the loss primarily through this head. The practical overhead is only ~9 ms/iter and ~17 MB of memory.

Loss & Training¶

Each iteration proceeds in three steps: 1. Sample a labeled batch \(\mathcal{B}_l\), an unlabeled batch \(\mathcal{B}_u\), and a second labeled batch \(\hat{\mathcal{B}}_l\). 2. Compute \(\mathcal{L}^{inner}\) and update \(\theta\) via gradient descent. 3. Compute \(\mathcal{L}^{outer}\) and update \(\phi\) via gradient descent.

Theoretical analysis (Theorem 1) shows that optimizing \(\phi\) is equivalent to maximizing the alignment between the inner and outer gradients:

\[\min_\phi -\langle \nabla_\theta \mathcal{L}^{inner}(\theta, \phi), \nabla_\theta \mathcal{L}^{outer}(\theta) \rangle\]

This ensures that the gradient direction of pseudo-label-assisted training is consistent with that of purely supervised training on labeled data.

Key Experimental Results¶

Main Results¶

Experiments are conducted on three datasets (UTKFace, IMDB-WIKI, STS-B) with 5%, 10%, and 20% labeling ratios.

Dataset	Label Ratio	Metric	Ours	UCVME	RankUp	Gain
UTKFace	5%	MAE↓	5.639	5.862	5.719	vs. RankUp −1.4%
UTKFace	5%	R²↑	0.523	0.495	0.495	vs. RankUp +5.7%
IMDB-WIKI	5%	MAE↓	9.177	9.730	10.251	vs. Mean Teacher −3.3%
IMDB-WIKI	5%	R²↑	0.664	0.633	0.599	vs. UCVME +4.9%
IMDB-WIKI	20%	MAE↓	8.166	8.309	8.216	approaching Fully-Sup (7.974)
STS-B	5%	MSE↓	1.540	1.713	1.844	vs. SSDKL −4.1%
STS-B	5%	R²↑	0.270	0.188	0.126	vs. SSDKL +13.0%

The advantage is most pronounced under scarce labeling (5%); as labeling increases, the gap with UCVME and RankUp narrows but remains competitive.

Ablation Study¶

Configuration	γ=5% MAE↓	γ=5% R²↑	γ=10% MAE↓	Note
Baseline (no UL, no BLO)	9.512	0.651	8.864	Standard pseudo-labels with fixed \(\sigma^2=1\)
Baseline + UL (no BLO)	9.914	0.630	9.562	Joint training degrades performance!
Baseline + UL + BLO (full)	9.177	0.664	8.539	Bilevel optimization is the key

Key Findings¶

Joint training of the uncertainty learner is harmful: without bilevel optimization, jointly training \(g_\phi\) and \(f_\theta\) leads to inaccurate uncertainty estimates that suppress learning from "hard but correct" samples.
Uncertainty is positively correlated with prediction error: visualization shows that \(\sigma^2\) estimated by \(g_\phi\) increases with the sample prediction error, consistent with expectations.
Computational overhead is minimal: only 9 ms and 17 MB extra per iteration, far below UCVME (257 ms, 10,057 MB) and SimRegMatch (548 ms, 7,419 MB).
Subgroup analysis: for older age groups with sparse labels, the proposed method produces significantly more accurate pseudo-labels than UCVME.

Highlights & Insights¶

The paper adapts the pseudo-label paradigm from classification to regression, elegantly addressing the reliability of continuous-valued pseudo-labels through heteroscedastic modeling.
The bilevel optimization design avoids the degeneration caused by naive joint training, using held-out labeled data in the outer level as a "referee" to calibrate uncertainty.
The approximation strategy of unrolling only the regression head gradient cleverly balances theoretical rigor and computational efficiency.
The theoretical equivalence between bilevel optimization and inner–outer gradient alignment provides a clear and intuitive explanation.

Limitations & Future Work¶

The method assumes simultaneous access to labeled and unlabeled data, which may not be applicable in privacy-sensitive scenarios.
Systematic biases in labeled data (e.g., demographic imbalances) are not addressed and may be amplified through pseudo-labels.
The uncertainty learner's input relies only on predicted values and pseudo-labels, without fully exploiting feature-space information.
The advantage diminishes at higher labeling ratios, indicating that the method primarily excels in extremely low-label regimes.

UCVME: employs a shared encoder with dual heads to predict mean and variance, but uncertainty consistency constraints do not suffice to guarantee accurate calibration.
FixMatch: the strategy of generating pseudo-labels from weakly augmented inputs and training on strongly augmented inputs is adopted in this work.
Meta-learning / DARTS-style bilevel optimization: the proposed framework directly borrows the alternating update strategy from MAML and DARTS.
Insight: in noisy-label learning, dynamically adjusting sample weights or uncertainty via bilevel optimization is a general and powerful strategy.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of heteroscedastic pseudo-labels and bilevel optimization is novel in SSR, though each component has prior precedent.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, three labeling ratios, multiple baselines, ablation studies, visualizations, and computational cost analysis are all provided.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated, the bilevel optimization derivation is rigorous, and theory and experiments are well aligned.
Value: ⭐⭐⭐⭐ Provides a concise and efficient solution for SSR with notable advantages in low-label scenarios.