Exposing Mixture and Annotating Confusion for Active Universal Test-Time Adaptation¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=89gxHJkCXk
Code: To be confirmed
Area: Test-Time Adaptation / Active Learning / Transfer Learning
Keywords: Universal Test-Time Adaptation, Active Learning, Dual Shift, Gaussian Mixture Model, Open-Set Recognition

TL;DR¶

This paper proposes a new paradigm of Active Universal Test-Time Adaptation (AUTTA) and introduces the EMAC method to incorporate sparse human annotations during testing. It decouples domain and class shifts using SVD + GMM to expose samples in the "mixed region," employs a reward-driven strategy to select representative samples for annotation, and utilizes a clustering contrastive loss to balance annotations and pseudo-labels, achieving SOTA performance under dual shifts.

Background & Motivation¶

Background: Universal Test-Time Adaptation (UTTA) aims to handle both domain shift (style/corruption changes) and class shift (emergence of unseen classes) in an open-world setting where source data is unavailable and data arrives in a stream. The goal is to accurately classify known classes while rejecting new classes. Existing UTTA methods (e.g., OPTTT, TTAC) rely primarily on pseudo-labels and heuristic rules.

Limitations of Prior Work: When domain and class shifts occur simultaneously, the error rate of pseudo-labels rises sharply, causing self-training to collapse. While introducing active learning for human intervention is intuitive, analysis shows (Fig. 1) that traditional active learning based on entropy/uncertainty tends to select samples with strong class shifts, which yield low annotation gains and introduce training bias.

Key Challenge: High-value samples reside in the "mixed region" where domain and class shifts overlap. Here, pseudo-label error rates are highest (exp. shows ~70-80% in mixed regions vs ~12-15% in pure shift regions), yet existing active learning methods fail to cover this area, leading to wasted annotation budgets.

Goal: Introduce sparse human annotations into the UTTA framework and design a mechanism to precisely locate and annotate mixed-region samples to maximize the utility of the limited budget.

Core Idea: [Expose Mixture + Annotate Confusion] Explicitly expose samples in the dual-shift mixed region to form a candidate pool, select the most informative representatives using reward signals, and integrate scarce annotations with pseudo-labels using contrastive clustering.

Method¶

Overall Architecture¶

EMAC (Exposing Mixture and Annotating Confusion) is a coarse-to-fine two-stage annotation and one-step optimization pipeline. Step 1 (Coarse Filtering) uses SVD to decouple classifier parameters into "known/unknown" subspaces and fits a GMM on unknown energy to expose candidates in the mixed region. Step 2 (Fine Selection) employs a reward-driven Max-Min Entropy strategy to select representative samples that contribute most to distinguishing new/old classes for oracle annotation. Step 3 (Optimization) uses a clustering contrastive loss to joint-optimize annotations and high-confidence pseudo-labels, mitigating decision boundary blurring caused by annotation scarcity.

flowchart LR
    A[Target batch features z] --> B[SVD decoupled classifier<br/>Known/Unknown Subspace]
    B --> C[GMM modeling unknown energy<br/>Bimodal distribution]
    C --> D[Expose confusion candidates Xmix<br/>+ Old-class region Xold / New-class region Xnew]
    D --> E[Reward-driven selection<br/>Max-Min Entropy + EMA]
    E --> F[Oracle labels representative samples → DLT]
    F --> G[Clustering contrastive loss<br/>Balancing labels & pseudo-labels]
    G --> H[Test-time model update]

Key Designs¶

1. Exposing Mixed Candidates: SVD Decoupling + GMM Bimodal Modeling Since source data is unavailable, source knowledge is extracted from classifier weights \(W_{cls} \in \mathbb{R}^{C \times D}\). Singular Value Decomposition performs orthogonal decomposition \(W_{cls} = F_{known}\Sigma F_{unknown}^\top\) to separate "known" basis vectors \(F_{known}\) from "unknown" ones \(F_{unknown}\). Target features \(z\) are projected onto these subspaces as \(z_{known}\) and \(z_{unknown}\), satisfying \(\|z_{known}\|_2^2 + \|z_{unknown}\|_2^2 = 1\). A key observation is that the distribution of unknown energy \(\|z_{unknown}\|_2^2\) is bimodal: low means for old classes and high means for new classes. A GMM fits this: \(p(z_{unknown}) = \pi\mathcal{N}(\mu_{old}, \sigma_{old}^2) + (1-\pi)\mathcal{N}(\mu_{new}, \sigma_{new}^2)\). Samples are partitioned into \(X_{old}\) (below \(\mu_{old}\)), \(X_{new}\) (above \(\mu_{new}\)), and \(X_{mix}\) (between). Annotating \(X_{mix}\) avoids wasting budget on fringe areas, as the error rate in \(X_{mix}\) (~78-83%) is much higher than in \(X_{old}/X_{new}\) (~10-14%).

2. Reward-Driven Selection: Picking Representatives via Marginal Info Gain To avoid distribution bias in the candidate pool, a Max-Min Entropy objective \(L_{MME} = \Gamma_{old} + \Gamma_{new}\) quantifies marginal information gain. \(\Gamma_{old} = \sum_i H(f(X_{old}, \theta_t)) - H(f(X_{old}, \theta_{t-1}))\) measures the confidence boost (entropy reduction) for old classes, while \(\Gamma_{new} = \sum_i H(f(X_{new}, \theta_{t-1})) - H(f(X_{new}, \theta_t))\) measures improved rejection of new classes. Weighted average rewards \(R'_{old}, R'_{new}\) (balanced by \(\omega_{old}, \omega_{new}\)) are smoothed via EMA \(R_{old} = \alpha R'_{old,t} + (1-\alpha)R'_{old,t-1}\). If \(R_{old} > R_{new}\), the highest-entropy old-class sample is chosen; otherwise, the lowest-entropy new-class sample is chosen.

3. Balancing Labels & Pseudo-labels: Clustering Contrastive Optimization Clustering contrastive loss \(L_c\) uses high-confidence pseudo-label features \(F_p\) and annotated features \(F_a\) to compute class prototypes \(p_c = \frac{1}{|F_p||F_a|}(\sum_{u_i \in F_p} u_i + \sum_{v_i \in F_a} v_i)\). The loss is defined as \(L_c = \sum_{i \in I_{old}} \frac{-1}{|Q(i)|}\sum_{p \in Q(i)} \log\frac{\exp(s_{ip})}{S(i)}\), where \(N(i) = I_{new} \cup I_{old}\) and \(s_{ij}\) is cosine similarity. This objective minimizes intra-class distance by pulling samples toward prototypes and maximizes inter-class distance by pushing new-class samples away. The total objective is \(L = L_{MME} + L_c\).

Key Experimental Results¶

Main Results (DomainNet, higher AH is better)¶

Category	Method	Avg. AH	GPU(s)
TTA	TEST	43.7	392
TTA	TENT	46.1	479
TTA	SHOT	46.0	564
UTTA	TTAC	46.7	651
UTTA	OPTTT	49.8	697
ATTA	SimATTA	47.1	747
ATTA	EATTA	47.7	738
ATTA	BiTTA	47.2	687
AUTTA	Ours (EMAC)	52.2	735
AUTTA	EMAC*	53.1	779

EMAC outperforms the strongest baseline OPTTT by 2.4 points. On VisDA-C, EMAC achieves AH scores of 79.8/77.4/72.3 for NOISE/MNIST/SVHN shifts, leading significantly over OPTTT (77.8/75.2/69.2).

Active Learning Comparison (using TENT framework, B=Budget)¶

Method	DomainNet Avg.	VisDA-C
Random (B=1000)	46.8	66.7
Entropy (B=1000)	46.3	65.4
Coreset (B=1000)	49.4	68.7
BADGE (B=1000)	49.5	68.2
SimATTA (B=1000)	47.1	67.9
EMAC (B=800)	50.8	73.1
EMAC (B=1000)	52.2	76.5
EMAC (B=1500)	53.1	78.2

EMAC with a budget of 800 outperforms other methods using 1000, demonstrating high annotation efficiency.

Ablation Study (EMSC=Mix Exposure / SC=Selection / BTPO=Bal. Opt.)¶

EMSC	SC	BTPO	DomainNet AH	VisDA-C AH
✓	-	-	43.7	65.0
-	✓	-	47.1	69.4
-	-	✓	47.9	71.2
✓	✓	-	49.1	71.7
-	✓	✓	50.8	73.1
✓	✓	✓	52.2	76.5

Key Findings¶

Mixed Region Effectiveness: The error rate of \(X_{mix}\) via GMM (~78-83%) is significantly higher than \(X_{old}/X_{new}\) (~10-14%), proving that annotating samples in the mixed region provides the highest gain.
Complementarity: All three modules are required for optimal performance, with BTPO (Balanced Optimization) contributing most individually.
Robustness to Small Batches: Simple GMM is unstable for batch size \(\le 8\). Using a sliding window and EMA smoothing improved batch=1 AH from 16.3 to 47.3.

Highlights & Insights¶

Shift from Uncertainty to Confusion: The core insight is that "uncertainty \(\neq\) annotation value" under dual shifts. High-value samples reside in the class-overlap mixed region, which traditional entropy sampling tends to skip.
Source-free Localization via SVD: Decoupling the classifier space into known/unknown bases without source data is a lightweight yet clever engineering design.
Theoretically Grounded Selection: Linking Max-Min Entropy rewards to information gain/generalization error bounds provides more rigor than purely heuristic scoring.

Limitations & Future Work¶

GMM Bimodal Assumption: Relies on \(\|z_{unknown}\|^2\) exhibiting clean bimodality. In cases of extreme class ratios, the distribution may collapse into a unimodal peak, degrading GMM accuracy.
Labeling Latency: As a human-in-the-loop approach, applicability in real-time streaming where oracle feedback is delayed or costly remains a challenge.
Task Scope: Evaluation is restricted to image classification on DA benchmarks; scalability to complex tasks like detection or segmentation is unexplored.
Hyperparameter Sensitivity: Parameters like \(\omega_{old}, \omega_{new}, \alpha\) require tuning, which is difficult in label-free test environments.

UTTA / Open-set TTA: OPTTT and TTAC model class shifts/distribution alignment; this work introduces active labels to break the pseudo-label ceiling.
Active TTA (ATTA): SimATTA, EATTA, and BiTTA use confidence scores, which are unreliable under dual shifts. EMAC addresses this failure point.
Active Learning: Traditional AL (Coreset, BADGE) ignores class shifts. This paper reveals their failure modes in open-world dual-shift scenarios.
Inspiration: The coarse-to-fine "expose then select" paradigm is applicable to other budget-constrained open-world tasks like Continual Learning.

Rating¶

Novelty: ⭐⭐⭐⭐ Proposes the AUTTA paradigm. The insight on the "mixed region" combined with SVD/GMM localization is fresh, though components have some combinational elements.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on DA datasets with various shifts. Includes budget, ablation, and visualization studies. Limited to classification.
Writing Quality: ⭐⭐⭐⭐ Motivation (Fig. 1) is clear and compelling. Progressive methodological explanation is well-structured.
Value: ⭐⭐⭐⭐ Significant for open-world scenarios requiring human intervention (e.g., autonomous driving). High annotation efficiency is practically valuable.