Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation¶

Conference: ICLR 2026 arXiv: 2602.11743 Code: https://github.com/Jinx630/ADTE Area: Social Computing Keywords: Test-Time Adaptation, Tsallis entropy, CLIP, debiasing, uncertainty estimation

TL;DR¶

This paper introduces Tsallis entropy (a generalization of Shannon entropy) into Test-Time Adaptation for vision-language models, and further develops Adaptive Debiasing Tsallis Entropy (ADTE), which customizes a per-class debiasing parameter \(q^l\) to select more reliable high-confidence views than Shannon entropy without distribution-specific hyperparameter tuning. ADTE surpasses the state of the art on ImageNet and its 5 variants as well as 10 cross-domain benchmarks.

Background & Motivation¶

Background: TTA methods improve the out-of-distribution performance of VLMs such as CLIP by selecting high-confidence augmented views. Representative methods including TPT and Zero employ Shannon entropy to measure uncertainty and filter low-entropy views.

Limitations of Prior Work: CLIP is pre-trained on imbalanced web-crawled data, leading to overconfidence on head classes and underconfidence on tail classes. Shannon entropy applies a uniform formula \(-p\log p\) to all classes, making it unable to distinguish the degree of bias across different classes. As a result, the entropy estimate itself is biased, which degrades the quality of high-confidence view selection.

Key Challenge: Shannon entropy assumes an unbiased probability distribution (the extensivity assumption), whereas CLIP's predictive distribution exhibits systematic bias (non-extensivity) that SE cannot characterize.

Goal: How can the influence of VLM prediction bias on entropy estimation be corrected during TTA?

Key Insight: Tsallis entropy is a generalization of Shannon entropy that captures statistical dependencies among probability distributions via the non-extensive parameter \(q\). When \(q<1\), TE tends to select more reliable high-confidence views.

Core Idea: Replace Shannon entropy with Tsallis entropy for high-confidence view selection, and adaptively compute a debiasing parameter \(q^l\) for each class.

Method¶

Overall Architecture¶

ADTE serves as a plug-and-play replacement for Shannon entropy in TTA methods such as Zero and TPT. The pipeline proceeds as follows: a test image is augmented into \(N\) views → ADTE computes the uncertainty of each view → low-entropy, high-confidence views are selected → predictions are aggregated. The key distinction lies in the entropy computation and the use of class-specific parameters.

Key Designs¶

Tsallis Entropy as a Replacement for Shannon Entropy:
Function: Replace SE \(\mathbf{H}_{SE} = -\sum_l P_l \log P_l\) with TE \(\mathbf{H}_{TE} = \frac{\sum_l P_l^q - 1}{1-q}\).
Mechanism: Theoretically, TE reduces to SE as \(q \to 1\) (lower-bound property). When \(q < 1\), views selected by TE exhibit higher Top-K cumulative reliability (TcrK). For \(0 < q < 1\), TE naturally mitigates the influence of VLM prediction bias.
Design Motivation: SE is sensitive to bias in tail classes (where probabilities approach zero). By replacing \(p\log p\) with \(p^q\), TE alters the treatment of small probabilities.
Adaptive Debiasing Tsallis Entropy (ADTE):
Function: Customize a class-specific parameter \(q^l\) for each class \(l\) without manual tuning.
Mechanism: (1) A memory bank is maintained to estimate the per-class prior probability \(\tilde{p}_l\) (solved via Jacobi iteration, approximated with pseudo-labels). (2) The estimated bias is mapped to \([\alpha, \beta] = [0.01, 0.9]\) via min-max normalization to yield \(q^l\)—classes with larger bias receive smaller \(q^l\), implying stronger correction.
Design Motivation: Manual tuning of \(q\) is infeasible across varying test distributions, and different classes are affected by bias to different extents (head vs. tail classes).
Integration with Logit Adjustment:
ADTE integrates seamlessly with logit adjustment strategies: the estimated bias is first used to adjust logits, after which ADTE selects high-confidence views.
The entire process requires no additional training or distribution-specific hyperparameter tuning.

Loss & Training¶

No training is required. ADTE is a purely inference-time method that directly replaces Shannon entropy in the TTA pipeline. The memory bank stores 10 samples per class.

Key Experimental Results¶

Main Results (ImageNet + 5 Variants, CLIP ViT-B/16)¶

Method	IN	IN-A	IN-R	IN-K	Average	OOD Avg
CLIP	68.7	50.6	77.7	48.3	61.5	59.7
Zero	70.9	64.0	80.8	50.3	66.2	65.0
BCA	70.2	61.1	80.7	50.9	65.6	64.4
ADTE	71.8	65.5	81.4	53.5	67.5	66.5

Cross-Domain Benchmarks (Best Average across 10 Datasets)¶

Metric	Description
ADTE Average Accuracy	Highest average performance across 10 cross-domain benchmarks
Model-agnostic	Outperforms SOTA on both ViT-B/16 and ViT-L/14
Prompt-agnostic	Effective with both hand-crafted templates and CuPL-generated text

Key Findings¶

TE consistently outperforms SE when \(q < 1\) (SE is a special case of TE at \(q=1\)), though the optimal \(q\) varies across test distributions.
ADTE eliminates the need for manual hyperparameter search via adaptive \(q^l\), yielding robust performance across all test distributions.
The largest gain is observed on ImageNet-K (48.3→53.5), the variant with the most severe distribution shift.
ADTE can directly replace entropy computation in any SE-based TTA method without further modification.

Highlights & Insights¶

First systematic analysis of Shannon entropy bias in VLM TTA: The extensivity assumption implicitly embedded in SE does not hold in this setting—a long-overlooked issue.
Tsallis entropy as a principled drop-in replacement: Theoretically elegant (SE is a lower bound) and practically effective; any TTA method using SE can directly substitute TE/ADTE.
Adaptive parameter estimation: The per-class debiasing parameter \(q^l\) is derived from existing bias estimation techniques (from Frolic), reusing available tools without additional overhead.

Limitations & Future Work¶

The memory bank size is fixed at 10 samples per class, which may be insufficient when the label space is large (e.g., 1,000 classes in ImageNet).
Bias estimation relies on pseudo-label quality; early-stage pseudo-labels may be unreliable.
The normalization interval \([\alpha, \beta] = [0.01, 0.9]\) remains a manually specified hyperparameter.
Validation is limited to classification tasks; dense prediction tasks such as detection and segmentation are not covered.

vs. Zero/TPT: ADTE is a direct upgrade—replacing only the entropy computation yields consistent improvements without modifying other components.
vs. Frolic: Frolic applies logit adjustment for bias correction, whereas ADTE corrects bias at the entropy estimation level; the two approaches are complementary.
vs. Prior Use of Tsallis Entropy in Domain Adaptation: Previous work optimizes TE for pseudo-label generation in source-domain adaptation; ADTE is the first to apply it to view selection in online TTA.

Rating¶

Novelty: ⭐⭐⭐⭐ — Applying Tsallis entropy to VLM TTA represents a novel theoretical perspective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — ImageNet + 5 variants, 10 cross-domain benchmarks, two model scales, two prompt types.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear, though notation is dense.
Value: ⭐⭐⭐⭐ — A plug-and-play replacement for Shannon entropy with strong practical utility.