Skip to content

Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

Conference: CVPR 2025
arXiv: 2406.04295
Code: https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment
Area: Diffusion Models
Keywords: Test-Time Adaptation, Diffusion Models, Domain Alignment, Synthetic Domain, Data Adaptation

TL;DR

This paper reveals the implicit domain misalignment issue between the source domain and the synthetic domain in diffusion-driven TTA methods, and proposes the Synthetic-Domain Alignment (SDA) framework. By utilizing the Mix of Diffusion (MoD) technique to simultaneously align the source model and the target data to the same synthetic domain, the proposed method achieves consistent performance improvements across classification, segmentation, and multimodal large language models.

Background & Motivation

Background: Test-time adaptation (TTA) is an emerging research direction aimed at improving the performance of source-domain pre-trained models on unseen target domains. Traditional TTA methods adapt to the target data stream by continuously updating model weights, but this approach is highly sensitive to the quantity and order of target data. Recently, diffusion-model-driven TTA methods (e.g., DiffPure, DDA, GDA) have shifted toward adapting input data instead of model weights, mapping target-domain data to the synthetic domain through unconditional diffusion models.

Limitations of Prior Work: Although the synthetic data generated by diffusion-driven TTA methods is visually indistinguishable from source data, the authors demonstrate that for deep networks, the synthetic data actually exhibits significant domain misalignment with the source data. Experiments show that even when applying DDA for diffusion adaptation directly on ImageNet source data (involving no domain shift), the accuracy of Swin-B and ConvNeXt-B drops by over 21.8% and 18.8%, respectively.

Key Challenge: The theoretical assumption of existing diffusion-driven TTA methods is to map target data back to the source domain. However, the destination of this mapping is actually the synthetic domain of the diffusion model rather than the true source domain. There is an implicit gap between the source domain and the synthetic domain—invisible to the human eye, but highly sensitive to deep networks.

Goal: (1) How to quantify and understand the source-to-synthetic domain misalignment? (2) How to adapt the model to the synthetic domain without accessing the source data? (3) How to handle the synthetic domain discrepancy between conditional and unconditional diffusion models?

Key Insight: Since the target data after diffusion adaptation inevitably falls into the synthetic domain, it is better to align the source model to the same synthetic domain, transforming the cross-domain TTA problem into an in-domain prediction problem.

Core Idea: Instead of trying to pull target data back to the source domain, both the model and the data are aligned to the synthetic domain of the diffusion model, thereby eliminating the domain gap.

Method

Overall Architecture

The SDA framework consists of three stages: (1) Source-domain model pre-training stage—training the source model normally on source data; (2) Source-to-synthetic model adaptation stage—generating a synthetic dataset via the MoD technique to fine-tune the source model to the synthetic domain; (3) Target-to-synthetic data adaptation stage—mapping target data to the synthetic domain using an unconditional diffusion model. Finally, the synthetic-domain data is fed into the synthetic-domain model for inference, which is then ensembled with the prediction of the source model on the original target data.

Key Designs

  1. Discovery and Quantification of Source-Synthetic Domain Misalignment:

    • Function: Expose the neglected domain misalignment issue in diffusion-driven TTA.
    • Mechanism: On the ImageNet validation set, the source-domain data is first subjected to diffusion-based forward and reverse processes (i.e., involving only source \(\to\) synthetic, not target \(\to\) source), and then tested using the source model. The results show that accuracy monotonically decreases as the timestep \(t\) increases. Within a reasonable TTA range of \(t \geq 500\), the accuracy of ConvNeXt-B drops from 83.4% to 41.5%–65.1%, confirming a massive gap between the synthetic domain and the source domain.
    • Design Motivation: Although synthetic data and source data are visually indistinguishable, deep networks are extremely sensitive to this implicit domain difference, which is the root cause of the performance limitations in prior methods.
  2. Mix of Diffusion (MoD) Synthetic Data Generation:

    • Function: Generate a labeled synthetic dataset aligned with the synthetic domain of the unconditional diffusion model.
    • Mechanism: This process involves two steps. First, a conditional diffusion model (e.g., DiT) is used to generate synthetic data \(x_{0,c}^{syn}\) conditioned on class labels, yielding a labeled synthetic dataset. Second, the generated conditional synthetic data is perturbed with noise up to timestep \(t^*\) and then denoised using an unconditional diffusion model, aligning the data from the conditional synthetic domain \(p_{0,c}^{syn}\) to the unconditional synthetic domain \(p_{0,u}^{syn}\). This step leverages the property that KL divergence monotonically decreases with the number of noise addition steps, leading to the convergence of the two synthetic domain distributions under high noise levels.
    • Design Motivation: Conditional diffusion models can generate labeled data, but their synthetic domain differs from that of unconditional diffusion models (due to differences in architecture and training methods). The noise-denoise process in the second step eliminates the gap between the two synthetic domains, ensuring that the domain of the fine-tuning data is perfectly aligned with the target domain during test-time data adaptation.
  3. Synthetic-Domain Model Fine-Tuning and Ensemble Inference:

    • Function: Align the source model to the synthetic domain and perform robust inference.
    • Mechanism: The source model is fine-tuned for 15 epochs using 50K synthetic data generated by MoD to obtain the synthetic-domain model \(f'\). During inference, the predictions of the source model \(f\) on the original target data \(q(y|x_0^{trg})\) and the predictions of the synthetic-domain model \(f'\) on the adapted data \(q'(y|x_{0,u}^{syn})\) are ensembled via probability summation: \(\hat{y} = \arg\max_y (q(y|x_0^{trg}) + q'(y|x_{0,u}^{syn}))\).
    • Design Motivation: Diffusion adaptation occasionally generates samples with lower recognizability than the original data, and the ensemble strategy can combine the strengths of both. Furthermore, synthetic data only needs to be generated once to adapt to different source models, incurring a very low marginal cost.

Loss & Training

During the model adaptation stage, standard classification cross-entropy loss is used for fine-tuning. Once generated, synthetic data can be reused across different source models. A similar strategy is employed for segmentation and MLLMs: conditional diffusion models are used to generate synthetic data with segmentation masks or VQA annotations, which are then processed via MoD before fine-tuning.

Key Experimental Results

Main Results

Model Source MEMO DiffPure GDA DDA SDA (Ours)
ResNet-50 18.7 24.7 16.8 31.8 29.7 32.5 (+2.8)
Swin-T 33.5 29.5 24.8 42.2 40.0 42.5 (+2.5)
ConvNeXt-T 39.3 37.8 28.8 44.8 44.2 47.0 (+2.8)
Swin-B 40.5 37.0 28.9 - 44.5 47.4 (+2.9)
ConvNeXt-B 45.6 45.8 32.7 - 49.4 51.9 (+2.5)

Average accuracy across 15 corruptions on ImageNet-C at severity=5. SDA consistently outperforms DDA by 2.5%–2.9% across all models.

Ablation Study

Configuration Swin-B (t=700) ConvNeXt-B (t=700) Description
Source-Synthetic (Misaligned) 55.7 60.3 Testing synthetic data directly on source model
Synthetic-Synthetic (Aligned, SDA) 65.0 67.4 Testing synthetic data on aligned model
\(\Delta\) +9.3 +7.1 Improvement from domain alignment
Fine-tuning with conditional synthetic domain only \(f_c'\) Lower than \(f_u'\) Lower than \(f_u'\) Proves the discrepancy between conditional/unconditional synthetic domains

Key Findings

  • SDA outperforms DDA on 14 out of 15 corruptions in ImageNet-C, with the sole exception of contrast corruption.
  • The improvement from domain alignment increases as the timestep \(t\) grows (\(+6.0\%\) at \(t=500\), \(+9.9\%\) at \(t=1000\)), demonstrating that greater synthetic domain shift yields higher alignment gains.
  • The SDA framework is also applicable to segmentation (DeepLabv3 achieves a \(+1.2\%\) mIoU improvement on PASCAL VOC-C) and MLLMs (LLaVA shows improvements on corrupted VQA), demonstrating its generalizability.
  • Synthetic data needs to be generated only once and can be reused across different source models.

Highlights & Insights

  • Discovery of implicit domain misalignment is highly precise: Although synthetic data is visually indistinguishable from source data, the latent feature space of deep networks is extremely sensitive to it. This insight reveals the fundamental bottleneck of diffusion-driven TTA methods, offering inspiring perspectives for the whole field.
  • MoD is an elegant alignment mechanism: By performing a simple "adding noise to high noise levels \(\to\) denoising with the target diffusion model" operation, any distribution can be aligned to the target synthetic domain, avoiding complex domain adaptation training.
  • The orthogonality of the framework is highly valuable: SDA focuses on "model \(\to\) synthetic domain" alignment, which is orthogonal to the "data \(\to\) synthetic domain" adaptation used in DDA/GDA. It can be directly combined with better data adaptation methods in the future to yield even larger improvements.

Limitations & Future Work

  • SDA requires extra pre-computation overhead for 50K synthetic data generation and fine-tuning (though this is a one-time cost).
  • The computational overhead of diffusion adaptation itself remains high (requiring a full forward-backward diffusion process for each image), which limits real-time application scenarios.
  • For target domains with extreme shifts from the source domain (e.g., cross-modality), the synthetic domain of the diffusion model itself may fail to effectively cover the target distribution.
  • The ensemble strategy is relatively simple (probability summation). Future work could explore adaptive weighting or confidence-based fusion schemes.
  • vs DDA: DDA only performs Target \(\to\) Synthetic data adaptation, ignoring the gap between Source and Synthetic. SDA addresses this missing link with additional model adaptation, serving as a natural extension of DDA.
  • vs GDA: GDA uses better structure guidance for data adaptation, while SDA approaches from the perspective of model adaptation. These two methods are orthogonal and complementary; GDA's data adaptation can be stacked with SDA's model alignment.
  • vs Traditional TTA (MEMO, etc.): Traditional methods continuously update model weights, rendering them highly sensitive to the order and batch size of the target data stream. SDA's model adaptation is an offline, one-time process that requires no online updates during inference.

Rating

  • Novelty: ⭐⭐⭐⭐ The core insight (synthetic domain \(\neq\) source domain) is simple yet profound. The MoD design is clever, though the technical implementation is relatively straightforward.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers three categories of tasks (classification, segmentation, and MLLMs) across multiple model architectures, with comprehensive ablation studies.
  • Writing Quality: ⭐⭐⭐⭐⭐ The logic is clear, and the narrative chain of problem \(\to\) discovery \(\to\) solution flows very smoothly.
  • Value: ⭐⭐⭐⭐ Provides crucial inspiration for the diffusion-driven TTA field, and the framework exhibits strong generalizability.