NeurIPS 2025 Image Generation Source-Free Domain Adaptation Diffusion Models Pseudo-Target Domain Generation Progressive Optimization Semantic Transformation

Diffusion-Driven Progressive Target Manipulation for Source-Free Domain Adaptation¶

Conference: NeurIPS 2025 arXiv: 2510.25279 Code: N/A Area: Image Generation / Diffusion Models / Domain Adaptation Keywords: Source-Free Domain Adaptation, Diffusion Models, Pseudo-Target Domain Generation, Progressive Optimization, Semantic Transformation

TL;DR¶

This paper proposes the DPTM framework, which leverages a latent diffusion model to perform semantic transformation on unreliable target samples, generating a pseudo-target domain and iteratively narrowing the gap with the real target domain via a progressive reconstruction mechanism. DPTM achieves up to 18.6% improvement over existing SFDA state-of-the-art methods under large domain shift scenarios.

Background & Motivation¶

Source-Free Domain Adaptation (SFDA) requires domain adaptation using only a pretrained source model and unlabeled target data. Existing methods fall into two categories, both constrained by the source–target domain discrepancy:

Non-generative methods: Rely on pseudo-labels produced by the source model, which are highly unreliable under large domain shifts (e.g., only ~60% accuracy on Ar→Pr in Office-Home), leading to unstable performance.
Generative methods: Generate a pseudo-source domain and convert the problem into a standard UDA setting, but the generation process introduces irrelevant domain features, inadvertently widening the source–target gap.

Key insight: The fundamental bottleneck of both paradigms is the source–target domain shift. The authors propose a novel paradigm—directly generating a pseudo-target domain rather than a pseudo-source domain—to eliminate this bottleneck at its root.

Method¶

Overall Architecture¶

DPTM consists of three core components executed over \(R\) progressive iterations:

Trust/Untrust Set Partitioning: Target data is divided into a reliable subset \(V\) and an unreliable subset \(U\) based on prediction uncertainty.
Untrust Set Manipulation Strategy: Samples in \(U\) are semantically transformed to newly assigned class labels via a diffusion model, while preserving target domain distribution characteristics.
Progressive Reconstruction Mechanism: The gap between the pseudo-target domain and the real target domain is iteratively reduced.

Key Designs¶

Trust/Untrust Set Partitioning¶

Entropy \(H\) of target model predictions serves as the uncertainty measure, with threshold \(E\) used for partitioning: - Samples with \(H \leq E\) are assigned to the trust set \(V\) and supervised with pseudo-labels directly. - Samples with \(H > E\) are assigned to the untrust set \(U\) for diffusion-based manipulation.

New class labels are uniformly assigned to each sample in the untrust set (\(\hat{y}_l = l \bmod \lfloor|U|/C\rfloor\)) to ensure class balance; residual tail samples are discarded.

Target-guided Initialization¶

Motivated by the finding that the starting point of diffusion sampling significantly influences generation results, target-domain-guided initialization is constructed via FFT-based frequency decomposition: - Low-frequency components \(F_x^L = \text{FFT}(x) \odot H\) are extracted from the original untrust sample \(x_l^u\) (capturing domain characteristics such as style and texture). - High-frequency components \(F_{I_G}^H = \text{FFT}(I_G) \odot (1 - H)\) are extracted from semantically neutral Gaussian noise \(I_G\) (to prevent semantic leakage from the original sample). - These are combined via IFFT to produce a pseudo-image \(\tilde{x}\) that is semantically neutral yet retains target domain characteristics. - \(\tilde{x}\) is encoded into latent space as \(\hat{z}_0 = \mathcal{E}(\tilde{x})\), and \(T\)-step DDPM forward noise is applied to obtain the sampling starting point \(z_T\).

Semantic Feature Injection¶

At each denoising timestep \(t\), a zigzag self-reflection operation is performed: - Denoise \(z_t \to z_{t-1}\). - Inject the semantics of the assigned label \(\hat{y}_l\) via DDIM inversion: \(z_{t-1} \to \tilde{z}_t\). - Apply classifier-free guidance with guidance scale \(\gamma_2\) to ensure semantic alignment. - Only the high-frequency components of \(\tilde{z}_t\) (containing semantic information) are retained; low-frequency components (potentially containing domain noise artifacts) are discarded.

Domain-specific Feature Preservation¶

At each timestep, two frequency components are combined: - High frequency: \(F_{\tilde{z}_t}^H\) from semantic injection (target class semantics). - Low frequency: \(F_{\hat{z}_{0,t}}^L\) extracted from \(\hat{z}_0\) after \(t\)-step noising (target domain distribution characteristics). - An enhanced latent \(\tilde{z}'_t\) is synthesized via IFFT, jointly preserving semantic content and domain features.

Progressive Reconstruction Mechanism¶

\(R\) rounds of iterative optimization are performed: - After round \(r\), the updated target model repartitions \(V^{(r+1)}\) and \(U^{(r+1)}\). - As the model improves, \(|V^{(r+1)}| > |V^{(r)}|\), so the untrust set gradually shrinks. - Reduced manipulation → reduced domain gap → further model improvement, forming a positive feedback loop.

Loss & Training¶

The trust set and manipulated untrust set are merged to form the pseudo-target domain \(\mathcal{D}_p = V \cup U^m\).
The target model is fine-tuned via standard cross-entropy supervision on \(\mathcal{D}_p\).
Diffusion model: Pretrained SD v1.5, resolution 512×512, 20 denoising steps.
Hyperparameters: \(\gamma_1 = 5.5\), \(\gamma_2 = 0\), \(E = 0.01\), \(R = 10\).
ResNet-50/101 as the adaptation backbone; SGD optimizer; 15K–20K training steps.

Key Experimental Results¶

Main Results¶

Office-31 (ResNet-50)

Method	D→A	W→A	Avg
ProDe (ICLR25)	79.8	79.0	89.9
DM-SFDA	82.7	83.5	93.7
DPTM (Ours)	92.0	91.7	95.8

On the challenging tasks D→A and W→A, improvements of +9.3% and +8.2% are achieved, respectively.

Office-Home (ResNet-50, 12 DA tasks)

Method	Ar→Cl	Pr→Cl	Rw→Cl	Avg
ProDe (ICLR25)	64.0	65.4	65.5	81.1
DM-SFDA	68.5	69.6	68.5	79.5
DPTM	86.7	86.4	87.1	91.2

Average improvement of +10.1% over ProDe; over 20% gain on the challenging →Cl tasks.

DomainNet-126 (ResNet-50, 12 tasks)

Method	C→P	Avg
CPGA	61.2	67.6
ProDe	79.3	81.5
DPTM	85.6	85.2

Ablation Study¶

Effect of Threshold \(E\) (Office-Home, \(R=10\))

\(E\)	Avg
0.001	80.7
0.005	86.7
0.01	91.2

A larger \(E\) assigns more samples to the untrust set for manipulation, yielding better performance.

Component Ablation on Manipulation Strategy: Removing any single component (Target-guided Initialization / Semantic Injection / Domain Preservation) leads to failure in either semantic alignment or domain preservation. SD v1.5 and SDXL achieve comparable performance (both 75.6% Avg), with SD v1.5 being more efficient.

Key Findings¶

The largest gains occur under large domain shifts (e.g., Rw→Cl), validating the pseudo-target domain strategy as a fundamental solution to the domain gap.
The trust set expands progressively during reconstruction, demonstrating the self-improving positive feedback loop.
Frequency-domain decomposition effectively decouples semantic and domain features; all three manipulation components are indispensable.

Highlights & Insights¶

Paradigm Innovation: The first SFDA strategy to generate a pseudo-target domain, fundamentally overcoming the domain shift bottleneck.
Elegant Frequency-Domain Design: The separation of low-frequency (domain features) and high-frequency (semantic features) via FFT permeates the entire method.
Progressive Self-Improvement: The positive feedback loop of a shrinking untrust set enables continuous optimization without additional data.
Strong Performance under Challenging Conditions: Maximum gains (18.6%) are achieved precisely in large domain shift scenarios where other methods typically fail.

Limitations & Future Work¶

Relies on a pretrained diffusion model (SD v1.5), incurring substantial computational and memory overhead.
Threshold \(E\) and iteration count \(R\) require tuning, with optimal settings varying across datasets.
Validated only on classification tasks; extension to detection, segmentation, and other downstream tasks remains unexplored.
The 512×512 generation resolution of the diffusion model may limit applicability in higher-resolution settings.

Core distinction from DM-SFDA: DM-SFDA generates a pseudo-source domain, whereas DPTM manipulates target data to generate a pseudo-target domain.
Frequency-domain operations are inspired by FreeInit/FlexiEdit research on the influence of low-frequency components at diffusion starting points.
Zigzag inversion (bai2024zigzag) is adopted for injecting semantics along the sampling trajectory.
Progressive training is related to curriculum learning but differs in direction—here the dataset shrinks rather than task difficulty increasing.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Pseudo-target domain generation paradigm is entirely novel; frequency-domain separation is highly original)
Technical Depth: ⭐⭐⭐⭐⭐ (Deep integration of diffusion models, frequency-domain operations, and progressive optimization)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (4 benchmarks, 21 comparison methods, comprehensive ablations)
Practicality: ⭐⭐⭐⭐ (Significant improvements but with relatively high computational cost)
Writing Quality: ⭐⭐⭐⭐ (Complex methodology presented clearly in a modular fashion)