CVPR 2026 Image Generation Remote sensing change generation diffusion bridge model spatiotemporal image generation multimodal conditions change detection data engine

ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing¶

Conference: CVPR 2026
arXiv: 2507.04678
Code: GitHub
Area: Image Generation
Keywords: Remote sensing change generation, diffusion bridge model, spatiotemporal image generation, multimodal conditions, change detection data engine

TL;DR¶

Proposes ChangeBridge, which achieves conditional spatiotemporal image generation from pre-event to post-event in remote sensing scenes via a drift-asynchronous diffusion bridge. It supports multimodal controls including coordinate-text, semantic masks, and instance layouts, serving as a data generation engine for change detection tasks.

Background & Motivation¶

Background: Remote sensing generation methods have progressed in sectors like layout-to-image and modality conversion, but conditional spatiotemporal generation (synthesizing future scenes based on historical observations and multimodal conditions) remains largely unexplored.

Limitations of Prior Work: Existing change generation methods start from pure noise and can only handle event-driven changes (e.g., new buildings). They fail to model cross-temporal dynamics (e.g., seasonal lighting changes, vegetation growth) and lack a direct correlation between pre- and post-temporal phases.

Key Challenge: Spatiotemporal generation must simultaneously handle heterogeneous evolution—drastic foreground event changes + subtle background temporal dynamics—where the evolution speed and magnitude differ significantly.

Goal: Design a generative model capable of discriminatively processing foreground event changes and background temporal evolution.

Key Insight: Diffusion bridge models to replace pure noise initialization + pixel-level drift magnitude maps to achieve asynchronous evolution.

Core Idea: Drift-asynchronous diffusion bridge—starting from a composite state of the pre-event, using different drift magnitudes to control differentiated generation of foreground and background.

Method¶

Overall Architecture¶

Three core modules: (a) Composite bridge initialization—a composite image of pre-event background + conditional foreground serves as the diffusion starting point; (b) Asynchronous drift diffusion—a pixel-level drift map assigns different evolution magnitudes to foreground/background; (c) Drift-aware denoising—the denoising network is conditioned on the drift map. Supports both UNet and DiT backbones.

Key Designs¶

Composite Bridge Initialization: Given multimodal conditions $\mathbf{x}_c$, the foreground mask $\mathbf{M}_{fg}$ is extracted to construct $\mathbf{x}_a = \mathbf{M}_{fg} \odot \mathbf{x}_c + (1-\mathbf{M}_{fg}) \odot \mathbf{x}_0$, serving as the starting point of the diffusion bridge rather than noise. Design Motivation: Starting from a composite state maintains spatial consistency and temporal continuity better than starting from noise.
Asynchronous Drift Diffusion: Defines $\mathbf{d}_{map} = \mathbf{M}_{fg} \cdot \gamma^{fg} + (1-\mathbf{M}_{fg}) \cdot \gamma^{bg}$ ($\gamma^{fg}=1.0, \gamma^{bg}=0.8$), modifying the drift coefficient $\tilde{m}_t(i,j) = m_t \cdot \mathbf{z}_d(i,j)$. Design Motivation: Foreground requires large-scale generation while background only needs slight evolution; uniform drift would lead to imbalance.
Drift-Aware Denoising: The denoising network is conditioned on $\mathbf{z}_d$ (latent representation of the drift map) and $\mathbf{z}_c$ (pre-event context). Loss: $$\mathcal{L}_{asy} = \mathbb{E}\left[\|\tilde{m}_t(\mathbf{z}_a - \mathbf{z}_b) + \sqrt{\delta_t}\epsilon - \epsilon_\theta(\mathbf{z}_t, t, \mathbf{z}_a, \mathbf{z}_c, \mathbf{z}_d)\|^2\right]$$

Loss & Training¶

UNet (SD1.5) 60 epochs, DiT (DiT-XL/2) 100 epochs, Adam 1e-4, batch 64, 2×A100.
VQGAN encoder + SkyCLIP text encoder.

Key Experimental Results¶

Main Results¶

Condition	Method	FID↓	IS↑	Consistency
Coord-Text	Instruct-Imagen	48.17	3.70	CosSim 0.81
Coord-Text	Ours-T	31.45	5.14	0.85
Layout (WHU)	Changen2	48.85	5.64	IoU 74.33
Layout (WHU)	Ours-T	40.12	6.77	78.13
Semantic (SECOND)	Changen2	69.43	6.18	mIoU 73.20
Semantic (SECOND)	Ours-T	59.33	6.41	74.26

Ablation Study¶

CB	AD	DD	FID↓	IoU↑	Description
✗	✗	✗	76.81	65.29	SD1.5 Baseline
✓	✗	✗	56.24 (-20.57)	71.87	Bridge initialization contributes most
✓	✓	✓	45.47 (-11.59)	75.30	Full effect of three components

Key Findings¶

Composite bridge initialization contributes the most (FID decreased by 20.57), verifying that "starting from a state" is superior to "starting from noise."
As a data engine: 2× synthetic data augmentation can improve change detection tasks by BCD +2.26 IoU and CC +10.97 CIDEr.
The DiT variant overall outperforms the UNet variant (FID 31.45 vs 38.36).

Highlights & Insights¶

First to propose the task of conditional spatiotemporal image generation for remote sensing, filling the gap where change generation could not model temporal dynamics.
The drift-asynchronous diffusion bridge is the core innovation, introducing spatial adaptive drift within the diffusion bridge.
Huge potential as a data engine: remote sensing change detection faces severe scarcity of paired data.
Backbone-agnostic design (applicable to both UNet/DiT), showing high generalizability.

Limitations & Future Work¶

Drift magnitudes $\gamma^{fg}, \gamma^{bg}$ require manual setting; only supports 256×256 resolution.
Drift modeling for transition areas (e.g., transition zones between old and new buildings) has not been explored.
Diminishing returns after exceeding 2× synthetic data.

BBDM provides the theoretical basis for "state-to-state" generation; ChangeBridge extends this with asynchronous drift.
Direct value for applications like urban planning and disaster assessment.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ New technical contribution with drift-asynchronous diffusion bridge; task definition is also pioneering.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 4 datasets × 6 baselines × 3 conditions + downstream validation.
Writing Quality: ⭐⭐⭐⭐⭐ Complete mathematical derivations and exquisite illustrations.
Value: ⭐⭐⭐⭐⭐ Trinity of task definition + method innovation + data engine.