Versatile Transition Generation with Image-to-Video Diffusion¶
Conference: ICCV 2025 arXiv: 2508.01698 Code: Project Page Area: Video Generation Keywords: video transition generation, image morphing, bidirectional motion prediction, LoRA interpolation, representation alignment regularization
TL;DR¶
This paper proposes VTG, a unified video transition generation framework built upon an image-to-video diffusion model. VTG achieves smooth, high-fidelity transitions across four task categories — object morphing, motion prediction, concept blending, and scene transition — via interpolation-based initialization (noise SLERP + LoRA interpolation + text SLERP), bidirectional motion fine-tuning, and DINOv2 representation alignment regularization.
Background & Motivation¶
- Background: Transition video generation encompasses object morphing (DiffMorpher), video frame interpolation (RIFE, etc.), and scene transition (SEINE), yet each method targets a specific task with no unified framework.
- Limitations of Prior Work: (1) Image morphing methods (e.g., DiffMorpher) produce discontinuous static images rather than temporally coherent frames; (2) video frame interpolation yields unnatural transitions under large content discrepancies; (3) existing frameworks address either morphing/motion prediction or scene transitions, but not both.
- Key Challenge: High-quality transitions must simultaneously satisfy four criteria: semantic similarity, input fidelity, inter-frame smoothness, and text alignment. Random latent initialization in I2V diffusion models causes inter-frame "flickering," and support for only forward motion prediction creates asymmetry between forward and backward inputs.
- Goal: Can a general-purpose transition generator be designed to handle object morphing, concept blending, motion prediction, and scene transitions in a unified manner?
- Key Insight: Three complementary designs are introduced on top of an I2V diffusion model: interpolation-based initialization (handling large content discrepancies), bidirectional motion (eliminating directional asymmetry), and representation alignment (enhancing fidelity).
- Core Idea: Unify four transition task categories via spherical interpolation of noise + LoRA fusion + text SLERP, while bidirectional motion fine-tuning eliminates directional bias.
Method¶
Overall Architecture¶
VTG is built upon the DynamiCrafter pre-trained I2V diffusion model. Given a start frame \(x^1\), an end frame \(x^N\), and corresponding text prompts, VTG comprises an inference stage and a training stage. During inference, DDIM inversion is applied to obtain latent noise at both endpoints, which are then SLERP-interpolated. During training, only the value/output matrices of temporal attention layers and an MLP projector are fine-tuned on 150 high-quality videos.
Key Designs¶
-
Interpolation-based Initialization:
- Function: Mitigates abrupt changes caused by random noise, preserves object identity, and handles large content discrepancies.
- Mechanism: Triple interpolation — (1) Noise SLERP: DDIM inversion is applied to both endpoints to obtain \(z_{t1}\) and \(z_{tN}\); intermediate frame noise is correlated via spherical linear interpolation \(z_{tn} = \frac{\sin((1-\lambda)\phi)}{\sin\phi}z_{t1} + \frac{\sin(\lambda\phi)}{\sin\phi}z_{tN}\), injected only in early denoising steps. (2) LoRA Interpolation: Separate LoRAs \(\Delta\theta_1, \Delta\theta_N\) are trained for each endpoint (only 200 steps), then linearly interpolated as \(\Delta\theta = (1-\lambda_{LoRA})\Delta\theta_1 + \lambda_{LoRA}\Delta\theta_N\) to fuse semantics. (3) Frame-aware Text SLERP: Text embeddings \(c_1, c_N\) from both endpoints are interpolated via SLERP as \(c_{\lambda} = \text{SLERP}(c_1, c_N, \lambda_{text})\) to achieve per-frame text-conditioned transitions.
- Design Motivation: Linear interpolation in Gaussian latent space produces unlikely norms; SLERP preserves Euclidean norms and enables in-distribution sampling. LoRA captures high-level semantics absent in image diffusion models. Text SLERP addresses the inability of a single caption to describe the blended meaning of intermediate frames.
-
Bidirectional Motion Prediction:
- Function: Eliminates quality asymmetry caused by the ordering of forward and backward inputs in I2V diffusion models.
- Mechanism: The temporal self-attention map is rotated by 180° to invert attention relationships, while the temporal dimension of the noise latent is simultaneously reversed. A forward U-Net and a backward U-Net predict forward and backward motion noise, respectively. The backward prediction is re-reversed and fused with the forward prediction: \(\epsilon_t = (1-\lambda_{BMP})\epsilon_{t,i} + \lambda_{BMP}\epsilon'_{t,N-i}\) (with \(\lambda_{BMP}=0.5\)). Only the value and output matrices of temporal attention layers are fine-tuned. Loss: \(\mathcal{L}_{BMP} = \|\text{flip}(\epsilon_t) - \epsilon_{\theta_{w,o}}(z_{t'}, c, t, A'_{i,j})\|_2^2\).
- Design Motivation: I2V models are biased toward similarity with the start frame (conditional image leakage) and are pre-trained only for forward motion. Bidirectional fusion ensures a consistent motion trajectory.
-
Representation Alignment Regularization:
- Function: Enhances the fidelity of generated transition frames and reduces blurriness.
- Mechanism: Intermediate diffusion latents are patchified per frame and aligned to DINOv2 features via a trainable MLP projector. Cosine similarity is computed per patch: \(\mathcal{L}_{RAR} = -\sum_{n=1}^{N}\mathbb{E}[\frac{1}{P}\sum_{p=1}^{P}\text{sim}(y_*^{[p]}, y_\phi(h_t)^{[p]})]\). The DINOv2 encoder and MLP are discarded at inference time.
- Design Motivation: Diffusion latents inherently lack high-frequency semantics, whereas DINOv2 features contain rich self-supervised semantic information. Distilling DINOv2 features into the diffusion process during training incurs zero overhead at inference.
Loss & Training¶
Lightweight fine-tuning on only 150 high-quality videos. BMP fine-tunes temporal attention V/O matrices; RAR trains the MLP projector. AdamW optimizer, learning rate 1e-5, ~20K iterations on 4 A100 GPUs. LoRA training requires only 200 steps (~85 seconds) per input pair. DDIM sampling with 50 steps.
Key Experimental Results¶
Main Results¶
| Method | MorphBench FID↓ | MorphBench PPL↓ | TC-Bench TCR↑ | Smoothness↑ |
|---|---|---|---|---|
| DiffMorpher | 70.49 | 18.19 | 41.82 | — |
| SEINE | 82.03 | 47.72 | — | — |
| DynamiCrafter | 87.32 | 42.09 | — | — |
| TVG | 86.92 | 35.18 | — | — |
| VTG (Ours) | 67.39 | 22.80 | Best | Best |
Ablation Study¶
| Configuration | FID↓ | PPL↓ | Note |
|---|---|---|---|
| Full VTG | Best | Best | Complete model |
| w/o Noise SLERP | ↑ | ↑ | Random abrupt changes in intermediate frames |
| w/o LoRA Interpolation | ↑ | — | Insufficient semantic fusion |
| w/o Text SLERP | ↑ | — | No per-frame text conditioning |
| w/o BMP | ↑ | ↑ | Forward/backward directional asymmetry |
| w/o RAR | ↑ | — | Loss of fine-grained detail |
Key Findings¶
- VTG significantly outperforms DiffMorpher on object morphing (FID 67.39 vs. 70.49), as DiffMorpher lacks temporal modeling.
- In concept blending, VTG generates semantically plausible intermediate states (e.g., a truck with lion coloring and proportions), whereas baselines produce abrupt transitions.
- A bidirectional motion weight of \(\lambda_{BMP}=0.5\) effectively eliminates directional bias.
- RAR yields the most pronounced gains in high-frequency texture scenarios (bicycle spokes, fabric patterns).
Highlights & Insights¶
- A unified definition and framework for four transition task categories: object morphing, concept blending, motion prediction, and scene transition.
- The triple interpolation strategy (noise + LoRA + text) forms a logically complementary hierarchy: structural consistency at the noise level, semantic fusion at the LoRA level, and conditional guidance at the text level.
- Construction of the TransitBench benchmark: 200 start–end frame pairs, providing the first standardized evaluation for concept blending and scene transition.
- Zero inference overhead for RAR: DINOv2 is used solely for training-time regularization.
Limitations & Future Work¶
- LoRA training requires ~85 seconds per input pair, which becomes costly for batch generation.
- Only 150 training videos limit motion diversity.
- The framework is based on DynamiCrafter (UNet architecture) and may be transferable to more recent DiT architectures.
- TransitBench is relatively small (200 pairs) and could be substantially expanded.
Related Work & Insights¶
- vs. DiffMorpher: Image diffusion-based morphing lacks temporal modeling, producing discontinuous frame sequences.
- vs. SEINE: Uses randomly masked conditioning layers for scene transitions but performs poorly on concept blending.
- vs. Generative Inbetweening: Fuses forward and backward noise but overlooks identity preservation and large content discrepancies.
Rating¶
- Novelty: ⭐⭐⭐⭐ Unified framework for four transition tasks + triple interpolation strategy
- Experimental Thoroughness: ⭐⭐⭐⭐ Benchmarks for all four task categories + new TransitBench
- Writing Quality: ⭐⭐⭐⭐ Clear problem formulation and logically coherent method components
- Value: ⭐⭐⭐⭐ Unified paradigm for transition generation with practical value for video editing and filmmaking