Mind the Time: Temporally-Controlled Multi-Event Video Generation¶

Conference: CVPR 2025
arXiv: 2412.05263
Code: None (project page available)
Area: Video Generation
Keywords: Multi-Event Video Generation, Temporal Control, Position Embedding, Diffusion Transformer, Video Generation

TL;DR¶

This work proposes MinT, the first multi-event video generator that supports event temporal control. By utilizing Rescaled RoPE (ReRoPE) position embedding to bind event descriptions to specific time periods, MinT achieves smooth and coherent multi-event video synthesis through fine-tuning on a pre-trained video DiT.

Background & Motivation¶

Existing video generation models rely on a single text prompt to describe the entire video, facing fundamental challenges when generating videos containing multiple sequential events:

Event Omission: When concatenating multiple event descriptions into a long prompt, the model often only generates a subset of the events and ignores the rest (as shown in Fig. 2, CogVideoX, Mochi, Kling, and Gen-3 all only generate partial events).
Order Disorganization: Even if multiple events are generated, they are often not arranged in the correct temporal sequence.
Fixed Duration: Existing multi-event methods (e.g., generating event-by-event autoregressively) cannot control the duration of individual events, forcing all events to have the same duration.
Unnatural Transitions: Stitching independently generated video clips results in abrupt scene cuts.

Key Insight: Similar to how bounding boxes bind objects to spatial locations in the spatial domain, events need to be bound to specific time intervals in the temporal domain. If each event has a clear temporal range, the model can focus on one event at a time, thereby naturally scheduling the event sequence.

Method¶

Overall Architecture¶

MinT is built upon a pre-trained video DiT (latent Diffusion Transformer). The inputs consist of: (1) a global caption describing the background and subject appearance, injected via the original cross-attention layers; (2) a series of temporal captions $(c_n, t_n^{start}, t_n^{end})$ describing dynamic events and their corresponding time intervals, injected via newly introduced temporal cross-attention layers. This decoupled design of global and temporal prompts resembles the classic content-motion decoupling paradigm in video generation.

Key Designs¶

Rescaled RoPE (ReRoPE) Position Embedding: The core technical contribution. To associate the event text embeddings with the video tokens of the corresponding temporal segments, an improved rotary position embedding is employed in the temporal cross-attention.

Key Problem: Directly using raw timestamps as the RoPE rotation angle (vanilla RoPE) fails—when adjacent events have different durations, some frames belonging to event A might be closer to the midpoint of event B, causing attention to mistake the target event.

Solution: All events are rescaled to the same length $L$ via the mapping formula: $$\tilde{t} = \frac{(t - t_n^{start})L}{t_n^{end} - t_n^{start}} + (n-1)L$$

ReRoPE satisfies three desirable properties: (i) frames within an event's temporal range always focus most on that event's text; (ii) the attention weight peaks at the event's midpoint and decays towards the boundaries; (iii) frames at event transition points attend equally to the two adjacent events, facilitating a smooth transition.

Using a fixed $L=8$ allows videos of different lengths to be scaled to the same position embedding space, decoupling the layer's behavior from the actual video duration.

Scene Cut Conditioning: Scene cuts are treated as special events—represented by a learnable vector $e^{cut} \in \mathbb{R}^{1 \times D^c}$. Their timestamps are encoded via ReRoPE and then concatenated with event embeddings for cross-attention. During training, videos containing scene cuts (accounting for 20% of the dataset) are preserved, enabling the model to learn to control scene transitions based on the presence or absence of the cut token. During inference, inputting a zero vector avoids unwanted cuts.
Prompt Enhancer: GPT-4 is utilized to expand simple user prompts into detailed global captions and multiple timestamped event captions, allowing users to generate motion-rich multi-event videos with simple inputs. The LLM is responsible for planning the temporal structure of the events.

Loss & Training¶

Training is based on Rectified Flow: $\mathcal{L}_{DiT} = \|v_t - u_\theta(z_t, t, y)\|^2$, where $v_t = \epsilon_t - z$.

Manually annotated temporal events (including start and end times) for ~200K videos, extracted from existing datasets.
Automatically detected scene cut boundaries using TransNetV2.
Full-model fine-tuning where only the parameters of the newly added temporal cross-attention layers are learned.
AdamW optimizer, batch size 512, 12K steps.
Inference with 256 denoising steps, CFG scale = 8.

Key Experimental Results¶

Main Results (T2V HoldOut Dataset)¶

Method	VQ ↑	DD ↑	CLIP-T ↑	TA ↑	TC ↑	#Cuts ↓
Concat (base)	2.61	3.32	0.247	2.37	2.45	0.020
AutoReg	2.39	2.97	0.267	2.96	2.10	0.056
MEVG	2.50	3.39	0.264	2.68	2.15	0.120
MinT	2.56	3.32	0.270	2.92	2.44	0.026

MinT substantially outperforms baselines in event text alignment (TA +0.55 vs. Concat) and temporal consistency (TC +0.29 vs. MEVG), while maintaining high visual quality.

Ablation Study¶

Configuration	CLIP-T ↑	TA ↑	TC ↑	#Cuts ↓	Description
Full Model (ReRoPE L=8)	0.270	2.92	2.44	0.026	Best balance
Concat time	0.249	2.42	2.33	0.075	No absolute position $\rightarrow$ unable to associate events
Hard attn mask	0.260	2.68	2.30	0.069	Hard boundary cut $\rightarrow$ abrupt transitions
Vanilla RoPE	0.262	2.79	2.42	0.030	Inaccurate localization for varying-duration events
No cut condition	0.268	2.89	2.34	0.084	No cut control $\rightarrow$ increased unintended cuts

Key Findings¶

Prompt Enhancement Boosts Dynamics: On VBench, the Dynamic Degree is only 0.481 with short prompts, increases to 0.517 after global-caption enhancement, and reaches 0.711 (+47.8%) when temporal captions are added, while preserving visual quality.
Effectiveness in I2V: In image-to-video generation, MinT demonstrates an even more notable advantage with an FID of 22.04 vs. MEVG's 57.57, and an FVD of 218.21 vs. 495.75.
Human Preference Evaluation: MinT significantly outperforms all baselines in event-text alignment, temporal precision, and transition smoothness (with a win rate exceeding 60%).
Robustness to L Parameter: Performance is highly robust and insensitive to the choice of $L$, with $L=4, 8, 16$ achieving comparable results.

Highlights & Insights¶

Remarkably Elegant Intuition behind ReRoPE: The challenging position embedding issue for events of unequal duration is resolved through a simple "rescale to equal length" operation, which can be mathematically proven to satisfy three desirable properties.
Scene Cut Conditioning is a Crucial yet Overlooked Dimension: Previous methods either discarded training data with cuts or suffered from uncontrolled transitions during inference. Modeling this explicitly as a controllable prompt is a clever engineering choice.
Content-Motion Decoupling with Global + Temporal Captions: This steering mechanism guides video dynamics more effectively than a single prompt and naturally analogizes to "spatial binding."
All-at-Once Generation vs. Autoregressive Block-by-Block Generation: The former yields awareness of the global temporal structure, whereas the latter lacks future event planning.

Limitations & Future Work¶

High annotation costs due to the requirement of manual temporal event bounding boxes (~200K videos).
Limited to 512×288 resolution and a maximum duration of 12 seconds, rendering it unable to handle truly long videos.
While transitions between events are smoother than baselines, they may still appear unnatural when the subject's appearance changes drastically.
LLM prompt enhancement introduces latency and optimization overhead, and the temporal structure planned by LLMs is not always optimal.

Phenaki pioneered multi-event generation, but its autoregressive paradigm leads to error accumulation and quality degradation.
MEVG utilizes DDIM inversion to initialize subsequent event noise for consistency, which fails under large subject changes.
The proposed ReRoPE paradigm can be extended to other generation tasks requiring temporal localization (e.g., audio-video sync, localization-aware generation).
The scene cut conditioning closely mirrors the practice of handling cropping/aspect ratios in image generation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Pioneers temporally-controlled multi-event video generation, featuring an elegant ReRoPE design with theoretical support.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensively validated across T2V, I2V, and prompt enhancement settings with both automatic metrics and human evaluations, complemented by robust ablation studies.
Writing Quality: ⭐⭐⭐⭐⭐ Well-defined problems, rigorous logical progression from motivation to method to experiments, and rich visualizations.
Value: ⭐⭐⭐⭐⭐ Addresses a highly practical and significant problem in video generation, pointing out a promising new direction.