DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution¶

Conference: NeurIPS 2025 arXiv: 2505.16239 Code: Available Area: Image Generation / Diffusion Models / Video Super-Resolution Keywords: One-step diffusion, video super-resolution, CogVideoX, latent-pixel space training, video data pipeline

TL;DR¶

This paper presents DOVE, a video super-resolution model built upon the CogVideoX pretrained video generation model. Through a two-stage latent-pixel space training strategy and a curated high-quality HQ-VSR dataset, DOVE achieves single-step inference for video super-resolution, delivering 28× speedup over multi-step diffusion methods while achieving comparable or superior performance.

Background & Motivation¶

Existing diffusion models demonstrate strong performance in real-world video super-resolution (VSR), yet face two major bottlenecks:

Inefficient multi-step sampling: Typical methods require tens of sampling steps; processing a 33-frame 720p video on an A100 GPU takes 173–425 seconds.

Overhead from auxiliary modules: Components such as ControlNet and temporal layers further slow inference.

One-step inference has been successfully demonstrated in image SR, but has not yet been realized for VSR. The key challenges are: - Prohibitive training cost for video: Methods such as DMD/VSD require joint optimization of multiple networks, which is infeasible in the video domain. - High fidelity requirements: The instability of adversarial training introduces undesirable artifacts in VSR.

Method¶

Overall Architecture¶

DOVE fine-tunes CogVideoX1.5 (a T2V pretrained model) with the following core design decisions: - No additional modules are introduced (no ControlNet, no optical flow modules, no temporal layers); only the Transformer is fine-tuned. - LR video is bilinearly upsampled and encoded into latent space \(z_{lr}\) via VAE. - \(z_{lr}\) is treated as a noisy latent at timestep \(t=399\) (rather than \(t=999\)), since the LR input already contains sufficient structural information. - Single-step v-prediction denoising: \(z_{sr} = \sqrt{\bar{\alpha}_t} \cdot z_{lr} - \sqrt{1-\bar{\alpha}_t} \cdot v_\theta(z_{lr}, c, t)\) - An empty text prompt is used and pre-encoded to reduce inference overhead. - The output video \(x_{sr}\) is obtained by VAE decoding.

Key Designs¶

Latent-Pixel Two-Stage Training Strategy¶

The core innovation is the use of regression losses (rather than distillation or adversarial losses) for efficient training:

Stage-1: Adaptation (Latent Space) - Minimizes MSE between the predicted latent \(z_{sr}\) and the HR latent \(z_{hr}\) in latent space. - Exploits the high compression ratio of VAE for computational efficiency, enabling training on longer frame sequences. - Trained for 10,000 steps; learning rate \(2\times10^{-5}\); video resolution \(320\times640\); 25 frames.

Stage-2: Refinement (Pixel Space) - After latent-space training, \(z_{sr}\) is close to \(z_{hr}\), but decoding through the VAE amplifies residual discrepancies. - Pixel-space training directly optimizes the gap between \(x_{sr}\) and \(x_{hr}\). - Image-video mixed training is introduced to address GPU memory bottlenecks in pixel-space video training: - Images (single-frame videos) constitute fraction \(\varphi=0.8\); pixel-space training is tractable. - Videos are processed by VAE encoding/decoding frame-by-frame to avoid multi-frame memory peaks, while the Transformer operates on the full latent sequence. - Trained for only 500 steps; learning rate \(5\times10^{-6}\).

Stage-2 Loss Functions: - Image: \(\mathcal{L}_{\text{s2-image}} = \text{MSE}(\hat{x}_{sr}, \hat{x}_{hr}) + \lambda_1 \cdot \text{DISTS}(\hat{x}_{sr}, \hat{x}_{hr})\) - Video: \(\mathcal{L}_{\text{s2-video}} = \text{MSE} + \lambda_1 \cdot \text{DISTS} + \lambda_2 \cdot \mathcal{L}_{\text{frame}}\) - Frame difference loss \(\mathcal{L}_{\text{frame}}\) enforces temporal consistency by aligning adjacent-frame differences \(\Delta x_{sr}\) with \(\Delta x_{hr}\). - \(\lambda_1 = \lambda_2 = 1\)

Video Processing Pipeline (HQ-VSR Dataset Construction)¶

A four-step pipeline is used to curate high-quality VSR training data from OpenVid-1M: 1. Metadata filtering: Short side > 720px; frame count > 50. 2. Scene filtering: Scene detection and splitting; clips with < 50 frames are discarded. 3. Quality filtering: Strict multi-metric screening (CLIP-IQA + FasterVQA + DOVER). 4. Motion processing: Optical-flow-based motion scoring + motion region detection algorithm. - A motion intensity map \(M\) is generated and thresholded to obtain a motion mask. - A bounding box \(B\) localizes high-motion regions; cropped clips below 720p are discarded. - This resolves the issue of globally high motion but locally static content.

The pipeline yields the HQ-VSR dataset comprising 2,055 high-quality videos.

Loss & Training¶

Stage-1: MSE loss for latent-space alignment.
Stage-2: MSE + DISTS (perceptual quality) + frame difference loss (temporal consistency).
Only the Transformer is fine-tuned; VAE weights are frozen.
4× A800-80G GPUs; total batch size 8.
AdamW optimizer; Stage-1: 10K steps; Stage-2: 500 steps.
Image data: DIV2K (900 images) with Real-ESRGAN degradation.
Video data: HQ-VSR (2,055 videos) with RealBasicVSR degradation.

Key Experimental Results¶

Main Results¶

UDM10 Synthetic Benchmark (×4 SR)

Method	Steps	PSNR↑	LPIPS↓	CLIP-IQA↑	DOVER↑	E*warp↓
RealBasicVSR	—	24.13	0.3908	0.3494	0.7564	3.10
MGLD-VSR	Multi	24.23	0.3272	0.4557	0.7264	3.59
STAR	Multi	23.47	0.4242	0.2417	0.4830	2.08
DOVE	1	26.48	0.2696	0.5107	0.7809	1.77

DOVE achieves comprehensive superiority in both fidelity (PSNR +2.25) and perceptual quality (CLIP-IQA/DOVER).

SPMCS Synthetic Benchmark (×4 SR)

Method	PSNR↑	LPIPS↓	CLIP-IQA↑
MGLD-VSR	22.39	0.3263	0.4348
DOVE	23.11	0.2888	0.5690

Efficiency Comparison (33-frame 720p video, single A100)

Method	Inference Time
MGLD-VSR	425.23s
STAR	173.07s
DOVE	~15s (28× speedup)

Ablation Study¶

Training Strategy Ablation (UDM10)

Strategy	PSNR	LPIPS	CLIP-IQA	DOVER
S1 (latent only)	27.20	0.3037	0.3236	0.6154
S1+S2-I (+pixel image)	26.39	0.2784	0.5085	0.7694
S1+S2-I/V (+pixel mixed)	26.48	0.2696	0.5107	0.7809

Stage-2 substantially improves perceptual quality (CLIP-IQA: 0.32→0.51); mixed training outperforms image-only training.

Image Ratio \(\varphi\) Ablation

\(\varphi\)	LPIPS	CLIP-IQA	DOVER
0% (video only)	0.2624	0.4800	0.7647
80% (optimal)	0.2696	0.5107	0.7809
100% (image only)	0.2784	0.5085	0.7694

HQ-VSR Dataset Comparison (Stage-1)

Dataset	# Videos	PSNR	DOVER
YouHQ	38,576	26.88	0.3965
OpenVid-1M	~400K	27.04	0.4363
HQ-VSR	2,055	27.20	0.6154

A dataset of merely 2K videos surpasses one of 400K, demonstrating that data quality vastly outweighs data quantity.

Key Findings¶

One-step inference is fully viable for VSR and can surpass multi-step methods in performance.
The latent-to-pixel two-stage training strategy is the key to balancing efficiency and quality.
Frame-by-frame VAE processing effectively resolves the GPU memory bottleneck of pixel-space video training.
Motion region detection and cropping is better suited to VSR scenarios than global motion scoring.

Highlights & Insights¶

First one-step diffusion VSR model: 28× speedup without sacrificing performance, offering significant practical value.
Minimalist architecture philosophy: No auxiliary modules are added; the method relies entirely on the priors of the pretrained T2V model, yielding a simple and efficient design.
Extremely short training schedule: Fine-tuning is completed in only 10K+500 steps, far fewer than comparable methods.
Data quality >> data quantity: 2K high-quality videos outperform 400K videos; motion region cropping is the critical factor.
Choice of \(t=399\): Rather than starting from pure noise, the method exploits the structural information already present in LR inputs, reducing unnecessary reconstruction burden.

Limitations & Future Work¶

The CogVideoX-based model is large in scale and still requires a GPU for single-frame inference.
Evaluation is conducted only for ×4 super-resolution; extension to other scale factors and degradation types remains to be explored.
The frame difference loss uses simple L1 alignment of adjacent-frame differences; more sophisticated temporal consistency constraints (e.g., optical flow) may yield further improvements.
HQ-VSR contains only 2K videos; larger-scale high-quality data could further boost performance.

CogVideoX serves as the backbone model: 3D causal VAE + Transformer denoiser.
OSEDiff first explored one-step diffusion for image SR; DOVE extends this paradigm to the video domain.
DISTS perceptual loss is employed both as a training objective and as an evaluation metric.
The frame difference loss is a simple yet effective temporal consistency solution that avoids the additional computational overhead of optical flow estimation.

Rating¶

Novelty: ⭐⭐⭐⭐ (first to extend one-step diffusion to VSR; training strategy is innovative)
Technical Depth: ⭐⭐⭐⭐ (two-stage training + data pipeline + architecture choices form a coherent methodology)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (6 test sets, 8 baseline methods, multi-dimensional ablation)
Practicality: ⭐⭐⭐⭐⭐ (28× speedup has significant application value)
Writing Quality: ⭐⭐⭐⭐⭐ (well-structured, with rich illustrations)