BVINet: Unlocking Blind Video Inpainting with Zero Annotations¶
Conference: ICCV 2025 arXiv: 2502.01181 Code: N/A Area: Image Generation / Video Restoration Keywords: blind video inpainting, mask prediction, wavelet sparse transformer, video completion, consistency loss
TL;DR¶
This paper is the first to formally define and address the task of blind video inpainting—simultaneously predicting where to restore and how to restore, end-to-end, without any annotation of corrupted regions. A mask prediction network and a video completion network mutually reinforce each other via a consistency constraint, achieving strong results on both synthetic data and real-world applications (danmaku removal and scratch repair).
Background & Motivation¶
Existing video inpainting methods are fundamentally non-blind—they assume the corruption mask is known in advance, requiring users to manually annotate damaged regions in each frame. This leads to two practical problems:
High annotation cost: Boundaries between corrupted and clean regions are often ambiguous, making precise annotation difficult and time-consuming, especially for high-frame-rate or high-resolution videos.
Limited applicability: In many scenarios, corrupted regions cannot be anticipated or manually labeled, such as video scratches, watermarks, and danmaku overlays.
The authors categorize video corruption into two types: (1) externally introduced artifacts that disrupt the original video structure (scratches, watermarks, danmaku, etc.); and (2) unwanted content originally present in the video (object removal). This paper focuses on the first category.
A naive approach of applying blind image inpainting frame-by-frame ignores inter-frame motion continuity, leading to flickering artifacts. An end-to-end video-level solution is therefore necessary.
Method¶
Overall Architecture¶
BVINet consists of two mutually constrained sub-networks: a Mask Prediction Network (MPNet) that predicts corrupted regions, and a Video Completion Network (VCNet) that uses the predicted masks to restore corrupted content by aggregating information from valid regions. Both networks are jointly optimized via a consistency loss.
Key Designs¶
-
Mask Prediction Network (MPNet):
- Two-stage structure: Short-Term Prediction module + Long-Term Refinement module
- Short-Term Prediction (STP):
- Encoder-decoder architecture; processes each frame independently
- Detects intra-frame semantic discontinuities to predict corruption masks: \(m_i^s = STP(x_i)\)
- Replaces conventional downsampling (max-pooling/strided-conv) with Discrete Wavelet Transform (DWT) to improve robustness against noise
- Long-Term Refinement (LTR):
- Exploits temporal consistency priors to refine the predicted mask sequence
- Core component: sequence-to-sequence transformer
- Maps deep features to Q/K/V, partitioned into \(N\) groups along the channel dimension; computes spatial-temporal affinity matrices within a \(T\)-frame response window
- Soft-attention aggregates multi-group affinity matrices and aggregated features: \(\hat{E} = E + Conv(D) \odot G\)
-
Video Completion Network (VCNet) — Wavelet Sparse Transformer:
- Innovation: isolates noise in the frequency domain and uses sparse attention to aggregate the most relevant features
- Frequency decomposition: applies DWT to Q/K/V, isolating noise into high-frequency components \(Q_i^H, K_i^H, V_i^H\), while low-frequency components \(Q_i^L, K_i^L, V_i^L\) retain only clean, structural features
- Dual-branch attention mechanism:
- Dense Self-Attention (DSA): \(DSA = Softmax(\frac{Q^L \cdot (K^L)^T}{\sqrt{d}} + B)\)
- Sparse Self-Attention (SSA): \(SSA = Softmax(ReLU(\frac{Q^L \cdot (K^L)^T}{\sqrt{d}}) + B)\), using ReLU to suppress negative similarities
- Adaptive weighted fusion: \(\hat{V}^L = (\omega_1 \odot DSA + \omega_2 \odot SSA) V^L\)
- Attention values at corrupted positions are set to zero in both DSA and SSA, ensuring information is borrowed exclusively from valid regions
- Inverse Wavelet Transform (IDWT) reconstructs the final feature representation
-
Consistency Loss:
- Core idea: if both mask prediction and video completion are accurate, the difference between the corrupted frame \(x_i\) and the completed result \(\hat{y}_i\) should reside exclusively within the corrupted region
- Consistency relation: \(m_i^l = \mathcal{B}(\hat{y}_i - x_i)\)
- Consistency loss: \(\mathcal{L}_c = \|m_i^l - \mathcal{B}(\hat{y}_i - x_i)\|_1 + \|m_i - \mathcal{B}(\hat{y}_i - x_i)\|_1\)
- Enforces precise correspondence between the two networks, enabling mutual regularization
Loss & Training¶
Total loss: \(\mathcal{L} = \lambda_m \mathcal{L}_m + \lambda_v \mathcal{L}_v + \lambda_c \mathcal{L}_c\)
- \(\mathcal{L}_m\): mask prediction loss
- \(\mathcal{L}_v\): video completion loss
- \(\mathcal{L}_c\): consistency loss
- Hyperparameters: \(\lambda_m = 3\), \(\lambda_v = 5\), \(\lambda_c = 0.02\) (determined via grid search)
Dedicated dataset construction: - Free-form strokes are used as corruption masks, filled with real image content (rather than constant values or noise) - Iterative Gaussian blur expands corruption boundaries to eliminate obvious edge priors, forcing the model to infer from semantic context - Dataset: 2,400 synthetic videos + 1,250 real danmaku removal videos
Key Experimental Results¶
Main Results (Blind vs. Non-Blind Setting)¶
YouTube-VOS dataset:
| Method | Blind | PSNR↑ | SSIM↑ | \(E_{warp}\)↓ | LPIPS↓ |
|---|---|---|---|---|---|
| FGT (non-blind) | ✗ | 30.811 | 0.9258 | 0.1308 | 0.4565 |
| WaveFormer (non-blind) | ✗ | 33.264 | 0.9435 | 0.1184 | 0.2933 |
| VCNet (non-blind) | ✗ | 34.107 | 0.9521 | 0.1102 | 0.2145 |
| MPNet+FGT (blind) | ✓ | 27.032 | 0.8755 | 0.1609 | 0.8667 |
| MPNet+WaveFormer (blind) | ✓ | 29.185 | 0.8902 | 0.1508 | 0.7153 |
| BVINet (blind) | ✓ | 30.528 | 0.9088 | 0.1362 | 0.6556 |
Ablation Study¶
MPNet ablation:
| Configuration | BCE↓ | IOU↑ |
|---|---|---|
| STP (strided-conv) | 1.1251 | 0.8437 |
| DWT_STP | 1.0785 | 0.8682 |
| DWT_STP + LTR | 0.9176 | 0.8829 |
| Full MPNet | 0.8052 | 0.9017 |
Sparse attention + consistency loss ablation:
| DSA | SSA | \(\mathcal{L}_c\) | PSNR↑ | SSIM↑ | \(E_{warp}\)↓ | LPIPS↓ |
|---|---|---|---|---|---|---|
| ✓ | ✗ | ✗ | 29.172 | 0.8897 | 0.1529 | 0.7264 |
| ✓ | ✓ | ✗ | 29.885 | 0.8962 | 0.1454 | 0.6891 |
| ✓ | ✓ | ✓ | 30.528 | 0.9088 | 0.1362 | 0.6556 |
Efficiency Analysis¶
| Method | FLOPs | Inference Time |
|---|---|---|
| STTN | 477.91G | 0.22s |
| FuseFormer | 579.82G | 0.30s |
| E2FGVI | 442.18G | 0.26s |
| FGT | 455.91G | 0.39s |
| VCNet | 396.35G | 0.21s |
Key Findings¶
- BVINet in the blind setting achieves performance comparable to the second-best non-blind method E2FGVI (PSNR 30.528 vs. 30.064), validating the feasibility of blind inpainting
- VCNet in the non-blind setting substantially outperforms all existing methods (PSNR 34.107 vs. runner-up 33.264)
- DWT downsampling significantly improves mask prediction quality (IOU 0.8437→0.8682), with long-term refinement further boosting it to 0.9017
- The dual-branch (DSA+SSA) design outperforms either branch alone; the consistency loss contributes an additional ~0.6 dB PSNR gain
- The model is robust to multiple corruption patterns, including Gaussian noise and solid-color fills unseen during training
- BVINet outperforms OGNet and RAVUNet on real-world danmaku removal
Highlights & Insights¶
- Pioneer task definition: The first work to formally propose blind video inpainting, unifying "where to restore" and "how to restore" in a single framework
- Elegant mutual constraint design: Mask prediction and video completion form a closed loop through the consistency loss—the mask guides localization, while the completion result in turn validates mask quality
- Frequency-domain sparse attention: The combination of DWT to isolate noise into high-frequency components and ReLU to suppress negative similarities is better suited to inpainting than conventional attention
- Dataset design rationale: Filling corrupted regions with real image content (rather than constant values) and blurring boundaries prevents the model from learning distributional priors, instead encouraging semantic understanding
Limitations & Future Work¶
- The paper focuses primarily on externally introduced corruption (scratches, danmaku, etc.); performance on the second corruption type (removal of originally present content) is not validated
- The dataset is relatively small (2,400 + 1,250 videos); larger-scale data could further improve generalization
- In the blind setting, a gap of 3–4 dB PSNR remains compared to the strongest non-blind methods, with mask prediction accuracy as the bottleneck
- The potential of diffusion models for blind video inpainting remains unexplored
- The corruption types addressed are primarily additive overlays; applicability to degradation types such as compression artifacts warrants further investigation
Related Work & Insights¶
- The transition from blind image inpainting to blind video inpainting is not a trivial frame-by-frame extension—temporal consistency is the central challenge
- The constraint of "borrowing information exclusively from valid regions" in video inpainting stands in contrast to the global aggregation of standard attention; sparse attention's selectivity is better suited to such tasks
- The consistency constraint paradigm can be generalized to other video restoration tasks requiring joint optimization of detection and restoration
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Entirely new task definition; the end-to-end blind inpainting framework is groundbreaking
- Experimental Thoroughness: ⭐⭐⭐⭐ — Validated on both synthetic and real data with 16 baselines, detailed ablations, and efficiency analysis
- Writing Quality: ⭐⭐⭐⭐ — Clear problem formalization and well-structured methodological decomposition
- Value: ⭐⭐⭐⭐ — Opens a zero-annotation paradigm for video inpainting with high practical value for danmaku/watermark removal