CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal¶
Conference: ICML 2026
arXiv: 2603.21901
Code: https://github.com/silent-commit/CLEAR (available)
Area: Video Generation / Video Inpainting / Subtitle Removal
Keywords: Video subtitle removal, diffusion model, LoRA, self-supervised prior, mask-free inference
TL;DR¶
This paper proposes CLEAR for video subtitle removal: a two-stage training pipeline (Stage I uses a dual encoder + orthogonal decoupling to self-supervise a subtitle prior mask; Stage II adds LoRA + an occlusion head to the Wan2.1 video diffusion model for adaptive weighting). Inference requires no mask or text detector at all; with only 0.77% trainable parameters, PSNR reaches 26.80 dB on a Chinese test set (+6.77 dB over the strongest baseline), and zero-shot generalizes to six languages.
Background & Motivation¶
Background: Current video subtitle removal mainly relies on mask-guided video diffusion inpainting (DiffuEraser, EraserDiT, MiniMax-Remover), requiring external text detection/segmentation to provide precise binary masks for every frame.
Limitations of Prior Work: (L1) Low training efficiency—full-parameter training and per-frame mask annotation, which itself depends on manual labeling or dedicated segmentation models throughout long videos; (L2) Fragile inference—text detection/tracking must run continuously after deployment, and any detection failure leads to flicker, ghosting, or drift; (L3) Static prior usage—auxiliary priors (heatmap, optical flow) are used with uniform weighting, ignoring reliability differences of subtitles across frames and regions.
Key Challenge: Video subtitles exhibit temporal continuity, diverse positions/fonts, and complex coupling with camera/object motion, requiring (K1) parameter efficiency + no mask annotation, (K2) fully mask-free end-to-end inference, and (K3) adaptive prior quality weighting; none of which are achieved by existing methods.
Goal: Construct a framework that can self-supervise subtitle priors from paired subtitle/clean videos during training, is fully mask-free during inference, and dynamically adapts weighting for subtitle regions.
Key Insight: Use the pixel difference between "subtitle frame - clean frame" as a weakly supervised pseudo-label (noisy but cheap), and employ a dual encoder + orthogonal constraint to isolate subtitle information; then let the diffusion model use an occlusion head to correct the prior during generation.
Core Idea: Explicitly distill the "subtitle mask recognition" capability into the LoRA-tuned DiT intermediate layers during training, so that inference only requires the subtitle video as input—the model internally generates \(\mathcal{M}^{pred}\), with no external mask needed.
Method¶
Overall Architecture¶
A two-stage pipeline. Stage I (self-supervised prior): Use pixel-difference pseudo-labels to train dual ResNet-50 encoders (\(E_{\text{sub}},E_{\text{content}}\)) + a 4-layer UNet decoder to obtain the prior mask \(\mathcal{M}^{prior}\), with ImageNet pretraining and orthogonal loss + adversarial discriminator to decouple subtitle and content features. Stage II (adaptive weighting): Freeze Wan2.1-Fun-V1.1-1.3B DiT, inject rank=64 LoRA into all attention + FFN, and add a 2.1M parameter occlusion head \(\mathcal{H}\) to compute \(\mathcal{M}^{pred}\) from DiT intermediate layers. The spatial emphasis × focal difficulty weight \(w_{i,j,t}\) modulates the diffusion loss, with three losses (distillation + context-aware adaptation + sparsity) jointly optimized. Inference: single input video → DiT + LoRA + internal \(\mathcal{M}^{pred}\) → DDIM 5 steps → VAE decodes clean video, with no external modules.
Key Designs¶
-
Stage I Self-Supervised Subtitle Prior (dual encoder + orthogonal decoupling + adversarial discriminator):
- Function: Learns a binary mask \(\mathcal{M}^{prior}\) predicting subtitle regions from 500 video pairs without manual masks.
- Mechanism: (a) Use pixel difference \(\Delta_t=\|\mathbf{X}^{sub}_t-\mathbf{X}^{clean}_t\|_2\) and per-frame mean+std thresholding to generate pseudo-labels; (b) dual encoder extracts \(F^{sub}, F^{content}\) at 1/8 resolution; (c) orthogonal loss \(\mathcal{L}_{\text{ortho}}=\frac{1}{T H' W'}\sum\langle F^{sub}, F^{content}\rangle^2\) enforces independence; (d) adversarial loss \(\mathcal{L}_{\text{adv}}\) prevents leakage; (e) decoder outputs \(\mathcal{M}^{prior}\) from \(F^{sub}\) only, and \(F^{content}\) alone must reconstruct the clean frame.
- Design Motivation: Pixel-difference pseudo-labels are noisy (lighting, semi-transparent subtitles, motion blur), and BCE alone cannot learn good masks; orthogonal + adversarial + reconstruction constraints force subtitle features to exclusively carry all difference information, enabling the mask head to generalize to unseen fonts/languages rather than memorizing specific token shapes.
-
Stage II Context-Aware Occlusion Head + Adaptive Weighting \(w_{i,j,t}\):
- Function: Dynamically computes subtitle probability for each patch in DiT intermediate layers, adjusting its weight in the diffusion loss to "explicitly attend to subtitles during training, implicitly erase during generation."
- Mechanism: Occlusion head \(\mathcal{H}(\mathbf{h}_{enc})=\mathrm{Conv}^1_{1\times 1}(\mathrm{SiLU}(\mathrm{Conv}^{64}_{3\times 3}(\mathbf{h}_{enc})))\) computes \(\mathcal{M}^{pred}=\sigma(\mathcal{H}(\mathbf{h}_{enc}))\) from DiT encoder activations; final weight \(w_{i,j,t}=(1+\alpha(k)\cdot\mathcal{M}^{pred}_{i,j,t})\cdot(\epsilon^{gen}_{i,j,t}+\delta)^\gamma\), where the first term is spatial emphasis (upweighting predicted subtitle regions), and the second is focal-style difficulty weighting (upweighting high reconstruction error regions); \(\alpha(k)\) oscillates between \(\alpha_{\min}=5,\alpha_{\max}=15\) with triangular scheduling to avoid local minima.
- Design Motivation: Naively using the prior as a mask condition is noisy; letting the head see latent noise, DiT high-level semantics, and diffusion timestep \(t\) turns "prior calibration" into "difficulty-aware generation." Focal weighting (from RetinaNet) ensures simple background regions contribute less gradient, while hard subtitle regions contribute more. Crucially, \(\mathcal{M}^{pred}\) is not detached, so gradients from \(\mathcal{L}_{\text{gen}}\) flow back, forming a self-correcting loop.
-
Joint Three-Loss Optimization + Internalized Mask-Free Inference:
- Function: Uses distillation (from Stage I prior) + generation feedback (quality) + sparsity/KL (anti-degeneration) to jointly optimize LoRA and the occlusion head, so that \(\mathcal{M}^{pred}\) retains prior structure while correcting local errors, and is absorbed into LoRA-augmented attention—no external mask needed at inference.
- Mechanism: \(\mathcal{L}_{stage2}=\mathcal{L}_{distill}+\mathcal{L}_{gen}+0.1\cdot\mathcal{L}_{sparse}\); \(\mathcal{L}_{distill}\) uses SmoothL1 to enforce \(\mathcal{M}^{pred}\approx\mathcal{M}^{prior}\) within 1 unit deviation; \(\mathcal{L}_{gen}\) is standard diffusion \(\epsilon\) loss weighted by \(w\); \(\mathcal{L}_{sparse}\) combines L1 sparsity + \(D_{KL}(\mathcal{M}^{pred}\|\mathcal{M}^{prior})\), the former prevents uniform degeneration, the latter prevents drifting from the prior distribution.
- Design Motivation: Pure distillation amplifies Stage I noise; pure generation feedback leads to trivial head outputs (all 0 or uniform). The three losses each serve a role: distill for structure, gen for quality, sparse for controllability. After training, LoRA + head absorb "which regions to erase" into attention patterns; at inference (Alg.1), \(\mathcal{M}^{pred}\) is internal, never output, and a single forward pass yields the clean video.
Loss & Training¶
Stage I: \(\mathcal{L}_{stage1}=\mathcal{L}_{ortho}+0.5\mathcal{L}_{adv}+\mathcal{L}_{region}+0.1\mathcal{L}_{recon}\), AdamW lr=\(2\times 10^{-5}\), 1 epoch (~70 min). Stage II: above \(\mathcal{L}_{stage2}\), AdamW lr=\(1\times 10^{-4}\), gradient clipping=1.0, 1 epoch ≈ 1 day (8×A800). LoRA rank=64, applied to q,k,v,o and ffn.0/2; \(\gamma=0.8,\delta=10^{-6}\); Stage II data: 500 videos × 81 consecutive frames.
Key Experimental Results¶
Main Results (Chinese Subtitle Test Set, 400 Samples)¶
| Method | PSNR↑ | SSIM↑ | LPIPS↓ | VFID↓ | TWE↓ | Flow Var↓ | s/frame↓ |
|---|---|---|---|---|---|---|---|
| ProPainter | 17.24 | 0.658 | 0.329 | 98.46 | 1.286 | 0.885 | 2.36 |
| MiniMax-Remover | 20.03 | 0.773 | 0.166 | 95.39 | 4.222 | 0.415 | 4.90 |
| DiffuEraser | 17.85 | 0.672 | 0.458 | 72.51 | 1.523 | 0.630 | 3.47 |
| CLEAR (mask-free) | 26.80 | 0.894 | 0.101 | 20.37 | 1.227 | 0.029 | 4.86 |
PSNR +6.77 dB, VFID -74.7%, Flow Variance -93.0%; all baselines require external masks, while CLEAR only takes the subtitle video as input.
Ablation Study¶
| Configuration | PSNR↑ | VFID↓ | TWE↓ |
|---|---|---|---|
| Baseline (LoRA-only) | 21.62 | 34.74 | 1.320 |
| + M1: Stage I prior + focal weighting | 23.11 | 38.21 | 1.303 |
| + M2: Context Distillation | 24.72 | 31.73 | 1.279 |
| + M3: Context-Aware Adaptation | 25.09 | 31.56 | 1.257 |
| + M4: Context Consistency (CLEAR) | 26.80 | 20.37 | 1.227 |
| Inference setting | PSNR↑ | VFID↓ | s/frame |
|---|---|---|---|
| steps=5 (default) | 26.80 | 20.37 | 4.86 |
| steps=10 | 29.43 | 35.70 | 9.92 |
| cfg=1.2 | 29.65 | 40.71 | 4.86 |
| lora_scale=0.5 | 25.17 | 63.02 | 4.86 |
| lora_scale=1.5 | 27.94 | 42.16 | 4.86 |
Key Findings¶
- The cumulative 5.18 dB PSNR gain comes from the four modules in combination; consistency regularization (M4) alone provides the largest VFID drop (-35.5%), indicating that "preventing \(\mathcal{M}^{pred}\) degeneration" is critical for perceptual quality.
- steps=10 yields higher PSNR but worse VFID (35.70 vs 20.37), suggesting more denoising steps introduce artifacts; 5 steps is optimal by default.
- LoRA scale 0.5 leads to severe under-removal (LPIPS +82%), 1.5 to over-smoothing—1.0 is the sweet spot; CFG=1.0 balances fidelity and perceptual quality.
- Zero-shot cross-lingual: trained only on Chinese subtitles, the model cleanly removes English/Korean/French/Japanese/Russian/German subtitles—demonstrating that it learns abstract occlusion patterns rather than character features.
Highlights & Insights¶
- Pixel-difference pseudo-labels + orthogonal decoupling: Replaces expensive mask annotation with pixel differences from subtitle/clean video pairs, and uses orthogonal + adversarial constraints to force subtitle information into a single encoder. This self-supervised process enables learning a subtitle prior that generalizes to six languages from just 500 video pairs, exemplifying data efficiency.
- Gradient flow through \(\mathcal{M}^{pred}\) enables self-correction: Many attention/mask weighting methods detach the mask to avoid interfering with the main task; this work does the opposite—deliberately allowing diffusion loss gradients to flow back to the head, so "high reconstruction error regions" receive positive gradients to raise \(\mathcal{M}^{pred}\), and "low error regions" receive negative gradients to lower the weight, forming a feedback loop without GT masks.
- Mask-free inference has high engineering value: Eliminating text detection/segmentation removes an entire fragile sub-pipeline (no more OCR misdetection or tracking drift), and 0.77% trainable parameters + single-epoch training is deployment-friendly; end-to-end sub→clean mapping in one inference avoids cascading errors.
Limitations & Future Work¶
- Main experiments are only on Chinese subtitle training data (160K training pairs, 400 test), with other languages shown only qualitatively; quantitative generalization is unmeasured.
- 5-step DDIM takes 4.86 s/frame at 1280×720 resolution, still short of real-time; authors mention "real-time inference optimization" but provide no solution.
- Relies on Wan2.1-Fun-V1.1-1.3B as backbone; transferability to other video diffusion models (HunyuanVideo, Sora series) is unverified.
- Whether the self-supervised prior remains effective for "animated subtitles, artistic fonts, extreme semi-transparency" needs more stress testing; M1 only contributes +1.49 dB but VFID increases (38.21), suggesting the prior itself is still noisy and the three-loss system is necessary for robustness.
Related Work & Insights¶
- vs DiffuEraser / EraserDiT / MiniMax-Remover: All require external masks + full-parameter training; this work achieves mask-free inference with 0.77% parameters and +6.77 dB PSNR, decoupling "training annotation, inference dependency, and parameter efficiency."
- vs ProPainter: Traditional optical flow propagation method, PSNR 17.24 lags far behind; shows that modern video diffusion priors are essential for "high-frequency local occlusion" like subtitles.
- vs Image-based STR (EraseNet/ViTEraser): Image methods lack temporal consistency constraints; CLEAR's Flow Variance 0.029 (33× lower than ProPainter's 0.885) demonstrates the temporal stability advantage of end-to-end video diffusion + LoRA.
- Transferable idea: The dual encoder + orthogonal decoupling + pixel-difference pseudo-label self-supervised prior can be directly applied to other "local occlusion removal" tasks (watermark removal, logo erasure, video censor repair); focal-weighted diffusion loss is also worth reusing in general inpainting scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of self-supervised orthogonal decoupling + gradient-flow occlusion head + fully mask-free inference is new for video subtitle removal, though individual techniques (LoRA, self-supervised prior) are relatively mature.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-metric (PSNR/VFID/temporal/flow) comparison, four-module ablation, and inference hyperparameter analysis; cross-lingual part lacks quantitative results.
- Writing Quality: ⭐⭐⭐⭐ Three limitations (L1-L3) correspond to three capabilities (K1-K3); method diagrams, algorithm boxes, and tables are clearly organized.
- Value: ⭐⭐⭐⭐ Truly solves the "must have mask" pain point in video subtitle removal deployment; 0.77% parameters + mask-free inference is highly valuable for product-level applications.