BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering¶
Conference: ECCV2024
arXiv: 2403.06243
Code: To be confirmed
Area: Video Generation
Keywords: blind video deflickering, histogram, scale-time equalization, temporal consistency, exposure correction
TL;DR¶
BlazeBVD is proposed, which leverages classical Scale-Time Equalization (STE) in the illumination histogram space to extract deflickering priors (filtered illumination maps, exposure maps, and flickering frame indices). This simplifies complex video space-time learning into frame-by-frame processing using a 2D spatial network coupled with a lightweight 3D temporal consistency network. It achieves SOTA quality on blind video deflickering while speeding up inference by more than 10 times compared to baselines.
Background & Motivation¶
- Video flicker is a common degradation issue in video quality: sources include camera hardware limitations, inconsistencies caused by frame-by-frame application of image processing algorithms, and color distortions generated by generative models like GANs/Diffusion.
- Blind Video Deflickering (BVD) requires the method to have no prior knowledge of the flicker type or severity, and it does not rely on additional reference videos or flicker type annotations, making it the most general yet challenging setup.
- Existing deep learning methods (such as Deflicker based on Neural Atlas) suffer from heavy resource consumption: establishing an atlas for an 8-second video takes about 10 minutes, and the time and memory overhead scale severely with video length.
- Furthermore, existing methods produce color artifacts and color distortion when processing severe illumination flicker, fail to recover texture details in over-/under-exposed areas, and may even introduce new local flicker.
Core Problem¶
- Excessive Spatiotemporal Complexity: The sheer volume of pixel-level spatiotemporal data makes it difficult for neural networks to learn and maintain global visual consistency.
- Lack of Compact Representation: Flicker is essentially a local or transient shift in illumination, which requires a more compact representation than pixel values to accurately capture illumination fluctuations.
- Unresolved Local Exposure Issues: Overexposure turns textures into uniform flat colors, and underexposure leads to pitch-black regions. Existing BVD methods cannot compensate for these lost high-frequency details.
Method¶
BlazeBVD is divided into three stages. Its core idea is to utilize STE to extract prior information in the illumination histogram space, decomposing the complex task of processing the entire video's spatiotemporal data into simpler tasks that can be processed independently frame-by-frame.
Stage 1: Deflickering Prior Extraction¶
- Illumination Map Extraction: Each RGB frame is converted to the HSV space, and the V channel (i.e., the maximum value among R/G/B) is taken as the illumination map \(V_t\).
- Illumination Histogram Calculation: The histogram \(H_t\) is computed for the illumination map of each frame. These histograms compactly capture the frame-to-frame illumination variations—frames without obvious illumination changes show similar histograms, while flickering frames show anomalous histogram distributions.
- STE Filtering: Gaussian temporal smoothing is applied to the sequence of illumination histograms (in the histogram space rather than the pixel space) to obtain a smoothed histogram sequence. Filtered illumination maps \(\tilde{V}_t\) are then generated via histogram matching.
- Flicker Frame Detection: Flickering frame set \(S_{flicker}\) is identified by comparing the KL divergence of the histograms before and after STE with a local moving average.
- Exposure Map Generation: A binary exposure map \(M_t\) is generated based on pixel value thresholds of the filtered illumination map, marking overexposed (\(>\epsilon_2\)) and underexposed (\(<\epsilon_1\)) regions.
Stage 2: Global and Local Flicker Removal¶
- Global Flicker Removal Module (GFRM): A 2D-UNet is utilized, taking the flickering frame \(X_t\) and the filtered illumination map \(\tilde{V}_t\) as input, to perform frame-by-frame independent color correction. Since \(\tilde{V}_t\) is already temporally stable, the network only needs to learn a 2D spatial mapping, significantly reducing the learning difficulty. Training employs L2 loss.
- Local Flicker Removal Module (LFRM): This module operates only on the frames in the flickering frame set. It uses RAFT to calculate optical flow between adjacent frames, warping texture details of adjacent frames into the over-/under-exposed areas of the current frame (marked by the exposure map \(M_t\)). Unexposed regions retain the original correction results, and a fusion network of synthesis is applied at the end.
Stage 3: Adaptive Temporal Consistency¶
- A Temporal Consistency Model (TCM) is introduced, which is a lightweight spatiotemporal network based on the RTN architecture.
- An adaptive mask-weighted warping loss is designed, which assigns higher weights (\(M_t + 1\)) to the exposed regions, avoiding the global blur and color distortion caused by traditional warping losses.
- The total loss includes: reconstruction loss \(\mathcal{L}_{rec}\), perceptual loss \(\mathcal{L}_{per}\), spatiotemporal adversarial loss \(\mathcal{L}_{adv}\), and adaptive warping loss \(\mathcal{L}_{warp}\).
Key Experimental Results¶
Quantitative Results (DAVIS-2017-Test Synthetic Videos)¶
| Method | PSNR↑ | SSIM↑ | \(E_{warp}\)↓ |
|---|---|---|---|
| ConvLSTM | 24.110 | 0.9256 | 0.1369 |
| DVP | 24.136 | 0.9263 | 0.1415 |
| STE | 26.138 | 0.9321 | 0.1117 |
| Deflicker | 23.932 | 0.9243 | 0.0840 |
| BlazeBVD | 28.609 | 0.9638 | 0.0825 |
Blind Deflickering Dataset Synthetic Videos (Average)¶
| Method | PSNR↑ | SSIM↑ | \(E_{warp}\)↓ |
|---|---|---|---|
| Deflicker | 25.30 | 0.9349 | 0.0856 |
| BlazeBVD | 28.15 | 0.9538 | 0.0897 |
Efficiency Comparison¶
| Metric | Deflicker | BlazeBVD |
|---|---|---|
| Inference Time (80 frames) | 614.19s | 58.37s (approx. 10.5× speedup) |
| GPU Memory | 5204M | 1389M |
| MACs | 260.7G | 251.4G |
| Params | 12.48M | 17.77M |
User Study (Real-world + Generation Videos)¶
In preference experiments involving 50 users across 7 datasets, BlazeBVD comprehensively outperformed Deflicker, with preference rates ranging from 61.33% (Expert) to 80.75% (OldAnime).
Highlights & Insights¶
- Ingenious Representation Transformation: It transforms the complex spatiotemporal problem in pixel space into a temporal smoothing problem in the illumination histogram space, utilizing classical STE to provide high-quality priors for deep learning, thereby substantially reducing learning difficulty.
- Addressing Local Exposure for the First Time: By detecting over-/under-exposure using exposure maps and performing neighbor-frame texture migration via optical flow warping, it solves the problem of texture loss in exposed regions, which has been largely ignored by existing methods.
- Extremely Fast Inference: The combination of frame-by-frame 2D network processing and a lightweight 3D network for temporal refinement achieves a 10× speedup while reducing GPU memory consumption by 73%.
- Adaptive Mask-weighted Warping Loss: Weighting the exposed regions during temporal consistency optimization prevents the blur and color distortion commonly found in Deflicker.
Limitations & Future Work¶
- The local texture recovery of LFRM depends on the accuracy of optical flow and the quality of neighboring frames. If multiple consecutive frames are all in a flickering or highly exposed state, effective texture cannot be retrieved from neighbors.
- The parameter size is slightly larger than Deflicker (17.77M vs. 12.48M), though the actual inference speed and GPU memory consumption are significantly better.
- \(E_{warp}\) is slightly higher than Deflicker under some settings. The authors analyze that the optical flow estimation error is masked by the image blur of Deflicker, but this argument is not rigorously verified.
- The thresholds \(\epsilon_1, \epsilon_2\) for the exposure maps are manually set, which may require adjustment across different scenes.
Related Work & Insights¶
| Method | Type | Flicker Type | Local Exposure | Speed |
|---|---|---|---|---|
| STE | Traditional Filtering | Mild | Not Supported | Fast |
| ConvLSTM / DVP | Temporal Consistency | Reference Required | Not Supported | Medium |
| Deflicker | Neural Atlas | General | Not Supported | Slow (614s/80 frames) |
| BlazeBVD | Histogram Prior + 2D/3D Network | General | Supported | Fast (58s/80 frames) |
Inspirations & Connections¶
- The core takeaway is the paradigm of compact representation + classical prior: classical methods (such as STE), though limited in quality, provide correct prior signals, while deep networks perform fine-grained restoration under the guidance of these priors. This paradigm can be extended to other video enhancement tasks.
- As a compact representation of video flicker, the illumination histogram is better suited to capture frame-to-frame illumination variations than raw pixel values, which can be applied to tasks such as video color consistency editing.
- The optical flow warping approach used for local exposure repair can be integrated with video inpainting techniques.
Rating¶
- Novelty: 4/5 (The concept of using histogram priors and STE to assist deep deflickering is novel, with a clear stage-wise decomposition)
- Experimental Thoroughness: 4/5 (Full coverage of synthetic, real-world, and generated videos, with a comprehensive user study and sufficient ablation experiments)
- Writing Quality: 4/5 (Clear structure with convincing motivation)
- Value: 4/5 (High practical value, achieving a 10× speedup and solving the local exposure issue for the first time)