Asymmetric Mask Scheme for Self-supervised Real Image Denoising¶

Conference: ECCV2024
arXiv: 2407.06514
Code: lll143653/amsnet
Area: Image Restoration
Keywords: self-supervised denoising, blind spot network, mask strategy, real image denoising, asymmetric scheme

TL;DR¶

Proposed the asymmetric mask scheme AMSNet, which utilizes a single mask during training and complementary multiple masks during inference, breaking the structural requirements and receptive field constraints of blind spot networks, and achieving SOTA performance in self-supervised real image denoising.

Background & Motivation¶

Self-supervised denoising methods have attracted significant attention due to their exemption from paired training data, with the Blind Spot Network (BSN) representing the most classic paradigm. The core assumption of BSN is that noise is zero-mean and pixel-wise independent, preventing identity mapping (noise-to-noise) by excluding the center pixel via blind-spot convolution. However, BSN imposes strict constraints on network architectures:

Restricted Receptive Field: After blind-spot convolution, dilated convolutions or other specialized structures must be employed to further restrict the receptive field; otherwise, the center pixel information leaks back to the output via neighboring pixels, leading to identity mapping.
Loss of Structural Information: Excluding the center pixel inevitably damages structural details.
Limited Choice of Denoisers: State-of-the-art denoisers (such as Restormer and NAFNet) cannot be easily integrated directly into the BSN framework.

These limitations severely restrict the performance upper bound of BSN-based methods. Inspired by Masked Autoencoders (MAE), the authors explore whether masking operations can replace blind-spot convolutions to resolve identity mapping, thereby freeing the framework from structural limitations.

Core Problem¶

How to prevent identity mapping in self-supervised denoising without restricting the receptive field and architecture of the denoising networks, thereby allowing the flexible adoption of high-performance modern denoisers?

Method¶

Training Phase: Single Mask Scheme¶

The core idea is to mask raw pixels directly at the input stage, fundamentally preventing them from participating in their own reconstruction process:

Generate a random binary mask matrix \(M\) (where approximately 50% of the pixels are masked to zero) for the noisy image \(I_N\).
Feed the masked image \(M \odot I_N\) into the denoiser \(D_E\).
Compute the reconstruction loss only on the masked locations (as indicated by the complementary mask \(\tilde{M}\)).

Mask self-supervised loss:

\[\mathcal{L}_m(M_s, I_s) = \|\tilde{M_s} \odot (D_E(M_s \odot I_s, \theta) - I_s)\|_1\]

Key point: Since the masked pixels are entirely reconstructed from surrounding unmasked pixels, identity mapping is inherently avoided, and restricting the receptive field of the network is no longer necessary.

Handling Spatial Correlation of Real Noise¶

Real-world noise typically violates the pixel-wise independence assumption. Following AP-BSN, a Pixel Downsampling (PD) strategy is introduced: the input image is downshifted and sampled with a stride \(s\) to obtain \(s^2\) sub-samples \(I_s\). PD breaks the spatial correlation of noise, rendering sub-samples virtually independent. Masks are then independently generated and applied to each sub-sample.

Inference Phase: Multi Mask Scheme¶

During training, a single branch only restores a subset of the masked pixels. To achieve full-image denoising during inference, a Multi-branch Mask complementary Denoising Block (MMDB) is proposed:

Employ \(k\) denoising branches (default \(k=2\)), sharing the same denoiser \(D_E\).
The masks applied to the branches are mutually complementary and non-overlapping: \(\sum_{i=1}^{k} \tilde{M}_s^i = \mathbb{I}\).
The final denoised sub-sample is obtained by summing the outputs: \(D_M(I_s) = \sum_{i=1}^{k} \tilde{M}_s^i \odot D_E(M_s^i \odot I_s, \theta)\).
Finally, the original resolution is reconstructed via inverse pixel downsampling \(P_s^{-1}\).

Checkerboard Effect Elimination¶

The PD strategy disrupts structural consistency, often introducing checkerboard artifacts into the denoised outputs. A two-stage remedy is introduced:

Prior Smoothing Loss \(\mathcal{L}_p\): Fine-tune the base model AMSNet-B with the total loss \(\mathcal{L}_t = \lambda \mathcal{L}_p(I_{DN}) + \|I_{DN} - I_N\|_1\), where \(\lambda=0.01\).
Random Replacing Refinement Strategy \(\mathcal{R}^3\): Further suppress checkerboard patterns during inference.

This yields four model variants: AMSNet-B (Base), AMSNet-P (+ smoothing loss fine-tuning), AMSNet-B-E (+ refinement), and AMSNet-P-E (+ both, final version).

Key Experimental Results¶

Main Results (SIDD / DND / PolyU)¶

Method	SIDD Val (PSNR/SSIM)	SIDD Bench	DND Bench
AP-BSN+\(\mathcal{R}^3\)	36.74/0.850	36.91/0.931	38.09/0.937
LG-BPN+\(\mathcal{R}^3\)	37.31/0.886	37.28/0.936	38.43/0.942
BNN-LAN	37.39/0.883	37.41/0.934	38.18/0.939
AMSNet-P-E	37.93/0.895	37.87/0.941	38.70/0.947

AMSNet-P-E also achieves state-of-the-art performance on the PolyU dataset with 37.92 dB / 0.9645 SSIM.

Ablation Study & Key Findings¶

Identity Mapping Validation: When AP-BSN uses a denoiser with an unrestricted receptive field, its PSNR plummets to 20.91 dB (suffering from identity mapping). In contrast, AMSNet maintains 37.11 dB, proving that the input masking effectively prevents identity mapping.
Denoiser Versatility: Restormer (37.93) > DeamNet (37.80) > NAFNet (37.10) > UNet (36.94) \(\approx\) DnCNN (36.93). This demonstrates the freedom to easily integrate modern, powerful denoisers.
Optimal Masking Ratio: A masking ratio of approximately 50% (corresponding to \(k=2\) branches) yields the best performance.
Smoothing Loss Fine-Tuning: Introducing \(\mathcal{L}_t\) brings an improvement of \(\approx\) 0.1 dB.

Highlights & Insights¶

Elegant Formulation: Drawing inspiration from MAE, this work brings the masking concept to self-supervised denoising. Utilizing input masks instead of blind-spot convolutions completely liberates the network from receptive field and structural constraints.
Asymmetric Train-Test Strategy: Simple single masks are used during training to minimize optimization and training costs, while complementary multiple masks are applied during inference for seamless whole-image reconstruction.
Denoiser Agnostic: The framework allows modern architectures like Restormer and NAFNet to be incorporated in a plug-and-play manner, establishing high extensibility.
Rigorous Ablation: The identity mapping experiments explicitly illustrate the limitations of classic BSNs and highlight the core advantages of the proposed masking scheme.

Limitations & Future Work¶

Doubled Inference Overhead: The \(k=2\) setting requires two separate forward passes during inference, doubling the computational cost.
Post-Processing for Artifacts: The checkerboard artifacts originating from PD still necessitate extra smooth-loss fine-tuning and refinement strategies, adding complexity to the pipeline.
Asymmetric PD Strides: Running training with \(P_5\) but inference with \(P_2\) is highly empirical and requires heuristic tuning.
Restricted to sRGB Denoising: The generalization capability has not been verified on RAW image denoising or other low-level restoration tasks.
The masking ratio is fixed at 50%, leaving adaptive or learnable masking strategies unexamined.

Aspect	BSN-based Methods (AP-BSN, LG-BPN)	AMSNet
Identity Mapping Avoidance	Blind-spot + dilated conv. to restrict receptive field	Input masks to directly obstruct identity mappings
Denoiser Constraints	Strictly restricted, standard convolutions cannot be used	Unconstrained, any off-the-shelf denoiser can be integrated
Real Noise Handling	PD + BSN	PD + Mask
Inference Cost	Single forward pass	\(k\) forward passes (default 2)
SIDD Val PSNR	36.74 / 37.31	37.93

Compared to older self-supervised baseline frameworks (e.g., Noise2Void, Self2Self), AMSNet provides a substantial performance leap in real-world scenarios, primarily driven by enabling the usage of advanced modern denoisers.

Insights & Connections¶

The success of this masking strategy demonstrates that key design principles of MAE are highly generalizable to image restoration tasks, pointing to potential future adoption in super-resolution and deblurring.
The asymmetric train-test paradigm is highly powerful: simplifying the reconstruction task during training and assembling multiple complementary predictions during inference to obtain the final outcome.
The checkerboard artifacts expose persistent drawbacks within the PD strategy; future works may explore alternative approaches to decouple spatial noise correlation without spatial downsampling.

Rating¶

Novelty: 8/10 — Seamlessly transfers MAE's masking concept to self-supervised denoising to lift the structural limitations of BSNs, with a clear and effective execution.
Experimental Thoroughness: 8/10 — Multiple evaluation benchmarks, five different denoisers, and extensive ablation studies, particularly the identity mapping experiment.
Writing Quality: 7/10 — Lucid explanations, though the formulation of some mathematical derivations could be further streamlined.
Value: 7/10 — Offers a highly flexible framework and opens new paths for self-supervised restoration, though doubled inference cost remains a minor pain point for real-life applications.