Blind2Sound: Self-Supervised Image Denoising without Residual Noise¶

Conference: ICCV2025 arXiv: 2303.05183 Authors: Zejin Wang, Jiazheng Liu, Hao Zhai, Hua Han (Institute of Automation, Chinese Academy of Sciences) Code: Provided in the paper's supplementary material Area: Image Restoration Keywords: Self-supervised denoising, Poisson-Gaussian noise, blind denoising, residual noise elimination, noise awareness

TL;DR¶

This paper proposes the Blind2Sound framework, which perceives noise levels and achieves personalized denoising via an adaptive re-visible loss, complemented by a Cramer Gaussian loss that improves noise parameter estimation accuracy. The framework eliminates residual noise in self-supervised blind denoising and outperforms all contemporary self-supervised methods and even some supervised baselines.

Background & Motivation¶

Problem Definition¶

Noise in practical imaging sensors typically follows a Poisson-Gaussian mixture model: \(\mathbf{y} = \alpha P + N\), where \(P \sim \text{Poisson}(\mathbf{x}/\alpha)\) is signal-dependent Poisson noise and \(N \sim \mathcal{N}(0, \sigma^2)\) is signal-independent Gaussian noise. Self-supervised blind denoising requires learning to denoise from a single noisy image without relying on paired clean data.

Limitations of Prior Work¶

Pseudo-supervised pair methods (e.g., Noise2Noise variants): Constructing training pairs from a single noisy image doubly corrupts the signal, degrading performance.

Blind-spot methods (e.g., N2V, BSN series): Input masking causes information loss, producing severe artifacts.

Blind2Unblind: Achieves lossless denoising via a re-visible transition, but the MSE objective cannot perceive noise levels, leading to pixel-level greedy fitting that results in pronounced residual noise.

FBI-Denoiser: The Gaussian loss lacks fine-grained constraints, yielding insufficient noise estimation accuracy; moreover, post-processing steps amplify prior denoising errors.

Core Motivation¶

MSE as an objective function cannot adaptively adjust denoising strength based on dynamic noise levels. A loss function that is compatible with the lossless framework while performing personalized denoising according to perceived noise levels is needed to fundamentally eliminate residual noise.

Method¶

Overall Architecture¶

Blind2Sound consists of two modules:

Denoising network \(f_\omega(\cdot)\): An improved U-Net that outputs denoising results (mean \(\mu_m, \mu_v\) and covariance \(\Sigma_m, \Sigma_v\)) for both the masked branch and the visible branch.
Noise estimator \(g_\theta(\cdot)\): Predicts Poisson-Gaussian noise parameters \((\alpha, \sigma_1, \sigma_2)\).

Both modules are jointly optimized during training; the noise estimator and masked branch are removed at inference, and the denoiser directly generates results from the original noisy image.

Adaptive Re-Visible Loss¶

The core idea is to model the masked branch and the visible branch as two independent Gaussian generative processes:

Masked branch: \(\mathbf{z}_1 \sim \mathcal{N}(\mathbf{z}_1 | \mu_m, \Sigma_m)\), generating latent clean images from the masked noisy volume \(\Omega\mathbf{y}\).
Visible branch: \(\mathbf{z}_2 \sim \mathcal{N}(\mathbf{z}_2 | \mu_v, \Sigma_v)\), generated from the original noisy image \(\mathbf{y}\) with gradients disabled (excluded from backpropagation).

The noise model is explicitly incorporated via marginal likelihood:

\[p(\mathbf{y}) = \int p(\mathbf{y}|\mathbf{x}) p(\mathbf{x}|\mathbf{y}, \Omega\mathbf{y}) d\mathbf{x}\]

Minimizing the negative log-likelihood yields the adaptive re-visible loss:

\[\mathcal{L}_{arv} = \frac{1}{2}[(\mathbf{y} - \mu_y)^T \Sigma_y^{-1} (\mathbf{y} - \mu_y)] + \frac{1}{2}\log|\Sigma_y| + \text{const}\]

where \(\mu_y = \frac{\mu_m + \lambda \mu_v}{1+\lambda}\) and \(\lambda\) is a visibility factor that grows progressively from 3 to 11.

Key designs:

The two branches are modeled as i.i.d., decoupling their correlation and preventing the masked results from suppressing visible denoising.
Gradient analysis on the intermediate medium \(\mu_m\) indicates that gradients through \(\text{diag}(\alpha \mu_y)\) must be disabled to stabilize training.
At convergence, the optimal estimate is \(\tilde{\mathbf{x}} = \frac{\mu_m + \lambda \mu_v}{1+\lambda}\), satisfying \(\mu_m \leq \tilde{\mathbf{x}} \leq \mu_v\).
No post-processing MAP as in Laine19 is required, since the loss itself already incorporates information from \(\mathbf{y}\).

Cramer Gaussian Loss¶

To address the limitation of the original Gaussian loss—which estimates noise only at the global image level and ignores local noise knowledge—fine-grained constraints are introduced:

Single-channel images (e.g., grayscale): Sub-patch constraints are employed by cropping four overlapping sub-patches (3/4 the size of the original image) from the four corners. After GAT transformation, the noise variance of both sub-patches and the full image should approximate unit variance:

\[\mathcal{L}_{est} = \sum_{s=1}^{4} \|\eta(G_{g_\theta}(\mathbf{y}_s)) - 1\|_2^2 + \|\eta(G_{g_\theta}(\mathbf{y})) - 1\|_2^2\]

Multi-channel images (e.g., sRGB): A cross-channel noise level consistency constraint is introduced to prevent estimation errors across channels from canceling each other out:

\[\mathcal{L}_{est} = \sum_{j \neq k}^{c} \|\eta(G_{g_\theta}(\mathbf{y}_j)) - 1\|_2^2 + \|\eta(G_{g_\theta}(\mathbf{y}_j)) - \eta(G_{g_\theta}(\mathbf{y}_k))\|_2^2\]

The Cramer Gaussian loss serves only as a regularization term (weight 0.01), as the actual noise level of the denoised image may differ from that of the original input.

Total Loss¶

\[\mathcal{L} = \mathcal{L}_{arv} + 0.01 \cdot \mathcal{L}_{est}\]

Key Experimental Results¶

Noise Estimation Accuracy¶

On BSD68 (grayscale) and CBSD68 (sRGB), the Cramer Gaussian loss compared to FBI-D: - On grayscale images, it eliminates the severe Gaussian parameter estimation errors of FBI-D. - On sRGB images, the cross-channel constraint produces predictions closer to ground truth.

Synthetic Grayscale Denoising (PSNR/SSIM)¶

Noise	Method	BSD68	Set12	Urban100
PG1	Blind2Unblind	30.61/0.869	31.45/0.880	30.70/0.900
PG1	Blind2Sound	30.83/0.875	31.68/0.886	31.14/0.908
PG3	Blind2Unblind	27.02/0.757	27.65/0.796	26.54/0.805
PG3	Blind2Sound	27.17/0.766	27.96/0.805	26.96/0.819

Maximum gain of 0.44 dB and minimum gain of 0.15 dB over Blind2Unblind.
Surpasses supervised baselines N2C and N2N on Set12 and Urban100, with a maximum gain of 0.4 dB.

Synthetic sRGB Denoising¶

Noise	Method	KODAK	SET14	BSD300
PG1	Blind2Unblind	33.88/0.915	32.47/0.886	32.53/0.913
PG1	Blind2Sound	34.23/0.920	32.75/0.896	33.00/0.921

Real-World Denoising¶

Method	SIDD Benchmark (RAW)	SIDD Validation (RAW)	FMD Confocal
N2C (supervised)	50.61/0.991	51.19/0.991	38.40/0.966
Blind2Unblind	50.79/0.991	51.36/0.992	38.44/0.964
Blind2Sound	50.92/0.991	51.50/0.992	38.46/0.965

Outperforms all self-supervised methods and the supervised baseline on SIDD RAW.
Nearly 0.3 dB gain over FBI-D in the RAW space.

Ablation Study¶

Experiment	Finding
Granularity size	Fine-grained sub-patch constraints improve coarse-grained estimation accuracy, but excessively small patches provide insufficient noise context.
Cramer loss weight	0.01 is optimal; both 0 (no regularization) and 100 (over-regularization) are inferior.
Training scheme	Joint training > fixed pre-trained estimator ≈ fixed ground-truth noise (larger gap at low noise).
Noise model	Enhanced model \(\mathcal{M}_E\) shows a clear advantage at high noise levels.
Branch independence	IID (independent) greatly outperforms non-IID, validating the importance of the decoupled design.
Visibility factor	\(\lambda_f = 11\) is optimal; the relationship is non-monotonically increasing.

Highlights & Insights¶

Unifying noise awareness with lossless denoising: Embedding noise level estimation into the re-visible loss enables adaptive adjustment of denoising strength—this is the key to eliminating residual noise.
Zero inference overhead: The noise estimator and masked branch are used only during training; at inference, the pipeline is identical to Blind2Unblind with no additional computation.
Bayesian re-formulation: The MSE loss of Blind2Unblind is elevated to the negative log-likelihood of a mixed Gaussian marginal likelihood, yielding a theoretically sounder objective.
Cross-channel/sub-patch fine-grained constraints: The Cramer Gaussian loss narrows the solution space via multi-scale noise consistency, resolving the inaccurate estimation issue of FBI-D.
Surpassing supervised methods on grayscale: Outperforming N2C/N2N on Set12 and Urban100 demonstrates the potential of self-supervised methods under noise-aware training.

Limitations & Future Work¶

Limited gains in sRGB space: Denoising across channels is more challenging at high noise levels, yielding smaller gains than in grayscale; cross-channel modeling can be further strengthened.
Noise model assumption: The method relies on the Poisson-Gaussian model and may require extensions for more complex real-world noise (e.g., spatially correlated noise).
Visibility factor tuning: The initial and final values of \(\lambda\) must be set manually; an adaptive schedule may be more effective.
Single network architecture: Only an improved U-Net is employed; incorporating a stronger backbone (e.g., Transformer) may yield further improvements.
No video denoising: The framework targets static images and does not exploit temporal information.

Blind2Unblind (CVPR 2022): The direct predecessor; Blind2Sound extends its re-visible framework with noise awareness.
FBI-Denoiser (CVPR 2021): The first to introduce a Gaussian loss for noise estimation within a self-supervised framework, but with insufficient accuracy.
Noise2Void / Noise2Self: Foundational works in blind-spot denoising, but suffer from severe information loss.
NBR2NBR (CVPR 2021): Constructs training pairs via sub-sampling, but the neighboring-pixel approximation causes over-smoothing.
GAT + BM3D: Traditional methods, now comprehensively outperformed by data-driven approaches.
Insight: Explicitly incorporating noise model knowledge into self-supervised loss design is an effective path to performance improvement; future work could explore applying this idea to other degradation types (e.g., blur, compression artifacts).

Rating¶

Novelty: Pending
Experimental Thoroughness: Pending
Writing Quality: Pending
Value: Pending