CVPR 2026 Image Generation Single-photon imaging SPAD diffusion models denoising & demosaicing low-light reconstruction VAE alignment burst fusion

gQIR: Generative Quanta Image Reconstruction¶

Conference: CVPR 2026 arXiv: 2602.20417 Code: https://github.com/Aryan-Garg/gQIR Area: Image Generation / Computational Imaging Keywords: Single-photon imaging, SPAD, diffusion models, denoising & demosaicing, low-light reconstruction, VAE alignment, burst fusion

TL;DR¶

This paper proposes gQIR, a modular three-stage framework that adapts large-scale text-to-image (T2I) diffusion models to the extreme photon-limited domain of SPAD sensors. It employs a quanta-aligned VAE (with a frozen encoder copy to prevent collapse), an adversarially fine-tuned LoRA U-Net for single-step generation, and a latent-space FusionViT for spatiotemporal fusion, enabling high-quality color image and video reconstruction from extremely sparse binary photon events.

Background & Motivation¶

Background: SPAD sensors can capture images under extremely low illumination and at ultra-high frame rates (10k–100k fps), recording only whether a photon was detected per pixel (Bernoulli distribution), resulting in highly sparse and noisy raw quanta frames. Existing reconstruction methods (QBP, QUIVER, QuDI) rely on alignment-and-merging strategies or task-specific networks, without leveraging large-scale generative priors.

Limitations of Prior Work: (a) Conventional denoising networks (NAFNet, Restormer) are designed for Poisson-Gaussian noise and suffer from severe over-smoothing under Bernoulli quantization noise; (b) existing generative restoration models (InstantIR) perform well on standard degradations but completely fail in the photon-limited domain (PSNR of only 7.9); (c) naively fine-tuning the VAE encoder of a diffusion model leads to encoder collapse—the trainable encoder simultaneously governs both prediction and supervision, causing rapid convergence to a constant output.

Key Challenge: A large domain gap exists between Bernoulli noise statistics and the continuous natural images used to train diffusion models, causing shortcut learning under direct fine-tuning.

Key Insight: The paper identifies the symmetric structure of the degradation-removal loss as the root cause of encoder collapse, and introduces a frozen encoder copy to break this symmetry.

Core Idea: A frozen encoder copy serves as an LSA anchor to prevent collapse, adversarial LoRA fine-tuning enables single-step generation, and a latent-space FusionViT performs spatiotemporal fusion.

Method¶

Overall Architecture¶

The input consists of SPAD binary frames (nano-burst: 7 frames averaged to yield 3-bit representations), and the output is a high-quality RGB image. Three stages are trained independently: Stage 1 fine-tunes the VAE encoder to achieve latent-space denoising and demosaicing alignment; Stage 2 adversarially fine-tunes a LoRA U-Net for single-step, high-perceptual-quality generation; Stage 3 trains a FusionViT for multi-frame spatiotemporal fusion in the latent space.

Key Designs¶

Imaging Model:
- Clean sRGB images are gamma-corrected as \(x_{lin} = x_{gt}^{2.2}\) to map into linear irradiance space.
- SPAD output follows a Bernoulli distribution: \(x_{spad} = \text{Bern}(1 - e^{-\alpha \cdot x_{lin}})\), where \(\alpha=1.0\) corresponds to an expected PPP of 3.5.
- A random Bayer pattern is applied, and \(N\) frames are averaged: \(x_{lq} = \frac{1}{N}\sum_{i=1}^{N} M_\pi[\text{Bern}(1-e^{-\alpha \cdot x_{lin}})]\)
Quanta-Aligned VAE (Stage 1):
- Deterministic mean encoding: \(\mu_\phi(x_{lq})\) is used in place of stochastic sampling to avoid variance amplification under Bernoulli noise.
- Latent Space Alignment (LSA) loss: \(\mathcal{L}_{lsa} = \|\mu_{\phi^*}(x_{lq}) - \mu_\phi(x_{gt})\|_2^2\), where \(\mu_\phi\) denotes a frozen copy of the original pretrained encoder that provides a stable anchor.
- Supplemented by pixel-space MSE (\(\lambda=10^3\)) and LPIPS (\(\lambda=2\)) losses.
- Total loss: \(\mathcal{L} = 0.1\mathcal{L}_{lsa} + 10^3\mathcal{L}_{MSE} + 2\mathcal{L}_{perc}\)
- Design Motivation: Existing degradation-removal formulations place the trainable encoder on both the prediction and supervision sides, yielding a degenerate optimum. The frozen copy fundamentally breaks this symmetry.
Adversarial LoRA Fine-Tuning (Stage 2):
- A multi-scale ConvNeXt-Large discriminator \(\mathcal{V}_\theta\) is used to perform standard min-max GAN training against a LoRA-initialized U-Net.
- Generator total loss: \(\mathcal{L}_{G} = \mathcal{L}_{adv} + \mathcal{L}_{perc} + \|\mathcal{D}(G_{lora}(\mu_{\phi^*}(x_{lq}))) - x_{gt}\|_2^2\)
- Design Motivation: The extremely high acquisition rate of SPAD sensors demands single-step inference; LoRA initialization inherits diffusion weights to provide a stable starting point for GAN training.
Latent-Space Burst Fusion (Stage 3 — FusionViT):
- All frames \(Y\) are first reconstructed using S1+S2, and optical flow is then estimated in the reconstructed domain using RAFT (pretrained on FlyingThings3D), avoiding the domain gap introduced by low-quality inputs.
- A pseudo-3D miniViT with sub-quadratic window attention adaptively fuses frames along temporal and spatial axes, preventing motion blur caused by naive averaging.
- The output is added residually to the center-frame latent encoding via a learnable scalar \(\delta=0.05\) before being passed to \(\mathcal{G}_{lora}\).
- Training loss: \(\mathcal{L}_{fusion} = \|\mathcal{F}(\mu_{\phi^*}(X)) - \mu_\phi(x_{gt})\|_2^2 + \|\mathcal{D}(\mathcal{G}(\mathcal{F}(\mu_{\phi^*}(X)))) - x_{gt}\|_2^2 + \mathcal{L}_{perc}\)

Training Configuration¶

Stage 1: 8×A100, 600k steps, batch=8, Adam (lr=1e-5), training data: 2.81M images + 44k videos
Stage 2: 1×RTX4090, 100k steps, 256×256
Stage 3: 20k steps, S1+S2 frozen, FusionViT only

Key Experimental Results¶

Main Results: Single-Frame 3-bit Color Reconstruction¶

Method	PSNR↑	SSIM↑	LPIPS↓	ManIQA↑	ClipIQA↑	MUSIQ↑
InstantIR	7.93	0.101	0.736	0.197	0.358	32.21
ft-Restormer	26.43	0.739	0.388	0.235	0.395	36.03
ft-NAFNet	26.88	0.757	0.338	0.251	0.431	36.73
qVAE (S1)	26.28	0.791	0.435	0.272	0.432	38.61
gQIR (S1+S2)	25.48	0.766	0.361	0.313	0.490	42.04

Burst Reconstruction Comparison¶

Test Set (fps)	Method	PSNR↑	SSIM↑	LPIPS↓
I2-2000fps	QBP	16.04	0.549	0.468
I2-2000fps	QUIVER	25.06	0.874	0.366
I2-2000fps	Burst-gQIR	31.21	0.878	0.296
XD (2k-100k)	QBP	12.78	0.409	0.458
XD (2k-100k)	Burst-gQIR	30.33	0.895	0.316

Ablation Study: Stage 1 Design Choices¶

Variant	PSNR↑	SSIM↑	ManIQA↑
w/o det. encoding	20.56	0.435	0.167
w/o LSA loss	10.39	0.222	0.139
w/o both	10.30	0.218	0.136
Full method	24.78	0.665	0.194

Ablation Study: Incremental Three-Stage Contribution (Video Set)¶

Stage	PSNR↑	SSIM↑	\(E_{warp}\)↓
S1 Alignment	20.04	0.759	9.088
S2 Perception	24.11	0.846	8.508
S3 Fusion	27.63	0.869	8.005

Key Findings¶

The LSA loss is indispensable—removing it causes PSNR to drop sharply from 24.78 to 10.39, with 100% encoder collapse.
gQIR surpasses SOTA QuDI by +2.17 dB on the I2-2000fps benchmark (30.81 vs. 28.64).
Adversarial fine-tuning substantially improves perceptual metrics: ClipIQA 0.432→0.490 (+13.4%), MUSIQ 38.6→42.0.
FusionViT contributes most significantly on the extreme-motion XD dataset, with burst reconstruction outperforming single-frame by approximately +5 dB.
Reconstruction from real color SPAD data requires only gray-world white balancing, with no hot-pixel or dark-count correction needed.

Highlights & Insights¶

The frozen encoder copy to prevent collapse is the core contribution and can be directly transferred to any degradation-aware VAE fine-tuning scenario.
Deterministic encoding under Bernoulli noise avoids variance amplification and offers general guidance for adapting diffusion models to non-Gaussian noise domains.
The paper introduces the first color SPAD burst dataset and the XD video benchmark, filling a critical evaluation gap.
The three-stage decoupled training allows independent optimization of each stage, making the framework highly practical for engineering applications.

Limitations & Future Work¶

Training assumes a fixed PPP of 3.5; generalization to extremely low light (PPP≤1) is limited, and PPP should be incorporated as a conditioning signal.
The pretrained VAE decoder's 8-bit constraint limits the native HDR capability of SPAD sensors.
Adversarial training is conducted only at 256×256; higher resolutions rely on VAE tiling.
Stage 3 depends on a two-step "reconstruct-then-estimate-flow" strategy; an end-to-end approach may be superior.

Rating¶

Novelty: ⭐⭐⭐⭐ First successful adaptation of T2I diffusion priors to single-photon imaging
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Single-frame + burst + real data + new datasets + multiple metrics + complete ablations
Writing Quality: ⭐⭐⭐⭐ Clear structure and rigorous physical modeling
Value: ⭐⭐⭐⭐ Opens a new direction for adapting generative priors to extreme imaging conditions