gQIR: Generative Quanta Image Reconstruction¶
Conference: CVPR 2026 arXiv: 2602.20417 Code: https://github.com/Aryan-Garg/gQIR Area: Image Generation / Computational Imaging Keywords: Single-photon imaging, SPAD, diffusion models, denoising & demosaicing, low-light reconstruction, VAE alignment, burst fusion
TL;DR¶
This paper proposes gQIR, a modular three-stage framework that adapts large-scale text-to-image (T2I) diffusion models to the extreme photon-limited domain of SPAD sensors. It employs a quanta-aligned VAE (with a frozen encoder copy to prevent collapse), an adversarially fine-tuned LoRA U-Net for single-step generation, and a latent-space FusionViT for spatiotemporal fusion, enabling high-quality color image and video reconstruction from extremely sparse binary photon events.
Background & Motivation¶
Background: SPAD sensors can capture images under extremely low illumination and at ultra-high frame rates (10k–100k fps), recording only whether a photon was detected per pixel (Bernoulli distribution), resulting in highly sparse and noisy raw quanta frames. Existing reconstruction methods (QBP, QUIVER, QuDI) rely on alignment-and-merging strategies or task-specific networks, without leveraging large-scale generative priors.
Limitations of Prior Work: (a) Conventional denoising networks (NAFNet, Restormer) are designed for Poisson-Gaussian noise and suffer from severe over-smoothing under Bernoulli quantization noise; (b) existing generative restoration models (InstantIR) perform well on standard degradations but completely fail in the photon-limited domain (PSNR of only 7.9); (c) naively fine-tuning the VAE encoder of a diffusion model leads to encoder collapse—the trainable encoder simultaneously governs both prediction and supervision, causing rapid convergence to a constant output.
Key Challenge: A large domain gap exists between Bernoulli noise statistics and the continuous natural images used to train diffusion models, causing shortcut learning under direct fine-tuning.
Key Insight: The paper identifies the symmetric structure of the degradation-removal loss as the root cause of encoder collapse, and introduces a frozen encoder copy to break this symmetry.
Core Idea: A frozen encoder copy serves as an LSA anchor to prevent collapse, adversarial LoRA fine-tuning enables single-step generation, and a latent-space FusionViT performs spatiotemporal fusion.
Method¶
Overall Architecture¶
The input consists of SPAD binary frames (nano-burst: 7 frames averaged to yield 3-bit representations), and the output is a high-quality RGB image. Three stages are trained independently: Stage 1 fine-tunes the VAE encoder to achieve latent-space denoising and demosaicing alignment; Stage 2 adversarially fine-tunes a LoRA U-Net for single-step, high-perceptual-quality generation; Stage 3 trains a FusionViT for multi-frame spatiotemporal fusion in the latent space.
Key Designs¶
-
Imaging Model:
- Clean sRGB images are gamma-corrected as \(x_{lin} = x_{gt}^{2.2}\) to map into linear irradiance space.
- SPAD output follows a Bernoulli distribution: \(x_{spad} = \text{Bern}(1 - e^{-\alpha \cdot x_{lin}})\), where \(\alpha=1.0\) corresponds to an expected PPP of 3.5.
- A random Bayer pattern is applied, and \(N\) frames are averaged: \(x_{lq} = \frac{1}{N}\sum_{i=1}^{N} M_\pi[\text{Bern}(1-e^{-\alpha \cdot x_{lin}})]\)
-
Quanta-Aligned VAE (Stage 1):
- Deterministic mean encoding: \(\mu_\phi(x_{lq})\) is used in place of stochastic sampling to avoid variance amplification under Bernoulli noise.
- Latent Space Alignment (LSA) loss: \(\mathcal{L}_{lsa} = \|\mu_{\phi^*}(x_{lq}) - \mu_\phi(x_{gt})\|_2^2\), where \(\mu_\phi\) denotes a frozen copy of the original pretrained encoder that provides a stable anchor.
- Supplemented by pixel-space MSE (\(\lambda=10^3\)) and LPIPS (\(\lambda=2\)) losses.
- Total loss: \(\mathcal{L} = 0.1\mathcal{L}_{lsa} + 10^3\mathcal{L}_{MSE} + 2\mathcal{L}_{perc}\)
- Design Motivation: Existing degradation-removal formulations place the trainable encoder on both the prediction and supervision sides, yielding a degenerate optimum. The frozen copy fundamentally breaks this symmetry.
-
Adversarial LoRA Fine-Tuning (Stage 2):
- A multi-scale ConvNeXt-Large discriminator \(\mathcal{V}_\theta\) is used to perform standard min-max GAN training against a LoRA-initialized U-Net.
- Generator total loss: \(\mathcal{L}_{G} = \mathcal{L}_{adv} + \mathcal{L}_{perc} + \|\mathcal{D}(G_{lora}(\mu_{\phi^*}(x_{lq}))) - x_{gt}\|_2^2\)
- Design Motivation: The extremely high acquisition rate of SPAD sensors demands single-step inference; LoRA initialization inherits diffusion weights to provide a stable starting point for GAN training.
-
Latent-Space Burst Fusion (Stage 3 — FusionViT):
- All frames \(Y\) are first reconstructed using S1+S2, and optical flow is then estimated in the reconstructed domain using RAFT (pretrained on FlyingThings3D), avoiding the domain gap introduced by low-quality inputs.
- A pseudo-3D miniViT with sub-quadratic window attention adaptively fuses frames along temporal and spatial axes, preventing motion blur caused by naive averaging.
- The output is added residually to the center-frame latent encoding via a learnable scalar \(\delta=0.05\) before being passed to \(\mathcal{G}_{lora}\).
- Training loss: \(\mathcal{L}_{fusion} = \|\mathcal{F}(\mu_{\phi^*}(X)) - \mu_\phi(x_{gt})\|_2^2 + \|\mathcal{D}(\mathcal{G}(\mathcal{F}(\mu_{\phi^*}(X)))) - x_{gt}\|_2^2 + \mathcal{L}_{perc}\)
Training Configuration¶
- Stage 1: 8×A100, 600k steps, batch=8, Adam (lr=1e-5), training data: 2.81M images + 44k videos
- Stage 2: 1×RTX4090, 100k steps, 256×256
- Stage 3: 20k steps, S1+S2 frozen, FusionViT only
Key Experimental Results¶
Main Results: Single-Frame 3-bit Color Reconstruction¶
| Method | PSNR↑ | SSIM↑ | LPIPS↓ | ManIQA↑ | ClipIQA↑ | MUSIQ↑ |
|---|---|---|---|---|---|---|
| InstantIR | 7.93 | 0.101 | 0.736 | 0.197 | 0.358 | 32.21 |
| ft-Restormer | 26.43 | 0.739 | 0.388 | 0.235 | 0.395 | 36.03 |
| ft-NAFNet | 26.88 | 0.757 | 0.338 | 0.251 | 0.431 | 36.73 |
| qVAE (S1) | 26.28 | 0.791 | 0.435 | 0.272 | 0.432 | 38.61 |
| gQIR (S1+S2) | 25.48 | 0.766 | 0.361 | 0.313 | 0.490 | 42.04 |
Burst Reconstruction Comparison¶
| Test Set (fps) | Method | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|
| I2-2000fps | QBP | 16.04 | 0.549 | 0.468 |
| I2-2000fps | QUIVER | 25.06 | 0.874 | 0.366 |
| I2-2000fps | Burst-gQIR | 31.21 | 0.878 | 0.296 |
| XD (2k-100k) | QBP | 12.78 | 0.409 | 0.458 |
| XD (2k-100k) | Burst-gQIR | 30.33 | 0.895 | 0.316 |
Ablation Study: Stage 1 Design Choices¶
| Variant | PSNR↑ | SSIM↑ | ManIQA↑ |
|---|---|---|---|
| w/o det. encoding | 20.56 | 0.435 | 0.167 |
| w/o LSA loss | 10.39 | 0.222 | 0.139 |
| w/o both | 10.30 | 0.218 | 0.136 |
| Full method | 24.78 | 0.665 | 0.194 |
Ablation Study: Incremental Three-Stage Contribution (Video Set)¶
| Stage | PSNR↑ | SSIM↑ | \(E_{warp}\)↓ |
|---|---|---|---|
| S1 Alignment | 20.04 | 0.759 | 9.088 |
| S2 Perception | 24.11 | 0.846 | 8.508 |
| S3 Fusion | 27.63 | 0.869 | 8.005 |
Key Findings¶
- The LSA loss is indispensable—removing it causes PSNR to drop sharply from 24.78 to 10.39, with 100% encoder collapse.
- gQIR surpasses SOTA QuDI by +2.17 dB on the I2-2000fps benchmark (30.81 vs. 28.64).
- Adversarial fine-tuning substantially improves perceptual metrics: ClipIQA 0.432→0.490 (+13.4%), MUSIQ 38.6→42.0.
- FusionViT contributes most significantly on the extreme-motion XD dataset, with burst reconstruction outperforming single-frame by approximately +5 dB.
- Reconstruction from real color SPAD data requires only gray-world white balancing, with no hot-pixel or dark-count correction needed.
Highlights & Insights¶
- The frozen encoder copy to prevent collapse is the core contribution and can be directly transferred to any degradation-aware VAE fine-tuning scenario.
- Deterministic encoding under Bernoulli noise avoids variance amplification and offers general guidance for adapting diffusion models to non-Gaussian noise domains.
- The paper introduces the first color SPAD burst dataset and the XD video benchmark, filling a critical evaluation gap.
- The three-stage decoupled training allows independent optimization of each stage, making the framework highly practical for engineering applications.
Limitations & Future Work¶
- Training assumes a fixed PPP of 3.5; generalization to extremely low light (PPP≤1) is limited, and PPP should be incorporated as a conditioning signal.
- The pretrained VAE decoder's 8-bit constraint limits the native HDR capability of SPAD sensors.
- Adversarial training is conducted only at 256×256; higher resolutions rely on VAE tiling.
- Stage 3 depends on a two-step "reconstruct-then-estimate-flow" strategy; an end-to-end approach may be superior.
Rating¶
- Novelty: ⭐⭐⭐⭐ First successful adaptation of T2I diffusion priors to single-photon imaging
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Single-frame + burst + real data + new datasets + multiple metrics + complete ablations
- Writing Quality: ⭐⭐⭐⭐ Clear structure and rigorous physical modeling
- Value: ⭐⭐⭐⭐ Opens a new direction for adapting generative priors to extreme imaging conditions