gQIR: Generative Quanta Image Reconstruction¶
Conference: CVPR 2026
arXiv: 2602.20417
Code: GitHub
Area: Image Generation / Image Reconstruction / Computational Imaging
Keywords: Single-photon sensors, Diffusion models, Image reconstruction, burst imaging, VAE alignment
TL;DR¶
Adapt large-scale text-to-image latent diffusion models to extreme photon-starved imaging scenarios of Single-Photon Avalanche Diodes (SPADs). Through a three-stage framework (Quanta-aligned VAE → Adversarially fine-tuned LoRA U-Net → FusionViT spatio-temporal fusion), the method achieves high-quality RGB reconstruction from sparse binary photon detections, significantly surpassing all existing methods under extreme 10K-100K fps conditions.
Background & Motivation¶
Background: SPAD single-photon sensors enable imaging in ultra-low light and high-speed scenarios where conventional cameras fail. However, each pixel only records binary photon detections (photon arrival or not), resulting in extremely sparse and noise-dominated single-frame data. Existing methods mainly fall into two categories: (1) Classical vision methods like QBP, which use block matching, frame alignment, and Wiener filtering; (2) Learning-based methods like QUIVER and QuDI, using optical flow estimation with recurrent fusion or temporal conditional U-Nets.
Limitations of Prior Work: - Classical methods suffer from unreliable motion estimation when photons are extremely scarce, and lack semantic priors, leading to loss of high-frequency details. - While learning methods introduce learned modules, they are task-specific and trained from scratch, failing to utilize the structural and semantic knowledge of large-scale pre-trained generative models. - All existing methods degrade severely under extreme deformation or ultra-high-speed motion (>10K fps). - The Bayer mosaic of color SPADs further exacerbates sparsity, with even fewer photon events per color channel.
Key Challenge: Large-scale T2I diffusion models possess powerful natural image priors, but they assume continuous Gaussian noise. SPAD data follows a discrete Bernoulli distribution with noise far exceeding regular photography—naive fine-tuning leads to encoder collapse (catastrophic forgetting) and shortcut learning.
Key Insight: Instead of training from scratch, bridge the domain gap through a carefully designed phased adaptation strategy to transfer stable diffusion structural priors to the quanta domain. A key observation is that direct fine-tuning of the VAE encoder results in constant outputs due to extreme SPAD noise; latent space alignment constraints must be introduced to prevent collapse.
Core Idea: A three-stage modular framework—first align the latent space to handle Bernoulli statistics, then distill the diffusion prior into a single-step generator for enhanced perceptual quality, and finally perform burst-level spatio-temporal fusion in the latent space.
Method¶
Overall Architecture¶
The input consists of binary photon frames from a SPAD sensor (or aggregated 3-bit nano-bursts), and the output is a high-quality RGB image. Training data is synthesized from clean images using a physically consistent Bernoulli SPAD image formation model. The pipeline consists of three sequentially trained stages: Stage 1 trains a Quanta-aligned VAE encoder for denoising and demosaicing; Stage 2 adversarially fine-tunes a LoRA U-Net to enhance perceptual fidelity; Stage 3 trains FusionViT for burst-level spatio-temporal fusion in the latent space. Parameters of previous modules are frozen at each subsequent stage.
graph TD
A["Bernoulli SPAD Image Formation Model<br/>clean sRGB → linear radiance → binary photon frames (+Bayer)"] --> B["Stage 1: Quanta-Aligned VAE<br/>Deterministic mean encoding + LSA loss, denoising/demosaicing for aligned latent"]
B --> C["Stage 2: Adversarially Fine-Tuned LoRA U-Net<br/>Distilled into single-step generator + ConvNext discriminator, restores high-freq"]
C -->|Single-frame reconstruction| D["Stage 3: FusionViT<br/>RAFT flow on reconstructions → warp to center → window attention fusion"]
D --> E["VAE Decoder"]
E --> F["High-Quality RGB Image"]
Key Designs¶
1. Bernoulli SPAD Image Formation Model: Physically consistent "fragmentation" of clean images into single-photon observations
To train on synthetic data, a physically consistent forward simulation chain is required because SPAD statistical properties differ significantly from conventional photography. Clean sRGB images \(x_{gt}\) are first gamma-corrected to linear radiance space \(x_{lin}=x_{gt}^{2.2}\). Photon detection at each pixel then follows a Bernoulli distribution:
where \(\alpha\) controls the expected photons per pixel (PPP). Training uses \(\alpha=1.0\) (PPP≈3.5). For color SPAD, a random Bayer pattern is applied to generate mosaiced binary frames; averaging \(N\) frames yields \(\log_2(N+1)\) bit observations. This Bernoulli (non-Gaussian) chain is the source of subsequent technical choices—because noise is discrete, intensity-dependent, and extreme, standard diffusion restoration methods fail.
2. Stage 1 — Quanta-Aligned VAE: Ensuring the encoder outputs "clean-aligned" latents under noisy input
Directly fine-tuning an SD VAE encoder for SPAD input results in a collapse to a constant mapping. This occurs because existing restoration methods (like SUPIR/DiffBIR) use the same trainable encoder for both supervision and prediction, allowing a shortcut to be learned under extreme noise. Ours freezes the decoder \(\mathcal{D}\) and fine-tunes only the encoder \(\mathcal{E}_{\phi^*}\) with two key modifications. First, deterministic mean encoding: skip posterior sampling and use the mean \(\mu_\phi(x_{lq})\) to avoid amplifying Bernoulli noise. Second, the Latent Space Alignment (LSA) loss, using a frozen pre-trained encoder to encode the GT as an alignment target:
Since the \(\mu_\phi\) in the second term is a frozen copy, the supervision target does not drift during training, preventing the encoder from "cheating" through mutual degradation. The total loss is \(\mathcal{L}=\lambda_{lsa}\mathcal{L}_{lsa}+\lambda_{MSE}\mathcal{L}_{MSE}+\lambda_{perc}\mathcal{L}_{perc}\). This stage handles both denoising and demosaicing.
3. Stage 2 — Adversarially Fine-tuned LoRA U-Net: Distilling multi-step diffusion into a single-step generator
Stage 1 reconstructions are often flat and lack detail. Given SPAD's 10K–100K fps throughput, multi-step diffusion sampling is impractical. Thus, the diffusion U-Net is distilled into a single-step generator. A LoRA adapter initializes the generator \(\mathcal{G}_{lora}\) (inheriting diffusion weights to maintain small initial gradients and preserve priors), paired with a multi-layer ConvNext-Large discriminator \(\mathcal{V}_\theta\) for standard min-max GAN adversarial training:
The total loss includes perceptual and pixel terms: \(\mathcal{L}=\mathcal{L}_{adv}+\mathcal{L}_{perc}+\|\mathcal{D}(\mathcal{G}_{lora}(\mu_{\phi^*}(x_{lq})))-x_{gt}\|_2^2\). This preserves pre-trained generative knowledge while enabling single-step real-time inference.
4. Stage 3 — FusionViT: Dynamic burst fusion in latent space for fidelity and temporal stability
Single-frame reconstruction ignores temporal redundancy in burst sequences, and naive flow-warping + averaging leads to motion blur. FusionViT performs both alignment and fusion in the latent space, avoiding the pitfall of estimating flow on noisy frames. Specifically, Stage 1+2 first reconstructs individual frames \(Y=\mathcal{D}(\mathcal{G}_{lora}(\mathcal{E}_{\phi^*}(X_{lq})))\). Pre-trained RAFT estimates optical flow on these reconstructed frames. All burst latents are warped to the center frame. A pseudo-3D miniViT \(\mathcal{F}\) uses window attention across space and time for adaptive weighted fusion, determining contributions based on motion magnitude and temporal distance. Outputs are added back to the center latent \(z_{T/2}\) via a learnable scalar \(\delta\) (initial 0.05).
Loss & Training¶
Phased progressive training, freezing previous modules at each stage: - Stage 1: 8×A100, 600K steps, \(\lambda_{lsa}=0.1, \lambda_{MSE}=10^3, \lambda_{perc}=2\) - Stage 2: Single RTX 4090, 100K iterations, 256×256, \(\lambda_{adv}=0.5, \lambda_{MSE}=500, \lambda_{perc}=5\) - Stage 3: FusionViT for 20K steps, using RAFT flow (pre-trained on FlyingThings3D).
Training data: 2.81M images + 44,575 videos from various sources (SR, faces, deblurring). 3-bit nano-bursts are averaged from 7 binary frames; burst mode uses 11 nano-bursts (77 binary frames total).
Key Experimental Results¶
Main Results¶
Single-frame 3-bit Reconstruction (334 test images, 384×384):
| Method | PSNR↑ | SSIM↑ | LPIPS↓ | ManIQA↑ | ClipIQA↑ | MUSIQ↑ |
|---|---|---|---|---|---|---|
| InstantIR | 10.79 | 0.178 | 0.651 | 0.187 | 0.346 | 36.65 |
| ft-Restormer | 28.73 | 0.816 | 0.294 | 0.262 | 0.435 | 40.44 |
| ft-NAFNet | 28.28 | 0.830 | 0.261 | 0.300 | 0.473 | 39.13 |
| qVAE (Stage 1) | 28.18 | 0.863 | 0.327 | 0.299 | 0.487 | 44.90 |
| gQIR (S1+S2) | 27.28 | 0.839 | 0.318 | 0.331 | 0.547 | 45.61 |
Burst Reconstruction (Extreme Motion):
| Dataset (fps) | QBP PSNR/SSIM | QUIVER PSNR/SSIM | Burst-gQIR PSNR/SSIM |
|---|---|---|---|
| XVFI (1000) | 12.01 / 0.370 | 23.00 / 0.751 | 25.82 / 0.712 |
| I2-2000fps (2000) | 16.04 / 0.549 | 25.06 / 0.874 | 31.21 / 0.878 |
| XD (2K-100K) | 12.78 / 0.409 | 20.10 / 0.790 | 30.33 / 0.895 |
| Aggregate | 13.38 / 0.448 | 22.43 / 0.814 | 29.83 / 0.856 |
I2-2000fps Full Test Set: Burst-gQIR achieves 30.81 dB PSNR / 0.868 SSIM, exceeding Prev. SOTA QuDI (28.64 dB) by +2.17 dB Gain.
Ablation Study¶
Stage 1 Design Choices:
| Configuration | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| w/o Deterministic Encoding (A) | 20.56 | 0.435 | 0.167 |
| w/o LSA loss (B) | 10.39 | 0.222 | 0.139 |
| w/o (A) + (B) | 10.30 | 0.218 | 0.136 |
| Full Stage 1 | 24.78 | 0.665 | 0.194 |
Three-stage Progress - Fidelity vs Temporal Stability:
| Stage | PSNR↑ | SSIM↑ | \(E^*_{warp}\)↓ |
|---|---|---|---|
| Alignment (S1) | 20.04 | 0.759 | 9.088 |
| Perceptual (S2) | 24.11 | 0.846 | 8.508 |
| Fidelity (S3) | 27.63 | 0.869 | 8.005 |
Key Findings¶
- LSA loss is critical for Stage 1 success: Removing LSA drops PSNR from 24.78 to 10.39, as the encoder collapses to a constant mapping.
- Deterministic encoding is significant, providing a ~4 dB PSNR gain, though less critical than LSA.
- Stage 2 adversarial training improves perceptual quality but introduces slight content drift, which Stage 3 spatio-temporal fusion effectively mitigates.
- On the extreme motion dataset XD, gQIR provides a +17.5 dB Gain over QBP, demonstrating the advantage of generative priors in out-of-distribution scenarios.
- Realistic color SPAD data can be reconstructed with photo-like quality without dark count or hot pixel correction, requiring only Gray World white balance.
Highlights & Insights¶
- First successful transfer of large-scale T2I diffusion priors to quanta imaging: Addressed the challenge of drastically different noise statistics (Bernoulli vs. Gaussian) by solving encoder collapse.
- Novel LSA Design: Using a frozen pre-trained encoder as the alignment target cut the shortcut path that causes degradation in existing methods like SUPIR/DiffBIR.
- Decoupled Three-Stage Design: S1 handles domain adaptation, S2 enhances perception, and S3 leverages temporal information, with clear responsibilities for each stage.
- Pre-denoising strategy for flow estimation: Estimating flow in the reconstruction domain rather than the noise domain circumvents the failure of optical flow on raw SPAD data.
Limitations & Future Work¶
- Training at fixed PPP=3.5 limits robustness in ultra-low light (PPP≤1); using PPP as an explicit conditioning signal may improve generalization.
- Pre-trained VAE decoder's 8-bit output limits native SPAD HDR capabilities; HDR-capable decoders are a future direction.
- Video-level diffusion priors could further enhance temporal consistency beyond what Stage 3 currently achieves.
- Stage 3 depends on RAFT flow; implicit alignment in latent space could be explored for extreme motion.
Related Work & Insights¶
- vs QBP: QBP uses a classic align-and-merge pipeline. gQIR generalizes this to latent space and replaces simple averaging with FusionViT, showing a +17.5 dB gain in extreme motion.
- vs QUIVER/QuDI: These task-specific models do not utilize pre-trained generative priors. gQIR outperforms QuDI by +2.17 dB on I2-2000fps.
- vs SUPIR/DiffBIR: These fail under Bernoulli noise. gQIR’s LSA loss + frozen encoder target is a critical modification for extreme quanta noise.
- The decoupled training approach is transferable to other non-standard sensor restoration tasks (e.g., event cameras, quantum sensors).
Rating¶
- Novelty: ⭐⭐⭐⭐ First to adapt T2I diffusion to quanta imaging, solving the key encoder collapse problem.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across synthetic/real data, single/burst modes, and various fps levels.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation and physical modeling; intuitive visualization of encoder collapse.
- Value: ⭐⭐⭐⭐ Opens a new direction for generative priors in computational imaging; code and datasets are open-sourced.