Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction¶

Conference: CVPR 2026
arXiv: 2603.00611
Code: DynaSpec
Area: Computational Spectral Imaging
Keywords: Spectral Compressed Imaging, Hyperspectral Video Reconstruction, Spatiotemporal Feature Propagation, Transformer, DynaSpec Dataset

TL;DR¶

The first work to advance spectral compressed imaging (SCI) from image-level to video-level reconstruction, introducing the first high-quality dynamic hyperspectral dataset DynaSpec (30 sequences / 300 frames), and proposing PG-SVRT with spatial-then-temporal attention plus bridge tokens that achieves 41.52 dB PSNR with optimal temporal consistency at lower FLOPs (28.18G) than several image-level SOTAs.

Background & Motivation¶

State of the field: Hyperspectral images (HSI) capture material spectral properties and are widely used in classification, detection, tracking, and autonomous driving. Spectral compressed imaging (SCI) compresses 3D data \(X \in \mathbb{R}^{H \times W \times C}\) into 2D measurements \(Y \in \mathbb{R}^{H \times W'}\) for snapshot acquisition via spatial-spectral encoding. Existing reconstruction methods (MST-L, DPU, RDLUF, etc.) have achieved excellent image-level performance.

Existing limitations: (1) Reconstruction uncertainty—mask encoding inevitably loses spatial-spectral information, and recovering occluded content from a single frame has inherent ambiguity; (2) Temporal inconsistency—frame-by-frame independent reconstruction cannot guarantee temporal continuity, manifesting as flickering intensity curves and inter-frame jitter, which fails to meet video perception requirements.

Core tension: Video-level reconstruction faces dual obstacles—data scarcity (existing datasets are all image-level; pseudo-video cropping lacks real motion freedom) and algorithmic bottlenecks (existing methods struggle to efficiently model high-dimensional spatiotemporal dependencies—joint attention has explosive complexity, while fully separated processing limits interaction).

Objective: To advance spectral reconstruction from image-level to video-level across data, model, and benchmark dimensions.

Approach: Fixed encoding patterns differentially capture complementary features across adjacent frames—occluded information can be recovered through propagation from neighboring frames, while naturally enhancing temporal consistency. This physical property provides a solid signal foundation for video-level reconstruction.

Core idea: Leveraging complementary features and temporal continuity in sequential measurements, efficient video-level hyperspectral reconstruction is achieved through spatial-then-temporal progressive attention with bridge tokens.

Method¶

Overall Architecture¶

PG-SVRT adopts a U-Net architecture, taking \(T=3\) measurement frames as input, composed of three core modules: Mask-Guided Degradation Prior (MGDP), Cross-Domain Propagation Attention (CDPA), and Multi-Domain Feed-Forward Network (MDFFN). A shuffle operation aligns degraded features with measurements along the spectral dimension. Module counts \((N_1, N_2, N_3) = (4, 8, 8)\), with base channel number \(C = N_\lambda = 30\).

Key Designs¶

DynaSpec Dataset:
- Function: Constructing the first high-quality dynamic hyperspectral image dataset
- Core idea: A GaiaField push-broom hyperspectral camera captures frame-by-frame with controlled objects, manually designing translation/rotation/articulated motion to simulate real-world scene dynamics. Specifications: 30 scenes, 300 HSI frames, spatial resolution 1280×1280, spectral resolution 2 nm, wavelength range 400–700 nm (151 channels)
- Five quality principles: (i) inter-frame motion is continuous and physically plausible; (ii) long exposure for noise reduction; (iii) spectral response calibration; (iv) illumination spectrum exclusion to approximate reflectance data; (v) invariant object intensity calibration to eliminate temperature drift
- Design motivation: Existing datasets are either image-level (CAVE/KAIST) or downstream task datasets with unreliable low spectral resolution. Controlled frame-by-frame scanning ensures ground truth authenticity
Cross-Domain Propagation Attention (CDPA):
- Function: Spatial-then-temporal progressive attention for efficient spatiotemporal feature propagation
- Core idea: Spatial attention—features are partitioned into non-overlapping windows (\(H_{win}=8, W_{win}=32\)), with bridge tokens \(B_s \in \mathbb{R}^{Thw \times N_B \times C}\) (generated by pooling \(Q_s\), \(N_B=64\)) serving as Q-K-V intermediaries without extra parameters: \(Y_s^{out} = \text{GConv}(A(Q_s, B_s, A(B_s, K_s, V_s, \tau_1), \tau_2)) + Y_{N1}\). Temporal attention—after dimension rearrangement, the spatial output is shared as value: \(Y_t^{out} = A(Q_t, K_t, Y_t, \tau_3)\), with no temporal windowing (\(T\) is small and frames are strongly correlated)
- Complexity: \(O = 4THWC^2 + 4THWN_BC + 2T^2HWC\). Bridge tokens reduce cost when \(2N_B < H_{win}W_{win}\) (\(128 < 256\), satisfied)
- Design motivation: Joint spatiotemporal attention is too expensive, while fully separated processing limits interaction. Shared value cross-domain propagation + bridge tokens achieve a balance between \(O(N\log N)\)-level efficiency and high-quality interaction
Mask-Guided Degradation Prior (MGDP):
- Function: Explicitly modeling the compression degradation process before the main architecture
- Core idea: The mask \(\Phi\) is compressed according to the SCI architecture (SD/DD) to obtain \(\Phi_s\), cropped/replicated to \(\Phi_p\), and the intensity distribution difference between \(\Phi\) and \(\Phi_p\) is learned (Conv\(_{1\times1}\) + sigmoid) to produce weights \(W_\Phi\), which are element-wise applied to measurement features and concatenated: \(Y_{in} = \text{Concat}(\text{Conv}(W_m \odot F_m(Y)), Y)\)
- Design motivation: Degradation priors help the network understand the degree of encoding loss at each spatial-spectral location, guiding targeted reconstruction

Loss Function / Training Strategy¶

Multi-stage RMSE loss; Adam optimizer (\(\beta_1=0.9, \beta_2=0.999\)); learning rate \(3 \times 10^{-4}\) with cosine annealing to \(1 \times 10^{-6}\); 80 epochs, batch size 2; RTX 3090 GPU. Unified comparison across 4 SCI systems (SD-CASSI/DD-CASSI/PMVIS/NDSSI).

Key Experimental Results¶

Main Results — Comparison with SOTAs (DD-CASSI System)¶

Method	Venue	PSNR-K↑	PSNR-D↑	SAM-K↓	ST-RRED-K↓	GFLOPs
MST-L	CVPR'22	39.99	39.58	3.82	30.99	28.23
PADUT	ICCV'23	38.61	40.41	4.72	47.19	32.78
DPU	CVPR'24	40.02	41.01	5.22	25.90	31.04
DPU* (with temporal)	CVPR'24	40.50	41.36	5.17	26.71	77.36
PG-SVRT	Ours	41.23	41.82	3.81	19.35	28.18

Ablation Studies¶

Configuration	PSNR	SSIM	SAM↓	ST-RRED↓	GFLOPs
Baseline (F-MSA+FFN)	39.97	0.9827	5.53	43.90	30.11
+ CDPA	41.30 (+1.33)	0.9884	4.32	25.44	21.11
+ CDPA + MGDP	41.41 (+0.11)	0.9886	4.25	24.63	21.31
+ CDPA + MGDP + MDFFN	41.52 (+0.11)	0.9893	3.91	23.25	28.18

Key Findings¶

DD-CASSI dominates among the four SCI architectures (PSNR 41.52 vs. runner-up NDSSI 37.84), owing to its high spectral sampling efficiency and clear structural representation
CDPA contributes the most (+1.33 dB PSNR) while actually reducing FLOPs (30.11→21.11G), as bridge tokens replace full-window attention
Spatial-then-temporal + shared value strategy achieves the best result (41.52), outperforming parallel processing (41.35) and temporal-then-spatial (41.04)
Despite being a video model, PG-SVRT's per-frame FLOPs (28.18G) are lower than image-level methods such as DAUHST (35.93G)

Highlights & Insights¶

Data + model + benchmark trinity: The DynaSpec dataset, PG-SVRT model, and four-SCI-system comparison benchmark together provide substantial momentum for the dynamic computational spectral imaging field
Elegant bridge token design: Pooling queries to generate intermediary tokens enables indirect attention with zero extra parameters and reduced complexity. Computation is strictly reduced when \(2N_B < H_{win}W_{win}\)
Shared value cross-domain propagation: The spatial attention output directly serves as the temporal attention value, elegantly enabling multi-domain feature interaction without additional projection overhead
Compelling DPU* comparison: Naively concatenating temporal frames costs significantly more (77.36G) than PG-SVRT (28.18G) with inferior performance

Limitations & Future Work¶

DynaSpec contains only 30 scenes / 300 frames; limited diversity and scale may lead to overfitting to specific motion patterns
Frame count is fixed at \(T=3\); extension to longer sequences is not validated, and real dynamic scenarios may require larger temporal windows
Training uses 256×256 crops; full-resolution (1280×1280) inference efficiency and effectiveness are not discussed
Explicit motion modeling methods (optical flow alignment, deformable convolutions) combined with CDPA remain unexplored

vs. DPU (CVPR'24): Image-level SOTA; naively concatenating temporal frames (DPU*) causes a 2.5× FLOPs explosion but limited improvement (+0.48 dB). PG-SVRT elegantly achieves video-level reconstruction via shared value propagation at lower FLOPs
vs. MST-L/CST-L: Earlier image-level methods show far worse temporal consistency (ST-RRED 30–35) compared to PG-SVRT (19.35)
The bridge token concept is generalizable to other attention mechanism designs requiring efficient processing of high-dimensional data
Fair comparison across SCI systems provides important reference for spectral imaging hardware selection (DD-CASSI is clearly superior)

Rating¶

Novelty: ⭐⭐⭐⭐ Video-level spectral reconstruction is a novel problem formulation; CDPA bridge tokens and shared value propagation are creative designs
Experimental rigor: ⭐⭐⭐⭐⭐ Comparison across four SCI systems, 12 SOTAs, multi-dimensional ablations, and real prototype validation
Writing quality: ⭐⭐⭐⭐ Clear problem motivation with a complete unified SCI mathematical framework
Impact: ⭐⭐⭐⭐⭐ The combination of dataset + method + benchmark has far-reaching impact on the dynamic computational spectral imaging field