SpatioTemporal Difference Network for Video Depth Super-Resolution¶
Conference: AAAI 2026 arXiv: 2508.01259 Code: yanzq95/STDNet Area: Image Restoration Keywords: Video Depth Super-Resolution, Long-tail Distribution, Spatial Difference, Temporal Difference, Deformable Convolution
TL;DR¶
Motivated by the statistical observation that spatially non-smooth regions and temporally varying regions in video depth super-resolution (VDSR) follow long-tail distributions, this paper proposes STDNet. The method incorporates a spatial difference branch (learning spatial difference representations for intra-frame RGB-D adaptive aggregation) and a temporal difference branch (exploiting temporal difference representations for motion compensation in changing regions). On the TarTanAir dataset at ×16 super-resolution, RMSE is reduced from 112.04 cm to 96.80 cm, outperforming state-of-the-art methods by an average of 27.6%–32.6%.
Background & Motivation¶
Development of Depth Super-Resolution¶
Depth data plays a critical role in 3D reconstruction, virtual reality, and augmented reality. Numerous depth super-resolution (DSR) methods have been proposed to reconstruct high-resolution (HR) depth maps from low-resolution (LR) inputs. Single-frame DSR has achieved notable progress through filtering-based, multimodal fusion, multi-task collaborative, and structure-guided approaches. Video depth super-resolution (VDSR) further improves reconstruction quality by aggregating multi-frame RGB-D features.
Long-tail Distribution Problem¶
The authors conduct a statistical analysis of VDSR and identify long-tail distribution phenomena along two dimensions:
- Spatial dimension: Discrepancies between GT depth and upsampled LR depth are concentrated in non-smooth regions (edges, structural transitions), which constitute a small fraction of the overall data yet are substantially harder to reconstruct than the dominant smooth regions.
- Temporal dimension: Depth differences between consecutive and non-adjacent frames are concentrated in temporally varying regions (dynamic objects, edge contours, occluded areas), exhibiting similarly long-tail characteristics.
Limitations of Prior Work¶
Existing VDSR methods such as DVSR incorporate multi-frame feature aggregation but do not explicitly model these long-tail distributional properties. Unimodal video RGB super-resolution methods have limited effectiveness in establishing multi-frame, multimodal RGB-D correspondences. A framework specifically designed to address spatially and temporally long-tail regions is therefore needed.
Core Problem¶
How to selectively enhance reconstruction quality in spatially non-smooth regions and temporally varying regions—both exhibiting long-tail distributions—within video depth super-resolution, while maintaining temporal consistency?
Method¶
Overall Architecture¶
STDNet consists of two core branches:
- Spatial Difference Branch: Predicts spatial difference representations \(\boldsymbol{\sigma}\) from LR depth video to guide intra-frame RGB feature alignment and aggregation toward depth non-smooth regions.
- Temporal Difference Branch: Estimates consecutive-frame differences \(\boldsymbol{\varphi}\) and cross-frame differences \(\hat{\boldsymbol{\varphi}}\) to prioritize multi-frame RGB-D aggregation in temporally varying regions.
A difference regularization loss is additionally introduced to supervise the learning of spatiotemporal difference representations.
Spatial Difference Branch¶
Spatial difference representation: Non-smooth region information is captured via a downsampling–upsampling operation on depth features:
Spatial difference mechanism:
- A filter kernel \(\boldsymbol{k}_t = \mathcal{G}(\boldsymbol{\sigma}_t)\) is generated from \(\boldsymbol{\sigma}_t\) to align RGB features toward non-smooth depth regions.
- Adaptive weights \(\boldsymbol{w}_t = \mathcal{E}_w(\boldsymbol{\sigma}_t)\) are encoded from \(\boldsymbol{\sigma}_t\) (via convolution, max, mean, and sigmoid operations).
- Weighted aggregation: \(\boldsymbol{F}_{sd}^t = f_c(\boldsymbol{F}_d^t, \boldsymbol{w}_t \otimes \boldsymbol{F}_r^t, \mathcal{F}(\boldsymbol{F}_r^t, \boldsymbol{k}_t))\)
This mechanism drives RGB information to selectively propagate into depth non-smooth regions via spatial difference representations, effectively alleviating the long-tail effect.
Temporal Difference Branch¶
Temporal difference representations:
These capture temporal variation information from consecutive and cross-frame pairs, respectively.
Temporal difference strategy: A bidirectional iterative scheme is employed, comprising neighboring-frame fusion and cross-frame fusion stages:
- Neighboring-frame fusion: The temporal difference \(\boldsymbol{\varphi}_{t-1}\) is passed through encoder \(\mathcal{E}_\varphi\) to produce offsets \(\delta_{t-1}\) and modulation scalars \(m_{t-1}\); deformable convolution \(\mathcal{D}\) is then applied to dynamically sample temporally varying information. Spatial difference weights \(\boldsymbol{w}_t\) are additionally used to mitigate cross-modal discrepancies:
- Cross-frame fusion: \(\hat{\boldsymbol{\varphi}}_{t-2}\) is used analogously to process frame \(t-2\).
- Final fusion: \(\hat{\boldsymbol{F}}_f^t = \boldsymbol{F}_f^{t-1,t} + \boldsymbol{F}_f^{t-2,t}\)
Difference Regularization Loss¶
The total loss combines a reconstruction loss with difference regularization:
where \(\mathcal{L}_{rec}\) uses Charbonnier regularization and \(\mathcal{L}_{diff} = \alpha_1 \mathcal{L}_{sd} + \alpha_2 \mathcal{L}_{td}\).
- Spatial difference loss: Introduces uncertainty constraints to impose larger reconstruction error penalties in non-smooth regions: \(\mathcal{L}_{sd} = \sum_q (\boldsymbol{\sigma}^q - \min(\boldsymbol{\sigma}^q)) \|\boldsymbol{D}_{GT}^q - \boldsymbol{D}_{HR}^q\|_1\)
- Temporal difference loss: Constrains the temporal difference representations to be consistent with the temporal variations in GT depth, covering both consecutive-frame and cross-frame terms.
Hyperparameter settings: \(\alpha_1 = \alpha_2 = 0.5\), \(\beta = 0.01\).
Key Experimental Results¶
Table 1: Quantitative Comparison on TarTanAir¶
| Method | Conference | ×4 RMSE↓ | ×4 MAE↓ | ×8 RMSE↓ | ×8 MAE↓ | ×16 RMSE↓ | ×16 MAE↓ | ×16 TEPE↓ |
|---|---|---|---|---|---|---|---|---|
| DJFR | PAMI'19 | 75.56 | 10.59 | 105.45 | 18.43 | 141.14 | 31.22 | 20.27 |
| DKN | IJCV'21 | 82.69 | 11.73 | 110.10 | 18.78 | 153.56 | 33.21 | 21.93 |
| SGNet | AAAI'24 | 79.40 | 11.36 | 116.33 | 23.15 | 144.17 | 34.34 | 20.14 |
| DORNet | CVPR'25 | 63.38 | 8.60 | 93.75 | 13.96 | 123.24 | 23.59 | 16.40 |
| DVSR | CVPR'23 | 57.72 | 4.40 | 76.96 | 7.74 | 112.04 | 14.39 | 11.06 |
| STDNet | — | 50.28 | 3.73 | 72.03 | 6.75 | 96.80 | 12.01 | 8.90 |
At ×16 super-resolution, STDNet reduces RMSE by 15.24 cm, MAE by 2.38 cm, and TEPE by 2.16 cm compared to DVSR.
Table 2: Generalization on DyDToF Dataset¶
| Method | ×4 RMSE↓ | ×8 RMSE↓ | ×16 RMSE↓ | ×16 MAE↓ |
|---|---|---|---|---|
| DVSR | 19.53 | 27.63 | 43.55 | 9.80 |
| STDNet | 18.23 | 26.87 | 39.24 | 8.72 |
Without fine-tuning, STDNet outperforms DVSR on DyDToF, reducing ×16 RMSE by 4.31 cm.
Table 3: Generalization on DynamicReplica Dataset¶
| Method | ×4 RMSE↓ | ×8 RMSE↓ | ×16 RMSE↓ |
|---|---|---|---|
| DVSR | 0.37 | 0.58 | 1.25 |
| STDNet | 0.32 | 0.53 | 1.10 |
Ablation Study¶
| Variant | TarTanAir ×4 RMSE↓ | DyDToF ×4 RMSE↓ |
|---|---|---|
| Baseline (concatenation replacing SD+TD) | ~60+ | ~36+ |
| +SD (spatial difference only) | RMSE reduced by 3.56 cm | — |
| +TD (temporal difference only) | RMSE reduced by 14.02 cm | — |
| +SD+TD (full STDNet) | reduced by 17.94 cm | — |
Removing the difference regularization loss increases TarTanAir ×16 RMSE by 7.08 cm, validating the effectiveness of the loss design.
Model Complexity¶
Compared to single-frame methods, STDNet reduces parameter count by an average of 9.23 M and RMSE by 35.82 cm. Compared to the multi-frame method DVSR, inference speed improves by 47.35 ms and performance improves by 4.93 cm RMSE, with only 4.4 M additional parameters.
Highlights & Insights¶
- Problem formulation grounded in statistical analysis: Rather than designing a network without prior analysis, the authors first perform histogram-based statistical analysis on VDSR, identify long-tail distributional characteristics along both spatial and temporal dimensions, and then design targeted solutions accordingly—a data-driven problem discovery approach worth emulating.
- Simplicity and effectiveness of the spatial difference mechanism: The spatial difference representation is obtained via a straightforward downsampling–upsampling difference (analogous to the Laplacian pyramid concept), which is then used to generate filter kernels and weights for guiding RGB-D aggregation. The design is concise yet yields significant gains.
- Elegant integration of temporal difference with deformable convolution: Temporal difference representations are transformed into offsets and modulation scalars for deformable convolution, enabling motion compensation to naturally focus on long-tail regions with pronounced temporal variation.
- Consistent cross-dataset generalization: Trained solely on TarTanAir, STDNet consistently improves performance on DyDToF and DynamicReplica without fine-tuning, demonstrating robustness.
- Substantial improvement in temporal consistency: x-t slice visualizations clearly demonstrate that STDNet produces more stable depth predictions in temporally varying regions.
Limitations & Future Work¶
- Synthetic data training: Experiments are conducted exclusively on synthetic datasets (TarTanAir, DyDToF, DynamicReplica); performance on noisy depth videos captured by real sensors remains unvalidated.
- Fixed number of reference frames: Experiments show that using 2 frames (1 neighboring + 1 cross-frame) yields optimal performance, but this empirical setting may not generalize well to scenes with large motion or prolonged occlusion; an adaptive frame selection mechanism is lacking.
- Absence of optical flow: Spatial difference representations are computed via simple downsampling–upsampling differences, and temporal differences via inter-frame feature differences, both without explicit motion estimation using optical flow, which may limit performance under large-displacement scenarios.
- Computational overhead: Although the parameter count is lower than single-frame methods such as DKN, the bidirectional iteration combined with deformable convolution increases computation, posing challenges for real-time applications.
- Scope limited to depth super-resolution: The method is not extended to related tasks such as depth completion or depth denoising, leaving its generality insufficiently validated.
Related Work & Insights¶
- DVSR (Sun et al., CVPR 2023): The first dToF-based VDSR method, which mitigates spatial blurring through multi-frame fusion but does not explicitly address long-tail distributions. STDNet outperforms DVSR by an average of 32.6%/28.8%/27.6% on TarTanAir at ×4/×8/×16.
- DORNet (Wang et al., CVPR 2025): A recent single-frame DSR method whose performance is substantially weaker than multi-frame methods (×16 RMSE 123.24 vs. STDNet 96.80), confirming the importance of multi-frame information.
- BasicVSR++ (Chan et al.): A bidirectional recurrent framework for video RGB super-resolution. STDNet adopts its bidirectional iterative idea but designs a temporally difference-driven aggregation strategy tailored to VDSR.
- SVDC (Zhu et al., 2025): A video depth completion framework that fuses multi-frame features via adaptive frequency selection. STDNet focuses specifically on the long-tail distribution problem in depth super-resolution and follows a distinct design philosophy.
The long-tail distribution perspective is generalizable to other video restoration tasks (e.g., video deblurring, video denoising), where edge and motion regions similarly follow non-uniform distributions. The spatial difference representation (downsampling–upsampling difference) serves as a lightweight, supervision-free detector for non-smooth regions and can be applied as an attention weight generator in other multimodal fusion tasks. The temporal difference-driven deformable convolution design can inspire motion modeling in video depth estimation and video optical flow estimation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The long-tail distribution perspective and the dual-branch design driven by spatiotemporal differences are novel; however, individual modules (deformable convolution, bidirectional iteration) are combinations of existing techniques.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets, three scaling factors, 11 comparison methods, detailed ablation studies (SD/TD/loss/frame count), complexity analysis, and PCA visualizations constitute a comprehensive evaluation.
- Writing Quality: ⭐⭐⭐⭐ — The narrative logic of statistical analysis → problem identification → method design is clear; figures and tables are of high quality; histogram comparisons intuitively illustrate the mitigation of long-tail effects.
- Value: ⭐⭐⭐⭐ — Achieves substantial improvements on the VDSR task (×16 average 27.6%), with a long-tail distribution perspective transferable to other restoration tasks.