Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks¶

Conference: NeurIPS 2025 arXiv: 2512.03014 Code: None (but method is reproducible) Area: 3D Vision / Video Processing / Temporal Consistency Keywords: Temporal Stability, Stabilization Adapters, Video Consistency, Corruption Robustness, EMA

TL;DR¶

This paper proposes a class of universal Stabilization Adapters that can be inserted into nearly any image model architecture. By freezing the base network and training only the adapter parameters, combined with a unified accuracy–stability–robustness loss function, the method endows frame-level models with video temporal consistency and corruption robustness.

Background & Motivation¶

Background: Video is commonly processed frame by frame — image datasets are more abundant, image models are cheaper to train, and per-frame performance gains often transfer to video tasks.

Limitations of Prior Work: - Frame-by-frame processing introduces temporal inconsistencies (flickering, abrupt changes), degrading perceptual quality and downstream system reliability. - Real-world deployment faces transient corruptions (sensor noise, compression artifacts, adverse weather), which simultaneously exacerbate instability and reduce accuracy. - Existing video models are typically designed for specific tasks and lack generality.

Key Challenge: Enhancing temporal stability may lead to over-smoothing, reducing accuracy; a balance between precision and stability is required.

Goal: Design a universal, lightweight, and modular approach that enables pretrained image models to achieve temporal stability and corruption robustness during video inference without sacrificing accuracy.

Key Insight: Model stability and robustness jointly within a single loss function; theoretically analyze the properties of this loss to avoid "over-smoothing reality"; and design learnable adapters for adaptive stabilization.

Core Idea: A controller network predicts element-wise EMA decay rates, allowing the degree of stabilization to adapt to the rate of scene change. It is theoretically shown that when \(\lambda < 1/2\), the ground truth is always the global minimum of the loss.

Method¶

Overall Architecture¶

Insert Stabilization Adapters at selected layers and outputs of the pretrained image model.
Freeze original model parameters; train only adapter parameters.
Train using a unified accuracy–stability–robustness loss.

Key Designs¶

Unified Loss Function: Combines corruption robustness \(\mathcal{R}_c\) and corruption stability \(\mathcal{S}_c\): \(\mathcal{U}_c = -(\mathcal{R}_c + \lambda \mathcal{S}_c) = \mathbb{E}\left[\sum_t \delta(f(\varepsilon_t(\bm{x}_t)), \bm{y}_t) + \lambda \sum_{t} \delta(f(\varepsilon_t(\bm{x}_t)), f(\varepsilon_{t+1}(\bm{x}_{t+1})))\right]\) where \(\lambda\) controls the trade-off between stability and accuracy.
Oracle Bound (\(\lambda < 1/2\)): When the distance \(\delta\) can be expressed as a norm, as long as \(\lambda < 1/2\), the ground truth is always the global minimum of the loss in the prediction space — a perfect model is never incentivized to deviate from correct predictions in exchange for higher stability.
Collapse Bound (\(\lambda > \tau - 1\)): When \(\lambda\) exceeds the sequence length minus one, the global minimum of the loss degenerates to "repeating the initial prediction" (prediction collapse). Since \(\tau - 1 > 0.5\), the oracle bound and the collapse bound are mutually exclusive.
EMA Stabilizer: The basic unit is an exponential moving average: \(\tilde{z}_t = \beta z_t + (1-\beta) \tilde{z}_{t-1}\) \(\beta \in [0,1]\) controls the degree of stabilization (\(\beta=1\) means no stabilization).
Stabilization Controller: Composed of a shared backbone network \(g\) and per-layer heads \(h_i\), the controller predicts pixel-wise decay rates \(\bm{\beta}\) from current/previous frame inputs and features: \(\tilde{\bm{z}}_{i,t} = \bm{\beta}_{i,t} \odot \bm{z}_{i,t} + (1-\bm{\beta}_{i,t}) \odot \tilde{\bm{z}}_{i,t-1}\) \(\bm{\beta}_{i,t} = \sigma(h_i(g(\bm{x}_t, \bm{x}_{t-1}), \bm{z}_{i,t}, \tilde{\bm{z}}_{i,t-1}, \bm{z}_{i,t-1}))\)
Spatial Fusion Extension: The controller predicts spatial decay kernels \(\bm{\eta}\) (instead of a scalar \(\beta\)), performing weighted aggregation over spatial neighborhoods to allow motion compensation; the maximum trackable motion is determined by the spatial extent of the kernel.

Design Principles¶

Causality: Stabilized outputs depend only on current and past inputs (no future frames), making the method suitable for streaming video.
Feature-Domain + Output-Domain: Both intermediate features and final outputs are stabilized, supporting high-level semantic consistency.
Non-invasive: Adapters are modular layers with independent parameters; the original model is not modified.

Key Experimental Results¶

Image Enhancement (HDRNet)¶

The controller with spatial fusion simultaneously improves both PSNR and stability (approximately +2 dB PSNR and 35% reduction in instability), whereas naive EMA can only improve stability at the cost of quality.

Denoising (NAFNet)¶

Noise Level	Method	PSNR	Instability
σ=0.1	Base	Lower	Higher
σ=0.1	Controlled+Spatial	Higher	Lower

Key finding: Fixed EMA simultaneously degrades both PSNR and stability (because the denoising model predicts noise residuals that are entirely uncorrelated across frames — smoothing residuals suppresses denoising); the learned controller automatically avoids destabilizing features.

Corruption Robustness¶

Corruption Type	Model	Enh. PSNR	Enh. Instab.	Den. PSNR	Den. Instab.	Depth AbsRel↓	Depth Instab.
Patch drop	Base	17.43	164.6	18.93	151.4	0.070	9.89
Patch drop	Ours	31.39	30.36	35.46	20.42	0.070	4.73
JPEG artifacts	Base	24.85	42.06	29.01	39.71	0.057	7.32
JPEG artifacts	Ours	26.46	23.58	32.19	20.49	0.065	4.92

Across nearly all corruption types and tasks, the stabilizer substantially reduces instability (typically by 50–80%) while maintaining or improving per-frame accuracy.

Adverse Weather Robustness¶

Stabilizer?	Unfreeze Base?	Rain PSNR	Rain Instab.	Snow PSNR	Snow Instab.
×	×	21.43	151.76	18.62	262.48
✓	×	28.63	57.88	31.34	59.31
×	✓	32.19	70.84	34.33	66.57
✓	✓	32.61	58.30	35.20	58.98

Ablation Study¶

Stabilizer Variant	Characteristics	Effect
Output Fixed	Fixed EMA at output layer only	Stability ↑, Accuracy ↓
Simple Fixed	Fixed EMA at all layers	Both accuracy and stability degrade in denoising
Simple Learned	Learned per-channel β	Slightly better than fixed
Controlled	Controller predicts pixel-wise β	Stability ↑, Accuracy ↑
Spatial	Controller + spatial fusion	Best performance in most cases

Prediction collapse is confirmed at \(\lambda=8 > \tau-1\) (instability \(< 10^{-3}\)), empirically validating the theoretical collapse bound.

Key Findings¶

Feature-domain stabilization is critical for high-level tasks (depth estimation, semantic segmentation).
The core value of the controller lies in adaptively regulating the degree of stabilization based on scene dynamics.
Stabilization adapters naturally enhance corruption robustness without explicitly modeling corruption types.

Highlights & Insights¶

Closed Loop Between Theory and Practice: The oracle bound and collapse bound provide theoretical guidance for choosing \(\lambda\), and experiments perfectly validate both.
Strong Generality: The same framework applies to HDRNet (enhancement), NAFNet (denoising), Depth Anything v2 (depth estimation), and DeepLabv3+ (segmentation).
Modular Design: Original model parameters are not modified; adapters are plug-and-play.
Causality: Only current and historical frames are used, making the method suitable for real-time streaming video.
The approach is conceptually related to Mamba's selective state-space models (Eq. 7 can be viewed as an input-conditioned linear dynamical system).

Limitations & Future Work¶

The theoretical bounds require \(\delta\) to be expressible as a norm, which excludes many complex loss functions.
Depth estimation encounters difficulties in sim-to-real transfer (real videos contain subtle corruptions absent in simulation).
Spatial fusion exhibits performance degradation on long sequences under extreme noise, partially mitigable by increasing training \(\tau\).
The current formulation uses simple Euclidean norm as \(\delta\); exploring more complex metrics such as Wasserstein distance may yield further improvements.
Computational overhead is not analyzed in detail, particularly the additional cost introduced by the controller backbone.

Blind video temporal consistency methods (Bonneel et al. 2015; Lai et al. 2018) operate only in the output space; this work extends stabilization to the feature space.
Clockwork ConvNets observed that semantic content changes more slowly than pixel values — this paper embodies the same intuition through feature-domain stabilization.
The approach has direct applicability to any "frame-level model + video deployment" scenario: autonomous driving perception, video editing, AR/VR.
The adapter training paradigm can be extended to other temporal tasks (e.g., stabilizing frame-level models in audio processing).

Rating¶

Novelty: ⭐⭐⭐⭐ The theoretical analysis of the unified loss is novel and the stabilization controller design is practical, though the EMA concept itself is conventional.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers four tasks (denoising, enhancement, depth, segmentation), multiple corruption types, adverse weather, and thorough ablations.
Writing Quality: ⭐⭐⭐⭐⭐ Theory is clearly presented, experiments are well-organized, and figures are informative.
Value: ⭐⭐⭐⭐⭐ Addresses a core bottleneck in deploying image models to video; excellent generality and practical utility.