Smooth Regularization for Efficient Video Recognition¶

Conference: NeurIPS 2025 arXiv: 2511.20928 Code: GitHub Area: Model Compression Keywords: Video Recognition, Smooth Regularization, Gaussian Random Walk, Lightweight Models, Temporal Inductive Bias

TL;DR¶

This paper proposes a Gaussian Random Walk (GRW)-based smooth regularization technique that imposes temporal smoothness constraints (penalizing high-acceleration changes) on intermediate-layer embeddings of video recognition models, achieving 3.8%–6.4% accuracy improvements on lightweight models and establishing a new state of the art on Kinetics-600 under corresponding FLOP constraints.

Background & Motivation¶

Although video recognition models have made significant progress in learning spatiotemporal representations, many architectures still suffer from overfitting or inefficient exploitation of temporal information. The core insight of this paper is that real-world video content typically exhibits continuous motion and gradual appearance changes, and therefore a model's internal representations should also vary smoothly over time.

However, current video models do not explicitly leverage this temporal smoothness prior. Large networks have sufficient capacity to simultaneously learn meaningful variations and noise; but for resource-constrained lightweight networks, limited capacity makes it difficult to distinguish meaningful motion signals from noisy fluctuations in the embedding space. Consequently, injecting a temporal smoothness inductive bias into lightweight models becomes especially important.

The authors illustrate the value of smoothness through an elegant warm-up experiment: on a simple airplane-rotation classification dataset, a model trained without a smoothness term learns a chaotic and unstructured embedding space, whereas adding smooth regularization leads the model to discover an intrinsic two-dimensional linear representation in which each rotation maps to a specific direction and the embedding trajectory is smooth with low acceleration.

Method¶

Overall Architecture¶

GRW-smoothing is added as a plug-and-play regularization term to the standard cross-entropy loss. It operates on per-frame embeddings extracted from intermediate or final layers of the model, and enforces both frame-order preservation and smoothness constraints via a contrastive loss. The overall training objective is: $$\mathcal{L} = \mathcal{L}_{CE} + \lambda \mathcal{L}_{smooth}$$

Key Designs¶

Frame-Order Contrastive Loss: Given a normalized embedding sequence $Z = (\mathbf{z}_t)_{t=0}^{N-1}$, it is segmented into sub-clips $Z^c$ of length $T$. A contrastive loss is constructed to contrast the correct frame order against all permutations: $$\mathcal{L}_f(\varphi) = -\mathbb{E}_{X,c}\left[\log \frac{f(Z^c_{\text{correct}})}{\sum_{\pi \in S(1:T)} f(Z^c_\pi)}\right]$$ This frame-order constraint prevents degenerate solutions (mapping all frames to a single point is maximally "smooth" yet meaningless). Design Motivation: naively minimizing embedding differences leads to representational collapse, which is mitigated by contrastive learning to preserve informativeness.
Gaussian Random Walk Smoothness Prior: The velocity changes in the embedding sequence are modeled as a Gaussian random walk. Defining velocity $V^c = (\mathbf{z}_{t+1}^c - \mathbf{z}_t^c)$ and acceleration $A^c = (\mathbf{v}_{t+1}^c - \mathbf{v}_t^c)$, and assuming acceleration follows i.i.d. standard normal distributions $\mathbf{a}_t^c \sim \mathcal{N}(\mathbf{0}, I)$, the probability density is: $$f(Z^c) = p(A^c) = \prod_{t=0}^{T-3} \mathcal{N}(\mathbf{a}_t^c)$$ Substituting this into the contrastive loss assigns higher probability to the correct frame order (low acceleration) than to random permutations. Mechanism: low acceleration implies smoother motion.
Velocity Scale Control: An additional term $\Omega(V^c) = \log \prod_{t} \mathcal{N}(\mathbf{v}_t^c)$ is introduced to regulate the overall scale of velocities, preventing degenerate solutions caused by embedding rescaling. The final smooth loss is: $$\mathcal{L}_{smooth} = -\mathbb{E}_{X,c}\left[\log \frac{p(A^c)}{\sum_\pi p(A^c_\pi)} + \alpha \Omega(V^c)\right]$$
Application Location: GRW-smoothing supports both intermediate-layer smoothing (applied after global pooling and BN normalization) and final-layer smoothing (applied after affine-transform normalization, followed by 1–2 Transformer layers). Experiments show that final-layer smoothing yields superior performance.

Loss & Training¶

Balancing coefficient $\lambda = 0.1$; scaling factor $\alpha = 0.5$
GRW window covers 0.5–1.0 seconds of video; $T=5$ or $T=6$
All $(T-1)!$ permutations are enumerated when $T \leq 7$; $k=1000$ permutations are sampled when $T > 7$
Fine-tuned from existing weights for 14 epochs on K600; backbone learning rate $[10^{-4}, 10^{-6}]$, Transformer head $[10^{-3}, 10^{-5}]$
Negligible computational overhead: wall-clock training time increases by approximately 2%

Key Experimental Results¶

Main Results — Kinetics-600 FLOP Constraints¶

Model	Top-1 (%)	GFLOPs	Gain
MoViNet-A0	72.3	2.7	—
MoViNet-A0-S-GRW	78.4	2.7	+6.1
MoViNet-A1	76.7	6.0	—
MoViNet-A1-S-GRW	81.9	6.0	+5.2
MoViNet-A2	78.6	10.3	—
MoViNet-A2-S-GRW	83.3	11.3	+4.7
MoViNet-A3	81.8	56.9	—
MoViNet-A3-GRW	85.6	56.4	+3.8
MViTv2-B-32×3	85.5	1030	—

MoViNet-A3-GRW achieves 85.6% accuracy at 56.4 GFLOPs, matching MViTv2-B-32×3 while requiring 18.3× fewer FLOPs.

Ablation Study — K600 Memory Constraints¶

Model	Top-1 (%)	Memory (MB)	Gain
MobileNetV3-S	61.3	29	—
MobileNetV3-S-GRW	67.3	30	+6.0
MoViNet-A0-S	72.0	53	—
MoViNet-A0-S-GRW	78.4	53	+6.4
MoViNet-A2-S	78.4	78	—
MoViNet-A2-S-GRW	83.3	78	+4.9

Key Findings¶

Consistent improvements of 3.8%–6.4% across all models, achieved without increasing FLOP or memory budgets
The benefit of smoothing is more pronounced for lighter models: MoViNet-A0-S gains 6.4%, while MoViNet-A3 gains 3.8%
Performance is robust to the choice of $\lambda$ over a wide range, suggesting that gradients in the smoothness direction naturally align with those of the classification likelihood
The computational overhead of GRW is negligible, adding only approximately 2% to training time

Highlights & Insights¶

Mathematical Elegance: Modeling temporal smoothness as a Gaussian random walk over embedding velocities is both physically intuitive and probabilistically grounded; the use of a contrastive loss to prevent degenerate solutions is particularly elegant
High Practical Value: As a plug-and-play regularization term, it requires no architectural modifications and virtually no additional computation, yet yields substantial improvements for lightweight models
The warm-up experiment (airplane rotation) provides a compelling visualization that clearly demonstrates how smoothness helps models discover intrinsic low-dimensional structure

Limitations & Future Work¶

Validation is limited to lightweight models; the effect on large models remains unknown (the authors suggest that large models may not require this prior given their sufficient capacity)
Evaluation is restricted to action recognition; applicability to other video tasks (detection, segmentation) has yet to be verified
The temporal window size $T$ still requires manual tuning
Full permutation enumeration grows computationally expensive for large $T$; while sampling is feasible, it may degrade the quality of loss estimation

This work is closely related to slow feature analysis, frame-ordering self-supervised methods (Shuffle & Learn, Odd-One-Out), and efficient video models (MoViNets, X3D, TSM). The GRW-smoothing framework may inspire other tasks requiring temporal consistency (e.g., video generation, temporal forecasting) and could be extended to continuous signal processing in other modalities.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Introducing random walk theory into video representation regularization is both novel and mathematically elegant
Experimental Thoroughness: ⭐⭐⭐⭐ Consistent validation across multiple models and datasets with thorough ablations, though limited to lightweight models
Writing Quality: ⭐⭐⭐⭐⭐ The warm-up experiment provides an excellent introduction; mathematical derivations are clear and figures are of high quality
Value: ⭐⭐⭐⭐⭐ Plug-and-play, negligible overhead, and consistent gains make this highly valuable for practical deployment