AI-Generated Video Detection via Perceptual Straightening¶

Conference: NeurIPS 2025 arXiv: 2507.00583 Code: GitHub Area: Model Compression Keywords: AI-generated video detection, perceptual straightening, DINOv2, temporal curvature, representation geometry

TL;DR¶

This paper proposes ReStraV, a method grounded in the perceptual straightening hypothesis—which posits that real videos form straighter trajectories in neural representation space—to detect AI-generated videos. Using temporal curvature and step-size statistics extracted from DINOv2 feature space, a lightweight classifier is trained to distinguish real from generated content, achieving 97.17% accuracy and 98.63% AUROC on VidProM with only ~48ms inference time.

Background & Motivation¶

Background: AI video generation systems (Sora, Pika, VideoCrafter, etc.) are advancing rapidly, producing increasingly photorealistic content, making robust detection urgently necessary. Existing detection approaches include image-based methods (CNNSpot, UnivFD) and video-based methods (I3D, SlowFast, VideoSwin); the former disregard temporal information, while the latter require extensive training and generalize poorly.

Limitations of Prior Work: (a) Image-level detectors fail to capture temporal inconsistencies; (b) video-level detectors require large-scale generator-specific training and generalize poorly to unseen generators; (c) watermarking schemes depend on generator cooperation and can be circumvented.

Key Challenge: A generalizable detection method is needed that does not rely on generator-specific artifacts and can capture anomalies along the temporal dimension.

Goal - Does a fundamental geometric difference exist between real and AI-generated videos in neural representation space? - Can such a difference support efficient and generalizable detection?

Key Insight: Motivated by the neuroscientific perceptual straightening hypothesis—which holds that the visual system straightens the temporal trajectories of natural videos to facilitate predictive coding—this work hypothesizes that pre-trained vision models (DINOv2) selectively straighten real videos but not AI-generated ones, yielding discriminative curvature differences.

Core Idea: Real videos trace straighter trajectories (lower curvature) in DINOv2 representation space, whereas AI-generated videos follow more curved trajectories—this geometric discrepancy constitutes a reliable detection signal.

Method¶

Overall Architecture¶

ReStraV operates in three stages: (1) uniformly sample 24 frames from the input video and extract per-frame features using a frozen DINOv2 ViT-S/14 by concatenating CLS and patch tokens into \(z_i \in \mathbb{R}^{75648}\); (2) compute inter-frame step sizes \(d_i = \|z_{i+1} - z_i\|\) and curvature angles \(\theta_i = \arccos(\frac{\Delta z_i \cdot \Delta z_{i+1}}{\|\Delta z_i\| \|\Delta z_{i+1}\|})\); (3) extract statistical descriptors (mean/min/max/var) and train a lightweight classifier (MLP/GB/RF) for binary discrimination.

Key Designs¶

Differential Effect of Perceptual Straightening
- Function: Identifies that DINOv2 straightens real and AI-generated videos to different degrees.
- Mechanism: A systematic comparison across 14 visual encoders reveals that HVS-inspired models (Gabor, LGN-V1) straighten all videos equally (\(\Delta\theta < 0\), no discriminative power), whereas the self-supervised model DINOv2 selectively straightens real videos that conform to its training distribution while leaving AI-generated videos curved (\(\Delta\theta = 45.46°\), strong discriminative signal). Crucially, absolute straightening ability is uncorrelated with detection performance (\(\rho=-0.13, p=0.64\)); differential straightening is the key factor.
- Design Motivation: DINOv2 is self-supervised on large-scale real-world data, internalizing the statistical regularities of natural content. Trajectories of real videos that conform to this prior are straightened; those of AI-generated videos, which violate the prior, remain curved.
Temporal Curvature and Step Size as Detection Features
- Function: Quantifies the geometric properties of trajectories in representation space.
- Mechanism: Curvature \(\theta_i = \arccos(\frac{\Delta z_i \cdot \Delta z_{i+1}}{d_i \cdot d_{i+1}})\) measures directional change between successive displacements (path bending); step size \(d_i = \|\Delta z_i\|\) measures the magnitude of inter-frame change. Real videos exhibit low mean curvature (\(\mu_\theta\) small) and high curvature variance (\(\sigma_\theta^2\) large), whereas AI-generated videos exhibit high mean curvature and low variance—i.e., they bend uniformly. An 8-dimensional statistical feature vector is sufficient for discrimination.
- Design Motivation: These simple geometric quantities (angles and distances) are physically intuitive, interpretable, and computationally inexpensive.
Lightweight Classifier
- Function: Performs binary classification from a 21-dimensional feature vector using off-the-shelf classifiers.
- Mechanism: The feature vector comprises 7 step-size values + 6 curvature values + 8 statistical descriptors = 21 dimensions. Six classifiers (LR/GNB/RF/GB/SVM/MLP) are evaluated; MLP (64→32) achieves the best performance (97.17% accuracy). No pixel-level processing or DINOv2 fine-tuning is required.
- Design Motivation: The lightweight classifier ensures transparency and interpretability while enabling extremely fast inference—DINOv2 forward pass 43.6ms + classification <5ms = ~48ms end-to-end.

Loss & Training¶

DINOv2 is fully frozen (pre-trained weights used as-is).
The classifier is trained with standard cross-entropy loss.
No data augmentation or feature engineering is applied; only 3-fold grid search is performed.

Key Experimental Results¶

Main Results: VidProM Benchmark¶

Method	Type	Accuracy↑	AUROC↑
CNNSpot (image)	Supervised	52.66	55.47
UnivFD (image)	Supervised	68.71	66.11
I3D (video)	Supervised	91.76	95.18
VideoSwin (video)	Supervised	94.47	97.95
ReStraV-MLP (Ours)	Lightweight	97.17	98.63

Cross-Benchmark Generalization¶

Benchmark	ReStraV Accuracy	Best Baseline
VidProM	97.17%	94.47% (VideoSwin)
GenVidBench	SOTA	—
Physics-IQ	SOTA	—

Ablation Study: Encoder Selection¶

Encoder	\(\Delta\theta\) (Curvature Gap)	Detection Ability
DINOv2 ViT-S/14	+45.46°	Best
CLIP	+25°	Second
SimCLR	+15°	Moderate
Gabor (HVS)	−5°	Ineffective

Key Findings¶

Differential straightening by DINOv2 is the key: Absolute straightening ability does not equal detection ability; differential straightening—straighter for real videos, not straightened for AI videos—is the detection signal.
"Uniform bending" signature of AI-generated videos: AI-generated videos exhibit high mean curvature but low variance, indicating that temporal inconsistencies introduced by generative models are systematic rather than random.
Pixel space provides no discriminative power: In raw pixel space, curvature and step-size distributions of real and AI-generated videos overlap heavily; the difference is only visible in learned representation space.
Extreme efficiency: ~48ms end-to-end inference, orders of magnitude faster than VideoSwin and comparable methods.
Cross-generator generalization: The method is effective across diverse generators including Sora, Pika, and VideoCrafter.

Highlights & Insights¶

Cross-domain inspiration from neuroscience to AI safety is particularly elegant: the perceptual straightening hypothesis was originally developed to explain biological visual systems, and this work ingeniously repurposes it for AI detection—real videos are straightened even within a "digital neural system."
A 21-dimensional feature vector outperforms deep video models, demonstrating that correct feature design matters far more than model complexity.
The discovery of differential vs. absolute straightening constitutes the core contribution: HVS models straighten everything equally (no discriminative power), whereas SSL models straighten selectively (only within-distribution content is straightened, yielding discriminative power).
Strong interpretability: Curvature and step size carry clear physical meaning—bending corresponds to temporal inconsistency, and large step sizes correspond to abrupt transitions.

Limitations & Future Work¶

Dependence on DINOv2's specific training distribution: If a generative model's training data substantially overlaps with DINOv2's, the differential straightening effect may diminish.
Only short videos (2–5 seconds) are evaluated: Effectiveness on longer videos remains unverified.
Adversarial robustness: An adversary aware of ReStraV's operating principle could potentially post-process AI-generated videos to artificially straighten their trajectories.
Fixed frame sampling strategy: Uniform sampling of 24 frames may not be optimal across all scenarios.
Potential failure on static scenes: When video content changes minimally, the curvature signal may be insufficiently strong.

vs. CNNSpot / UnivFD (image detectors): Frame-by-frame detection ignores the temporal dimension, yielding performance far below ReStraV.
vs. I3D / VideoSwin (video detectors): End-to-end video models require extensive training and heavy computation; ReStraV surpasses them using frozen features and a lightweight classifier.
vs. watermarking: Watermarking requires generator cooperation; ReStraV does not—it operates as a purely post-hoc detector.
Transfer implications: Geometric analysis of representation space may generalize to detection of other forms of generated content (AI images, AI audio, AI text).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The cross-domain leap from perceptual straightening to AI detection is highly original; the discovery of differential straightening is a genuinely novel insight.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comparison across 14 encoders, 50K training/test samples, multiple benchmarks, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐⭐ The narrative arc from hypothesis to validation to method to experiments is exceptionally coherent, with high-quality visualizations.
Value: ⭐⭐⭐⭐⭐ 48ms inference, 97% accuracy, and cross-generator generalization make this directly deployable for content authentication.