AI-Generated Video Detection via Perceptual Straightening¶
Conference: NeurIPS 2025 arXiv: 2507.00583 Code: GitHub Area: Model Compression Keywords: AI-generated video detection, perceptual straightening, DINOv2, temporal curvature, representation geometry
TL;DR¶
This paper proposes ReStraV, a method grounded in the perceptual straightening hypothesis—which posits that real videos form straighter trajectories in neural representation space—to detect AI-generated videos. Using temporal curvature and step-size statistics extracted from DINOv2 feature space, a lightweight classifier is trained to distinguish real from generated content, achieving 97.17% accuracy and 98.63% AUROC on VidProM with only ~48ms inference time.
Background & Motivation¶
Background: AI video generation systems (Sora, Pika, VideoCrafter, etc.) are advancing rapidly, producing increasingly photorealistic content, making robust detection urgently necessary. Existing detection approaches include image-based methods (CNNSpot, UnivFD) and video-based methods (I3D, SlowFast, VideoSwin); the former disregard temporal information, while the latter require extensive training and generalize poorly.
Limitations of Prior Work: (a) Image-level detectors fail to capture temporal inconsistencies; (b) video-level detectors require large-scale generator-specific training and generalize poorly to unseen generators; (c) watermarking schemes depend on generator cooperation and can be circumvented.
Key Challenge: A generalizable detection method is needed that does not rely on generator-specific artifacts and can capture anomalies along the temporal dimension.
Goal - Does a fundamental geometric difference exist between real and AI-generated videos in neural representation space? - Can such a difference support efficient and generalizable detection?
Key Insight: Motivated by the neuroscientific perceptual straightening hypothesis—which holds that the visual system straightens the temporal trajectories of natural videos to facilitate predictive coding—this work hypothesizes that pre-trained vision models (DINOv2) selectively straighten real videos but not AI-generated ones, yielding discriminative curvature differences.
Core Idea: Real videos trace straighter trajectories (lower curvature) in DINOv2 representation space, whereas AI-generated videos follow more curved trajectories—this geometric discrepancy constitutes a reliable detection signal.
Method¶
Overall Architecture¶
ReStraV operates in three stages: (1) uniformly sample 24 frames from the input video and extract per-frame features using a frozen DINOv2 ViT-S/14 by concatenating CLS and patch tokens into \(z_i \in \mathbb{R}^{75648}\); (2) compute inter-frame step sizes \(d_i = \|z_{i+1} - z_i\|\) and curvature angles \(\theta_i = \arccos(\frac{\Delta z_i \cdot \Delta z_{i+1}}{\|\Delta z_i\| \|\Delta z_{i+1}\|})\); (3) extract statistical descriptors (mean/min/max/var) and train a lightweight classifier (MLP/GB/RF) for binary discrimination.
Key Designs¶
-
Differential Effect of Perceptual Straightening
- Function: Identifies that DINOv2 straightens real and AI-generated videos to different degrees.
- Mechanism: A systematic comparison across 14 visual encoders reveals that HVS-inspired models (Gabor, LGN-V1) straighten all videos equally (\(\Delta\theta < 0\), no discriminative power), whereas the self-supervised model DINOv2 selectively straightens real videos that conform to its training distribution while leaving AI-generated videos curved (\(\Delta\theta = 45.46°\), strong discriminative signal). Crucially, absolute straightening ability is uncorrelated with detection performance (\(\rho=-0.13, p=0.64\)); differential straightening is the key factor.
- Design Motivation: DINOv2 is self-supervised on large-scale real-world data, internalizing the statistical regularities of natural content. Trajectories of real videos that conform to this prior are straightened; those of AI-generated videos, which violate the prior, remain curved.
-
Temporal Curvature and Step Size as Detection Features
- Function: Quantifies the geometric properties of trajectories in representation space.
- Mechanism: Curvature \(\theta_i = \arccos(\frac{\Delta z_i \cdot \Delta z_{i+1}}{d_i \cdot d_{i+1}})\) measures directional change between successive displacements (path bending); step size \(d_i = \|\Delta z_i\|\) measures the magnitude of inter-frame change. Real videos exhibit low mean curvature (\(\mu_\theta\) small) and high curvature variance (\(\sigma_\theta^2\) large), whereas AI-generated videos exhibit high mean curvature and low variance—i.e., they bend uniformly. An 8-dimensional statistical feature vector is sufficient for discrimination.
- Design Motivation: These simple geometric quantities (angles and distances) are physically intuitive, interpretable, and computationally inexpensive.
-
Lightweight Classifier
- Function: Performs binary classification from a 21-dimensional feature vector using off-the-shelf classifiers.
- Mechanism: The feature vector comprises 7 step-size values + 6 curvature values + 8 statistical descriptors = 21 dimensions. Six classifiers (LR/GNB/RF/GB/SVM/MLP) are evaluated; MLP (64→32) achieves the best performance (97.17% accuracy). No pixel-level processing or DINOv2 fine-tuning is required.
- Design Motivation: The lightweight classifier ensures transparency and interpretability while enabling extremely fast inference—DINOv2 forward pass 43.6ms + classification <5ms = ~48ms end-to-end.
Loss & Training¶
- DINOv2 is fully frozen (pre-trained weights used as-is).
- The classifier is trained with standard cross-entropy loss.
- No data augmentation or feature engineering is applied; only 3-fold grid search is performed.
Key Experimental Results¶
Main Results: VidProM Benchmark¶
| Method | Type | Accuracy↑ | AUROC↑ |
|---|---|---|---|
| CNNSpot (image) | Supervised | 52.66 | 55.47 |
| UnivFD (image) | Supervised | 68.71 | 66.11 |
| I3D (video) | Supervised | 91.76 | 95.18 |
| VideoSwin (video) | Supervised | 94.47 | 97.95 |
| ReStraV-MLP (Ours) | Lightweight | 97.17 | 98.63 |
Cross-Benchmark Generalization¶
| Benchmark | ReStraV Accuracy | Best Baseline |
|---|---|---|
| VidProM | 97.17% | 94.47% (VideoSwin) |
| GenVidBench | SOTA | — |
| Physics-IQ | SOTA | — |
Ablation Study: Encoder Selection¶
| Encoder | \(\Delta\theta\) (Curvature Gap) | Detection Ability |
|---|---|---|
| DINOv2 ViT-S/14 | +45.46° | Best |
| CLIP | +25° | Second |
| SimCLR | +15° | Moderate |
| Gabor (HVS) | −5° | Ineffective |
Key Findings¶
- Differential straightening by DINOv2 is the key: Absolute straightening ability does not equal detection ability; differential straightening—straighter for real videos, not straightened for AI videos—is the detection signal.
- "Uniform bending" signature of AI-generated videos: AI-generated videos exhibit high mean curvature but low variance, indicating that temporal inconsistencies introduced by generative models are systematic rather than random.
- Pixel space provides no discriminative power: In raw pixel space, curvature and step-size distributions of real and AI-generated videos overlap heavily; the difference is only visible in learned representation space.
- Extreme efficiency: ~48ms end-to-end inference, orders of magnitude faster than VideoSwin and comparable methods.
- Cross-generator generalization: The method is effective across diverse generators including Sora, Pika, and VideoCrafter.
Highlights & Insights¶
- Cross-domain inspiration from neuroscience to AI safety is particularly elegant: the perceptual straightening hypothesis was originally developed to explain biological visual systems, and this work ingeniously repurposes it for AI detection—real videos are straightened even within a "digital neural system."
- A 21-dimensional feature vector outperforms deep video models, demonstrating that correct feature design matters far more than model complexity.
- The discovery of differential vs. absolute straightening constitutes the core contribution: HVS models straighten everything equally (no discriminative power), whereas SSL models straighten selectively (only within-distribution content is straightened, yielding discriminative power).
- Strong interpretability: Curvature and step size carry clear physical meaning—bending corresponds to temporal inconsistency, and large step sizes correspond to abrupt transitions.
Limitations & Future Work¶
- Dependence on DINOv2's specific training distribution: If a generative model's training data substantially overlaps with DINOv2's, the differential straightening effect may diminish.
- Only short videos (2–5 seconds) are evaluated: Effectiveness on longer videos remains unverified.
- Adversarial robustness: An adversary aware of ReStraV's operating principle could potentially post-process AI-generated videos to artificially straighten their trajectories.
- Fixed frame sampling strategy: Uniform sampling of 24 frames may not be optimal across all scenarios.
- Potential failure on static scenes: When video content changes minimally, the curvature signal may be insufficiently strong.
Related Work & Insights¶
- vs. CNNSpot / UnivFD (image detectors): Frame-by-frame detection ignores the temporal dimension, yielding performance far below ReStraV.
- vs. I3D / VideoSwin (video detectors): End-to-end video models require extensive training and heavy computation; ReStraV surpasses them using frozen features and a lightweight classifier.
- vs. watermarking: Watermarking requires generator cooperation; ReStraV does not—it operates as a purely post-hoc detector.
- Transfer implications: Geometric analysis of representation space may generalize to detection of other forms of generated content (AI images, AI audio, AI text).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The cross-domain leap from perceptual straightening to AI detection is highly original; the discovery of differential straightening is a genuinely novel insight.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comparison across 14 encoders, 50K training/test samples, multiple benchmarks, and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐⭐ The narrative arc from hypothesis to validation to method to experiments is exceptionally coherent, with high-quality visualizations.
- Value: ⭐⭐⭐⭐⭐ 48ms inference, 97% accuracy, and cross-generator generalization make this directly deployable for content authentication.