GenVidBench: A 6-Million Benchmark for AI-Generated Video Detection¶

Conference: AAAI 2026
arXiv: 2501.11340
Code: Project Page
Area: Video Generation
Keywords: AI-generated video detection, benchmark dataset, cross-source cross-generator, video forensics, deepfake detection

TL;DR¶

This paper introduces GenVidBench—the first large-scale AI-generated video detection dataset with 6.78 million videos, featuring cross-source and cross-generator properties, covering 11 state-of-the-art video generators, and providing rich semantic annotations.

Background & Motivation¶

State of the Field¶

Video generation models (e.g., Sora, Kling) are advancing rapidly, with generated video quality approaching photorealism, making the boundary between real and fake videos increasingly blurred. This raises serious concerns regarding misinformation propagation, reputational harm, and cybersecurity threats, underscoring the urgent need for effective AI-generated video detectors.

Limitations of Prior Work¶

Development of high-performance detectors is constrained by the lack of large-scale, high-quality dedicated datasets. Existing datasets suffer from the following issues:

Dataset	Scale	Prompt/Image	Video Pairs	Semantic Labels	Cross-Source
GVD	11k	✗	✗	✗	✗
GVF	2.8k	✓	✓	✓	✗
GenVideo	2.27M	✗	✗	✗	✗
GenVidDet	2.66M	✗	✗	✗	✗
GenVidBench	6.78M	✓	✓	✓	✓

GVD and GenVideo lack original prompts, video pairs, and semantic labels, making it impossible to prevent content overlap between training and test sets.
GVF provides prompts and semantic labels but contains only 2.8k samples and lacks a cross-source setting.
Existing datasets use the same generators for both training and testing, resulting in low detection difficulty that does not reflect real-world scenarios.

Core Idea¶

Construct a large-scale dataset with cross-source and cross-generator properties: training and test sets use different video generators and different content sources, compelling detectors to learn intrinsic features of generated videos rather than relying on generator-specific or content-specific biases.

Method¶

Overall Architecture¶

GenVidBench is designed around three core principles: large scale, cross-source and cross-generator design, and coverage of state-of-the-art generators.

Key Designs¶

1. Dataset Composition and Pairing Strategy¶

The dataset is organized into two groups of paired videos (Video Pairs), where videos within each pair share the same text prompt or source image:

Training Set (Video Pair 1):

Source	Type	Task	Resolution	Count
Vript	Real	-	-	417,566
Pika	Fake	T2V & I2V	1088×560	1,670,465
VideoCraftV2	Fake	T2V & I2V	512×320	1,672,242
ModelScope	Fake	T2V	256×256	1,672,242
T2V-Zero	Fake	T2V	512×512	1,268,595

Test Set (Video Pair 2):

Source	Type	Task	Resolution	Count
HD-VG-130M	Real	-	1280×720	13,853
MuseV	Fake	I2V	1210×576	13,853
SVD	Fake	I2V	1024×576	13,853
Mora	Fake	T2V	1024×576	13,853
CogVideo	Fake	T2V	480×480	13,853
Sora	Fake	T2V	1920×1080	51
Kling	Fake	T2V & I2V	-	264

Design Motivation: Training and test sets use entirely different generators and content sources to prevent models from distinguishing real from fake based on content or video quality alone.

2. Cross-Source and Cross-Generator Task Design¶

Same-generator detection: Training and testing on the same subset yields accuracy generally >97.4%, indicating the task is trivial.
Cross-generator detection: Performance drops dramatically when the generator changes (e.g., training on Pika and testing on SVD yields only 54.66%), revealing the challenges of real-world scenarios.
Same-source vs. cross-source: Videos generated from the same content source are easier to classify (Pair1 average 61.81% vs. cross-source 56.71%), confirming that detectors are heavily dependent on the generation source.

3. Semantic Content Annotation¶

Videos are semantically categorized along three dimensions: - Object category: Person, animal, architecture, nature, plant, cartoon, food, game, vehicle, other - Action: Reflects temporal properties (static poses, display/exhibition, etc.) - Scene: Indicates scene complexity (natural landscape vs. traffic scene)

Topics are extracted using an LLM and aggregated into an abstract classification tree (≤10 classes per dimension), providing a foundation for scene-specific analysis.

4. Lightweight Version: GenVidBench-143k¶

A carefully sampled subset of 143,400 videos is drawn from the 6.78 million total, preserving representativeness and diversity while substantially reducing computational requirements.

Loss & Training¶

This paper is primarily a dataset contribution. Standard video classification models are used for benchmarking, following MMAction2 default configurations with 8-frame sampling, 224×224 resolution, and batch size 8.

Key Experimental Results¶

Main Results (Cross-Source Cross-Generator Detection)¶

Method	Type	MuseV	SVD	CogV	Mora	Sora	Kling	HD	Top-1
I3D	CNN	32.72	12.04	76.44	72.30	41.18	46.22	95.04	60.21
SlowFast	CNN	87.14	29.80	93.07	55.23	23.53	58.33	96.61	70.06
TSM	CNN	95.94	73.16	36.44	91.72	33.34	71.60	96.30	73.88
VideoSwin	Trans.	90.24	27.72	91.64	88.14	19.60	50.76	99.10	80.39
MViTv2-S	Trans.	77.08	44.89	99.91	76.77	61.36	31.37	86.98	80.45
DeMamba	Mamba	85.04	48.81	98.66	90.23	1.96	33.71	99.86	85.47

DeMamba achieves the highest Top-1 accuracy of 85.47%, yet the detection rate for Sora-generated videos is only 1.96%, rendering them nearly undetectable.

Cross-Dataset Comparison¶

Dataset	SlowFast	I3D	F3Net
NeuralTextures	82.55	-	-
GVF	-	61.88	-
GenVideo	-	-	51.83
GenVidBench	70.06	60.21	42.52

All detectors achieve significantly lower performance on GenVidBench compared to other datasets, confirming its greater difficulty.

Key Findings¶

Cross-generator detection is extremely challenging: Same-generator train/test accuracy exceeds 97%, whereas cross-generator accuracy can drop to ~50%.
Substantial quality variation across generators: Sora-generated videos are nearly undetectable (1.96%), while CogVideo is the easiest to detect due to poor temporal continuity.
SVD-generated videos are most difficult to distinguish: Hard-case analysis reveals that SVD produces the most severe blurring artifacts.
Transformer-based models outperform CNNs: Transformer and Mamba architectures perform better on this task.
Semantic category affects detection difficulty: Cartoons are the easiest to detect (Mean=0.209), while vehicles are the hardest (Mean=0.308).

Scene-Specific Analysis (Plants Category)¶

Method	MuseV	SVD	CogV	Mora	HD	Mean
TimeSformer	77.96	29.80	96.30	93.44	87.14	75.09
VideoSwin	57.96	7.35	92.59	47.88	98.76	52.86

Different models exhibit substantial performance variation on specific semantic categories.

Highlights & Insights¶

The cross-source design is the most significant contribution: By ensuring that training and test sets share the same content source but differ in generator, content bias is effectively eliminated, forcing detectors to learn generation artifacts.
Unprecedented scale: With 6.78 million videos, GenVidBench is 2.5× larger than the previous largest dataset, GenVidDet (2.66M).
Practical value of semantic annotations: Researchers can extract targeted subsets by scene type (e.g., person, action) for focused investigation.
The 143k lightweight version lowers the barrier to entry: It substantially reduces computational demands and accelerates model iteration.

Limitations & Future Work¶

Small test set scale: Each test generator contains only ~14k videos, which is highly asymmetric compared to the millions of training samples.
Lower generation quality in the training set: Pika, ModelScope, and T2V-Zero exhibit noticeably lower resolution and quality than MuseV, SVD, and Mora in the test set.
Limited coverage of the latest generators: Sora contributes only 51 samples and Kling only 264.
Lack of in-depth analysis of temporal artifacts: Frame-level temporal consistency and other temporal forensic cues are not explored.
Semantic annotation relies on LLMs: Classification quality is bounded by LLM capability.

The dataset design philosophy is consistent with GenImage for image generation detection, extended to the video domain with an added temporal dimension.
The cross-source design principle is generalizable to other generated content detection tasks (audio, 3D models, etc.).
DeMamba's superior performance suggests the potential of SSM architectures in video forensics.

Rating¶

Novelty: ⭐⭐⭐ — Primarily a dataset contribution; the cross-source design is a highlight, but technical innovation is limited.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multi-model benchmarking, cross-validation, hard-case analysis, and scene-specific analysis are comprehensive.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with abundant tables.
Value: ⭐⭐⭐⭐⭐ — Fills a critical gap in large-scale, high-quality generated video detection datasets, offering long-term value to the community.