SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution¶

Conference: CVPR 2026 arXiv: 2603.08536 Code: GitHub Area: Video Generation Keywords: Generated video attribution, 3D VAE, sliding window reconstruction, training-free, temporal consistency

TL;DR¶

SWIFT introduces the novel task of "few-shot training-free generated video attribution," exploiting the temporal mapping property of 3D VAEs — where \(K\) pixel frames correspond to a single latent frame — by performing two reconstructions (normal and corrupted) via fixed-length sliding windows. The ratio of reconstruction losses over overlapping frames serves as the attribution signal. Using only 20 samples, SWIFT achieves over 90% average attribution accuracy, with a 5-model average of 94%.

Background & Motivation¶

Background: Video generation technologies (HunyuanVideo, Wan2.1/2.2, EasyAnimate, etc.) are advancing rapidly, all adopting 3D VAE + DiT architectures. Generated videos risk being misused for disinformation or intellectual property infringement.
Limitations of Prior Work: Existing attribution methods fall into two categories: (1) active watermarking requires embedding operations that may degrade video quality; (2) training-based passive attribution requires large amounts of training data and must be retrained for each new model. Image attribution methods (RONAN/LatentTracer/AEDR) suffer significant accuracy drops when transferred to video.
Key Challenge: Image attribution methods focus solely on spatial consistency, ignoring the temporal consistency constraints inherent in video data, making them ineffective against sequence-correlated perturbations.
Goal: How to achieve reliable generated video attribution without training and with only a small number of samples, by leveraging the temporal characteristics of video?
Key Insight: State-of-the-art video generation models employ 3D VAEs that perform temporal up/downsampling (typically with compression ratios of 4 or 8), naturally forming a "\(K\) pixel frames \(\leftrightarrow\) 1 latent frame" temporal mapping. Videos generated by a target model satisfy that model's VAE distribution when chunk-aligned, whereas non-attributed videos do not.
Core Idea: By using a sliding window to break temporal alignment — thus "corrupting" the reconstruction — videos belonging to the target model exhibit a significant loss discrepancy between normal and corrupted reconstructions, while non-attributed videos do not.

Method¶

Overall Architecture¶

Given a test video and the 3D VAE of a target model, SWIFT proceeds in three steps: (1) define a fixed-length sliding window; (2) perform normal and corrupted reconstructions separately and compute the loss ratio over overlapping frames as the attribution signal; (3) use KDE to determine a threshold and make the attribution decision. The entire process requires only white-box access to the target model's VAE encoder-decoder, with no model training required.

Key Designs¶

Fixed-Length Sliding Window:
- Function: Defines two windows for contrastive reconstruction — one maintaining temporal alignment (normal) and one breaking it (corrupted).
- Mechanism: Given a video with \(KN\) frames (\(K\) = temporal compression ratio, \(N\) = number of chunks), the window size is \(K(N-1)\) frames. The normal window \(W_0\) starts at frame 1; each chunk within it satisfies the VAE's temporal mapping in both frame composition and position. The corrupted window \(W_{K-1}\) is shifted forward by \(K-1\) frames, placing every frame into the wrong chunk position, maximally disrupting temporal consistency. When \(j \bmod K = 0\), the normal window is used; when \(j \bmod K \neq 0\), the corrupted window is used.
- Design Motivation: \(W_0\) and \(W_{K-1}\) are selected because a \(K-1\) offset simultaneously alters both the frame composition within each chunk and the frame-to-latent position mapping, achieving maximal disruption. For VAEs with denoising steps in the decoder (e.g., LTX), the maximally divergent window pair is identified quantitatively.
Normal vs. Corrupted Differential Reconstruction:
- Function: Generates the attribution signal via the loss ratio between two reconstructions.
- Mechanism: \(W_0\) is reconstructed to yield \(W_0^* = \mathcal{R}(W_0)\), and \(W_{K-1}\) is reconstructed to yield \(W_{K-1}^{**} = \mathcal{R}(W_{K-1})\). The attribution signal \(t\) is defined as the mean loss ratio over overlapping frames: \(t = \frac{1}{K(N-1)-K+1} \sum_{i=K}^{K(N-1)} \frac{\mathcal{L}(F_i^*, F_i)}{\mathcal{L}(F_i^{**}, F_i)}\), where the loss \(\mathcal{L}\) is MSE. For videos attributed to the target model, normal reconstruction loss is small and corrupted reconstruction loss is large, yielding \(t \ll 1\); for non-attributed videos, the two losses are comparable, yielding \(t \approx 1\).
- Design Motivation: The differential design eliminates the influence of varying per-video reconstruction difficulty due to content, making the attribution signal more robust.
KDE-Based Adaptive Threshold Determination:
- Function: Independently determines an attribution threshold for each model.
- Mechanism: Kernel density estimation (KDE) is applied to the signal distribution of a small set of attributed videos to estimate the threshold \(\tau\), selecting the point where the cumulative distribution function reaches \(1-\alpha\) (\(\alpha = 0.05\)). A Gaussian kernel with Scott's bandwidth is used, requiring no assumption about the data distribution.
- Design Motivation: The attribution signal does not follow a consistent probability distribution across models and may contain outliers. KDE is a non-parametric method that is naturally robust to distributional assumptions and outliers.

Loss & Training¶

SWIFT is entirely training-free. The core metric uses MSE as the reconstruction loss. Ablation experiments show that MSE outperforms MAE (98.4% vs. 97.8%) and substantially outperforms PSNR (47.8%) and SSIM (47.1%). The latter two focus on structural similarity rather than pixel-level differences and thus fail to effectively capture VAE distributional characteristics.

Key Experimental Results¶

Main Results¶

Evaluated on the self-constructed S-Video dataset (4,000 videos: 500 real + 3,500 generated from 5 state-of-the-art models):

Target Model	SWIFT Avg. Acc.	AEDR Avg. Acc.	Gain
HunyuanVideo	90.7%	60.5%	+30.2%
Wan2.1	98.4%	89.3%	+9.1%
EasyAnimate	97.8%	63.1%	+34.7%
LTX-Video	85.3%	79.3%	+6.0%
Wan2.2	97.9%	78.5%	+19.4%
Overall Average	94.0%	73.6%	+20.4%

Ablation Study¶

Few-shot capability (number of samples required for threshold estimation):

Sample Count S	Avg. Accuracy	Notes
0 (zero-shot)	85.1%	Direct setting \(\tau=1\)
20	90.2%	90% achieved with few samples
50	92.5%	Performance near saturation
200	94.0%	Optimal

Window selection ablation (HunyuanVideo, \(K=4\)):

Normal Window	Corrupted Window	Accuracy
\(W_0\)	\(W_1\)	82.3%
\(W_0\)	\(W_2\)	82.3%
\(W_0\)	\(W_3\)	90.7%

Key Findings¶

Exceptional performance on Wan2.1/EasyAnimate/Wan2.2 (97–98%), as these models employ pure encoder-decoder VAEs that fully preserve VAE distributional characteristics.
Lowest on LTX-Video (85.3%), due to its VAE incorporating an additional denoising step during decoding, which attenuates the reconstruction discrepancy signal. Performance still substantially exceeds all baselines.
Zero-shot feasibility: Setting \(\tau = 1\) directly achieves approximately 90% accuracy for HunyuanVideo, EasyAnimate, and Wan2.2.
Efficiency advantage: 4–32% faster than AEDR, as SWIFT reconstructs only the window rather than the full video.
MSE is the optimal loss metric: MSE amplifies differences more effectively than MAE (98.4% vs. 97.8%).

Highlights & Insights¶

Elegant exploitation of 3D VAE temporal compression: Converting the inherent temporal mapping of 3D VAEs into an attribution signal source is a highly elegant insight. This paradigm of "leveraging model architectural properties for forensics" is generalizable to detection tasks targeting other architecture-specific components.
Differential reconstruction eliminates content bias: Rather than examining absolute reconstruction error — which is confounded by video content — the method examines the ratio between normal and corrupted reconstructions, making the signal depend solely on VAE distribution matching and substantially improving robustness.
Practical few-shot, training-free design: Achieving 90% accuracy with only 20 attributed video samples and no model training is highly practical given the rapid proliferation of new generative models.

Limitations & Future Work¶

Attribution accuracy drops to 85.3% for LTX-Video due to its decoder denoising step; the method may require adaptation for future models with more complex VAE designs.
Currently limited to white-box VAE access scenarios, restricting usability to model owners rather than third-party auditors.
Robustness to video compression (e.g., H.264/H.265) is not discussed.
Attribution may fail when multiple models share the same VAE (e.g., fine-tuned variants of a common base model).
Future directions include black-box attribution settings and integration of frequency-domain analysis to enhance detection for complex VAEs.

vs. AEDR: An image attribution method performing attribution via VAE reconstruction consistency. SWIFT extends this to video; the key innovation is differential reconstruction along the temporal dimension rather than purely spatial reconstruction, improving accuracy from 73.6% to 94.0%.
vs. RONAN/LatentTracer: Gradient-optimization-based image attribution methods with high computational overhead. SWIFT requires no gradient optimization, relying only on forward encoding and decoding.
vs. Watermarking methods: Watermarking requires modifying the generation pipeline. SWIFT is entirely passive and transparent to the generation process.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First formulation of this task; elegant exploitation of 3D VAE temporal properties; unique differential reconstruction design.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across 5 models with detailed ablations; lacks robustness testing under video compression.
Writing Quality: ⭐⭐⭐⭐ — Clear formal definitions; some notation is redundant.
Value: ⭐⭐⭐⭐⭐ — Highly practical; the few-shot training-free paradigm has significant implications for AI safety.