SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution¶
Conference: CVPR 2026 arXiv: 2603.08536 Code: GitHub Area: Video Generation Keywords: Generated video attribution, 3D VAE, sliding window reconstruction, training-free, temporal consistency
TL;DR¶
SWIFT introduces the novel task of "few-shot training-free generated video attribution," exploiting the temporal mapping property of 3D VAEs — where \(K\) pixel frames correspond to a single latent frame — by performing two reconstructions (normal and corrupted) via fixed-length sliding windows. The ratio of reconstruction losses over overlapping frames serves as the attribution signal. Using only 20 samples, SWIFT achieves over 90% average attribution accuracy, with a 5-model average of 94%.
Background & Motivation¶
- Background: Video generation technologies (HunyuanVideo, Wan2.1/2.2, EasyAnimate, etc.) are advancing rapidly, all adopting 3D VAE + DiT architectures. Generated videos risk being misused for disinformation or intellectual property infringement.
- Limitations of Prior Work: Existing attribution methods fall into two categories: (1) active watermarking requires embedding operations that may degrade video quality; (2) training-based passive attribution requires large amounts of training data and must be retrained for each new model. Image attribution methods (RONAN/LatentTracer/AEDR) suffer significant accuracy drops when transferred to video.
- Key Challenge: Image attribution methods focus solely on spatial consistency, ignoring the temporal consistency constraints inherent in video data, making them ineffective against sequence-correlated perturbations.
- Goal: How to achieve reliable generated video attribution without training and with only a small number of samples, by leveraging the temporal characteristics of video?
- Key Insight: State-of-the-art video generation models employ 3D VAEs that perform temporal up/downsampling (typically with compression ratios of 4 or 8), naturally forming a "\(K\) pixel frames \(\leftrightarrow\) 1 latent frame" temporal mapping. Videos generated by a target model satisfy that model's VAE distribution when chunk-aligned, whereas non-attributed videos do not.
- Core Idea: By using a sliding window to break temporal alignment — thus "corrupting" the reconstruction — videos belonging to the target model exhibit a significant loss discrepancy between normal and corrupted reconstructions, while non-attributed videos do not.
Method¶
Overall Architecture¶
Given a test video and the 3D VAE of a target model, SWIFT proceeds in three steps: (1) define a fixed-length sliding window; (2) perform normal and corrupted reconstructions separately and compute the loss ratio over overlapping frames as the attribution signal; (3) use KDE to determine a threshold and make the attribution decision. The entire process requires only white-box access to the target model's VAE encoder-decoder, with no model training required.
Key Designs¶
-
Fixed-Length Sliding Window:
- Function: Defines two windows for contrastive reconstruction — one maintaining temporal alignment (normal) and one breaking it (corrupted).
- Mechanism: Given a video with \(KN\) frames (\(K\) = temporal compression ratio, \(N\) = number of chunks), the window size is \(K(N-1)\) frames. The normal window \(W_0\) starts at frame 1; each chunk within it satisfies the VAE's temporal mapping in both frame composition and position. The corrupted window \(W_{K-1}\) is shifted forward by \(K-1\) frames, placing every frame into the wrong chunk position, maximally disrupting temporal consistency. When \(j \bmod K = 0\), the normal window is used; when \(j \bmod K \neq 0\), the corrupted window is used.
- Design Motivation: \(W_0\) and \(W_{K-1}\) are selected because a \(K-1\) offset simultaneously alters both the frame composition within each chunk and the frame-to-latent position mapping, achieving maximal disruption. For VAEs with denoising steps in the decoder (e.g., LTX), the maximally divergent window pair is identified quantitatively.
-
Normal vs. Corrupted Differential Reconstruction:
- Function: Generates the attribution signal via the loss ratio between two reconstructions.
- Mechanism: \(W_0\) is reconstructed to yield \(W_0^* = \mathcal{R}(W_0)\), and \(W_{K-1}\) is reconstructed to yield \(W_{K-1}^{**} = \mathcal{R}(W_{K-1})\). The attribution signal \(t\) is defined as the mean loss ratio over overlapping frames: \(t = \frac{1}{K(N-1)-K+1} \sum_{i=K}^{K(N-1)} \frac{\mathcal{L}(F_i^*, F_i)}{\mathcal{L}(F_i^{**}, F_i)}\), where the loss \(\mathcal{L}\) is MSE. For videos attributed to the target model, normal reconstruction loss is small and corrupted reconstruction loss is large, yielding \(t \ll 1\); for non-attributed videos, the two losses are comparable, yielding \(t \approx 1\).
- Design Motivation: The differential design eliminates the influence of varying per-video reconstruction difficulty due to content, making the attribution signal more robust.
-
KDE-Based Adaptive Threshold Determination:
- Function: Independently determines an attribution threshold for each model.
- Mechanism: Kernel density estimation (KDE) is applied to the signal distribution of a small set of attributed videos to estimate the threshold \(\tau\), selecting the point where the cumulative distribution function reaches \(1-\alpha\) (\(\alpha = 0.05\)). A Gaussian kernel with Scott's bandwidth is used, requiring no assumption about the data distribution.
- Design Motivation: The attribution signal does not follow a consistent probability distribution across models and may contain outliers. KDE is a non-parametric method that is naturally robust to distributional assumptions and outliers.
Loss & Training¶
SWIFT is entirely training-free. The core metric uses MSE as the reconstruction loss. Ablation experiments show that MSE outperforms MAE (98.4% vs. 97.8%) and substantially outperforms PSNR (47.8%) and SSIM (47.1%). The latter two focus on structural similarity rather than pixel-level differences and thus fail to effectively capture VAE distributional characteristics.
Key Experimental Results¶
Main Results¶
Evaluated on the self-constructed S-Video dataset (4,000 videos: 500 real + 3,500 generated from 5 state-of-the-art models):
| Target Model | SWIFT Avg. Acc. | AEDR Avg. Acc. | Gain |
|---|---|---|---|
| HunyuanVideo | 90.7% | 60.5% | +30.2% |
| Wan2.1 | 98.4% | 89.3% | +9.1% |
| EasyAnimate | 97.8% | 63.1% | +34.7% |
| LTX-Video | 85.3% | 79.3% | +6.0% |
| Wan2.2 | 97.9% | 78.5% | +19.4% |
| Overall Average | 94.0% | 73.6% | +20.4% |
Ablation Study¶
Few-shot capability (number of samples required for threshold estimation):
| Sample Count S | Avg. Accuracy | Notes |
|---|---|---|
| 0 (zero-shot) | 85.1% | Direct setting \(\tau=1\) |
| 20 | 90.2% | 90% achieved with few samples |
| 50 | 92.5% | Performance near saturation |
| 200 | 94.0% | Optimal |
Window selection ablation (HunyuanVideo, \(K=4\)):
| Normal Window | Corrupted Window | Accuracy |
|---|---|---|
| \(W_0\) | \(W_1\) | 82.3% |
| \(W_0\) | \(W_2\) | 82.3% |
| \(W_0\) | \(W_3\) | 90.7% |
Key Findings¶
- Exceptional performance on Wan2.1/EasyAnimate/Wan2.2 (97–98%), as these models employ pure encoder-decoder VAEs that fully preserve VAE distributional characteristics.
- Lowest on LTX-Video (85.3%), due to its VAE incorporating an additional denoising step during decoding, which attenuates the reconstruction discrepancy signal. Performance still substantially exceeds all baselines.
- Zero-shot feasibility: Setting \(\tau = 1\) directly achieves approximately 90% accuracy for HunyuanVideo, EasyAnimate, and Wan2.2.
- Efficiency advantage: 4–32% faster than AEDR, as SWIFT reconstructs only the window rather than the full video.
- MSE is the optimal loss metric: MSE amplifies differences more effectively than MAE (98.4% vs. 97.8%).
Highlights & Insights¶
- Elegant exploitation of 3D VAE temporal compression: Converting the inherent temporal mapping of 3D VAEs into an attribution signal source is a highly elegant insight. This paradigm of "leveraging model architectural properties for forensics" is generalizable to detection tasks targeting other architecture-specific components.
- Differential reconstruction eliminates content bias: Rather than examining absolute reconstruction error — which is confounded by video content — the method examines the ratio between normal and corrupted reconstructions, making the signal depend solely on VAE distribution matching and substantially improving robustness.
- Practical few-shot, training-free design: Achieving 90% accuracy with only 20 attributed video samples and no model training is highly practical given the rapid proliferation of new generative models.
Limitations & Future Work¶
- Attribution accuracy drops to 85.3% for LTX-Video due to its decoder denoising step; the method may require adaptation for future models with more complex VAE designs.
- Currently limited to white-box VAE access scenarios, restricting usability to model owners rather than third-party auditors.
- Robustness to video compression (e.g., H.264/H.265) is not discussed.
- Attribution may fail when multiple models share the same VAE (e.g., fine-tuned variants of a common base model).
- Future directions include black-box attribution settings and integration of frequency-domain analysis to enhance detection for complex VAEs.
Related Work & Insights¶
- vs. AEDR: An image attribution method performing attribution via VAE reconstruction consistency. SWIFT extends this to video; the key innovation is differential reconstruction along the temporal dimension rather than purely spatial reconstruction, improving accuracy from 73.6% to 94.0%.
- vs. RONAN/LatentTracer: Gradient-optimization-based image attribution methods with high computational overhead. SWIFT requires no gradient optimization, relying only on forward encoding and decoding.
- vs. Watermarking methods: Watermarking requires modifying the generation pipeline. SWIFT is entirely passive and transparent to the generation process.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First formulation of this task; elegant exploitation of 3D VAE temporal properties; unique differential reconstruction design.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across 5 models with detailed ablations; lacks robustness testing under video compression.
- Writing Quality: ⭐⭐⭐⭐ — Clear formal definitions; some notation is redundant.
- Value: ⭐⭐⭐⭐⭐ — Highly practical; the few-shot training-free paradigm has significant implications for AI safety.