Skip to content

SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution

Conference: CVPR 2026 arXiv: 2603.08536 Code: GitHub Area: Video Generation Keywords: Generated video attribution, 3D VAE, sliding window reconstruction, training-free, temporal consistency

TL;DR

SWIFT introduces the novel task of "few-shot training-free generated video attribution," exploiting the temporal mapping property of 3D VAEs — where \(K\) pixel frames correspond to a single latent frame — by performing two reconstructions (normal and corrupted) via fixed-length sliding windows. The ratio of reconstruction losses over overlapping frames serves as the attribution signal. Using only 20 samples, SWIFT achieves over 90% average attribution accuracy, with a 5-model average of 94%.

Background & Motivation

  1. Background: Video generation technologies (HunyuanVideo, Wan2.1/2.2, EasyAnimate, etc.) are advancing rapidly, all adopting 3D VAE + DiT architectures. Generated videos risk being misused for disinformation or intellectual property infringement.
  2. Limitations of Prior Work: Existing attribution methods fall into two categories: (1) active watermarking requires embedding operations that may degrade video quality; (2) training-based passive attribution requires large amounts of training data and must be retrained for each new model. Image attribution methods (RONAN/LatentTracer/AEDR) suffer significant accuracy drops when transferred to video.
  3. Key Challenge: Image attribution methods focus solely on spatial consistency, ignoring the temporal consistency constraints inherent in video data, making them ineffective against sequence-correlated perturbations.
  4. Goal: How to achieve reliable generated video attribution without training and with only a small number of samples, by leveraging the temporal characteristics of video?
  5. Key Insight: State-of-the-art video generation models employ 3D VAEs that perform temporal up/downsampling (typically with compression ratios of 4 or 8), naturally forming a "\(K\) pixel frames \(\leftrightarrow\) 1 latent frame" temporal mapping. Videos generated by a target model satisfy that model's VAE distribution when chunk-aligned, whereas non-attributed videos do not.
  6. Core Idea: By using a sliding window to break temporal alignment — thus "corrupting" the reconstruction — videos belonging to the target model exhibit a significant loss discrepancy between normal and corrupted reconstructions, while non-attributed videos do not.

Method

Overall Architecture

Given a test video and the 3D VAE of a target model, SWIFT proceeds in three steps: (1) define a fixed-length sliding window; (2) perform normal and corrupted reconstructions separately and compute the loss ratio over overlapping frames as the attribution signal; (3) use KDE to determine a threshold and make the attribution decision. The entire process requires only white-box access to the target model's VAE encoder-decoder, with no model training required.

Key Designs

  1. Fixed-Length Sliding Window:

    • Function: Defines two windows for contrastive reconstruction — one maintaining temporal alignment (normal) and one breaking it (corrupted).
    • Mechanism: Given a video with \(KN\) frames (\(K\) = temporal compression ratio, \(N\) = number of chunks), the window size is \(K(N-1)\) frames. The normal window \(W_0\) starts at frame 1; each chunk within it satisfies the VAE's temporal mapping in both frame composition and position. The corrupted window \(W_{K-1}\) is shifted forward by \(K-1\) frames, placing every frame into the wrong chunk position, maximally disrupting temporal consistency. When \(j \bmod K = 0\), the normal window is used; when \(j \bmod K \neq 0\), the corrupted window is used.
    • Design Motivation: \(W_0\) and \(W_{K-1}\) are selected because a \(K-1\) offset simultaneously alters both the frame composition within each chunk and the frame-to-latent position mapping, achieving maximal disruption. For VAEs with denoising steps in the decoder (e.g., LTX), the maximally divergent window pair is identified quantitatively.
  2. Normal vs. Corrupted Differential Reconstruction:

    • Function: Generates the attribution signal via the loss ratio between two reconstructions.
    • Mechanism: \(W_0\) is reconstructed to yield \(W_0^* = \mathcal{R}(W_0)\), and \(W_{K-1}\) is reconstructed to yield \(W_{K-1}^{**} = \mathcal{R}(W_{K-1})\). The attribution signal \(t\) is defined as the mean loss ratio over overlapping frames: \(t = \frac{1}{K(N-1)-K+1} \sum_{i=K}^{K(N-1)} \frac{\mathcal{L}(F_i^*, F_i)}{\mathcal{L}(F_i^{**}, F_i)}\), where the loss \(\mathcal{L}\) is MSE. For videos attributed to the target model, normal reconstruction loss is small and corrupted reconstruction loss is large, yielding \(t \ll 1\); for non-attributed videos, the two losses are comparable, yielding \(t \approx 1\).
    • Design Motivation: The differential design eliminates the influence of varying per-video reconstruction difficulty due to content, making the attribution signal more robust.
  3. KDE-Based Adaptive Threshold Determination:

    • Function: Independently determines an attribution threshold for each model.
    • Mechanism: Kernel density estimation (KDE) is applied to the signal distribution of a small set of attributed videos to estimate the threshold \(\tau\), selecting the point where the cumulative distribution function reaches \(1-\alpha\) (\(\alpha = 0.05\)). A Gaussian kernel with Scott's bandwidth is used, requiring no assumption about the data distribution.
    • Design Motivation: The attribution signal does not follow a consistent probability distribution across models and may contain outliers. KDE is a non-parametric method that is naturally robust to distributional assumptions and outliers.

Loss & Training

SWIFT is entirely training-free. The core metric uses MSE as the reconstruction loss. Ablation experiments show that MSE outperforms MAE (98.4% vs. 97.8%) and substantially outperforms PSNR (47.8%) and SSIM (47.1%). The latter two focus on structural similarity rather than pixel-level differences and thus fail to effectively capture VAE distributional characteristics.

Key Experimental Results

Main Results

Evaluated on the self-constructed S-Video dataset (4,000 videos: 500 real + 3,500 generated from 5 state-of-the-art models):

Target Model SWIFT Avg. Acc. AEDR Avg. Acc. Gain
HunyuanVideo 90.7% 60.5% +30.2%
Wan2.1 98.4% 89.3% +9.1%
EasyAnimate 97.8% 63.1% +34.7%
LTX-Video 85.3% 79.3% +6.0%
Wan2.2 97.9% 78.5% +19.4%
Overall Average 94.0% 73.6% +20.4%

Ablation Study

Few-shot capability (number of samples required for threshold estimation):

Sample Count S Avg. Accuracy Notes
0 (zero-shot) 85.1% Direct setting \(\tau=1\)
20 90.2% 90% achieved with few samples
50 92.5% Performance near saturation
200 94.0% Optimal

Window selection ablation (HunyuanVideo, \(K=4\)):

Normal Window Corrupted Window Accuracy
\(W_0\) \(W_1\) 82.3%
\(W_0\) \(W_2\) 82.3%
\(W_0\) \(W_3\) 90.7%

Key Findings

  • Exceptional performance on Wan2.1/EasyAnimate/Wan2.2 (97–98%), as these models employ pure encoder-decoder VAEs that fully preserve VAE distributional characteristics.
  • Lowest on LTX-Video (85.3%), due to its VAE incorporating an additional denoising step during decoding, which attenuates the reconstruction discrepancy signal. Performance still substantially exceeds all baselines.
  • Zero-shot feasibility: Setting \(\tau = 1\) directly achieves approximately 90% accuracy for HunyuanVideo, EasyAnimate, and Wan2.2.
  • Efficiency advantage: 4–32% faster than AEDR, as SWIFT reconstructs only the window rather than the full video.
  • MSE is the optimal loss metric: MSE amplifies differences more effectively than MAE (98.4% vs. 97.8%).

Highlights & Insights

  • Elegant exploitation of 3D VAE temporal compression: Converting the inherent temporal mapping of 3D VAEs into an attribution signal source is a highly elegant insight. This paradigm of "leveraging model architectural properties for forensics" is generalizable to detection tasks targeting other architecture-specific components.
  • Differential reconstruction eliminates content bias: Rather than examining absolute reconstruction error — which is confounded by video content — the method examines the ratio between normal and corrupted reconstructions, making the signal depend solely on VAE distribution matching and substantially improving robustness.
  • Practical few-shot, training-free design: Achieving 90% accuracy with only 20 attributed video samples and no model training is highly practical given the rapid proliferation of new generative models.

Limitations & Future Work

  • Attribution accuracy drops to 85.3% for LTX-Video due to its decoder denoising step; the method may require adaptation for future models with more complex VAE designs.
  • Currently limited to white-box VAE access scenarios, restricting usability to model owners rather than third-party auditors.
  • Robustness to video compression (e.g., H.264/H.265) is not discussed.
  • Attribution may fail when multiple models share the same VAE (e.g., fine-tuned variants of a common base model).
  • Future directions include black-box attribution settings and integration of frequency-domain analysis to enhance detection for complex VAEs.
  • vs. AEDR: An image attribution method performing attribution via VAE reconstruction consistency. SWIFT extends this to video; the key innovation is differential reconstruction along the temporal dimension rather than purely spatial reconstruction, improving accuracy from 73.6% to 94.0%.
  • vs. RONAN/LatentTracer: Gradient-optimization-based image attribution methods with high computational overhead. SWIFT requires no gradient optimization, relying only on forward encoding and decoding.
  • vs. Watermarking methods: Watermarking requires modifying the generation pipeline. SWIFT is entirely passive and transparent to the generation process.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — First formulation of this task; elegant exploitation of 3D VAE temporal properties; unique differential reconstruction design.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across 5 models with detailed ablations; lacks robustness testing under video compression.
  • Writing Quality: ⭐⭐⭐⭐ — Clear formal definitions; some notation is redundant.
  • Value: ⭐⭐⭐⭐⭐ — Highly practical; the few-shot training-free paradigm has significant implications for AI safety.