Skip to content

Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

Conference: CVPR 2026 arXiv: 2603.25209 Code: https://github.com/Westlake-AGI-Lab/FreeLOC Area: Video Generation / Diffusion Models Keywords: Long video generation, training-free, positional encoding extrapolation, sparse attention, layer-adaptive

TL;DR

FreeLOC proposes a training-free, layer-adaptive framework that identifies the differential sensitivity of each layer in video DiTs to two out-of-distribution (OOD) problems—frame-level relative position OOD and context length OOD—and selectively applies multi-granularity positional re-encoding (VRPR) and tiered sparse attention (TSA) to sensitive layers, achieving state-of-the-art long video generation quality without any additional training cost.

Background & Motivation

  1. Background: Video diffusion models (e.g., Wan, HunyuanVideo) can generate high-quality short videos, but are typically trained on short clips (~5 seconds). Direct application to longer video generation leads to severe quality degradation.
  2. Limitations of Prior Work: Training-based autoregressive methods are computationally expensive and struggle to match the native quality of short-video models. Existing training-free methods fall into two categories: sliding window approaches (e.g., FreeNoise) preserve local consistency but fail to capture long-range inter-frame dependencies; global manipulation approaches (e.g., FreeLong) improve quality by operating on latent variables but still suffer from artifacts (identity drift, lighting inconsistency). Moreover, most methods are designed for UNet architectures and are incompatible with state-of-the-art DiT models.
  3. Key Challenge: Long video generation faces two OOD problems: (a) frame-level relative position OOD—3D RoPE positional encodings fail to extrapolate beyond the training length; (b) context length OOD—excessively long token sequences cause softmax attention to become over-dispersed, increasing attention entropy and weakening focus on local information.
  4. Goal: How to resolve both OOD problems without retraining, while simultaneously preserving local detail and global consistency?
  5. Key Insight: The authors observe that different Transformer layers in video DiTs exhibit significantly different sensitivities to the two OOD problems—some layers are sensitive to positional shifts, while others are sensitive to context extension. Corrections should therefore be applied selectively to the most sensitive layers rather than uniformly across all layers.
  6. Core Idea: Through automatic layer sensitivity probing, selectively apply multi-granularity RoPE re-encoding to position-sensitive layers and tiered sparse attention to context-sensitive layers.

Method

Overall Architecture

FreeLOC comprises three components: (1) offline layer sensitivity probing—automatically quantifying each layer's sensitivity to the two OOD problems; (2) VRPR—video-based relative position re-encoding that maps out-of-training-range frame-level relative positions back into the training domain; (3) TSA—tiered sparse attention that constrains the effective context length while preserving long-range dependencies. Based on probing results, VRPR is applied to position-sensitive-only layers, and VRPR+TSA is applied to context-sensitive layers.

Key Designs

  1. Video-based Relative Position Re-encoding (VRPR):

    • Function: Remaps frame-level relative positions that exceed the training range back into the pre-training domain.
    • Mechanism: Based on the observation that video attention decays with temporal distance, a three-tier re-encoding scheme is designed. For short-range frames (\(|i-j| \leq W_1\)), the original relative positions are preserved to ensure motion continuity. For mid-range frames (\(W_1 < |i-j| \leq W_2\)), FLOOR quantization with group size \(G_1\) is applied: \(P = \lfloor P_{ori}/G_1 \rfloor + \text{sign}(P_{ori})(W_1 - \lfloor W_1/G_1 \rfloor)\). For long-range frames (\(|i-j| > W_2\)), more aggressive quantization with a larger group size \(G_2\) is applied, retaining only approximate ordering. Smooth transitions between tiers are ensured.
    • Design Motivation: Clipping and grouping methods from the LLM literature ignore the hierarchical structure of video temporal dependencies—nearby frames require high-precision positional information for motion detail, while distant frames need only coarse ordering for global coherence. This design directly corresponds to the attention decay characteristic of video.
  2. Tiered Sparse Attention (TSA):

    • Function: Constrains the effective context length within the pre-training range while preserving long-range temporal dependencies.
    • Mechanism: A 4D attention mask \(\tilde{M} \in \{0,1\}^{f \times f \times n \times n}\) is constructed with a three-tier design: (a) short-range (\(|i-j| < D_1\)) uses standard dense attention windows to capture local detail; (b) mid-range (\(D_1 \leq |i-j| < D_2\)) uses striped attention, allowing only spatially proximate tokens (\(|k-l| < D_s\)) to interact, effectively reducing computation while extending the temporal receptive field; (c) long-range (\(|i-j| > D_2\)) removes direct attention, but all frames may attend to the first frame (attention sink) as a global anchor.
    • Design Motivation: Fixed sliding-window attention preserves local detail but severs long-range dependencies. TSA exploits the empirical observation that tokens at the same spatial position exhibit high attention across frames, enabling a larger temporal receptive field while reducing token count.
  3. Layer-wise Probing Mechanism:

    • Function: Automatically quantifies each layer's sensitivity to both OOD problems, guiding the selective application of correction strategies.
    • Mechanism: For position OOD, RoPE key position indices are shifted per layer (±20, ±40) to generate probe videos, with Vision Reward and Attention Logits Difference (ALD) used to measure quality and attention change. For context OOD, a sliding window is applied per layer to constrain context length and generate probe videos, with attention entropy difference \(S_i = \|H_i^{probing} - H_i^{original}\| / \|H_i^{original}\|\) used to measure sensitivity.
    • Design Motivation: Applying uniform correction across all layers ignores inter-layer heterogeneity. Experiments show that different layers play significantly different roles (e.g., in Wan2.1, Layer 18 is insensitive to position while Layer 28 is highly sensitive).

Loss & Training

FreeLOC is entirely training-free. VRPR and TSA are applied only at inference time. The probing process is completed offline once, after which a fixed per-layer strategy configuration is used for all subsequent generation.

Key Experimental Results

Main Results (Wan2.1-T2V-1.3B, 4× extension = 321 frames)

Method Subject Consist.↑ BG Consist.↑ Motion Smooth.↑ Imaging Quality↑ Aesthetic↑ Dynamic↑
Direct Sampling 98.50 97.89 98.83 59.21 49.43 4.32
Sliding Window 96.15 95.92 98.54 65.64 54.04 39.81
RIFLEx 98.41 97.87 98.86 59.92 49.67 4.45
FreeLong 97.88 97.51 98.91 63.17 54.56 21.21
FreeNoise 97.31 97.25 98.84 66.32 56.01 35.11
FreeLOC 98.44 97.78 98.97 67.44 61.21 36.27

Ablation Study

Configuration SC↑ BC↑ MS↑ IQ↑ AQ↑ DD↑
Direct 98.50 97.89 98.83 59.21 49.43 4.32
Direct+TSA 97.41 96.76 98.67 65.87 57.05 37.01
Direct+VRPR 98.42 97.81 98.89 61.88 54.13 15.32
(TSA+VRPR)_uniform 97.56 97.67 98.75 65.19 56.34 34.44
(TSA+VRPR, VRPR)_random 98.03 97.61 98.91 63.90 54.44 33.13
FreeLOC(layer-wise) 98.44 97.78 98.97 67.44 61.21 36.27

Key Findings

  • Layer-adaptive strategy is critical: Uniform application of (TSA+VRPR)_uniform achieves only 56.34 on AQ, while the layer-wise strategy reaches 61.21, a gain of +4.87.
  • VRPR applied alone primarily improves consistency (SC changes marginally from 98.50 to 98.42) but contributes little to dynamics (DD: 4.32 → 15.32).
  • TSA applied alone substantially improves visual quality and dynamics (IQ 65.87, DD 37.01) at the cost of some consistency (SC drops to 97.41).
  • FreeLOC achieves the best balance between consistency and quality, with cross-model effectiveness also validated on HunyuanVideo.
  • Compared to alternative positional re-encoding methods, the three-tier granularity design of VRPR outperforms simple clipping and grouping.

Highlights & Insights

  • OOD perspective on long video generation degradation: Decomposing positional extrapolation and context extension into two independent OOD sources, each with clear definitions and metrics, yields an elegant analytical framework.
  • Layer sensitivity probing: Sensitivity is quantified via automated experiments rather than empirical assumptions, providing a principled basis for per-layer strategy assignment. This probing methodology is transferable to other layer-wise intervention scenarios.
  • Three-tier granularity VRPR: Designing differentiated-precision positional re-encoding based on the attention decay property of video is more faithful to video characteristics than simple truncation or grouping from the LLM literature. The use of striped attention to exploit cross-frame spatial correspondence is also an elegant design choice.

Limitations & Future Work

  • The probing process requires a one-time offline analysis for each new model (generating \(M \times N\) probe videos), incurring non-trivial cost.
  • The window parameters of VRPR (\(W_1, W_2, G_1, G_2\)) and the distance parameters of TSA (\(D_1, D_2\)) must be pre-specified and may require tuning for different extension ratios.
  • Only 2× and 4× extensions are evaluated; the effectiveness under more extreme extension ratios (e.g., 10×+) remains unknown.
  • No fair comparison with training-based methods is provided, as the two paradigms target different settings.
  • vs. RIFLEx: RIFLEx suppresses frame repetition by reducing intrinsic frequencies but supports only 2× extension and applies uniform treatment across all layers; FreeLOC supports 4× extension with layer-adaptive processing.
  • vs. FreeNoise: FreeNoise improves consistency via noise reuse at the cost of visual quality; FreeLOC fundamentally addresses the attention dispersion problem.
  • vs. LongDiff: LongDiff is designed for UNet and relies on heuristic mappings requiring 16× attention recomputation; FreeLOC is natively designed for DiT and is more efficient.

Rating

  • Novelty: ⭐⭐⭐⭐ The identification of two OOD problems and the layer-adaptive probing mechanism are original contributions, though the individual technical components of VRPR (multi-level quantization) and TSA (tiered sparse attention) are combinations of existing ideas.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Dual-model validation (Wan + HunyuanVideo), extensive ablations (strategy / components / positional encoding / attention mechanism), and comprehensive quantitative and qualitative comparisons.
  • Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, probing experiment visualizations are intuitive, and the overall logic is self-consistent.
  • Value: ⭐⭐⭐⭐ Training-free long video generation offers direct practical value to the community; open-sourced code further enhances impact.