MiVE: Multiscale Vision-language features for reference-guided video Editing¶

Conference: ICML 2026
arXiv: 2605.14664
Code: https://mivepaper.github.io (Project page, code not explicitly open-sourced)
Area: Video Editing / Multimodal VLM / Diffusion Models
Keywords: Reference-guided, Video Editing, Multiscale VLM Features, Unified Self-Attention, DiT

TL;DR¶

MiVE extracts both the first and last layer hidden states of Qwen3-VL as multi-scale condition tokens. These are concatenated with VAE visual latents into a long sequence for reference-guided video editing within a unified self-attention DiT. It ranks first in human preference and across 6 VLM automated metrics on a 60-video 720P benchmark, surpassing the open-source Wan-Animate and commercial Kling O1.

Background & Motivation¶

Background: The reference-guided video editing task is defined as: given a source video \(x_{src}\) and an editing instruction \(x_{text}\), use an external image editor (e.g., FLUX.1 Kontext) to modify the first frame to obtain a reference image \(x_{ref}\). The model is then required to faithfully propagate this modification to the entire video while preserving the motion and unedited regions of the original video. Current mainstream approaches take two paths: 1) Decoupled encoders like T5 + SigLIP, where text and vision are encoded separately and fused via cross-attention in DiT; 2) Using VLMs like Qwen3-VL / MiniCPM-V as unified encoders (the path taken by Kling O1).

Limitations of Prior Work: Decoupled encoders suffer from a natural "modality gap"—text and visual features reside in different semantic spaces, making it difficult for the final cross-attention layer to truly bridge them. This leads to "instruction misunderstanding" or "reference misalignment" in fine-grained cross-modal reasoning. While unified VLM encoders solve the modality gap, they only extract the last layer's hidden states, discarding the rich local spatial details in early layers, which causes blurred details in hair, lighting, and textures.

Key Challenge: VLMs possess an overlooked hierarchical structure—shallow layers tend to encode spatial local details (pixel-level alignment), while deep layers encode global semantics (instruction understanding). Existing methods either avoid VLMs entirely (losing a unified semantic space) or only use the last layer (losing spatial details); none utilize both ends simultaneously. Furthermore, cross-attention is inherently asymmetric—visual tokens query text, but text tokens remain agnostic to vision, failing to achieve fine-grained bidirectional correspondence.

Goal: (1) Verify the hypothesis that VLM shallow layers encode spatial details while deep layers encode semantics; (2) Design a video editing framework that utilizes both shallow and deep VLM features; (3) Replace cross-attention with a truly symmetric unified attention mechanism.

Key Insight: The authors quantify the attention concentration of text tokens on visual tokens at each layer using a "cross-modal diagnostic matrix" \(A_{txt \to vis}^{(l)} = E B^{\top}\). Combined with person masks generated by SAM2, they calculate the Attention Mask Ratio. Results show that for Qwen3-VL, \(R_{mask} \approx 0.37\) at Layer 0 but drops to \(0.23\) at the final layer. Shallow layers precisely locate character contours, while deep layer attention follows a diffuse global pattern. This provides direct evidence for the multi-scale design.

Core Idea: Extract VLM first + last layers → Project into condition tokens → Concatenate with visual latents into a long sequence → Unified self-attention throughout, using a shared attention manifold to simultaneously achieve "local detail propagation" and "global semantic understanding."

Method¶

Overall Architecture¶

The input to MiVE consists of source video \(x_{src}\), text instruction \(x_{text}\), and reference image \(x_{ref}\) (the edited first frame); it outputs the edited video \(\hat{x}_{tgt}\). The pipeline consists of three stages:

Multi-Level Context Extraction: Feed \(\{x_{text}, x_{ref}, x_{src}\}\) together into a frozen Qwen3-VL-8B. Extract hidden states from the first layer and the \(L\)-th layer (\(\phi_1, \phi_L \in \mathbb{R}^{S \times D_{VLM}}\)). Project them via RMSNorm + Linear to \(\mathbb{R}^{S \times D/2}\), concatenate along the feature dimension, and pass through a fusion linear layer to obtain condition tokens \(c \in \mathbb{R}^{N_c \times D}\).
Reference-Aware Latent Encoding: Encode \(x_{src}, x_{tgt}, x_{ref}\) into latents using a frozen VAE. During training, the reference latent \(z_{ref}\) is prepended along the temporal dimension to both the noisy target \(\tilde z_t\) and the control \(z_{src}\) branches. The two branches are then concatenated along the channel dimension, resulting in a shape of \((T'+1) \times 2C \times H' \times W'\). This allows the model to "see" the reference image as an appearance anchor from the very first frame.
Unified Self-Attention Backbone: Concatenate condition tokens \(c\) and patchified visual tokens \(v\) into \(u^{(0)} = [c; v] \in \mathbb{R}^{(N_c + N_v) \times D}\). The sequence passes through unified self-attention within DiT blocks, without cross-attention. A key trick is per-token AdaLN: clean tokens (condition + reference frame patches) use fixed time embeddings for \(t=0\), while noisy tokens (target video patches) use embeddings for the current diffusion timestep \(t\). The model is initialized from Wan2.1-T2V-14B self-attention blocks and trained via flow matching.

flowchart TD
    IN["Input: Instruction + Reference Image + Source Video"]
    subgraph CTX["Multi-Level Context Extraction"]
        direction TB
        Q["Qwen3-VL-8B (frozen) forward pass"] --> L1["Layer 1 φ1: Spatial Details"]
        Q --> LL["Layer L φL: Global Semantics"]
        L1 --> FU["RMSNorm+Linear per layer<br/>Concat along channel → Fusion linear"]
        LL --> FU
    end
    subgraph LAT["Reference-Aware Latent Encoding"]
        direction TB
        VAE["VAE Encoding z_ref / z_src / z_tgt"] --> PRE["Prepend z_ref to both branches temporally<br/>Concat noisy target and control along channel"]
        PRE --> PE["Patch embedding → Visual tokens v"]
    end
    IN --> CTX
    IN --> LAT
    subgraph BK["Unified Self-Attention Backbone + per-token AdaLN"]
        direction TB
        U["u = [c ; v] long sequence"] --> DIT["Unified self-attention in P DiT blocks<br/>Clean tokens use t=0, noisy tokens use current t"]
    end
    FU -->|"condition tokens c"| U
    PE -->|"visual tokens v"| U
    DIT --> OUT["Extract visual tokens, unpatchify<br/>Discard condition and reference → VAE decode"]
    OUT --> VID["Edited Video x̂_tgt"]

Key Designs¶

Multi-Level Context Extraction:
- Function: Enables condition signals to carry both spatial details from shallow VLM layers and global semantics from deep layers.
- Mechanism: Hidden states from the same Qwen3-VL forward pass are extracted at Layer 1 and Layer \(L\). After projection, they are concatenated along the channel dimension: \(c_{raw} = \text{Concat}_D(\tilde\phi_1, \tilde\phi_L)\), then processed by \(\text{Linear}_{fuse}\). Choosing the two ends instead of uniform sampling is based on diagnostic experiments showing \(R_{mask}\) extrema near Layer 0 and Layer \(L\), with middle layers being redundant.
- Design Motivation: Solves the issue where unified VLMs lose spatial details by only using the last layer. Since \(c\) is independent of timestep \(t\), it acts as a fixed "semantic anchor."
Reference-Aware Latent Encoding (Temporal Prepend + Dual-Branch Channel Concat):
- Function: Provides both appearance anchors (from reference) and motion anchors (from source) without introducing masks.
- Mechanism: During training, \(z_t = \text{Concat}_C([z_{ref}; \tilde z_t], [z_{ref}; z_{src}])\). Both branches start with \(z_{ref}\) in the temporal dimension. During inference, the noisy target branch is initialized with \(\tilde z_T \sim \mathcal{N}(0, I)\), while the control branch uses the real source video.
- Design Motivation: Makes \(z_{ref}\) both an appearance and structural anchor. Temporal prepending ensures every attention layer sees it, while dual-branch channel concatenation aligns control (source) and target (to-be-generated) signals at the latent level, avoiding mask inaccuracies in fast-motion scenarios.
Unified Self-Attention + per-token AdaLN:
- Function: Replaces asymmetric cross-attention with a symmetric long-sequence self-attention, allowing condition and visual tokens to query each other in the same space.
- Mechanism: \(u^{(0)} = [c; v]\) enters \(P\) DiT blocks. Each token determines its own AdaLN modulation—condition and reference patches use \(t=0\) embeddings ("always clean"), while target patches use current \(t\). Output only takes \(u^{(P)}[N_c:]\) for unpatchification.
- Design Motivation: Unified self-attention is naturally symmetric, allowing the model to learn modality alignment. Per-token AdaLN solves the engineering problem where clean signals are "polluted" by noisy time embeddings, keeping the reference frame stable.

Loss & Training¶

Uses flow matching objective (Lipman et al., 2023). Initialized from Wan2.1-T2V-14B self-attention blocks and trained on 8 H100s at 720P / 81 frames for 8000 steps (~2 epochs, ~65 hours). AdamW optimizer, lr \(3 \times 10^{-5}\), 200-step warmup. 30K training pairs (24K from OpenVE-3M filtered for Qwen3-VL score \(\ge 9.3\), plus 6K synthetic portrait data with foreground segmentation).

Key Experimental Results¶

Main Results¶

Benchmark: 60 720P videos, split into Simple (30, RoseBench/VPBench) and Complex (30, portrait videos with lighting/background changes). Evaluation via Gemini-3-Flash on 6 dimensions (IA / CC / TS / PR / VA / SC) and a 30-person user study (holistic score 1-5).

Subset	Method	IA	CC	TS	VA	SC	User
Simple	VACE	7.06	7.12	6.45	6.39	7.02	2.67
Simple	Kling O1	8.48	9.03	8.91	8.51	9.31	3.69
Simple	Ours	9.30	8.65	8.81	8.83	9.46	4.18
Complex	Wan-Animate	8.87	7.78	7.83	7.73	8.98	3.03
Complex	Kling O1	8.68	7.71	8.11	7.74	9.14	3.61
Complex	Ours	9.23	8.05	8.27	8.09	9.22	3.75

MiVE ranks first in IA / VA / SC on Simple and sweeps all categories on the Complex set.

Ablation Study¶

Configuration	IA	CC	TS	VA	SC	Description
Decoupled Enc. + Dual Cross-Attn	6.76	6.10	5.88	5.87	7.45	Legacy decoupled baseline
Unified Enc. (last layer) + Dual Cross-Attn	8.51	8.24	7.68	7.42	8.03	Last layer only + cross-attn
Unified Enc. + Fused Cross-Attn	8.53	8.22	7.87	8.08	9.00	Single-branch cross-attn

Switching to a unified VLM encoder improves nearly all metrics by \(>1.5\) points. Replacing cross-attention with unified self-attention (complete MiVE) further pushes IA from 8.53 to 9.23.

Key Findings¶

VLM Layer 1 Attention Mask Ratio is significantly higher than the last layer, confirming the "shallow spatial / deep semantic" hypothesis.
Qwen3-VL has stronger shallow localization than GLM-4.6V (0.37 vs 0.33).
In complex scenes (fast motion/lighting changes), MiVE's identity preservation is more stable than Wan-Animate, reflecting the value of the temporal prepend design.
The authors intentionally do not report SSIM/LPIPS, arguing that edited videos should differ from source videos, making structural similarity metrics inappropriate.

Highlights & Insights¶

Diagnostic Motivation: Using a \(E B^{\top}\) matrix + SAM2 mask to quantify layer focus turns intuition into data, providing a persuasive narrative for the multi-scale design.
Unified Self-Attention over Cross-Attention: Aligned with Z-Image / FLUX, it explicitly uses per-token AdaLN to encode the "reference is clean" prior into time embeddings.
Two-End Multi-Scale Selection: Computationally efficient compared to sampling every \(k\) layers, while catching the extremes of the \(R_{mask}\) gradient.

Limitations & Future Work¶

Small training set (30K pairs) from synthetic/filtered sources; generalization to complex physical phenomena (fluids/reflections) is untested.
High inference cost: 6.5 mins for 81 frames using 50GB VRAM on H100; lacks discussion on acceleration (distillation/pruning).
Metric Bias: Using Qwen3-VL as the backbone and Gemini-3-Flash as the judge might introduce preference bias towards specific instruction-following styles.

vs VACE/VideoPainter (mask-guided): These rely on precise masks for control, which fail in fast motion; MiVE is mask-free, inferring edit regions from instructions.
vs LucyEdit/Wan-Animate (mask-free, single-layer VLM): These lose shallow spatial details; MiVE recovers them via multi-scale tokens.
vs Kling O1 (commercial): MiVE outperforms it in IA (instruction adherence), suggesting single-layer + cross-attn is a bottleneck.

Rating¶

Novelty: ⭐⭐⭐⭐ Systematic multi-scale VLM + unified self-attention in video editing is a first, though components exist elsewhere.
Experimental Thoroughness: ⭐⭐⭐⭐ Solid benchmark and ablation; however, small data scale and limited in-the-wild testing.
Writing Quality: ⭐⭐⭐⭐⭐ Excellent diagnostic motivation and logical flow regarding layer selection.
Value: ⭐⭐⭐⭐ Establishes a strong baseline for using VLMs as multi-scale unified encoders in video generation tasks.