AutoCut: End-to-end Advertisement Video Editing Based on Multimodal Discretization and Controllable Generation¶

Conference: CVPR 2026 arXiv: 2603.28366 Code: https://github.com/AdAutoCut/Autocut Area: Video Understanding / Video Editing Keywords: video editing, Multimodal LLM, Residual VQ, Advertisement, Controllable Generation

TL;DR¶

AutoCut proposes an end-to-end advertisement video editing framework that unifies video, audio, and text into a shared discrete token space via Residual Vector Quantization (RQVAE), performs multimodal alignment and supervised fine-tuning on Qwen3-8B, and enables four tasks—clip selection, ordering, script generation, and background music selection—within a single unified model, surpassing GPT-4o baselines across multiple metrics.

Background & Motivation¶

Background: Short-form video has become the dominant medium for digital advertising, yet the production pipeline—covering scripting, shooting, editing, and post-processing—remains costly and technically demanding.

Three Major Obstacles in Prior Work: - Loose multimodal coupling: Weak alignment among video, audio, and text representations prevents unified reasoning. - Lack of interpretable control: Models provide no structured or discrete representations, making it difficult to adjust narrative pacing and content emphasis. - Disconnected understanding and generation: Multimodal understanding and generation are treated as separate processes with inconsistent optimization objectives.

Opportunities and Limitations of MLLMs: Multimodal large language models hold promise for unifying perception, understanding, and creation, but are constrained by context window length, making large-scale video retrieval and editing difficult.

Core Idea: Video and audio features are discretized into tokens via RQVAE and unified with text tokens into a shared vocabulary, enabling the LLM to perform multimodal reasoning and generation within a single token space.

Method¶

Overall Architecture¶

Two-stage training: 1. Multimodal Alignment: The LLM backbone is frozen; only the newly introduced multimodal embedding layers are updated (~700K samples). 2. Supervised Fine-Tuning (SFT): Full-parameter fine-tuning for task-specific behavior learning (~100K curated samples).

At inference: The LLM generates a token sequence → video tokens retrieve nearest-neighbor clips; audio tokens are decoded; text is output directly → ffmpeg compositing produces the final video.

Key Designs¶

Multimodal Encoding and Discretization:
- Video Encoder: ResNet-50 (pretrained with contrastive learning), extracting frame-level semantic embeddings.
- Audio Encoder: PANNs (Wavegram-Logmel-CNN), pretrained on AudioSet.
- RQVAE Discretization: Residual vector quantization compresses continuous embeddings into discrete tokens.
  - Codebook size: \(256 \times 8\) (each frame/audio segment encoded as 8 tokens).
  - Reconstruction quality: video cosine similarity 0.89, audio 0.96.
  - Training loss: \(\mathcal{L}_{rec} = 1 - \cos(\hat{f}, f)\)
- Design Motivation: RQVAE achieves efficient compression through successive residual approximation; the 8-token configuration strikes a favorable balance between reconstruction quality and sequence length.
Unified Token Space:
- Video tokens, audio tokens, and text tokens share an expanded vocabulary.
- The multimodal alignment stage is trained with standard NTP loss: \(\mathcal{L}_{NTP} = -\sum_t \log P(x_t | x_{<t})\)
- The LLM backbone is frozen during alignment; only the new embedding layers are updated, ensuring stable training.
- Design Motivation: A unified token space reduces cross-modal reasoning to a sequence modeling problem.
Unified Modeling of Four Tasks:
- Clip Selection: Selecting relevant segments from a candidate pool (CSA metric).
- Clip Ordering: Arranging segments into a coherent temporal sequence (CRA metric).
- Script Generation: Generating advertisement copy aligned with visual content (SQ + WCD metrics).
- Background Music Selection: Retrieving BGM matching the multimodal context (MSS metric).
Retrieval and Rendering:
- Video tokens → FAISS nearest-neighbor search to match clips in the asset library.
- Audio tokens → decoded or retrieved.
- ffmpeg splicing, transitions, and subtitle overlay → final MP4 output.

Training Data¶

Alignment Data: ~700K filtered advertisement videos (high engagement, with voiceover).
SFT Data: ~100K high-quality curated samples (duration <120s, clips 2–60s, high visual-text relevance assessed by Qwen-VL).
Data processing: ASR extraction of aligned timestamps, 1fps frame sampling, pydub audio separation.

Key Experimental Results¶

Main Results (364 test videos)¶

Method	CSA↑	CRA↑	VSC↑	SQ↑	WCD↓	MSS↑
Qwen3-8B (Caption)	0.137	0.016	0.931	80.0	5.26	–
Qwen3-8B (Caption+SFT)	0.569	0.030	1.123	59.2	6.82	–
Qwen2.5-VL-32B	0.665	0.025	0.998	78.3	12.51	–
GPT-4o + MGSV	0.269	0.078	1.136	83.0	7.75	0.266
AutoCut	0.659	0.107	1.036	84.6	3.02	0.348

Ablation Study¶

Configuration	CSA↑	CRA↑	VSC↑	SQ↑	WCD↓
SFT only	0.478	0.082	1.004	83.2	4.43
emb+full+sft	0.717	0.058	0.967	79.0	4.50
emb+sft (Ours)	0.659	0.107	1.036	84.6	3.02

Key Findings¶

AutoCut achieves substantially higher CRA (clip ordering accuracy) than all baselines (0.107 vs. 0.078), demonstrating that tokenized multimodal representations better capture temporal structure.
WCD (script–video temporal consistency) of 3.02 far outperforms GPT-4o's 7.75, reflecting the temporal alignment advantage of joint multimodal modeling.
In human evaluation, AutoCut outperforms GPT-4o on all 5 dimensions (88% overall win rate).
Adding an extra pretraining stage (emb+full+sft) degrades CRA and SQ, indicating that limited-quality pretraining corpora introduce noise.
Significant cost advantage: processing 100 videos costs AutoCut ~\(0.015 vs. GPT-4o ~\)2.5.

Highlights & Insights¶

"Discretization as unification": By mapping all modalities into a shared token space via RQVAE, the problem reduces elegantly to next-token prediction.
The two-stage training strategy (alignment + SFT) outperforms the three-stage variant (alignment + pretraining + SFT), confirming that data quality matters more than data quantity.
The dual-track design—low-frame-rate tokens for reasoning and high-frame-rate frames for retrieval—balances efficiency and precision.
The inclusion of BGM selection is a notable contribution, as audio selection has long been neglected in video editing research.

Limitations & Future Work¶

Fine-grained synchronization between video motion and audio rhythm remains imperfect, with occasional desynchronization.
Control granularity is limited to the clip level; frame-level or emotion-level editing is not supported.
Although RQVAE achieves high cosine similarity in reconstruction, the effect of information loss may be amplified in downstream tasks.
Evaluation relies on GPT-4o as a judge (VSC, SQ metrics), introducing potential assessment bias.

VC-LLM also employs MLLMs for advertisement video generation but relies on multi-resolution spatiotemporal reasoning.
MGSV is the only baseline with audio matching capability.
Compared to "any-modality" LLMs such as NExT-GPT, AutoCut is more focused on the practical constraints of editing scenarios.
The discretization + retrieval paradigm is generalizable to other video creation contexts (short dramas, vlogs, etc.).

Rating¶

Novelty: ⭐⭐⭐⭐ The unified multimodal discretization framework is a novel approach to advertisement editing, though the core components are relatively mature.
Experimental Thoroughness: ⭐⭐⭐⭐ Automatic metrics + human evaluation + ablation study, though the test set comprises only 364 videos.
Writing Quality: ⭐⭐⭐⭐ The framework is described clearly, but definitions of evaluation metrics are somewhat scattered.
Value: ⭐⭐⭐⭐ The work has direct practical value for automated advertisement video production, with a significant cost advantage.