Skip to content

Composing Concepts from Images and Videos via Concept-prompt Binding

Conference: CVPR 2026
arXiv: 2512.09824
Code: Project Page
Area: LLM / NLP (Other)
Keywords: visual concept composition, diffusion Transformer, video personalization, concept binding, temporal decoupling

TL;DR

Bind & Compose (BiCo) is a one-shot method that binds visual concepts to prompt tokens via hierarchical binders and achieves flexible image-video concept composition through token-level composition, comprehensively outperforming prior work in concept consistency, prompt fidelity, and motion quality.

Background & Motivation

Background: Visual concept composition aims to integrate elements from different images and videos into a coherent output, serving as a fundamental capability for visual creation and filmmaking. With the development of DiT-architecture T2V diffusion models (e.g., Wan2.1), concept localization and customization capabilities have significantly improved.

Limitations of Prior Work: (i) Insufficient concept extraction precision — existing methods (LoRA/learnable embedding + masks) struggle to disentangle complex concepts with occlusion and temporal variation, and cannot extract non-object concepts (e.g., style); (ii) Limited image-video concept composition flexibility — prior work is restricted to subject from images + motion from videos, unable to flexibly compose arbitrary attributes (visual style, lighting changes, etc.).

Key Challenge: The challenges of precise concept decomposition (without mask input) and cross-modal concept composition (image + video) are mutually coupled.

Goal: Enable flexible extraction and composition of arbitrary visual concepts (including non-object concepts such as style and motion) from images and videos.

Key Insight: Leverage T2V diffusion models' concept localization capability by binding text tokens to corresponding visual concepts (one-shot training), then compose concepts through token-level composition.

Core Idea: First bind visual concepts to prompt tokens (Bind), then select bound tokens from different sources and compose them into the target prompt (Compose), achieved through hierarchical binder structures + diversified absorption mechanism + temporal decoupling strategy.

Method

Overall Architecture

BiCo is built on the Wan2.1-T2V-1.3B model, with a two-phase workflow: 1. Concept Binding: For each visual input, lightweight binder modules learn the mapping between text tokens and corresponding visual concepts 2. Concept Composing: Different parts of the target prompt are updated through corresponding binders, composing multi-source visual information into the updated prompt

Core operation is based on DiT's cross-attention conditional injection:

\[\mathbf{x}_{out} = \text{cross\_attention}(\mathbf{x}_{in}, \mathbf{p}, \mathbf{p})\]

Key Designs

  1. Hierarchical Binder Structure: Comprises a global binder \(f_g(\cdot)\) and per-block binders \(f_l^i(\cdot)\). Each binder is a residual MLP with zero-initialized scaling factors: \(f(\mathbf{p}) = \mathbf{p} + \gamma \cdot \text{MLP}(\mathbf{p})\). Since DiT blocks behave differently during denoising, the hierarchical design enables global correlation + targeted fine-tuning. Coupled with a two-stage reverse training strategy — first strengthen the global binder at high noise levels (\(\geq \alpha\), \(\alpha=0.875\)), then jointly train all binders.

  2. Diversified-Absorption Mechanism (DAM): Addresses concept-token binding precision in the one-shot setting. A VLM (Qwen2.5-VL) extracts spatial and temporal key concepts and generates diversified prompts (keeping key concept words fixed). Learnable absorption tokens \(p_a^j\) absorb concept-irrelevant visual details during training; these tokens are discarded at inference to suppress unwanted details.

  3. Temporal Decoupling Strategy (TDS): Addresses image-video temporal heterogeneity. Video concept training is divided into two stages: Stage 1 trains on single frames (aligned with image concept training), Stage 2 trains on full videos with a dual-branch binder structure:

\[\text{MLP}(\mathbf{p}) \leftarrow (1-g(\mathbf{p})) \cdot \text{MLP}_s(\mathbf{p}) + g(\mathbf{p}) \cdot \text{MLP}_t(\mathbf{p})\]

\(\text{MLP}_s\) inherits weights from Stage 1, and \(g(\cdot)\) is zero-initialized to ensure proper initialization.

Loss & Training

Standard diffusion model denoising loss is used to train binders. Each stage runs for 2400 iterations with learning rate \(1.0 \times 10^{-4}\). Inference generates 81-frame videos. Experiments are conducted on NVIDIA RTX 4090.

Key Experimental Results

Main Results

Method CLIP-T↑ DINO-I↑ Concept↑ Prompt↑ Motion↑ Overall↑
Textual Inversion† 25.96 20.47 2.14 2.17 2.94 2.42
DB-LoRA† 30.25 27.74 2.76 2.76 2.51 2.68
DreamVideo 27.43 24.15 1.90 1.82 1.66 1.79
DualReal 31.60 32.78 3.10 3.11 2.78 3.00
BiCo (Ours) 32.66 38.04 4.71 4.76 4.46 4.64

BiCo improves subjective Overall Quality by +54.67% over DualReal (3.00 → 4.64).

Ablation Study

Configuration Concept↑ Prompt↑ Motion↑ Overall↑
Baseline (global binder only) 2.16 2.60 2.26 2.34
+ Hierarchical Binder 2.63 2.88 2.93 2.81
+ Prompt Diversification 3.40 3.34 3.04 3.26
+ Absorption Token 3.55 3.43 3.43 3.47
+ TDS (w/o absorption) 3.80 3.97 3.70 3.82
▲ w/o Reverse Training 2.60 2.70 2.43 2.58
Full Model 4.43 4.47 4.32 4.40

Key Findings

  • Hierarchical binders significantly improve concept retention and motion quality (Motion: 2.26 → 2.93)
  • Absorption tokens effectively suppress unwanted details (ablation visualization shows irrelevant elements appearing when removed)
  • TDS is critical for image-video compatibility (Overall: 3.47 → 3.82)
  • Two-stage reverse training is indispensable — removing it causes Overall to plummet from 4.40 to 2.58

Highlights & Insights

  • Unified framework: First to achieve flexible composition of arbitrary concepts from images and videos, supporting non-object concepts (style, motion)
  • No masks required: Achieves implicit decomposition through text-conditioned concept composition, lowering the user barrier
  • Scalable design: Binders are lightweight modules independently trained for different concept sources, composable on demand
  • Rich derivative applications: Image/video decomposition (retaining only selected tokens), text-guided editing

Limitations & Future Work

  • All tokens are treated equally, but token importance for T2V generation is non-uniform — tokens representing subjects/motion are far more important than function words
  • Based on a 1.3B model; scaling effects on larger T2V models (CogVideoX, Sora-scale) remain unverified
  • Consistency between automatic metrics (CLIP-T, DINO-I) and human evaluation needs further confirmation
  • Computational overhead: each concept source requires independent binder training (2400 iterations × 2 stages)
  • Textual Inversion/DreamBooth-LoRA are foundational video personalization methods but offer coarse concept control granularity
  • DreamVideo/DualReal support subject + motion composition but constrain input types and quantities
  • TokenVerse achieves prompt-controlled image concept composition but relies on text-conditional modulation architecture, inapplicable to modern T2V models
  • Break-A-Scene depends on explicit mask input and cannot extract non-object concepts
  • BiCo unifies concept decomposition and composition through the binder + token composition paradigm
  • Set-and-Sequence and Grid-LoRA achieve appearance/motion learning in LoRA space but cannot precisely specify concepts and composition methods

Method Details Supplement

  • VLM Key Concept Extraction: Qwen2.5-VL extracts spatial concepts (objects, style, lighting, etc.) and temporal concepts (motion patterns, speed changes, etc.), combined into spatial-only and spatiotemporal prompts respectively
  • Inference Process: The target prompt \(\mathbf{p}_d\) is decomposed by concept correspondence, each part updated through corresponding binders, then reassembled into \(\mathbf{p}_u^i\)
  • Derivative Applications: Image/video decomposition (retain only dog-related tokens, discard cat-related tokens), text-guided editing (unchanged parts pass through binders, edited parts use original tokens)

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — First to achieve unified flexible composition of arbitrary image-video concepts
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Quantitative automatic + human evaluation + detailed ablation + visualization cases are comprehensive
  • Writing Quality: ⭐⭐⭐⭐ — Concepts are clear; DAM/TDS design motivations are well articulated
  • Value: ⭐⭐⭐⭐⭐ — Direct and broad application prospects for visual content creation