Composing Concepts from Images and Videos via Concept-prompt Binding¶
Conference: CVPR 2026
arXiv: 2512.09824
Code: Project Page
Area: LLM / NLP (Other)
Keywords: visual concept composition, diffusion Transformer, video personalization, concept binding, temporal decoupling
TL;DR¶
Bind & Compose (BiCo) is a one-shot method that binds visual concepts to prompt tokens via hierarchical binders and achieves flexible image-video concept composition through token-level composition, comprehensively outperforming prior work in concept consistency, prompt fidelity, and motion quality.
Background & Motivation¶
Background: Visual concept composition aims to integrate elements from different images and videos into a coherent output, serving as a fundamental capability for visual creation and filmmaking. With the development of DiT-architecture T2V diffusion models (e.g., Wan2.1), concept localization and customization capabilities have significantly improved.
Limitations of Prior Work: (i) Insufficient concept extraction precision — existing methods (LoRA/learnable embedding + masks) struggle to disentangle complex concepts with occlusion and temporal variation, and cannot extract non-object concepts (e.g., style); (ii) Limited image-video concept composition flexibility — prior work is restricted to subject from images + motion from videos, unable to flexibly compose arbitrary attributes (visual style, lighting changes, etc.).
Key Challenge: The challenges of precise concept decomposition (without mask input) and cross-modal concept composition (image + video) are mutually coupled.
Goal: Enable flexible extraction and composition of arbitrary visual concepts (including non-object concepts such as style and motion) from images and videos.
Key Insight: Leverage T2V diffusion models' concept localization capability by binding text tokens to corresponding visual concepts (one-shot training), then compose concepts through token-level composition.
Core Idea: First bind visual concepts to prompt tokens (Bind), then select bound tokens from different sources and compose them into the target prompt (Compose), achieved through hierarchical binder structures + diversified absorption mechanism + temporal decoupling strategy.
Method¶
Overall Architecture¶
BiCo is built on the Wan2.1-T2V-1.3B model, with a two-phase workflow: 1. Concept Binding: For each visual input, lightweight binder modules learn the mapping between text tokens and corresponding visual concepts 2. Concept Composing: Different parts of the target prompt are updated through corresponding binders, composing multi-source visual information into the updated prompt
Core operation is based on DiT's cross-attention conditional injection:
Key Designs¶
-
Hierarchical Binder Structure: Comprises a global binder \(f_g(\cdot)\) and per-block binders \(f_l^i(\cdot)\). Each binder is a residual MLP with zero-initialized scaling factors: \(f(\mathbf{p}) = \mathbf{p} + \gamma \cdot \text{MLP}(\mathbf{p})\). Since DiT blocks behave differently during denoising, the hierarchical design enables global correlation + targeted fine-tuning. Coupled with a two-stage reverse training strategy — first strengthen the global binder at high noise levels (\(\geq \alpha\), \(\alpha=0.875\)), then jointly train all binders.
-
Diversified-Absorption Mechanism (DAM): Addresses concept-token binding precision in the one-shot setting. A VLM (Qwen2.5-VL) extracts spatial and temporal key concepts and generates diversified prompts (keeping key concept words fixed). Learnable absorption tokens \(p_a^j\) absorb concept-irrelevant visual details during training; these tokens are discarded at inference to suppress unwanted details.
-
Temporal Decoupling Strategy (TDS): Addresses image-video temporal heterogeneity. Video concept training is divided into two stages: Stage 1 trains on single frames (aligned with image concept training), Stage 2 trains on full videos with a dual-branch binder structure:
\(\text{MLP}_s\) inherits weights from Stage 1, and \(g(\cdot)\) is zero-initialized to ensure proper initialization.
Loss & Training¶
Standard diffusion model denoising loss is used to train binders. Each stage runs for 2400 iterations with learning rate \(1.0 \times 10^{-4}\). Inference generates 81-frame videos. Experiments are conducted on NVIDIA RTX 4090.
Key Experimental Results¶
Main Results¶
| Method | CLIP-T↑ | DINO-I↑ | Concept↑ | Prompt↑ | Motion↑ | Overall↑ |
|---|---|---|---|---|---|---|
| Textual Inversion† | 25.96 | 20.47 | 2.14 | 2.17 | 2.94 | 2.42 |
| DB-LoRA† | 30.25 | 27.74 | 2.76 | 2.76 | 2.51 | 2.68 |
| DreamVideo | 27.43 | 24.15 | 1.90 | 1.82 | 1.66 | 1.79 |
| DualReal | 31.60 | 32.78 | 3.10 | 3.11 | 2.78 | 3.00 |
| BiCo (Ours) | 32.66 | 38.04 | 4.71 | 4.76 | 4.46 | 4.64 |
BiCo improves subjective Overall Quality by +54.67% over DualReal (3.00 → 4.64).
Ablation Study¶
| Configuration | Concept↑ | Prompt↑ | Motion↑ | Overall↑ |
|---|---|---|---|---|
| Baseline (global binder only) | 2.16 | 2.60 | 2.26 | 2.34 |
| + Hierarchical Binder | 2.63 | 2.88 | 2.93 | 2.81 |
| + Prompt Diversification | 3.40 | 3.34 | 3.04 | 3.26 |
| + Absorption Token | 3.55 | 3.43 | 3.43 | 3.47 |
| + TDS (w/o absorption) | 3.80 | 3.97 | 3.70 | 3.82 |
| ▲ w/o Reverse Training | 2.60 | 2.70 | 2.43 | 2.58 |
| Full Model | 4.43 | 4.47 | 4.32 | 4.40 |
Key Findings¶
- Hierarchical binders significantly improve concept retention and motion quality (Motion: 2.26 → 2.93)
- Absorption tokens effectively suppress unwanted details (ablation visualization shows irrelevant elements appearing when removed)
- TDS is critical for image-video compatibility (Overall: 3.47 → 3.82)
- Two-stage reverse training is indispensable — removing it causes Overall to plummet from 4.40 to 2.58
Highlights & Insights¶
- Unified framework: First to achieve flexible composition of arbitrary concepts from images and videos, supporting non-object concepts (style, motion)
- No masks required: Achieves implicit decomposition through text-conditioned concept composition, lowering the user barrier
- Scalable design: Binders are lightweight modules independently trained for different concept sources, composable on demand
- Rich derivative applications: Image/video decomposition (retaining only selected tokens), text-guided editing
Limitations & Future Work¶
- All tokens are treated equally, but token importance for T2V generation is non-uniform — tokens representing subjects/motion are far more important than function words
- Based on a 1.3B model; scaling effects on larger T2V models (CogVideoX, Sora-scale) remain unverified
- Consistency between automatic metrics (CLIP-T, DINO-I) and human evaluation needs further confirmation
- Computational overhead: each concept source requires independent binder training (2400 iterations × 2 stages)
Related Work & Insights¶
- Textual Inversion/DreamBooth-LoRA are foundational video personalization methods but offer coarse concept control granularity
- DreamVideo/DualReal support subject + motion composition but constrain input types and quantities
- TokenVerse achieves prompt-controlled image concept composition but relies on text-conditional modulation architecture, inapplicable to modern T2V models
- Break-A-Scene depends on explicit mask input and cannot extract non-object concepts
- BiCo unifies concept decomposition and composition through the binder + token composition paradigm
- Set-and-Sequence and Grid-LoRA achieve appearance/motion learning in LoRA space but cannot precisely specify concepts and composition methods
Method Details Supplement¶
- VLM Key Concept Extraction: Qwen2.5-VL extracts spatial concepts (objects, style, lighting, etc.) and temporal concepts (motion patterns, speed changes, etc.), combined into spatial-only and spatiotemporal prompts respectively
- Inference Process: The target prompt \(\mathbf{p}_d\) is decomposed by concept correspondence, each part updated through corresponding binders, then reassembled into \(\mathbf{p}_u^i\)
- Derivative Applications: Image/video decomposition (retain only dog-related tokens, discard cat-related tokens), text-guided editing (unchanged parts pass through binders, edited parts use original tokens)
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First to achieve unified flexible composition of arbitrary image-video concepts
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Quantitative automatic + human evaluation + detailed ablation + visualization cases are comprehensive
- Writing Quality: ⭐⭐⭐⭐ — Concepts are clear; DAM/TDS design motivations are well articulated
- Value: ⭐⭐⭐⭐⭐ — Direct and broad application prospects for visual content creation