CVPR 2026 Human Understanding Multi-human motion generation sketch guidance rectified flow distillation human-object-human collaboration CTMC discrete events

Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation¶

Conference: CVPR 2026 arXiv: 2603.02190 Code: Unavailable Area: Human Understanding Keywords: Multi-human motion generation, sketch guidance, rectified flow distillation, human-object-human collaboration, CTMC discrete events

TL;DR¶

This paper proposes Sketch2Colab, which distills a sketch-driven diffusion prior into a rectified flow student network, and combines energy guidance with continuous-time Markov chain (CTMC) discrete event planning to generate coordinated multi-human–object interaction 3D motions from storyboard sketches, achieving state-of-the-art constraint compliance and perceptual quality on CORE4D and InterHuman.

Background & Motivation¶

Background: Diffusion models have demonstrated strong performance in single-person motion generation, with mature solutions for text-, trajectory-, and style-conditioned control. Collaborative multi-person–object scenarios (e.g., two people carrying a table together) remain an open challenge; representative work COLLAGE has made preliminary progress via LLM planning combined with latent diffusion.

Limitations of Prior Work: (1) Text as a control channel is too coarse—temporal ordering and spatial layout are difficult to express precisely; (2) achieving accurate constraint compliance in diffusion models under strong multi-entity constraints requires expensive posterior guidance or dedicated control modules, leading to slow and unstable sampling; (3) multi-entity interaction involves discrete events (contact, grasping, handover) that are difficult to model with purely continuous flows.

Key Challenge: Sketches provide spatiotemporal control signals that are more precise than text, yet extending sketch-based constraints to multi-person–object scenarios faces: keyframe misalignment, multi-agent phase drift, and difficulty optimizing discrete contact/handover states.

Goal: How to generate coordinated multi-human–object 3D motion sequences from storyboard sketches (keyframe poses, joint trajectories, and object paths)?

Key Insight: Diffusion-to-rectified-flow distillation for fast and stable sampling + energy guidance for precise constraint enforcement + CTMC for discrete contact event modeling.

Core Idea: Distill a diffusion teacher into a rectified flow student; use differentiable energy functions in dual spaces (raw motion space and latent space) to guide constraint compliance; use CTMC to schedule discrete interaction events that modulate the continuous flow.

Method¶

Overall Architecture¶

Input storyboard (keyframe poses, joint trajectories, object masks, optional text) → 2D/3D alignment encoder extracts control signals → rectified flow student network generates motions in VQ-VAE latent space → CTMC schedules contact/handover events to modulate sub-fields → dual-space energy guidance enforces constraint compliance → frozen decoder produces final multi-human–object 3D motions.

Key Designs¶

Diffusion-to-Rectified-Flow Distillation (PF-Distillation):
- Function: Distills a pretrained sketch-driven diffusion teacher into a rectified flow student.
- Mechanism: The teacher is trained with a VP noise schedule to obtain the probability flow velocity field \(v_\theta^{PF}\). The student jointly minimizes the rectified flow objective \(\mathcal{L}_{RF}\) and the PF distillation objective \(\mathcal{L}_{distill}\). Condition injection adopts a lift-then-fuse scheme: ControlNet-style trajectory paths combined with a time-gated keyframe adapter.
- Design Motivation: Directly training a multi-entity rectified flow model is computationally expensive and difficult; distilling a pretrained diffusion teacher inherits its learned motion distribution while enabling faster and more stable sampling.
Dual-Space Energy Guidance (Dual-Space Conditioning):
- Function: Simultaneously guides constraint compliance in raw motion space and latent space.
- Mechanism: Defines differentiable energy functions \(E_{key}\) (keyframe alignment), \(E_\tau\) (trajectory tracking), \(E_{int}\) (contact/spacing), and \(E_{phys}\) (physical constraints such as foot sliding prevention and ground plane). A learned low-rank block-Toeplitz Jacobian approximation \(\mathbf{B}_\rho\) maps gradients from raw space to latent space. A latent-space anchor loss \(\mathcal{L}_{lat}\) maintains samples on the motion manifold.
- Design Motivation: Constraints applied purely in raw space may leave the manifold, while purely latent-space constraints lack geometric precision—the dual-space approach is complementary.
CTMC Discrete Event Planning:
- Function: Handles temporal scheduling of discrete events such as contact, grasping, and handover.
- Mechanism: A continuous-time Markov chain is defined over contact states, with transition rates learned via physics-informed objectives. The resulting contact/handover schedule \(\boldsymbol{\pi}_t\) modulates the sub-fields and contact weights of the continuous flow, coupled via importance weighting.
- Design Motivation: Contact on/off, grasping, and handover are inherently discrete events; purely continuous flow optimization tends to produce ambiguous mode transitions and contact flickering—CTMC provides clean, correctly phased discrete scheduling.

Loss & Training¶

The total training loss comprises \(\mathcal{L}_{RF}\) (rectified flow) + \(\mathcal{L}_{distill}\) (distillation) + \(\mathcal{L}_{Lyap}\) (Lyapunov potential) + \(\mathcal{L}_{CTMC}\) (discrete events) + \(\mathcal{L}_{lat}\) (latent anchor) + individual energy terms. Classifier-free guidance is applied with 10% dropout and guidance scale \(\omega \in [1.4, 1.8]\). SMPL-X with 22 joints and 6D rotation representation is used throughout.

Key Experimental Results¶

Main Results (CORE4D and InterHuman)¶

Method	Constraint Compliance	FID↓	Perceptual Quality	Inference Speed
COLLAGE	Baseline	—	Baseline	Slow (diffusion)
Sketch2Anim	Single-person only	—	—	Moderate
Sketch2Colab	SOTA	Lowest	Highest	Fast (rectified flow)

Ablation Study¶

Configuration	Constraint Compliance	Note
Full model	Best	Complete pipeline
w/o CTMC	Degraded (contact ambiguity)	Discrete events lost
w/o dual-space	Degraded (geometric deviation)	Lack of precise guidance
w/o distillation	Significant degradation	Direct rectified flow training is unstable

Key Findings¶

Diffusion-to-rectified-flow distillation substantially outperforms directly extending a diffusion baseline to multi-entity scenarios, avoiding keyframe misalignment and phase drift.
CTMC contributes most to contact quality—without it, contacts exhibit flickering and temporal errors.
Dual-space guidance is more effective than either raw-space-only or latent-space-only guidance.

Highlights & Insights¶

Three-tier architecture of distillation + energy + CTMC: Continuous dynamics are handled by rectified flow, precise constraints by energy guidance, and discrete events by CTMC—each component has a well-defined role while remaining tightly coupled, resulting in an elegant overall design.
Sketches are better suited for interaction control than text: Sketches naturally encode spatiotemporal information (when, where, and in what pose), offering far greater precision than textual descriptions.
Jacobian approximation bridges the dual spaces: Learning a low-rank block-Toeplitz Jacobian to map gradients from raw to latent space avoids costly automatic differentiation.

Limitations & Future Work¶

The method requires storyboard-level inputs—creating precise keyframe sketches still demands animator expertise.
The discrete state space of CTMC is fixed (contact/non-contact), leaving richer interaction semantics (e.g., "carefully," "forcefully") unmodeled.
Evaluation is conducted primarily on CORE4D and InterHuman; the complexity of real-world film and game production scenarios is considerably higher.

vs. COLLAGE: COLLAGE relies on text + LLM planning + latent diffusion; Sketch2Colab uses sketches + rectified flow + CTMC, offering more precise control and faster inference.
vs. Sketch2Anim: Sketch2Anim supports only single-person generation; Sketch2Colab extends to collaborative multi-human–object scenarios.
vs. MotionLab: MotionLab also employs rectified flow but targets unified single-person generation and editing; Sketch2Colab focuses specifically on sketch-conditioned multi-entity control.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of distillation + energy guidance + CTMC is pioneering in motion generation.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on two multi-human interaction benchmarks, CORE4D and InterHuman.
Writing Quality: ⭐⭐⭐⭐ Method description is systematic but notation-dense.
Value: ⭐⭐⭐⭐⭐ Significant advancement for multi-person collaborative motion generation in animation and game industries.