Skip to content

BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation

Conference: CVPR2026
arXiv: 2602.18873
Code: Project Page
Area: Image Generation / Dynamic 3D Generation
Keywords: B-spline, Motion Generation, Text-guided, 3D Character Animation, VAE-latent diffusion, Control Point Representation

TL;DR

BiMotion is proposed to compress variable-length motion sequences into a fixed number of control points using continuously differentiable B-spline curves. Combined with a specialized VAE and a flow-matching diffusion model, it achieves fast, highly expressive, and semantically complete text-guided dynamic 3D character generation, outperforming existing methods in both quality and efficiency.

Background & Motivation

  1. High Demand for Dynamic 3D Generation: There is an increasing demand for text-driven 3D character animation in games, film, and education. Decoupling motion generation from shape synthesis is the current mainstream paradigm.
  2. Fixed-length Input Bottleneck: Existing feed-forward methods (e.g., AnimateAnyMesh) utilize VAE-latent diffusion, which requires fixed-size inputs, forcing motion sequences to be cropped or uniformly downsampled.
  3. Semantic Loss from Cropping: Truncating variable-length sequences captures only isolated sub-actions (e.g., "rotating right") and fails to express the complete motion semantics described by users.
  4. Jitter from Downsampling: Uniform temporal downsampling leads to non-smooth and jittery motion results.
  5. Discrete Frame-by-frame Representation as Root Cause: Motion is inherently continuous; frame counts only reflect the sampling rate. Semantics do not change with frame counts, necessitating a continuous and compact parameterization.
  6. Lack of High-quality Annotated Data: Existing datasets lack pairings of diverse variable-length motion sequences with high-quality text descriptions.

Method

Overall Architecture

BiMotion adopts a VAE-latent diffusion architecture: - Training Phase: Variable-length vertex displacement sequences are converted into a fixed number of control points via B-spline fitting → Encoded into motion latents by a VAE → Conditional generation is learned by a flow-matching diffusion model. - Inference Phase: Given an initial mesh + text → Initial shape is encoded by the VAE → Motion latent is generated by the diffusion model → Decoded into control points by the VAE → Arbitrary length sequences are generated via B-spline reprojection.

B-spline Motion Representation

  • Fitting individual vertex displacement trajectories independently using uniform cubic B-splines (\(d=3\)) with \(k=16\) control points.
  • Laplacian Regularized Solver: When \(k > T\) (short sequences), the system is underdetermined. A second-order difference operator L is introduced for regularization. The closed-form solution is efficiently computed via Cholesky decomposition (\(< 1\) second for 200 frames and 50K vertices).
  • Three advantages of B-splines: ① Continuous differentiability ensures natural trajectories ② Local controllability ③ Temporal re-parameterization supports arbitrary length sampling.

Key Designs

1. Normal Fusion: - Surface normals are encoded via MLP and fused with point coordinate features through point-wise cosine similarity weights. - Effectively distinguishes motion components that are spatially close but have different mesh structures, proving more stable than mesh-connectivity methods.

2. Multi-level Control Point Embedding (Control-PE): - Inspired by wavelet packet decomposition, a level hierarchy of control points is constructed \([17,15,13,11,9,7,5,4]\). - High-frequency residuals are extracted level-by-level and concatenated with the coarsest level coefficients, computed efficiently through a single matrix multiplication. - Significantly outperforms traditional frequency positional encoding in capturing fine motion details (e.g., a lion's tail wagging).

3. Cross-attention Spatial Compression: - FPS sampling compresses \(n=4096\) points into \(n'=512\) tokens. - Encoder uses 8 layers of cross-attention; decoder uses 8 layers of self-attention.

Loss & Training

\[\mathcal{L}_{VAE} = \mathcal{L}_{Fit} + 0.3 \cdot \mathcal{L}_{Corr} + 0.1 \cdot \mathcal{L}_{Rigid} + 2 \times 10^{-5} \cdot \mathcal{L}_{KL}\]
Loss Function
\(\mathcal{L}_{Fit}\) (Charbonnier) Fits input control points
\(\mathcal{L}_{Corr}\) (Correspondence) Fits original displacement trajectories after B-spline reprojection; faster early convergence
\(\mathcal{L}_{Rigid}\) (Local Rigidity) Enforces local distance consistency between adjacent frames to maintain shape identity
\(\mathcal{L}_{KL}\) Regularizes latent distribution

Generation Model

  • Based on Rectified Flow-Matching with a 12-layer DiT block.
  • Initial latent \(\mathbf{z}_0\) is concatenated with the motion latent, then fused with text (CLIP ViT-L/14) and shape conditions through decoupled cross-attention.
  • Classifier-free guidance (\(\gamma=3.0\)) is used during inference.

Key Experimental Results

Dataset BIMO

  • 38,944 motion sequences totaling 3,682,790 frames.
  • Sources: DeformingThings4D (1,770) + ObjaverseV1 (10,550) + ObjaverseXL (26,624).
  • Text Annotations: Manual labels for DeformingThings4D + GPT-5 auto-labels for Objaverse (including iterative verification by an inspector).

Main Results

Method OC↑ SC↑ AQ↑ DD↑ TA (User)↑ MP (User)↑ ME (User)↑ Time↓ VRAM↓
GVFDiffusion 0.167 0.920 0.505 0.650 2.34 2.30 2.44 2.1min 14.1GB
AnimateAnyMesh 0.155 0.951 0.514 0.100 2.31 2.69 2.44 16.8s 3.1GB
V2M4 0.175 0.876 0.478 0.750 2.88 2.71 3.05 1.7h 48.4GB
Ours 0.187 0.948 0.529 0.800 4.10 4.06 4.05 4.4s 1.2GB
  • User study metrics lead significantly (approx. 4.0 vs. 2.9 for second place) with the lowest standard deviation.
  • Speeds are 3.8× faster than AnimateAnyMesh, using only 1.2 GB VRAM.
  • When mesh vertices increase from 9K to 24K, BiMotion's time/VRAM remains nearly constant, whereas AnimateAnyMesh grows linearly.

Ablation Study

Configuration Recon. Error (\(\times 10^{-2}\))
w/o B-spline + w/o All 3.237
w/o B-spline + w/ NF/Corr/Rigid 2.674
w/ B-spline w/o NF 1.328
w/ B-spline w/o Control-PE 1.648
w/ B-spline w/o Corr 1.303
w/ B-spline w/o Rigid 1.349
Full Model 1.078
  • B-spline representation provides the largest gain in reconstruction quality (3.237 → 1.328).
  • Normal Fusion contributes significantly to spatial discrimination (1.328 → 1.078).
  • Laplacian regularization outperforms Ridge regularization for short sequences (\(T < k\)).

Highlights & Insights

  • Elegant Representation Design: B-spline converts variable-length motion into fixed control points, elegantly solving the fundamental contradiction of processing variable sequences with fixed-capacity models.
  • Extreme Efficiency: 4.4s generation and 1.2 GB VRAM consumption far exceed other methods and are insensitive to mesh complexity.
  • Topological Robustness: Dense-point training + Normal Fusion makes the method independent of specific mesh topology; the same model produces consistent motion for inputs with different remeshing.
  • Multi-level Embedding Innovation: Control-PE inspired by wavelet decomposition is significantly superior to standard frequency encoding.
  • High-quality Data Pipeline: Construction of a 39K-level motion dataset with rich annotations, featuring an auto-labeling pipeline with inspector-based iterative verification.

Limitations & Future Work

  • Limited expressiveness for high-frequency complex motions (e.g., rapid vibration), requiring an increased number of control points.
  • Assumes fixed-topology meshes; does not support motion with topological changes (e.g., splitting, fluids).
  • Dependent on large-scale high-quality dynamic 3D data and compute.
  • Motions like walking may manifest as "stepping in place" due to a lack of global displacement modeling.
Method Motion Representation Input Condition Variable Length Feed-forward
AnimateAnyMesh Per-frame vertex token Text + Mesh ✗ (Fixed Crop)
GVFDiffusion 3D Gaussian Video
V2M4 Monocular Reconstruction Video ✗ (Optimization)
DNF 4D INR Unconditional
Puppeteer Skeleton-driven Video + Mesh
Ours B-spline Control Points Text + Mesh
  • AnimateAnyMesh is the most direct competitor, using Text + Mesh for feed-forward generation; BiMotion breaks its fixed-frame cropping limit via B-splines.
  • Video-conditioned methods (GVFDiffusion, V2M4) suffer from video quality issues and low efficiency.
  • Skeletal methods (Puppeteer) require precise rigging and generalize poorly to diverse characters.

Rating

  • Novelty: ⭐⭐⭐⭐ — B-spline motion representation + multi-level embedding are novel and sound designs.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — VBench + User Study + Comprehensive Ablation + Multi-baseline Comparison + Efficiency Analysis.
  • Writing Quality: ⭐⭐⭐⭐ — Clear logic, complete mathematical derivations, and rich visualizations.
  • Value: ⭐⭐⭐⭐ — Contributions at the representation level are universal and transferable to other motion generation tasks.