BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation¶
Conference: CVPR2026
arXiv: 2602.18873
Code: Project Page
Area: Image Generation / Dynamic 3D Generation
Keywords: B-spline, Motion Generation, Text-guided, 3D Character Animation, VAE-latent diffusion, Control Point Representation
TL;DR¶
BiMotion is proposed to compress variable-length motion sequences into a fixed number of control points using continuously differentiable B-spline curves. Combined with a specialized VAE and a flow-matching diffusion model, it achieves fast, highly expressive, and semantically complete text-guided dynamic 3D character generation, outperforming existing methods in both quality and efficiency.
Background & Motivation¶
- High Demand for Dynamic 3D Generation: There is an increasing demand for text-driven 3D character animation in games, film, and education. Decoupling motion generation from shape synthesis is the current mainstream paradigm.
- Fixed-length Input Bottleneck: Existing feed-forward methods (e.g., AnimateAnyMesh) utilize VAE-latent diffusion, which requires fixed-size inputs, forcing motion sequences to be cropped or uniformly downsampled.
- Semantic Loss from Cropping: Truncating variable-length sequences captures only isolated sub-actions (e.g., "rotating right") and fails to express the complete motion semantics described by users.
- Jitter from Downsampling: Uniform temporal downsampling leads to non-smooth and jittery motion results.
- Discrete Frame-by-frame Representation as Root Cause: Motion is inherently continuous; frame counts only reflect the sampling rate. Semantics do not change with frame counts, necessitating a continuous and compact parameterization.
- Lack of High-quality Annotated Data: Existing datasets lack pairings of diverse variable-length motion sequences with high-quality text descriptions.
Method¶
Overall Architecture¶
BiMotion adopts a VAE-latent diffusion architecture: - Training Phase: Variable-length vertex displacement sequences are converted into a fixed number of control points via B-spline fitting → Encoded into motion latents by a VAE → Conditional generation is learned by a flow-matching diffusion model. - Inference Phase: Given an initial mesh + text → Initial shape is encoded by the VAE → Motion latent is generated by the diffusion model → Decoded into control points by the VAE → Arbitrary length sequences are generated via B-spline reprojection.
B-spline Motion Representation¶
- Fitting individual vertex displacement trajectories independently using uniform cubic B-splines (\(d=3\)) with \(k=16\) control points.
- Laplacian Regularized Solver: When \(k > T\) (short sequences), the system is underdetermined. A second-order difference operator L is introduced for regularization. The closed-form solution is efficiently computed via Cholesky decomposition (\(< 1\) second for 200 frames and 50K vertices).
- Three advantages of B-splines: ① Continuous differentiability ensures natural trajectories ② Local controllability ③ Temporal re-parameterization supports arbitrary length sampling.
Key Designs¶
1. Normal Fusion: - Surface normals are encoded via MLP and fused with point coordinate features through point-wise cosine similarity weights. - Effectively distinguishes motion components that are spatially close but have different mesh structures, proving more stable than mesh-connectivity methods.
2. Multi-level Control Point Embedding (Control-PE): - Inspired by wavelet packet decomposition, a level hierarchy of control points is constructed \([17,15,13,11,9,7,5,4]\). - High-frequency residuals are extracted level-by-level and concatenated with the coarsest level coefficients, computed efficiently through a single matrix multiplication. - Significantly outperforms traditional frequency positional encoding in capturing fine motion details (e.g., a lion's tail wagging).
3. Cross-attention Spatial Compression: - FPS sampling compresses \(n=4096\) points into \(n'=512\) tokens. - Encoder uses 8 layers of cross-attention; decoder uses 8 layers of self-attention.
Loss & Training¶
| Loss | Function |
|---|---|
| \(\mathcal{L}_{Fit}\) (Charbonnier) | Fits input control points |
| \(\mathcal{L}_{Corr}\) (Correspondence) | Fits original displacement trajectories after B-spline reprojection; faster early convergence |
| \(\mathcal{L}_{Rigid}\) (Local Rigidity) | Enforces local distance consistency between adjacent frames to maintain shape identity |
| \(\mathcal{L}_{KL}\) | Regularizes latent distribution |
Generation Model¶
- Based on Rectified Flow-Matching with a 12-layer DiT block.
- Initial latent \(\mathbf{z}_0\) is concatenated with the motion latent, then fused with text (CLIP ViT-L/14) and shape conditions through decoupled cross-attention.
- Classifier-free guidance (\(\gamma=3.0\)) is used during inference.
Key Experimental Results¶
Dataset BIMO¶
- 38,944 motion sequences totaling 3,682,790 frames.
- Sources: DeformingThings4D (1,770) + ObjaverseV1 (10,550) + ObjaverseXL (26,624).
- Text Annotations: Manual labels for DeformingThings4D + GPT-5 auto-labels for Objaverse (including iterative verification by an inspector).
Main Results¶
| Method | OC↑ | SC↑ | AQ↑ | DD↑ | TA (User)↑ | MP (User)↑ | ME (User)↑ | Time↓ | VRAM↓ |
|---|---|---|---|---|---|---|---|---|---|
| GVFDiffusion | 0.167 | 0.920 | 0.505 | 0.650 | 2.34 | 2.30 | 2.44 | 2.1min | 14.1GB |
| AnimateAnyMesh | 0.155 | 0.951 | 0.514 | 0.100 | 2.31 | 2.69 | 2.44 | 16.8s | 3.1GB |
| V2M4 | 0.175 | 0.876 | 0.478 | 0.750 | 2.88 | 2.71 | 3.05 | 1.7h | 48.4GB |
| Ours | 0.187 | 0.948 | 0.529 | 0.800 | 4.10 | 4.06 | 4.05 | 4.4s | 1.2GB |
- User study metrics lead significantly (approx. 4.0 vs. 2.9 for second place) with the lowest standard deviation.
- Speeds are 3.8× faster than AnimateAnyMesh, using only 1.2 GB VRAM.
- When mesh vertices increase from 9K to 24K, BiMotion's time/VRAM remains nearly constant, whereas AnimateAnyMesh grows linearly.
Ablation Study¶
| Configuration | Recon. Error (\(\times 10^{-2}\)) |
|---|---|
| w/o B-spline + w/o All | 3.237 |
| w/o B-spline + w/ NF/Corr/Rigid | 2.674 |
| w/ B-spline w/o NF | 1.328 |
| w/ B-spline w/o Control-PE | 1.648 |
| w/ B-spline w/o Corr | 1.303 |
| w/ B-spline w/o Rigid | 1.349 |
| Full Model | 1.078 |
- B-spline representation provides the largest gain in reconstruction quality (3.237 → 1.328).
- Normal Fusion contributes significantly to spatial discrimination (1.328 → 1.078).
- Laplacian regularization outperforms Ridge regularization for short sequences (\(T < k\)).
Highlights & Insights¶
- Elegant Representation Design: B-spline converts variable-length motion into fixed control points, elegantly solving the fundamental contradiction of processing variable sequences with fixed-capacity models.
- Extreme Efficiency: 4.4s generation and 1.2 GB VRAM consumption far exceed other methods and are insensitive to mesh complexity.
- Topological Robustness: Dense-point training + Normal Fusion makes the method independent of specific mesh topology; the same model produces consistent motion for inputs with different remeshing.
- Multi-level Embedding Innovation: Control-PE inspired by wavelet decomposition is significantly superior to standard frequency encoding.
- High-quality Data Pipeline: Construction of a 39K-level motion dataset with rich annotations, featuring an auto-labeling pipeline with inspector-based iterative verification.
Limitations & Future Work¶
- Limited expressiveness for high-frequency complex motions (e.g., rapid vibration), requiring an increased number of control points.
- Assumes fixed-topology meshes; does not support motion with topological changes (e.g., splitting, fluids).
- Dependent on large-scale high-quality dynamic 3D data and compute.
- Motions like walking may manifest as "stepping in place" due to a lack of global displacement modeling.
Related Work & Insights¶
| Method | Motion Representation | Input Condition | Variable Length | Feed-forward |
|---|---|---|---|---|
| AnimateAnyMesh | Per-frame vertex token | Text + Mesh | ✗ (Fixed Crop) | ✓ |
| GVFDiffusion | 3D Gaussian | Video | ✗ | ✓ |
| V2M4 | Monocular Reconstruction | Video | ✗ (Optimization) | ✗ |
| DNF | 4D INR | Unconditional | ✓ | ✗ |
| Puppeteer | Skeleton-driven | Video + Mesh | ✗ | ✗ |
| Ours | B-spline Control Points | Text + Mesh | ✓ | ✓ |
- AnimateAnyMesh is the most direct competitor, using Text + Mesh for feed-forward generation; BiMotion breaks its fixed-frame cropping limit via B-splines.
- Video-conditioned methods (GVFDiffusion, V2M4) suffer from video quality issues and low efficiency.
- Skeletal methods (Puppeteer) require precise rigging and generalize poorly to diverse characters.
Rating¶
- Novelty: ⭐⭐⭐⭐ — B-spline motion representation + multi-level embedding are novel and sound designs.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — VBench + User Study + Comprehensive Ablation + Multi-baseline Comparison + Efficiency Analysis.
- Writing Quality: ⭐⭐⭐⭐ — Clear logic, complete mathematical derivations, and rich visualizations.
- Value: ⭐⭐⭐⭐ — Contributions at the representation level are universal and transferable to other motion generation tasks.
Related Papers¶
- [CVPR 2026] InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing
- [CVPR 2026] Vinedresser3D: Agentic Text-guided 3D Editing
- [AAAI 2026] ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment
- [ICCV 2025] TeRA: Rethinking Text-guided Realistic 3D Avatar Generation
- [ECCV 2024] Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation