ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion¶
Conference: CVPR 2026 arXiv: 2601.16148 Code: Project Page Area: 3D Vision / 4D Generation Keywords: animated 3D mesh generation, temporal 3D diffusion, topology-consistent, rig-free, feed-forward
TL;DR¶
ActionMesh minimally extends a pretrained 3D diffusion model with a temporal axis (temporal 3D diffusion), then employs a temporal 3D autoencoder to convert independent shape sequences into topology-consistent animated meshes. The method generates production-quality animated 3D meshes from diverse inputs (video, text, or 3D mesh) in just 2 minutes, achieving state-of-the-art performance in both geometric accuracy and temporal consistency.
Background & Motivation¶
Background: Automatic generation of animated 3D objects is a core demand in gaming, film, and AR/VR, yet existing methods suffer from three major limitations.
Limitations of Prior Work: - Input constraints: Most methods are tied to specific input modalities and object categories. - Slow speed: They rely on per-scene optimization taking 30–45 minutes (DreamMesh4D, V2M4, LIM). - Insufficient quality: Results do not meet production standards (e.g., Gaussian Splatting lacks fixed topology and texture mapping support).
Key Challenge: How can one achieve fast, topology-consistent 4D generation without sacrificing quality?
Key Insight: Inspired by early video generation models — a pretrained 3D diffusion model can be minimally extended with a temporal axis, reusing powerful 3D priors to compensate for the scarcity of 4D animation data.
Core Idea: Decouple "3D generation" from "animation prediction" — first generate synchronized independent 3D shape sequences, then convert them into deformations of a reference mesh.
Method¶
Overall Architecture¶
Stage I: Input video → reference frame processed by image-to-3D to obtain a reference mesh → temporal 3D diffusion model generates a synchronized 4D mesh sequence (without topological consistency). Stage II: Temporal 3D autoencoder → converts the independent mesh sequence into per-frame vertex offsets of the reference mesh → topology-consistent animated 3D mesh.
Key Designs¶
-
Temporal 3D Diffusion Model (Stage I): Built on the 3D latent diffusion framework of 3DShape2VecSet/TripoSG with two minimal modifications:
- Inflated Attention: Self-attention layers are extended to cross-frame attention, allowing tokens from all frames to attend to each other: \(\text{infattn}(\mathbf{X}) = \text{reshape}^{-1}(\text{selfattn}(\text{reshape}(\mathbf{X})))\) The reshape operation flattens \(N \times T \times D\) into \(1 \times NT \times D\). Rotary Position Encoding (RoPE) is added to inject relative inter-frame position information and reduce jitter.
- Masked Generation: During training, a subset of latents is randomly kept noise-free (flow step set to 0); at inference, the latents of known 3D shapes can be fixed.
- Design Motivation: Inspired by MVDream's multi-view generation paradigm; inflated attention reuses pretrained weights and requires only fine-tuning; masked generation enables conditioning on known 3D mesh constraints.
-
Temporal 3D Autoencoder (Stage II):
- Encoder: A frozen 3D encoder \(\mathcal{E}_{\text{3D}}\) independently encodes each frame's point cloud to produce a latent sequence.
- Decoder \(\mathcal{D}_{\text{4D}}\): Takes the full latent sequence as input and outputs a displacement field from the reference mesh vertices to each target timestep.
- Query points are reference mesh vertex positions augmented with normals (normals help disambiguate points that are topologically distant but spatially close).
- Timestep pairs \((t_i, t_j)\) are injected via Fourier encoding as additional tokens.
- Inflated attention + RoPE are also applied to ensure cross-frame consistency.
- Design Motivation: Reformulates the traditionally optimization-based problem of mapping an independent mesh sequence to a deformation field as feed-forward inference.
Loss & Training¶
- Stage I: Flow matching loss, computed only on masked (to-be-generated) latents.
- Stage II: MSE supervision on the deformation field.
- The two stages are trained independently and chained at inference.
- Overall inference time: 2 minutes for 16-frame video, a 10× speedup.
Key Experimental Results¶
Main Results (ActionBench)¶
| Method | Inference Time | CD-3D↓ | CD-4D↓ | CD-M↓ |
|---|---|---|---|---|
| DreamMesh4D | 35min | 0.104 | 0.152 | 0.265 |
| LIM | 15min | 0.089 | 0.126 | 0.243 |
| V2M4 | 35min | 0.068 | 0.340 | 0.616 |
| ShapeGen4D | 15min | 0.056 | 0.170 | 0.348 |
| TripoSG (per-frame) | 2min | 0.056 | 0.184 | - |
| ActionMesh | 2min | 0.053 | 0.081 | 0.148 |
Ablation Study¶
| Configuration | CD-3D↓ | CD-4D↓ | CD-M↓ | Note |
|---|---|---|---|---|
| Full model | 0.050 | 0.069 | 0.137 | Best |
| w/o Stage II | 0.050 | 0.069 | - | Stage II preserves 3D quality |
| w/o Stage I & II | 0.050 | 0.187 | - | Stage I is critical for 4D |
| Craftsman backbone | 0.072 | 0.117 | 0.216 | Framework is backbone-agnostic |
Key Findings¶
- CD-4D improves by 35% (0.081 vs. 0.126) and CD-M by 39% (0.148 vs. 0.243), with a 10× speed advantage.
- Per-frame TripoSG achieves comparable CD-3D to ActionMesh (0.056 vs. 0.053) but falls significantly behind in CD-4D (0.184 vs. 0.081), confirming that temporal consistency is the key contribution.
- Stage II does not degrade 3D quality (CD-3D unchanged) while providing topology consistency.
- The method generalizes well to real DAVIS videos despite being trained exclusively on synthetic data.
- Motion transfer is a notable capability: the flying motion of a bird can be transferred to a dragon.
Highlights & Insights¶
- Minimal modification strategy: Only inflated attention and masked generation are added to the pretrained 3D diffusion model, maximally reusing 3D priors.
- Topology consistency + rig-free are critical production requirements: texture propagation and retargeting become trivial.
- Decoupling generation from animation is an elegant simplification that reduces the complexity of the 4D problem.
- Motion transfer is an emergent capability: masked generation naturally supports {3D + video} → animation conditioning.
Limitations & Future Work¶
- Topological changes: The fixed-topology assumption cannot handle topological changes during deformation (e.g., splitting or merging).
- Severe occlusion: Heavy occlusion in the reference frame or during motion may cause reconstruction failures.
- The method's quality is bounded by that of the underlying image-to-3D model.
- ActionBench is relatively small (128 animated scenes); larger-scale benchmarks are needed.
Related Work & Insights¶
- The term "temporal 3D diffusion" precisely distinguishes this approach from "4D diffusion" (multi-view extension).
- The methodology parallels the extension of image models to video models (adding temporal attention + fine-tuning).
- The generality of the VecSet architecture (3DShape2VecSet → TripoSG → CLAY) makes such temporal extensions broadly applicable.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The idea of minimally extending 3D diffusion to the temporal domain is clear and elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Quantitative benchmarks, qualitative comparisons, ablations, real-video generalization, and motion transfer; highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ — Terminology is clearly distinguished (4D mesh vs. animated 3D mesh); structure is concise.
- Value: ⭐⭐⭐⭐⭐ — Achieves speed, quality, and topology consistency simultaneously; production-ready.