ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion¶

Conference: CVPR 2026 arXiv: 2601.16148 Code: Project Page Area: 3D Vision / 4D Generation Keywords: animated 3D mesh generation, temporal 3D diffusion, topology-consistent, rig-free, feed-forward

TL;DR¶

ActionMesh minimally extends a pretrained 3D diffusion model with a temporal axis (temporal 3D diffusion), then employs a temporal 3D autoencoder to convert independent shape sequences into topology-consistent animated meshes. The method generates production-quality animated 3D meshes from diverse inputs (video, text, or 3D mesh) in just 2 minutes, achieving state-of-the-art performance in both geometric accuracy and temporal consistency.

Background & Motivation¶

Background: Automatic generation of animated 3D objects is a core demand in gaming, film, and AR/VR, yet existing methods suffer from three major limitations.

Limitations of Prior Work: - Input constraints: Most methods are tied to specific input modalities and object categories. - Slow speed: They rely on per-scene optimization taking 30–45 minutes (DreamMesh4D, V2M4, LIM). - Insufficient quality: Results do not meet production standards (e.g., Gaussian Splatting lacks fixed topology and texture mapping support).

Key Challenge: How can one achieve fast, topology-consistent 4D generation without sacrificing quality?

Key Insight: Inspired by early video generation models — a pretrained 3D diffusion model can be minimally extended with a temporal axis, reusing powerful 3D priors to compensate for the scarcity of 4D animation data.

Core Idea: Decouple "3D generation" from "animation prediction" — first generate synchronized independent 3D shape sequences, then convert them into deformations of a reference mesh.

Method¶

Overall Architecture¶

Stage I: Input video → reference frame processed by image-to-3D to obtain a reference mesh → temporal 3D diffusion model generates a synchronized 4D mesh sequence (without topological consistency). Stage II: Temporal 3D autoencoder → converts the independent mesh sequence into per-frame vertex offsets of the reference mesh → topology-consistent animated 3D mesh.

Key Designs¶

Temporal 3D Diffusion Model (Stage I): Built on the 3D latent diffusion framework of 3DShape2VecSet/TripoSG with two minimal modifications:
- Inflated Attention: Self-attention layers are extended to cross-frame attention, allowing tokens from all frames to attend to each other: \(\text{infattn}(\mathbf{X}) = \text{reshape}^{-1}(\text{selfattn}(\text{reshape}(\mathbf{X})))\) The reshape operation flattens \(N \times T \times D\) into \(1 \times NT \times D\). Rotary Position Encoding (RoPE) is added to inject relative inter-frame position information and reduce jitter.
- Masked Generation: During training, a subset of latents is randomly kept noise-free (flow step set to 0); at inference, the latents of known 3D shapes can be fixed.
- Design Motivation: Inspired by MVDream's multi-view generation paradigm; inflated attention reuses pretrained weights and requires only fine-tuning; masked generation enables conditioning on known 3D mesh constraints.
Temporal 3D Autoencoder (Stage II):
- Encoder: A frozen 3D encoder \(\mathcal{E}_{\text{3D}}\) independently encodes each frame's point cloud to produce a latent sequence.
- Decoder \(\mathcal{D}_{\text{4D}}\): Takes the full latent sequence as input and outputs a displacement field from the reference mesh vertices to each target timestep.
- Query points are reference mesh vertex positions augmented with normals (normals help disambiguate points that are topologically distant but spatially close).
- Timestep pairs \((t_i, t_j)\) are injected via Fourier encoding as additional tokens.
- Inflated attention + RoPE are also applied to ensure cross-frame consistency.
- Design Motivation: Reformulates the traditionally optimization-based problem of mapping an independent mesh sequence to a deformation field as feed-forward inference.

Loss & Training¶

Stage I: Flow matching loss, computed only on masked (to-be-generated) latents.
Stage II: MSE supervision on the deformation field.
The two stages are trained independently and chained at inference.
Overall inference time: 2 minutes for 16-frame video, a 10× speedup.

Key Experimental Results¶

Main Results (ActionBench)¶

Method	Inference Time	CD-3D↓	CD-4D↓	CD-M↓
DreamMesh4D	35min	0.104	0.152	0.265
LIM	15min	0.089	0.126	0.243
V2M4	35min	0.068	0.340	0.616
ShapeGen4D	15min	0.056	0.170	0.348
TripoSG (per-frame)	2min	0.056	0.184	-
ActionMesh	2min	0.053	0.081	0.148

Ablation Study¶

Configuration	CD-3D↓	CD-4D↓	CD-M↓	Note
Full model	0.050	0.069	0.137	Best
w/o Stage II	0.050	0.069	-	Stage II preserves 3D quality
w/o Stage I & II	0.050	0.187	-	Stage I is critical for 4D
Craftsman backbone	0.072	0.117	0.216	Framework is backbone-agnostic

Key Findings¶

CD-4D improves by 35% (0.081 vs. 0.126) and CD-M by 39% (0.148 vs. 0.243), with a 10× speed advantage.
Per-frame TripoSG achieves comparable CD-3D to ActionMesh (0.056 vs. 0.053) but falls significantly behind in CD-4D (0.184 vs. 0.081), confirming that temporal consistency is the key contribution.
Stage II does not degrade 3D quality (CD-3D unchanged) while providing topology consistency.
The method generalizes well to real DAVIS videos despite being trained exclusively on synthetic data.
Motion transfer is a notable capability: the flying motion of a bird can be transferred to a dragon.

Highlights & Insights¶

Minimal modification strategy: Only inflated attention and masked generation are added to the pretrained 3D diffusion model, maximally reusing 3D priors.
Topology consistency + rig-free are critical production requirements: texture propagation and retargeting become trivial.
Decoupling generation from animation is an elegant simplification that reduces the complexity of the 4D problem.
Motion transfer is an emergent capability: masked generation naturally supports {3D + video} → animation conditioning.

Limitations & Future Work¶

Topological changes: The fixed-topology assumption cannot handle topological changes during deformation (e.g., splitting or merging).
Severe occlusion: Heavy occlusion in the reference frame or during motion may cause reconstruction failures.
The method's quality is bounded by that of the underlying image-to-3D model.
ActionBench is relatively small (128 animated scenes); larger-scale benchmarks are needed.

The term "temporal 3D diffusion" precisely distinguishes this approach from "4D diffusion" (multi-view extension).
The methodology parallels the extension of image models to video models (adding temporal attention + fine-tuning).
The generality of the VecSet architecture (3DShape2VecSet → TripoSG → CLAY) makes such temporal extensions broadly applicable.

Rating¶

Novelty: ⭐⭐⭐⭐ — The idea of minimally extending 3D diffusion to the temporal domain is clear and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Quantitative benchmarks, qualitative comparisons, ablations, real-video generalization, and motion transfer; highly comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ — Terminology is clearly distinguished (4D mesh vs. animated 3D mesh); structure is concise.
Value: ⭐⭐⭐⭐⭐ — Achieves speed, quality, and topology consistency simultaneously; production-ready.