Skip to content

ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Conference: CVPR 2026 arXiv: 2601.16148 Code: Project Page Area: 3D Vision / 4D Generation Keywords: animated 3D mesh generation, temporal 3D diffusion, topology-consistent, rig-free, feed-forward

TL;DR

ActionMesh minimally extends a pretrained 3D diffusion model with a temporal axis (temporal 3D diffusion), then employs a temporal 3D autoencoder to convert independent shape sequences into topology-consistent animated meshes. The method generates production-quality animated 3D meshes from diverse inputs (video, text, or 3D mesh) in just 2 minutes, achieving state-of-the-art performance in both geometric accuracy and temporal consistency.

Background & Motivation

Background: Automatic generation of animated 3D objects is a core demand in gaming, film, and AR/VR, yet existing methods suffer from three major limitations.

Limitations of Prior Work: - Input constraints: Most methods are tied to specific input modalities and object categories. - Slow speed: They rely on per-scene optimization taking 30–45 minutes (DreamMesh4D, V2M4, LIM). - Insufficient quality: Results do not meet production standards (e.g., Gaussian Splatting lacks fixed topology and texture mapping support).

Key Challenge: How can one achieve fast, topology-consistent 4D generation without sacrificing quality?

Key Insight: Inspired by early video generation models — a pretrained 3D diffusion model can be minimally extended with a temporal axis, reusing powerful 3D priors to compensate for the scarcity of 4D animation data.

Core Idea: Decouple "3D generation" from "animation prediction" — first generate synchronized independent 3D shape sequences, then convert them into deformations of a reference mesh.

Method

Overall Architecture

Stage I: Input video → reference frame processed by image-to-3D to obtain a reference mesh → temporal 3D diffusion model generates a synchronized 4D mesh sequence (without topological consistency). Stage II: Temporal 3D autoencoder → converts the independent mesh sequence into per-frame vertex offsets of the reference mesh → topology-consistent animated 3D mesh.

Key Designs

  1. Temporal 3D Diffusion Model (Stage I): Built on the 3D latent diffusion framework of 3DShape2VecSet/TripoSG with two minimal modifications:

    • Inflated Attention: Self-attention layers are extended to cross-frame attention, allowing tokens from all frames to attend to each other: \(\text{infattn}(\mathbf{X}) = \text{reshape}^{-1}(\text{selfattn}(\text{reshape}(\mathbf{X})))\) The reshape operation flattens \(N \times T \times D\) into \(1 \times NT \times D\). Rotary Position Encoding (RoPE) is added to inject relative inter-frame position information and reduce jitter.
    • Masked Generation: During training, a subset of latents is randomly kept noise-free (flow step set to 0); at inference, the latents of known 3D shapes can be fixed.
    • Design Motivation: Inspired by MVDream's multi-view generation paradigm; inflated attention reuses pretrained weights and requires only fine-tuning; masked generation enables conditioning on known 3D mesh constraints.
  2. Temporal 3D Autoencoder (Stage II):

    • Encoder: A frozen 3D encoder \(\mathcal{E}_{\text{3D}}\) independently encodes each frame's point cloud to produce a latent sequence.
    • Decoder \(\mathcal{D}_{\text{4D}}\): Takes the full latent sequence as input and outputs a displacement field from the reference mesh vertices to each target timestep.
    • Query points are reference mesh vertex positions augmented with normals (normals help disambiguate points that are topologically distant but spatially close).
    • Timestep pairs \((t_i, t_j)\) are injected via Fourier encoding as additional tokens.
    • Inflated attention + RoPE are also applied to ensure cross-frame consistency.
    • Design Motivation: Reformulates the traditionally optimization-based problem of mapping an independent mesh sequence to a deformation field as feed-forward inference.

Loss & Training

  • Stage I: Flow matching loss, computed only on masked (to-be-generated) latents.
  • Stage II: MSE supervision on the deformation field.
  • The two stages are trained independently and chained at inference.
  • Overall inference time: 2 minutes for 16-frame video, a 10× speedup.

Key Experimental Results

Main Results (ActionBench)

Method Inference Time CD-3D↓ CD-4D↓ CD-M↓
DreamMesh4D 35min 0.104 0.152 0.265
LIM 15min 0.089 0.126 0.243
V2M4 35min 0.068 0.340 0.616
ShapeGen4D 15min 0.056 0.170 0.348
TripoSG (per-frame) 2min 0.056 0.184 -
ActionMesh 2min 0.053 0.081 0.148

Ablation Study

Configuration CD-3D↓ CD-4D↓ CD-M↓ Note
Full model 0.050 0.069 0.137 Best
w/o Stage II 0.050 0.069 - Stage II preserves 3D quality
w/o Stage I & II 0.050 0.187 - Stage I is critical for 4D
Craftsman backbone 0.072 0.117 0.216 Framework is backbone-agnostic

Key Findings

  • CD-4D improves by 35% (0.081 vs. 0.126) and CD-M by 39% (0.148 vs. 0.243), with a 10× speed advantage.
  • Per-frame TripoSG achieves comparable CD-3D to ActionMesh (0.056 vs. 0.053) but falls significantly behind in CD-4D (0.184 vs. 0.081), confirming that temporal consistency is the key contribution.
  • Stage II does not degrade 3D quality (CD-3D unchanged) while providing topology consistency.
  • The method generalizes well to real DAVIS videos despite being trained exclusively on synthetic data.
  • Motion transfer is a notable capability: the flying motion of a bird can be transferred to a dragon.

Highlights & Insights

  • Minimal modification strategy: Only inflated attention and masked generation are added to the pretrained 3D diffusion model, maximally reusing 3D priors.
  • Topology consistency + rig-free are critical production requirements: texture propagation and retargeting become trivial.
  • Decoupling generation from animation is an elegant simplification that reduces the complexity of the 4D problem.
  • Motion transfer is an emergent capability: masked generation naturally supports {3D + video} → animation conditioning.

Limitations & Future Work

  • Topological changes: The fixed-topology assumption cannot handle topological changes during deformation (e.g., splitting or merging).
  • Severe occlusion: Heavy occlusion in the reference frame or during motion may cause reconstruction failures.
  • The method's quality is bounded by that of the underlying image-to-3D model.
  • ActionBench is relatively small (128 animated scenes); larger-scale benchmarks are needed.
  • The term "temporal 3D diffusion" precisely distinguishes this approach from "4D diffusion" (multi-view extension).
  • The methodology parallels the extension of image models to video models (adding temporal attention + fine-tuning).
  • The generality of the VecSet architecture (3DShape2VecSet → TripoSG → CLAY) makes such temporal extensions broadly applicable.

Rating

  • Novelty: ⭐⭐⭐⭐ — The idea of minimally extending 3D diffusion to the temporal domain is clear and elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Quantitative benchmarks, qualitative comparisons, ablations, real-video generalization, and motion transfer; highly comprehensive.
  • Writing Quality: ⭐⭐⭐⭐⭐ — Terminology is clearly distinguished (4D mesh vs. animated 3D mesh); structure is concise.
  • Value: ⭐⭐⭐⭐⭐ — Achieves speed, quality, and topology consistency simultaneously; production-ready.