Skip to content

Semantic-Aware Motion Encoding for Topology-Agnostic Character Animation

Conference: ICML 2026
arXiv: 2605.27055
Code: https://github.com/zzysteve/SATA
Area: 3D Vision / Human Motion / Character Animation / Representation Learning
Keywords: Motion Representation, Topology-Agnostic, Cross-Species Retargeting, Semantic Modulation, Graph Autoencoder

TL;DR

SATA utilizes joint semantic labels generated by MLLMs for FiLM-style feature modulation, combined with spatio-temporal interleaved graph autoencoders, to compress BVH motions of arbitrary skeletal topologies into a shared latent space. This enables high-fidelity reconstruction and zero-shot cross-species motion retargeting without paired data.

Background & Motivation

Background: Current motion representation learning in character animation is almost entirely built upon the "canonical skeleton" assumption. Large-scale datasets such as HumanML3D and AMASS map original SMPL sequences to a fixed number and hierarchy of joints, subsequently using VAE or VQ/RVQ-VAE to compress these fixed-dimensional trajectories into latent codes. While this paradigm achieves high reconstruction fidelity within a single species, it is inherently "tailored for a specific skeleton."

Limitations of Prior Work: In practice, digital characters range from humans to quadrupeds and fantasy creatures, exhibiting vast differences in joint counts and hierarchies. Even for "humans," joint definitions and naming conventions vary across datasets. This directly hinders the development of multi-species generative models for unified training. Existing topology-flexible solutions either employ graph convolutions (e.g., SAME) that focus only on local graph structures and lack cross-species functional correspondence, or use Transformer with zero-padding (Gat 2025, Lee 2025) for generation, which introduces quadratic computational redundancy, constrains the maximum joint count, and fails to learn compact continuous latent manifolds.

Key Challenge: The contradiction between topological diversity versus compact generative motion representations. Fixed templates offer compact latent spaces at the expense of topological flexibility; zero-padding and graph structures provide flexibility but either disrupt latent space structure or lack semantic alignment. A mechanism to decouple "topology" from "semantics" is required.

Goal: To construct a padding-free autoencoder that directly processes raw data from arbitrary BVH topologies, encodes motion into a species-shared continuous latent manifold, and decodes from this latent code to any target skeleton without paired supervision.

Key Insight: The authors observe that although skeletal topologies vary, motion "semantics" are shared across species: human arms and animal forelimbs align at a functional level (e.g., "support/locomotion/grasping"). Rather than matching geometry or graph structures, one can attach semantic descriptions to each joint (generated by MLLMs and encoded via T5) to establish functional correspondence in semantic space.

Core Idea: Utilize MLLM-generated joint semantic embeddings for FiLM-style feature modulation (where semantic and spatial information jointly generate scaling/shifting parameters to reshape motion features), alongside spatio-temporal interleaved graph blocks, to encode motions of arbitrary skeletons into a unified, topology-agnostic latent space.

Method

Overall Architecture

SATA represents a motion sequence of \(T\) frames as a graph sequence \(\mathcal{G}=\{G_1,\dots,G_T\}\). Each frame's graph \(G_t=(V,E,E_f,F_s,F_{m,t})\) contains nodes (joints), edges (bone chains), edge features (topological/inverse depth), and node features. Node features are decomposed into static skeletal parts \(F_s=(X_g, X_l, X_t)\) (global offsets, rest pose coordinates relative to parent, and joint semantic embeddings) and dynamic motion parts \(F_m=(q,x,v_q,v_x,r,c)\in\mathbb{R}^{J\times D}\) (6D rotation, relative position, angular/linear velocity, root motion, and foot contact). The pipeline is as follows:

  1. Dynamic motion \(F_m\), spatial coordinates \(X_g, X_l\), and semantic embeddings \(X_t\) are projected into three types of tokens.
  2. Semantic-Aware Feature Modulation fuses spatial and semantic info into FiLM parameters to reshape motion features.
  3. The encoder stacks three Spatio-Temporal Interleaved graph blocks, alternating between spatial graph convolution/attention and temporal attention.
  4. A spatial max-pool aggregates node features into a frame-level latent code \(z_t\), stripping source skeleton structural info to form a topology-agnostic latent manifold.
  5. Regularization is applied using VAE (KL regularization) or RVQ-VAE (commitment loss + codebook).
  6. During decoding, \(z_t\) is broadcast to all nodes of the target skeleton, processed with the target's \((X'_t, X'_g, X'_l)\) through a symmetric decoder. Finally, an MLP outputs \(\hat F_m^{out}=(q,r,c)\in\mathbb{R}^{J\times 11}\), which yields the complete motion via Forward Kinematics (FK).

Key Designs

  1. Semantic-Aware Feature Modulation:

    • Function: Uses MLLM-generated joint semantic embeddings and spatial coordinates as "identity conditions" to dynamically reshape each joint's motion features, establishing cross-species functional correspondence.
    • Mechanism: Gemini 2.5 Pro generates neutral functional descriptions for each joint (emphasizing function over species naming to aid transfer), which are encoded by a frozen T5 into \(X_t\). Three inputs are projected: \(z_m=\phi_m(F_m)\), \(z_s=\phi_s([X_g;X_l])\) (a sinusoidal encoder mapping coordinates to high-frequency spectra), and \(z_t=\phi_t(X_t)\). Concatenated spatial and semantic info pass through a non-linear \(\Phi\) to get node condition \(c=\Phi([z_s;z_t])\). FiLM parameters \([\gamma,\beta]=\Psi(c)\) are projected to perform \(\hat x=\mathrm{LN}(z_m)\odot(1+\gamma)+\beta\). Finally, a gated residual \(\widetilde F_m=z_m+g\odot\hat x\) (\(g=\sigma(W_g c)\)) prevents conflicting modality contamination.
    • Design Motivation: Compared to simple addition, FiLM allows semantic-spatial conditions to dynamically modulate the motion features, causing "left hand" and "left front paw" to map to similar latent regions despite different topologies. Gating allows the model to actively suppress noisy semantics.
  2. Spatio-Temporal Interleaved Graph Block:

    • Function: Models intra-frame biophysical constraints (bone lengths, joint limits) and inter-frame motion coherence simultaneously, avoiding jitter and structural drift.
    • Mechanism: The spatial branch, inspired by GPSConv, runs two parallel paths: GINEConv for message passing (utilizing topological depth in edge features for skeletal priors) and a Spatial Transformer for long-range coordination between non-adjacent joints. Their sum passes through an MLP+residual. The temporal branch uses spatio-temporal interleaving: a mapping operator \(\mathcal{T}\) rearranges the batch graph sequence \(\mathcal{G}_{batch}\) into "joint-aggregated temporal streams" \(\mathcal{X}_{temp}\), followed by a Temporal Transformer with sinusoidal positional encoding to capture long-term dependencies. The encoder and its symmetric decoder stack three such blocks.
    • Design Motivation: Pure graph convolution loses long-term context, while pure attention loses topological priors. The interleaved design ensures the skeleton remains physically plausible while the temporal dimension remains coherent.
  3. Topology-Agnostic Latent Manifold + Arbitrary Skeleton Data Pipeline:

    • Function: Strips source structural information from latent codes, allowing \(z_t\) to decode to any target skeleton; raw BVH processing enables joint training across datasets.
    • Mechanism: A spatial max-pool aggregates \(J\) node features into a frame-level vector \(z_t\), erasing information about the source skeleton's node count and connectivity. At decoding, only the latent code and target skeleton's \((X'_t,X'_g,X'_l)\) are used. \(z_t\) employs VAE (128D) for reconstruction/retargeting and RVQ-VAE (256D, 6 quantizers) for text-to-motion tokens. Benchmarks AT-HumanML3D (80,508 segments) and AT-AniMo4D (30,097 segments across 115 species) were created using quaternion BVH canonicalization.
    • Design Motivation: Unlike character-specific AEs or fixed templates, the max-pool + broadcast architecture allows the latent space to represent "motion semantics" rather than skeletal implementation.

Loss & Training

End-to-end optimization of \(\mathcal{L}=\mathcal{L}_{rec}+\lambda\mathcal{L}_{reg}\). Reconstruction terms follow SAME: rotation/position/velocity MSE, foot contact, ground penetration, and physical smoothness regularizations. Latent space uses KL for VAE and commitment loss for RVQ-VAE. Adam optimizer, lr 1e-4, 30-epoch linear warmup, exponential decay \(\gamma=0.99\) for 400 epochs. Inference uses a 64-frame sliding window with 16-frame overlap. The base model has 8.41M parameters / 29.66 GFLOPs.

Key Experimental Results

Main Results

Reconstruction/Zero-shot cross-dataset evaluation (Grey indicates zero-shot; JR=Joint Rotation, RT=Root Trajectory, JP=Joint Position, FS=Foot Skating, GP=Ground Penetration):

Training Source Method AT-HumanML3D JR↓ AT-HumanML3D JP↓ AT-AniMo4D JR↓ AT-AniMo4D JP↓
HumanML3D SAME 0.0831 2.81 0.5721 398.1 (Collapsed)
HumanML3D Ours 0.0568 1.36 0.4855 34.6
AniMo4D SAME 0.5266 122.7 (Collapsed) 0.5227 4.30
AniMo4D Ours 0.6616 80.9 0.3901 4.51

Joint training (simultaneous training on both benchmarks):

Method AT-HumanML3D JR↓ AT-HumanML3D JP↓ AT-AniMo4D JR↓ AT-AniMo4D JP↓
SAME 0.1060 2.34 0.2357 3.68
Ours 0.0769 1.63 0.1971 3.27

Human motion retargeting (global joint position error):

Method Internal↓ Cross↓
MoMask 89.42 103.72
SAN 15.96 34.82
SAME 1.48 0.96
Ours (RVQ) 1.12 0.97
Ours (VAE) 0.21 0.20

Ablation Study

Ablation on AT-HumanML3D (selected):

Configuration JR↓ JP↓ Internal Retarget↓ Cross Retarget↓
Full 0.0568 1.36 0.21 0.20
w/o Spatial Transformer 0.1243 10.23 1.78 1.67
w/o GNN 0.0747 2.38 1.10 1.06
w/o Temporal Transformer 0.0584 1.78 0.32 0.29
w/o Fusion Block (Semantic Mod.) 0.1196 7.81 0.48 0.48
w/o Text Fusion (No Text) 0.0663 1.52 0.35 0.25
w/o Sliding Window 0.0630 3.45 0.75 0.65

Pre-training effect (on AniMo4D with different data ratios, ✗=From scratch, ✓=After AT-HumanML3D pre-training):

Data \% PT JR↓ JP↓
10% ✗ / ✓ 0.62 / 0.28 14.34 / 7.71
30% ✗ / ✓ 0.52 / 0.22 8.72 / 4.06
100% ✗ / ✓ 0.39 / 0.18 4.51 / 3.14

Key Findings

  • Removing Fusion Block is critical: JP jumps from 1.36 to 7.81, and retargeting error doubles. This confirms "semantic-spatial modulation" is key to cross-species alignment.
  • Spatial Transformer is more important than GNN: Removing the former increases JP by 7.5x, whereas removing the latter increases it by only 1.7x, highlighting the necessity of long-range joint coordination.
  • Joint training provides reciprocal benefits: While the AniMo4D performance was initially lower than SAME, adding human data allowed the model to outperform it, indicating human motion helps animal trajectory modeling.
  • Robustness in zero-shot cross-species: SAME collapsed in Human→Animal settings (JP=398), while Ours remained stable (JP=34.6) due to the decoupling of the latent space and target conditions.
  • Pre-training yields significant gains: With 10% data, JP dropped from 14.34 to 7.71 (approx. 46% gain), suggesting the latent manifold serves as a universal backbone for motion.

Highlights & Insights

  • The "Topology Decoupling = Semantic Bridge" philosophy is elegant: Rather than matching at the geometric level, MLLM-based functional labels establish implicit "human left hand ↔ cat left front paw" correspondences.
  • FiLM + Gated Residual trick: Directly adding semantic labels can contaminate features; FiLM allows the condition to reshape features dynamically, while gating suppresses noise when conditions conflict.
  • Max-pool for source erasure + Broadacasting for target injection: This asymmetric design is the key to a truly topology-agnostic latent space compared to previous "end-to-end tied" graph structures.
  • Data pipeline as a contribution: AT-HumanML3D and AT-AniMo4D benchmarks provide a standardized platform for testing topological generalization.

Limitations & Future Work

  • MLLM quality control: The meaningfulness of Gemini-generated descriptions for bizarre topologies (e.g., fantasy creatures or multi-arm robots) was not fully discussed.
  • Dependency on BVH format: The pipeline is limited to quaternion BVH, excluding mesh-driven or blendshape-driven characters.
  • Geometric vs. Perceptual metrics: Cross-species retargeting lacks reliable user studies or perceptual metrics to evaluate the "naturalness" of the resulting motion.
  • Topology-agnostic \(\neq\) Physics-agnostic: The model does not explicitly account for target species physics (e.g., mass distribution, joint limits), potentially leading to unrealistic results during extreme transfers.
  • Future work: Incorporate physical parameters (mass, inertia) as decoding conditions and align RVQ codebooks with text for more controllable multi-species text-to-motion.
  • vs. SAME (Lee 2023): Both use topology-flexible GAEs, but SAME lacks cross-species semantic alignment, leading to zero-shot collapse. SATA uses MLLM modulation to reduce zero-shot JP from 398 to 34.
  • vs. Zero-padding Transformers (Gat 2025, Lee 2025): These use padding for generation, resulting in quadratic redundancy and joint limits. SATA is padding-free and provides a more compact latent space.
  • vs. MoMask / VQ-VAE series: These excel on fixed skeletons but cannot generalize to heterogeneous bones; SATA trades slight reconstruction fidelity for topological generalization.
  • vs. WalkTheDog / OmniMotionGPT: Previous cross-species work was either not scalable (per-character AEs) or limited to specific templates (SMAL); SATA supports 115 animal species and humans in a single model.

Rating

  • Novelty: ⭐⭐⭐⭐ Combination of MLLM semantics, FiLM modulation, and topology-agnostic latent manifolds is a clean, new approach.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers various training settings and ablations; lacks perceptual evaluation.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation and well-explained methodology.
  • Value: ⭐⭐⭐⭐ Successfully bridges arbitrary BVH topology with multi-species joint training and zero-shot retargeting.