Skip to content

MixerMDM: Learnable Composition of Human Motion Diffusion Models

Conference: CVPR 2025
arXiv: 2504.01019
Code: Project Page
Area: image_generation (motion generation)
Keywords: human motion diffusion, model composition, adversarial training, dynamic mixing, human-human interaction

TL;DR

MixerMDM is proposed as the first learnable composition technique for human motion diffusion models. It uses a Transformer-based Mixer module to predict dynamic mixing weights, learning to blend individual and interactive motion diffusion models via adversarial training to achieve fine-grained, controllable human-human interaction motion generation.

Background & Motivation

Background: Text-driven human motion generation has made significant progress, but high-quality motion data remains scarce and lacks unified formatting. Existing methods tend to train specialized models (e.g., individual motion models, interaction motion models) on dedicated datasets, with each excelling in its respective domain.

Key Challenge: How can the generative capabilities of different specialized models be combined? Individual motion models excel in the diversity and precise control of intrapersonal movements, while interaction models are proficient in handling global trajectories and interpersonal orientations. Existing model composition methods (e.g., DiffusionBlending, DualMDM) rely on manually set, fixed, or predefined weight schedulers, making them unable to dynamically adjust mixing strategies for different motions and conditions.

Key Insight: A learnable Mixer module is designed to dynamically predict mixing weights based on the currently generated motion content, text conditions, and denoising steps, preserving the core characteristics of each pre-trained model through adversarial training.

Method

Overall Architecture

The pipeline of MixerMDM: at each denoising timestep \(t\): 1. Two pre-trained models \(\mathcal{M}^a\) (interaction model) and \(\mathcal{M}^b\) (individual model) generate motions \(x_t^a\) and \(x_t^b\), respectively. 2. The Mixer module receives both motions and their respective conditions, and predicts the mixing weights \(w_t\). 3. The two motions are fused using a mixing formula to obtain \(x_t^m\). 4. \(x_t^m\) is fed back into both models as the input for the next denoising step.

For the individual + interaction mixing scenario, centering (normalizing starting positions) and alignment (restoring global trajectories) functions are also required to match the input formats of different models.

Key Designs

1. Mixer Module (Core Architecture)

A lightweight module based on a Transformer encoder (21M parameters, significantly smaller than the 300M+ parameters of the pre-trained models):

Input: Motion outputs \(x_t^a, x_t^b\) from the two pre-trained models, their respective conditions \(c^a, c^b\), and the current timestep \(t\).

Processing: 4-layer multi-head attention (latent dim=512, 8 heads) that encodes the inputs into high-dimensional representations.

Output: Decoding via MLP into mixing weights \(w_t\), offering four levels of granularity: - Global [G]: A single global scalar - Temporal [T]: One weight per frame - Spatial [S]: One weight per joint - Spatio-Temporal [ST]: One weight per joint per frame (most fine-grained)

Mixing formula: \(x_t^m = x_t^a + w_t \cdot (x_t^b - x_t^a)\)

2. Adversarial Training Strategy

Since mixed motion lacks ground truth, a GAN-based adversarial training scheme is designed:

  • One discriminator per pre-trained model: \(\mathcal{D}^a\) and \(\mathcal{D}^b\)
  • Positive samples: Outputs from each respective pre-trained model (\(x_t^a\) for \(\mathcal{D}^a\), \(x_t^b\) for \(\mathcal{D}^b\))
  • Negative samples: The mixed motion \(x_t^m\) generated by MixerMDM
  • Objective: The Mixer (generator) aims to make the mixed motion deceive both discriminators simultaneously, thereby preserving the core characteristics of both models.

3. Modular Design

The Mixer directly mixes the final outputs of the pre-trained models without depending on internal feature representations, allowing seamless replacement of the underlying pre-trained models (provided they are trained on the same dataset). Experiments show that pairing the worst-performing combination of pre-trained models with the best Mixer weights improves overall alignment by 37%.

Loss & Training

Generator Loss: $\(\mathcal{L}_{adv}^G = -\mathcal{D}^a(x_t^m) - \mathcal{D}^b(x_t^m) + L1\)$

Discriminator Loss (hinge loss): $\(\mathcal{L}_{adv}^D = \min(0, -1-\mathcal{D}^a(x_t^m)) + \min(0, -1-\mathcal{D}^b(x_t^m)) + \min(0, -1+\mathcal{D}^a(x_t^a)) + \min(0, -1+\mathcal{D}^b(x_t^b)) + L1\)$

The \(L1\) regularization term penalizes excessive discrepancy between the two discriminator losses. Training is conducted for 300 epochs with a batch size of 128 and learning rate of \(10^{-5}\), taking approximately 36 hours on a single RTX 4090.

Key Experimental Results

Main Results

Method Interaction Model Individual Model Overall Alignment ↑ Adaptability
DiffusionBlending in2IN in2IN 0.221
DualMDM in2IN in2IN 0.217
MixerMDM [ST] in2IN in2IN 0.335 0.112
Finetuned baseline 0.289

Ablation Study

Type \(\mathcal{M}^a\)=in2IN, \(\mathcal{M}^b\)=in2IN Adaptability
Global [G] 0.317 0.004
Temporal [T] 0.245 0.002
Spatial [S] 0.310 0.015
Spatio-Temporal [ST] 0.335 0.112

User Study

Method Interaction Avg Rank ↓ 1st Ranked ↑ Individual Avg Rank ↓ 1st Ranked ↑
DiffusionBlending 2.531 4.57% 2.446 10.57%
DualMDM 2.286 10.29% 2.051 24.57%
MixerMDM 1.182 85.14% 1.309 74.86%

Key Findings

  1. ST Granularity is Optimal: The Spatio-Temporal mixing weights achieve the best trade-off between Alignment and Adaptability.
  2. Adversarial Training is Effective: The learned mixing weight curves differ significantly from manual schedules, showing superior strategies.
  3. Modular Transferability: Transferring the best Mixer weights to weaker model combinations yields a 37% improvement in Overall Alignment.
  4. Individual Dominates Early, Interaction Dominates Late: All learned weight curves exhibit this pattern, validating hypotheses from prior studies.

Highlights & Insights

  1. First Learnable Motion Diffusion Model Composition: Compared to fixed or predefined weights, dynamically learning the mixing strategy represents a major advancement.
  2. Clever Use of GAN Paradigm: In the absence of ground-truth mixed motion, outputs from individual models are utilized as pseudo-ground truths for adversarial training.
  3. Modular Characteristic: The Mixer does not rely on internal model representations and can replace underlying models at zero-cost, offering strong practical value.
  4. Contribution to the Evaluation Framework: Two new metrics, Alignment and Adaptability, are proposed to fill the gap in quantitative evaluation for this task.

Limitations & Future Work

  1. Inference Overhead: While the Mixer is roughly 10x smaller than the pre-trained models, calculating the mixing weights at each denoising step increases inference time.
  2. Unstable Adversarial Training: Standard GAN issues persist, including sensitivity to hyperparameters and training instability.
  3. Data Representation Constraints: The models being composed must predict a unified data representation format; models with different output formats require retraining.
  4. Currently Evaluated on Dual-Model Configurations: Multi-way blending involving three or more models has not yet been explored.
  • DiffusionBlending / DualMDM: Prior works use fixed or predefined weights, which MixerMDM generalizes to a learnable, dynamic formulation.
  • MDM / InterGen / in2IN: Used as the underlying pre-trained models to be combined.
  • MultiDiffusion (Image Domain): Leverages the concept of diffusion path ensembling, which MixerMDM introduces to motion generation.
  • Insight: Learnable model composition is a general paradigm that can be extended to other diffusion applications (e.g., multi-model fusion for image, video, and audio).

Rating ⭐

Dimension Score
Novelty ⭐⭐⭐⭐⭐
Technical Depth ⭐⭐⭐⭐
Experimental Thoroughness ⭐⭐⭐⭐
Practical Utility ⭐⭐⭐
Overall Recommendation ⭐⭐⭐⭐