SMGDiff: Soccer Motion Generation using Diffusion Probabilistic Models¶

Conference: ICCV 2025 arXiv: 2411.16216 Code: https://github.com/SMGDiff/SMGDiff Area: Image Generation Keywords: motion generation, diffusion model, soccer animation, human-object interaction, character control

TL;DR¶

This paper proposes SMGDiff, a two-stage diffusion model framework that generates high-quality, diverse soccer motion animations in real time from user control signals, while refining ball-foot interaction details via a contact guidance module.

Background & Motivation¶

Soccer is the world's most popular sport, with broad application demand in gaming and VR/AR. However, generating realistic soccer motion animation faces the following core challenges:

Complex human-ball interaction: Soccer involves precise physical contact between players and the ball, with particularly stringent accuracy requirements for ball-foot contact.

Real-time requirements: Games and interactive applications demand real-time inference, whereas existing diffusion models are typically computationally expensive.

Motion diversity: Soccer encompasses a wide range of skill categories—dribbling, tricks, shooting, etc.—requiring coverage of a broad motion spectrum.

Lack of datasets: No large-scale public mocap dataset for soccer motion exists.

Limitations of prior work:

Commercial games (e.g., the EA SPORTS FC series) rely on massive pre-recorded motion libraries for motion matching, which is prone to visual artifacts.
Reinforcement learning / physics simulation methods are limited to specific skills (e.g., dribbling, shooting, juggling) and cannot cover the full soccer motion spectrum.
General motion diffusion models (e.g., CAMDM) focus solely on human motion style transitions and do not handle human-object interaction.
Human-object interaction generation methods require time-consuming post-optimization, making them unsuitable for real-time interactive scenarios.

Method¶

SMGDiff adopts a two-stage framework: the first stage generates trajectories, and the second stage generates soccer motions conditioned on those trajectories.

Motion Representation¶

A soccer motion state \(x^i = \{h, b, c\}\) consists of three components:

Body state \(h \in \mathbb{R}^{3+24 \times 6}\): root position and 6-DOF rotations of 24 SMPL joints.
Ball state \(b \in \mathbb{R}^{7}\): relative ball position, global ball velocity, and ball control weight.
Binary contact labels \(c = \{c_g, c_b\}\): foot-ground contact and foot-ball contact.

The ball control weight \(w_b = 1 - \|b_p^{xy} - h_p^{xy}\| / r\) converts the global ball position to a relative position; when the ball is more than radius \(r=2\text{m}\) from the character root, the weight approaches 0, effectively decoupling the ball representation between controlled and uncontrolled states.

Stage 1: Trajectory Generation Model (TGM)¶

Objective: Transform coarse-grained user control signals (direction, speed, skill category) into fine-grained global character trajectories.

Architecture: A lightweight single-step diffusion model based on a Transformer Encoder. Inputs include:

Soccer skill label \(\mathbf{S}\) (6 categories: dribble, trick, shoot, stand, celebrate, off-the-ball move)
Target waypoints \(\mathbf{G}\) (computed from keyboard directional input and press intensity)
Past trajectory \(\mathbf{T}^{\mathcal{P}}\)
Gaussian noise \(\epsilon \sim \mathcal{N}(0, \mathbf{I})\) (for diversity)

At inference, the process mirrors a single-step DDPM reverse pass, completing trajectory generation in a single forward pass to ensure real-time performance.

Trajectory blending: The HFTE (Heuristic Future Trajectory Extension) strategy from CAMDM is adopted; when the user issues new control signals, the newly generated trajectory is blended with the previous frame's result to prevent abrupt changes in character direction and speed.

Stage 2: Soccer Motion Diffusion Model¶

Architecture: An autoregressive diffusion model based on Transformers that generates soccer motion sequences conditioned on trajectories.

Conditioning input \(\mathbf{C} = \{\mathbf{S}, \mathbf{X}^{\mathcal{P}}, \mathbf{T}^{\mathcal{F}}\}\):

Skill label \(\mathbf{S}\)
Past motion \(\mathbf{X}^{\mathcal{P}}\) (10 history frames)
Future trajectory \(\mathbf{T}^{\mathcal{F}}\) (45 frames)

Training loss consists of four terms:

\[\mathcal{L} = \mathcal{L}_{\text{simple}} + \lambda_{\text{pos}} \mathcal{L}_{\text{pos}} + \lambda_{\text{vel}} \mathcal{L}_{\text{vel}} + \lambda_{\text{foot}} \mathcal{L}_{\text{foot}}\]

\(\mathcal{L}_{\text{simple}}\): reconstruction loss for directly predicting \(\mathbf{X}_0^{\mathcal{F}}\)
\(\mathcal{L}_{\text{pos}}\): joint position loss obtained via forward kinematics
\(\mathcal{L}_{\text{vel}}\): velocity consistency loss
\(\mathcal{L}_{\text{foot}}\): foot-ground contact constraint loss (penalizing foot sliding during contact)

Contact Guidance Module (CGM)¶

Contact guidance is introduced during the diffusion inference process to refine ball-foot contact via a dedicated loss function.

Contact detection: A foot-ball contact event is detected when ball acceleration exceeds threshold \(\tau_a = 2\text{ m/s}^2\) (under ground friction alone, acceleration is small and constant):

\[\hat{c}_b = \mathbb{I}(\|b_a\| > \tau_a)\]

Contact joint selection: Distances from each foot joint to the ball are computed, with preference given to the airborne foot (the grounded foot's distance is penalized by weight \(w_d=2\)):

\[d = \min_{j \in \text{foot joints}} ((f_p^j - b_p) \cdot (1 + (w_d - 1) \cdot c_g^j))\]

Contact loss:

\[L = \sum_{i=1}^{F} d^i \cdot \frac{\mathbb{I}(d^i > \tau_d) \cdot \hat{c}_b^i}{\mathbb{I}(d^i > \tau_d) + \delta}\]

This activates only when the ball-foot distance exceeds threshold \(\tau_d = 0.1\text{m}\) and a contact event is detected, guiding the foot toward the ball.

Gradient guidance: An adaptive step-size strategy from DSG is employed, blending the gradient direction with the unconditional sampling direction (guidance rate \(w_r = 0.5\)) to improve contact accuracy while preserving motion diversity.

Deployment strategy: Contact guidance is applied only during the final 2 of 8 denoising steps, since guidance applied under high noise levels in early steps tends to be ineffective and may produce unnatural results.

Implementation Details¶

Frame rate: 30 Hz; past frames \(P=10\); future frames \(F=45\)
Diffusion denoising steps: 8 (consistent between training and inference)
Runtime environment: Unity (user interaction and visualization) + Python (model inference), communicating via TCP
Hardware: Intel i7-10700K + NVIDIA RTX 3080 Ti

Dataset: Soccer-X¶

The authors construct Soccer-X, the first large-scale dataset targeting data-driven soccer motion generation:

Attribute	Value
Capture system	16 OptiTrack Prime x13 cameras
Capture volume	6m × 7.5m × 2.5m
Raw frame rate	240 fps (downsampled to 30 fps)
Number of players	30
Total frames	~1.08M
Total duration	>10 hours
Body format	SMPL
Motion categories	6

Six motion categories: Dribble (varying speed/foot/direction), Stand, Off-the-ball Move, Trick (5 types of skill moves), Shoot (captured indoors; full ball trajectories simulated via physics in Unity), and Celebrate.

Key Experimental Results¶

Quantitative Comparison¶

Comparison against three real-time controller baselines (LMP, MANN-DP, CM), using test-set trajectories as conditioning input:

Method	FID↓	Foot Slide↓	Accel.↓	Diversity↑	Traj. Error↓	Orient. Error↓	Skill Acc.↑
LMP	0.354	1.068	1.607	0.398	4.116	6.493	73.3%
MANN-DP	0.359	1.351	1.565	0.475	4.069	5.299	69.1%
CM	0.249	1.650	1.175	0.352	3.103	5.066	52.9%
SMGDiff	0.181	0.854	1.200	0.618	2.413	4.939	93.3%

SMGDiff significantly outperforms all baselines on nearly every metric; skill accuracy reaches 93.3% (20 percentage points above the strongest baseline), and the lowest FID indicates that the generated motion distribution most closely matches real data.

Ablation Study¶

Variant	FID↓	Foot Slide↓	Accel.↓	Diversity↑
w/o TGM	0.365	1.003	1.197	2.433
w/o CGM	0.370	1.005	1.196	2.691
Full	0.358	1.005	1.201	2.693

Role of TGM: Replacing straight-line trajectories with diverse generated ones substantially reduces FID and improves diversity.
Role of CGM: Although contact guidance slightly increases foot slide and acceleration, it noticeably lowers FID, indicating more realistic human-ball interaction.

Runtime Analysis¶

Denoising Steps	2	4	8	16	32
Inference Time	3ms	6ms	12ms	25ms	52ms
FID	0.395	0.390	0.370	0.366	0.338

Eight denoising steps achieve the best trade-off between inference speed and generation quality, requiring only 12ms to satisfy real-time constraints.

Limitations & Future Work¶

Absence of physical constraints: No real physics simulation is incorporated during training or inference, which may yield physically implausible motions.
Limited interaction types: Only ball-foot interaction is considered; other body-part contacts such as headers and chest control are ignored.
Single-player limitation: Multi-player interaction scenarios (e.g., passing, tackling) are not addressed; future work could combine physics simulation to generate multi-player soccer motions.

Personal Reflections¶

The two-stage decoupled design is the central contribution: separating "coarse control → fine trajectory" from "trajectory → full-body motion" reduces the learning difficulty of each individual model and allows the trajectory to serve as a flexible intermediate representation.
The lightweight contact guidance strategy is noteworthy: applying guidance only in the final 2 steps (rather than throughout) effectively improves contact quality while maintaining real-time performance—a "selective guidance" idea generalizable to other diffusion generation tasks with physical constraints.
The ball control weight design is concise and effective; distance-based decay unifies global/relative ball position representations and prevents distant balls from interfering with motion generation.
The construction and open-sourcing of Soccer-X is a valuable contribution to the community, though 30 fps indoor capture still differs from real-field soccer conditions.
The most promising future directions include post-optimization with physics engines such as Isaac Gym and extension to multi-player interaction scenarios.

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Novelty: TBD
Experimental Thoroughness: TBD
Writing Quality: TBD
Value: TBD

SMGDiff: Soccer Motion Generation using Diffusion Probabilistic Models¶

TL;DR¶

Background & Motivation¶

Method¶

Motion Representation¶

Stage 1: Trajectory Generation Model (TGM)¶

Stage 2: Soccer Motion Diffusion Model¶

Contact Guidance Module (CGM)¶

Implementation Details¶

Dataset: Soccer-X¶

Key Experimental Results¶

Quantitative Comparison¶

Ablation Study¶

Runtime Analysis¶

Limitations & Future Work¶

Personal Reflections¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Related Papers¶