SMGDiff: Soccer Motion Generation using Diffusion Probabilistic Models¶
Conference: ICCV 2025 arXiv: 2411.16216 Code: https://github.com/SMGDiff/SMGDiff Area: Image Generation Keywords: motion generation, diffusion model, soccer animation, human-object interaction, character control
TL;DR¶
This paper proposes SMGDiff, a two-stage diffusion model framework that generates high-quality, diverse soccer motion animations in real time from user control signals, while refining ball-foot interaction details via a contact guidance module.
Background & Motivation¶
Soccer is the world's most popular sport, with broad application demand in gaming and VR/AR. However, generating realistic soccer motion animation faces the following core challenges:
Complex human-ball interaction: Soccer involves precise physical contact between players and the ball, with particularly stringent accuracy requirements for ball-foot contact.
Real-time requirements: Games and interactive applications demand real-time inference, whereas existing diffusion models are typically computationally expensive.
Motion diversity: Soccer encompasses a wide range of skill categories—dribbling, tricks, shooting, etc.—requiring coverage of a broad motion spectrum.
Lack of datasets: No large-scale public mocap dataset for soccer motion exists.
Limitations of prior work:
- Commercial games (e.g., the EA SPORTS FC series) rely on massive pre-recorded motion libraries for motion matching, which is prone to visual artifacts.
- Reinforcement learning / physics simulation methods are limited to specific skills (e.g., dribbling, shooting, juggling) and cannot cover the full soccer motion spectrum.
- General motion diffusion models (e.g., CAMDM) focus solely on human motion style transitions and do not handle human-object interaction.
- Human-object interaction generation methods require time-consuming post-optimization, making them unsuitable for real-time interactive scenarios.
Method¶
SMGDiff adopts a two-stage framework: the first stage generates trajectories, and the second stage generates soccer motions conditioned on those trajectories.
Motion Representation¶
A soccer motion state \(x^i = \{h, b, c\}\) consists of three components:
- Body state \(h \in \mathbb{R}^{3+24 \times 6}\): root position and 6-DOF rotations of 24 SMPL joints.
- Ball state \(b \in \mathbb{R}^{7}\): relative ball position, global ball velocity, and ball control weight.
- Binary contact labels \(c = \{c_g, c_b\}\): foot-ground contact and foot-ball contact.
The ball control weight \(w_b = 1 - \|b_p^{xy} - h_p^{xy}\| / r\) converts the global ball position to a relative position; when the ball is more than radius \(r=2\text{m}\) from the character root, the weight approaches 0, effectively decoupling the ball representation between controlled and uncontrolled states.
Stage 1: Trajectory Generation Model (TGM)¶
Objective: Transform coarse-grained user control signals (direction, speed, skill category) into fine-grained global character trajectories.
Architecture: A lightweight single-step diffusion model based on a Transformer Encoder. Inputs include:
- Soccer skill label \(\mathbf{S}\) (6 categories: dribble, trick, shoot, stand, celebrate, off-the-ball move)
- Target waypoints \(\mathbf{G}\) (computed from keyboard directional input and press intensity)
- Past trajectory \(\mathbf{T}^{\mathcal{P}}\)
- Gaussian noise \(\epsilon \sim \mathcal{N}(0, \mathbf{I})\) (for diversity)
At inference, the process mirrors a single-step DDPM reverse pass, completing trajectory generation in a single forward pass to ensure real-time performance.
Trajectory blending: The HFTE (Heuristic Future Trajectory Extension) strategy from CAMDM is adopted; when the user issues new control signals, the newly generated trajectory is blended with the previous frame's result to prevent abrupt changes in character direction and speed.
Stage 2: Soccer Motion Diffusion Model¶
Architecture: An autoregressive diffusion model based on Transformers that generates soccer motion sequences conditioned on trajectories.
Conditioning input \(\mathbf{C} = \{\mathbf{S}, \mathbf{X}^{\mathcal{P}}, \mathbf{T}^{\mathcal{F}}\}\):
- Skill label \(\mathbf{S}\)
- Past motion \(\mathbf{X}^{\mathcal{P}}\) (10 history frames)
- Future trajectory \(\mathbf{T}^{\mathcal{F}}\) (45 frames)
Training loss consists of four terms:
- \(\mathcal{L}_{\text{simple}}\): reconstruction loss for directly predicting \(\mathbf{X}_0^{\mathcal{F}}\)
- \(\mathcal{L}_{\text{pos}}\): joint position loss obtained via forward kinematics
- \(\mathcal{L}_{\text{vel}}\): velocity consistency loss
- \(\mathcal{L}_{\text{foot}}\): foot-ground contact constraint loss (penalizing foot sliding during contact)
Contact Guidance Module (CGM)¶
Contact guidance is introduced during the diffusion inference process to refine ball-foot contact via a dedicated loss function.
Contact detection: A foot-ball contact event is detected when ball acceleration exceeds threshold \(\tau_a = 2\text{ m/s}^2\) (under ground friction alone, acceleration is small and constant):
Contact joint selection: Distances from each foot joint to the ball are computed, with preference given to the airborne foot (the grounded foot's distance is penalized by weight \(w_d=2\)):
Contact loss:
This activates only when the ball-foot distance exceeds threshold \(\tau_d = 0.1\text{m}\) and a contact event is detected, guiding the foot toward the ball.
Gradient guidance: An adaptive step-size strategy from DSG is employed, blending the gradient direction with the unconditional sampling direction (guidance rate \(w_r = 0.5\)) to improve contact accuracy while preserving motion diversity.
Deployment strategy: Contact guidance is applied only during the final 2 of 8 denoising steps, since guidance applied under high noise levels in early steps tends to be ineffective and may produce unnatural results.
Implementation Details¶
- Frame rate: 30 Hz; past frames \(P=10\); future frames \(F=45\)
- Diffusion denoising steps: 8 (consistent between training and inference)
- Runtime environment: Unity (user interaction and visualization) + Python (model inference), communicating via TCP
- Hardware: Intel i7-10700K + NVIDIA RTX 3080 Ti
Dataset: Soccer-X¶
The authors construct Soccer-X, the first large-scale dataset targeting data-driven soccer motion generation:
| Attribute | Value |
|---|---|
| Capture system | 16 OptiTrack Prime x13 cameras |
| Capture volume | 6m × 7.5m × 2.5m |
| Raw frame rate | 240 fps (downsampled to 30 fps) |
| Number of players | 30 |
| Total frames | ~1.08M |
| Total duration | >10 hours |
| Body format | SMPL |
| Motion categories | 6 |
Six motion categories: Dribble (varying speed/foot/direction), Stand, Off-the-ball Move, Trick (5 types of skill moves), Shoot (captured indoors; full ball trajectories simulated via physics in Unity), and Celebrate.
Key Experimental Results¶
Quantitative Comparison¶
Comparison against three real-time controller baselines (LMP, MANN-DP, CM), using test-set trajectories as conditioning input:
| Method | FID↓ | Foot Slide↓ | Accel.↓ | Diversity↑ | Traj. Error↓ | Orient. Error↓ | Skill Acc.↑ |
|---|---|---|---|---|---|---|---|
| LMP | 0.354 | 1.068 | 1.607 | 0.398 | 4.116 | 6.493 | 73.3% |
| MANN-DP | 0.359 | 1.351 | 1.565 | 0.475 | 4.069 | 5.299 | 69.1% |
| CM | 0.249 | 1.650 | 1.175 | 0.352 | 3.103 | 5.066 | 52.9% |
| SMGDiff | 0.181 | 0.854 | 1.200 | 0.618 | 2.413 | 4.939 | 93.3% |
SMGDiff significantly outperforms all baselines on nearly every metric; skill accuracy reaches 93.3% (20 percentage points above the strongest baseline), and the lowest FID indicates that the generated motion distribution most closely matches real data.
Ablation Study¶
| Variant | FID↓ | Foot Slide↓ | Accel.↓ | Diversity↑ |
|---|---|---|---|---|
| w/o TGM | 0.365 | 1.003 | 1.197 | 2.433 |
| w/o CGM | 0.370 | 1.005 | 1.196 | 2.691 |
| Full | 0.358 | 1.005 | 1.201 | 2.693 |
- Role of TGM: Replacing straight-line trajectories with diverse generated ones substantially reduces FID and improves diversity.
- Role of CGM: Although contact guidance slightly increases foot slide and acceleration, it noticeably lowers FID, indicating more realistic human-ball interaction.
Runtime Analysis¶
| Denoising Steps | 2 | 4 | 8 | 16 | 32 |
|---|---|---|---|---|---|
| Inference Time | 3ms | 6ms | 12ms | 25ms | 52ms |
| FID | 0.395 | 0.390 | 0.370 | 0.366 | 0.338 |
Eight denoising steps achieve the best trade-off between inference speed and generation quality, requiring only 12ms to satisfy real-time constraints.
Limitations & Future Work¶
- Absence of physical constraints: No real physics simulation is incorporated during training or inference, which may yield physically implausible motions.
- Limited interaction types: Only ball-foot interaction is considered; other body-part contacts such as headers and chest control are ignored.
- Single-player limitation: Multi-player interaction scenarios (e.g., passing, tackling) are not addressed; future work could combine physics simulation to generate multi-player soccer motions.
Personal Reflections¶
- The two-stage decoupled design is the central contribution: separating "coarse control → fine trajectory" from "trajectory → full-body motion" reduces the learning difficulty of each individual model and allows the trajectory to serve as a flexible intermediate representation.
- The lightweight contact guidance strategy is noteworthy: applying guidance only in the final 2 steps (rather than throughout) effectively improves contact quality while maintaining real-time performance—a "selective guidance" idea generalizable to other diffusion generation tasks with physical constraints.
- The ball control weight design is concise and effective; distance-based decay unifies global/relative ball position representations and prevents distant balls from interfering with motion generation.
- The construction and open-sourcing of Soccer-X is a valuable contribution to the community, though 30 fps indoor capture still differs from real-field soccer conditions.
- The most promising future directions include post-optimization with physics engines such as Isaac Gym and extension to multi-player interaction scenarios.
Highlights & Insights¶
Limitations & Future Work¶
Related Work & Insights¶
Rating¶
- Novelty: TBD
- Experimental Thoroughness: TBD
- Writing Quality: TBD
- Value: TBD