UnrealPose: Leveraging Game Engine Kinematics for Large-Scale Synthetic Human Pose Data¶

Conference: CVPR 2026 arXiv: 2601.00991 Code: Available Area: Segmentation Keywords: synthetic data, Human Pose Estimation, Unreal Engine, Game Engine, Instance Segmentation

TL;DR¶

This paper proposes UnrealPose-Gen, a synthetic human pose data generation pipeline built on Unreal Engine 5, which leverages native game engine skeletal kinematics—rather than SMPL—to produce UnrealPose-1M, a million-scale annotated dataset providing 3D joint positions, 2D keypoints, occlusion flags, instance segmentation masks, and camera parameters.

Background & Motivation¶

Accurate 3D human pose data acquisition has long been a pain point in the field, with existing approaches suffering from serious shortcomings:

Limitations of real 3D datasets: Human3.6M, 3DPW, and similar datasets rely on marker-based motion capture systems, which are costly, confined to limited scenes, and lack action diversity.

Absence of 3D annotations in 2D datasets: COCO-Pose and MPII provide rich in-the-wild annotations but lack 3D labels and suffer from inter-annotator inconsistency.

Systematic biases in pseudo-3D supervision: - Lifting methods (2D→3D): Poor cross-dataset generalization with significant accuracy degradation. - Parametric model fitting (SMPL family): Inherits demographic biases from training data (CAESAR), with joint positions dependent on fitting quality and regressor choice rather than anatomy—certain methods produce bent knees or unnaturally straight legs.

Existing synthetic data is mesh-centric: SURREAL, AGORA, BEDLAM, and others are built around SMPL parameters, with joint labels derived from mesh regression rather than kinematic pivot points; complex interactions remain an unsolved problem.

Core insight: Game developers have spent decades creating rich multi-person interactions, object manipulation, and diverse character animations—the computer vision community can directly leverage these abundant game animation assets.

Method¶

Overall Architecture¶

UnrealPose consists of two components: - UnrealPose-Gen: A UE5/MRQ (Movie Render Queue) data generation pipeline. - UnrealPose-1M: An approximately one-million-frame annotated dataset produced by this pipeline.

Key Designs¶

UnrealPose-Gen: Generation Pipeline¶

Camera-centric architecture

Function: Constructs the annotation pipeline entirely within the camera coordinate system, supporting both real-time online rendering (gameplay) and offline rendering (MRQ).

Mechanism: Users select character assets to track (up to 255), and the system extracts annotations for all tracked characters from the camera's perspective. Arbitrary camera parameters (focal length, sensor size, aspect ratio) and resolutions are supported, with the annotation system automatically scaling projected coordinates.

Design Motivation: A unified architecture ensures consistent annotation quality across online and offline rendering. The online rendering capability enables direct training data generation from UE5 games—an unprecedented feature.

Annotation generation

Four annotation types are generated per frame:

3D joint positions: Skeletal mesh components are queried to obtain all specified joint positions in world coordinates, which are then transformed into the camera coordinate system. These positions represent the pivot points driving the animation, distinct from SMPL-regressed joints.
2D keypoint projections: Two sets are provided—2D projections of 3D joints, and the standard COCO-Pose 17 keypoints.
Per-joint visibility flags: Occlusion status is determined via ray tracing (line trace) from the camera to each world point.
Occlusion-aware person detection labels: Bounding boxes and instance segmentation masks are generated with occlusion awareness—occluded regions are clipped, bounding boxes tightly enclose the visible portion, and each person is assigned a unique instance ID maintained across frames.

Data filtering

Two quality control criteria are applied: - Frame boundary filtering: Frames in which keypoint projections fall outside the image are discarded, ensuring all retained subjects are fully within the frame. - Temporal redundancy filtering: Joints' Euclidean distances between consecutive frames are compared, and frames with minimal change are discarded to reduce near-duplicate frames while preserving coverage.

Design Motivation: The pipeline is highly customizable—users can modify a few lines of code to export any subset of skeletal joints (eyes, ears, finger joints, facial landmarks, etc.) and support any UE-compatible character model.

UnrealPose-1M: Dataset¶

Motion drivers

Two modes are supported to balance temporal consistency and diversity:

Scripted mode: Waypoints are defined, and characters move between them while playing specified motion/idle animations, producing temporally coherent sequences suited to video-level pose estimation methods.
Random mode: An exploration region and animation directory are defined, and the system randomly selects positions and animations to maximize diversity of poses, viewpoints, and actions—suited to single-frame methods.

Dataset composition

Sequence Type	Frames	Scenes	Characters	Actions
Coherent sequences ×5	~800K	5 (gallery + basketball court)	5 MetaHumans	~40 scripted actions
Random sequences ×3	~170K	3 (urban park)	5 MetaHumans	~100 random animations
Multi-person frames	~115K	2 scenes	—	—
Total	~1M	8 environments	5 characters	Diverse

Camera configuration diversity: FOV ranges from 30° to 90°; camera height spans from ground level to overhead; distances vary widely—covering ground-level and steep overhead perspectives rarely seen in standard benchmarks.

Per-frame annotations: (i) 17 COCO-format 2D keypoints with visibility flags; (ii) 2D projections of 16 skeletal joints with visibility flags; (iii) 16 3D joints in world and camera coordinates; (iv) per-person bounding box, segmentation mask, and unique ID.

Data split: 75/20/5 train/val/test, with a minimum inter-frame Euclidean distance of 100 mm (summed over all joints in camera coordinates).

SMPL-Free Engine-Native Labels¶

Function: Joint labels are extracted directly from UE skeletal pivot points, entirely independent of SMPL.

Mechanism: MetaHuman characters and their skeletal joints are used in place of SMPL mesh regression. Any UE-compatible skeleton and animation (marketplace assets, retargeted MoCap, or even SMPL motions retargeted to the UE rig) can be rendered with full annotations generated automatically.

Design Motivation: This eliminates the systematic biases of SMPL: (i) joint positions depend on fitting and regressors rather than kinematic rotation centers; (ii) fixed topology struggles to model loose clothing, hair, and complex contact; (iii) the shape space reflects the demographics of training scans (CAESAR: ages 18–65, European-American population); (iv) complex interactions remain an open problem. The approach simultaneously leverages decades of game industry investment in interaction-rich animations (combat, dialogue, tool use) that are difficult or unsafe to capture via traditional motion capture.

Loss & Training¶

This paper is a dataset and pipeline contribution and does not involve model training. Validation experiments use existing pretrained models applied directly to synthetic data for inference, evaluating data quality in a real-to-synthetic transfer setting.

Key Experimental Results¶

Main Results¶

Pretrained models (not fine-tuned on synthetic data) are evaluated in a real-to-synthetic transfer setting to verify data fidelity:

Model	Task	AP	AP50	AP75	AR	MPJPE (mm)	PA-MPJPE (mm)
HRNet-W48	Image→2D (Top-down)	0.883	0.990	0.980	0.896	—	—
DEKR-HRNet-W32	Image→2D (Bottom-up)	0.802	0.977	0.923	0.831	—	—
PoseAug	2D→3D Lifting	—	—	—	—	61.81	57.28
MeTRAbs	Image→3D	—	—	—	—	104.16	111.41
Mask2Former (Swin-L)	Instance Segmentation	avg IoU=0.89	—	—	—	—	—

Ablation Study¶

Per-joint MPJPE distribution for PoseAug (2D→3D lifting):

Joint Region	Error Trend	Notes
Torso joints (neck, spine, hip)	Low error	Low articulation, geometrically stable
Distal joints (elbow, wrist, knee, ankle)	High error	High articulation, frequent occlusion, appearance variation
Pelvis	Highest raw error	Serves as alignment root; reflects residual global offset rather than local pose quality

Per-joint analysis for MeTRAbs Image→3D:

Joint Region	Error Trend	Notes
Core torso joints (hip, etc.)	Low error	Stable texture and clear shape; minimal cross-domain impact
Distal joints (neck, wrist, ankle)	High error	Viewpoint, occlusion, and rendering detail differences amplify cross-domain error

Key Findings¶

High 2D keypoint AP (HRNet 0.883 AP): COCO-pretrained models perform strongly but not saturated on synthetic data, validating annotation compatibility and image realism.
Reasonable cross-domain 3D error: PoseAug's 61.8 mm MPJPE falls within the expected range for cross-domain studies, indicating strong geometric consistency between 2D and 3D annotations.
Error patterns consistent with real data: The anatomically expected pattern of low torso error and high distal joint error is observed consistently on both synthetic and real datasets—an important indicator of data fidelity.
High-quality instance segmentation: Mask2Former achieves 0.89 avg IoU; scene elements (sky, vases, trees) are also correctly annotated, validating the realism of MetaHuman rendering and environments.
Annotation quality is maintained in multi-person scenes with occlusion and interaction.

Highlights & Insights¶

Paradigm shift: Moving from "laboriously synthesizing human interactions" to "directly leveraging the rich animation assets already created by the game industry"—a highly pragmatic and insightful reframing.
SMPL independence is a key selling point—eliminating systematic biases of parametric models (demographic skew, regressor dependency, fitting artifacts).
Dual online/offline rendering modes: Beyond high-quality offline rendering, the ability to generate data in real time during gameplay opens the possibility of extracting training data directly from existing UE5 games.
Occlusion-aware annotations: Per-joint visibility flags and occlusion-aware bounding boxes and masks are critical details for practical applications that are often neglected in existing synthetic datasets.
The pipeline is highly customizable—supporting any UE-compatible skeleton, arbitrary joint subsets, and flexible camera configurations.

Limitations & Future Work¶

No training experiments due to compute constraints: Only inference-based evaluation is conducted; the key experiment of training on UnrealPose-1M and transferring to real data remains unvalidated.
Limited character diversity: Only 5 MetaHumans are used, despite MetaHuman Creator's capacity to generate thousands of characters.
Static cameras: Current implementation uses fixed camera positions; moving cameras and dynamic intrinsics are not yet supported, though extension is straightforward.
Manual integration: Integration into UE5 projects requires manual setup; the pipeline has not been packaged as a plug-and-play UE5 plugin.
Online rendering unverified in real games: MRQ offline rendering has been validated, but the performance and quality of real-time data extraction from running games remains untested.
Dataset scale limited by time and compute: Although the pipeline is theoretically unbounded in scale, the scaling laws of synthetic data remain an open question.

Rating¶

Novelty: ⭐⭐⭐⭐ — The idea of replacing SMPL with game engine native skeletal joints is novel and pragmatic, though synthetic data pipelines themselves are not a new concept.
Experimental Thoroughness: ⭐⭐⭐ — The experimental design for validating data quality is reasonable but insufficiently deep; the critical experiment of training on synthetic data and transferring to real data is absent.
Writing Quality: ⭐⭐⭐⭐ — Motivation is articulated persuasively, the contrast with SMPL-based methods is clearly argued, and technical details are adequately covered.
Value: ⭐⭐⭐⭐ — The open-source pipeline and dataset have tangible impact, providing the community with a new pathway to leverage game assets; follow-up work is needed to validate training effectiveness.