Skip to content

Lifelong Domain Adaptive 3D Human Pose Estimation

Conference: AAAI2026 arXiv: 2512.23860 Code: davidpengucf/lifelongpose Area: Video Understanding Keywords: 3D human pose estimation, lifelong domain adaptation, catastrophic forgetting, GAN, diffusion model

TL;DR

This paper introduces a new task of lifelong domain adaptive 3D HPE, and proposes a GAN framework incorporating pose-aware, temporal-aware, and domain-aware encodings. A diffusion sampler is employed to generate domain-aware priors to mitigate catastrophic forgetting, achieving significant improvements over existing methods across multiple cross-scene/cross-dataset adaptation tasks.

Background & Motivation

The 2D-to-3D lifting paradigm for 3D Human Pose Estimation (3D HPE) relies on 3D annotations collected in controlled environments, and suffers from domain shift when generalizing to in-the-wild scenarios. Limitations of existing DA methods: - General DA: requires simultaneous access to source and target domain data - Source-free DA: assumes a static target domain distribution and permits joint training on all target data - Both paradigms ignore the non-stationary nature of target pose distributions in practice (e.g., shifts from pedestrian intent prediction in autonomous driving to in-vehicle safety monitoring)

Core motivation: This paper proposes lifelong domain adaptive 3D HPE—after source-domain pretraining, the model is sequentially adapted to multiple target domains, with access only to the current target domain data at each step, without revisiting the source domain or any previous target domains. The framework must simultaneously address current-domain adaptation and historical-domain knowledge retention.

Method

Overall Architecture

The framework comprises three core components: 3D pose generators, a 2D pose discriminator, and a 2D-to-3D lifting pose estimator, organized in a GAN structure to reduce domain shift.

3D Pose Generator

Given the 3D pose estimated in the current domain, three cascaded generators \(G = G_{BA} \circ G_{BL} \circ G_{RT}\) (bone angle / bone length / rotation-translation) produce augmented 3D poses with three types of encoding: 1. Pose-aware encoding: In addition to joint coordinates and bone vectors, 6 body part segments (left/right arms, left/right legs, torso, and extended torso) are introduced to capture relationships between non-adjacent joints. 2. Temporal-aware encoding: Multi-frame consecutive 3D poses are passed through a temporal weighted convolutional network to produce a weighted single-frame pose. 3. Domain-aware encoding: A 2D pose diffusion sampler trained with DDIM samples from prior-domain 2D poses (using only \(T/10\) steps), generating domain-aware priors to replace random noise.

Optimization

  • \(\mathcal{L}_{3D}\): MSE + feedback loss, constraining the similarity between augmented and predicted 3D poses
  • \(\mathcal{L}_{2D}\): MSE + normalized L1, preserving both scale and alignment direction
  • \(\mathcal{L}_{dis}\): Wasserstein GAN with gradient penalty, discriminating between original and augmented 2D poses
  • EMA: \(\mathcal{P}_{j+1} = \eta \mathcal{P}_j + (1-\eta)\hat{\mathcal{P}}_j\) (\(\eta=0.99\)), smoothly updating the pose estimator to alleviate forgetting

Key Experimental Results

Cross-Scene Adaptation H3.6M: S1→S5→S6→S7→S8 (MPJPE/PA-MPJPE)

Method S5 S6 S7 S8 Avg
PoseDA-LL 51.5/44.9 51.9/44.5 46.2/39.5 40.9/28.6 47.6/39.4
Ours 48.7/42.5 48.6/40.8 42.3/36.9 40.0/27.4 44.9/36.9

Cross-Dataset Adaptation H3.6M→3DHP (Average over 6 test sets)

Method Avg MPJPE/PA-MPJPE
PoseDA-LL 80.7/54.5
Ours 75.3/50.7

Multi-Dataset Adaptation (H3.6M→3DHP→3DPW)

Method 3DHP 3DPW Avg
PoseDA-LL 88.9/62.1 87.6/49.4 88.3/55.8
Ours 75.3/51.1 81.7/45.6 78.5/48.4

Ablation studies confirm that domain-aware embedding (DE) is the most critical component (its removal causes an 8.2 mm MPJPE degradation on 3DHP), and EMA also plays a substantial role in mitigating forgetting (5.9 mm degradation upon removal).

Highlights & Insights

  • First to introduce lifelong DA into 3D HPE, formalizing the sequential adaptation problem under non-stationary target domains
  • Diffusion sampler as domain memory: DDIM is used to retain prior-domain pose distributions, avoiding GAN mode collapse while enabling efficient prior generation with only \(T/10\) sampling steps
  • Part-aware encoding: 6 body part segments enrich the comprehensiveness of pose representations
  • Consistently outperforms all 5 baselines across all 3 experimental settings

Limitations & Future Work

  • The diffusion sampler requires retraining/updating for each new domain, with potentially growing overhead as the number of domains increases
  • Experiments are limited to a 16-keypoint body model and an FC-based estimator (VideoPose3D); generalization to Transformer-based architectures remains unverified
  • The adaptation order across target domains is fixed; the effect of ordering on final performance is not discussed
  • Online/streaming settings are not explored; the current framework operates in an offline batch adaptation mode

Rating

  • Novelty: ⭐⭐⭐⭐ — First to define lifelong DA for 3D HPE; the use of a diffusion sampler as domain memory is a novel idea
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Three adaptation settings, five baselines, and detailed ablations, though validation is limited to pose datasets
  • Writing Quality: ⭐⭐⭐⭐ — Problem definition is clear, method description is thorough, and figures are of high quality
  • Value: ⭐⭐⭐⭐ — Practically meaningful for continual adaptation in non-stationary environments; the framework demonstrates good extensibility