Skip to content

RAM: Recover Any 3D Human Motion in-the-Wild

Conference: CVPR 2026 arXiv: 2603.19929 Code: N/A Area: Human Understanding / 3D Human Motion Recovery Keywords: Multi-person 3D motion recovery, zero-shot tracking, SAM2, temporal human mesh recovery, motion prediction

TL;DR

RAM proposes a unified multi-person 3D motion recovery framework integrating a motion-aware semantic tracker SegFollow (built on SAM2 with adaptive Kalman filtering), a memory-augmented temporal human mesh recovery module T-HMR, a lightweight motion predictor, and a gated combiner. It achieves state-of-the-art zero-shot tracking stability and 3D accuracy on benchmarks including PoseTrack and 3DPW, while running 2–3× faster than prior methods.

Background & Motivation

  1. Background: Monocular video-based multi-person 3D motion recovery is an active research area, with representative methods including 4DHuman (HMR2.0 + PHALP tracking) and CoMotion (end-to-end joint optimization).
  2. Limitations of Prior Work: (1) Existing tracking methods rely on 2D appearance features and Hungarian matching, making them sensitive to fast motion, heavy occlusion, and viewpoint changes, leading to frequent ID switches; (2) once identity continuity is broken, 3D motion sequences become inconsistent; (3) the absence of memory-based motion priors during occlusion or fast motion results in discontinuous reconstruction.
  3. Key Challenge: Unstable tracking triggers redundant detections and repeated re-initialization, degrading reconstruction accuracy and hindering real-time performance.
  4. Goal: To build a real-time, robust multi-person 3D motion recovery system.
  5. Key Insight: Combining SAM2's strong segmentation capability with motion priors, using Kalman filtering to provide motion-aware identity association.
  6. Core Idea: SegFollow provides stable tracking → T-HMR leverages temporal memory to improve reconstruction consistency → Predictor estimates poses during occlusion → Combiner fuses reconstruction and prediction outputs.

Method

Overall Architecture

RAM consists of four components operating in sequence: SegFollow performs motion-aware tracking per frame → T-HMR reconstructs 3D meshes from tracked instances → Predictor forecasts future poses from historical motion → Combiner gated-fuses reconstruction and prediction to output final SMPL parameters.

Key Designs

  1. SegFollow: Motion-Aware Tracking

    • Function: Enables robust zero-shot identity tracking built upon SAM2.
    • Mechanism: Two components are introduced on top of SAM2: (a) a motion-guided selector that uses Kalman filtering to predict target bounding boxes, computes an IoU motion consistency score \(s_{\text{kf}}\), and fuses it with the SAM2 mask affinity score \(s_{\text{mask}}\) via gating as \(s_{\text{fused}} = \alpha s_{\text{mask}} + (1-\alpha) s_{\text{kf}}\); (b) a temporal buffer that replaces SAM2's FIFO memory update with exponential moving average, where the decay factor is adaptively modulated by the Kalman consistency score. A confidence-gated update is also introduced — Kalman states are updated only after consecutive reliable associations exceed threshold \(\tau_{kf}\).
    • Design Motivation: SAM2's FIFO memory lacks temporal reliability modeling and is prone to noise accumulation under occlusion and fast motion.
  2. T-HMR: Temporal Human Mesh Recovery

    • Function: Leverages temporal context to improve consistency and robustness of 3D reconstruction.
    • Mechanism: Comprises two components — a Memory Cache and a MemFormer. The Memory Cache selects the top-k most relevant frames from ViT features of \(L\) neighboring frames using a dual-branch attention scoring mechanism: one branch computes correlation between the current frame and memory frames, while the other assesses internal consistency among memory frames. The MemFormer injects the selected temporal priors into the reconstruction process.
    • Design Motivation: Single-frame reconstruction lacks temporal consistency and relies heavily on historical frame priors during occlusion.
  3. Predictor + Combiner

    • Function: The Predictor forecasts future poses to fill gaps during occlusion; the Combiner adaptively fuses reconstruction and prediction.
    • Mechanism: The Predictor performs lightweight prediction based on motion patterns derived from historical SMPL parameters. The Combiner employs a learnable gating mechanism to adaptively decide whether to trust reconstruction or prediction based on the reliability of current observations.
    • Design Motivation: Reconstruction is unreliable during occlusion and requires prediction to maintain continuity; once occlusion ends, the system should revert to reconstruction.

Loss & Training

T-HMR is trained with SMPL parameter regression losses covering joint positions, pose parameters, and shape parameters. The Predictor and Combiner are trained end-to-end.

Key Experimental Results

Main Results

Dataset Metric 4DHuman CoMotion RAM Gain
PoseTrack MOTA↑ 68.2 71.5 76.3 +4.8
PoseTrack IDF1↑ 72.1 74.8 80.5 +5.7
3DPW MPJPE↓ 78.5 72.3 65.8 -6.5
3DPW PA-MPJPE↓ 49.2 45.1 41.3 -3.8

RAM substantially outperforms prior methods on both tracking stability (MOTA/IDF1) and 3D accuracy (MPJPE).

Ablation Study

Configuration MOTA (PoseTrack) MPJPE (3DPW) FPS Notes
Full RAM 76.3 65.8 25+ Complete model
w/o SegFollow (PHALP) 71.1 70.2 15 SegFollow is critical
w/o T-HMR memory 74.8 69.5 25+ Temporal memory improves consistency
w/o Predictor 75.0 67.3 25+ Predictor improves occlusion handling

Key Findings

  • SegFollow contributes the largest gain (MOTA +5.2), indicating that stable tracking is the primary bottleneck in multi-person motion recovery.
  • RAM runs 2–3× faster than 4DHuman because stable tracking reduces redundant detections and repeated re-initialization.
  • ID switches are extremely rare in long in-the-wild videos, marking the first demonstration of stable zero-shot multi-person motion recovery.
  • The dual-branch scoring in T-HMR outperforms single-branch alternatives; both correlation and consistency are indispensable.

Highlights & Insights

  • SAM2 + Kalman Filtering: The combination of a visual foundation model's segmentation capability with classical motion modeling is mutually complementary.
  • Confidence-Gated Update: Preventing unreliable detections from corrupting motion states is a practically effective engineering design.
  • Zero-Shot Long-Video Capability: RAM is the first method to maintain stable multi-person 3D reconstruction in long in-the-wild videos without retraining.

Limitations & Future Work

  • Still relies on a detector to provide initial bounding boxes.
  • The predictor may drift under extreme occlusion (complete invisibility exceeding tens of frames).
  • The SMPL body model limits recovery of hand and facial details.
  • Future work may extend to fine-grained hand and face reconstruction (SMPL-X).
  • vs. 4DHuman: 4DHuman employs PHALP tracking; RAM's SegFollow substantially improves tracking stability and inference speed.
  • vs. CoMotion: CoMotion performs end-to-end joint optimization but is slow; RAM's modular design is faster and more flexible.
  • vs. SAM2: SAM2's FIFO memory is insufficiently robust for multi-object tracking scenarios; SegFollow addresses this limitation by incorporating motion priors.

Rating

  • Novelty: ⭐⭐⭐⭐ — An elegant integration of existing components into a coherent framework.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-benchmark evaluation with comprehensive ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure with detailed method descriptions.
  • Value: ⭐⭐⭐⭐⭐ — Addresses a practical bottleneck in multi-person 3D motion recovery.