SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens¶
Conference: CVPR 2025
arXiv: 2411.19824
Code: Project Page
Area: 3D Vision / Human Body Understanding
Keywords: Multi-Person 3D Mesh Estimation, Scale-Adaptive, Efficient ViT, DETR, Real-Time Inference
TL;DR¶
This paper proposes SAT-HMR, a DETR-based real-time multi-person 3D human mesh estimation framework. By introducing scale-adaptive tokens—utilizing high-resolution tokens for small-scale humans, low-resolution tokens for large-scale humans, and pooled/compressed tokens for the background—it improves inference speed to 24 FPS while maintaining the accuracy of high-resolution inputs, achieving an optimal balance between precision and speed.
Background & Motivation¶
- Challenges in Multi-Person 3D Mesh Estimation: Estimating SMPL parameters for all individuals from a single RGB image requires both local details (joint poses) and global context (relative positions, occlusion relations).
- Multi-Stage vs. One-Stage: Multi-stage methods perform detection first followed by person-by-person cropping and estimation, which achieves high accuracy but loses global context and struggles with occlusions. One-stage methods (e.g., ROMP, BEV) process the entire image based on CNNs, but low-resolution inputs limit their representation capability.
- The Cost of High Resolution: Emerging DETR-style methods (e.g., AiOS, Multi-HMR) achieve SOTA performance by using high-resolution inputs (1288 resolution), but their inference speed is only around 5 FPS. Key observation: High resolution primarily benefits small-scale people (long distance/children/crouching poses)—the error drops by 35mm in the 0-10% scale range, while improvements in the 30%+ scale range are negligible.
- Core Idea: Allocating high-resolution tokens to large-scale persons (close to the camera, occupying a large image area) is computationally wasteful because they are already represented by a sufficient number of tokens. High-resolution computational resources should be concentrated on small-scale persons who truly need them.
- Compressible Background Regions: Background tokens provide useful contextual information (and should not be completely discarded) but can be further compressed via spatial pooling.
Method¶
Overall Architecture¶
SAT-HMR adopts a DETR-style pipeline: (1) extracting tokens from low-resolution images and encoding them with a shallow Transformer encoder \(\rightarrow\) (2) predicting a patch-level scale map \(\mathbf{S}\) using a scale head \(\rightarrow\) (3) classifying tokens into background, small-scale, and large-scale based on the scale map \(\rightarrow\) (4) replacing small-scale tokens with their high-resolution counterparts, pooling and compressing background tokens, and keeping large-scale tokens unchanged \(\rightarrow\) (5) concatenating these tokens to obtain the scale-adaptive tokens \(\mathcal{T}_{\text{SA}}\) \(\rightarrow\) (6) feeding them into a subsequent Transformer encoder, decoder, and prediction head to regress SMPL parameters.
Key Designs¶
Design 1: Patch-Level Scale Map Prediction - Function: Determines whether each patch covers a person and predicts the relative scale of that person. - Mechanism: The scale map \(\mathbf{S}(i,j) = (c, s)\) contains two values: \(c\) is the person confidence (0 indicates background), and \(s = \min(d_{\text{bb}} / S_{\text{hr}}, 1)\) represents the person scale (the ratio of the bounding box diagonal to the maximum edge of the image). When a patch overlaps with multiple people, the scale of the closest person is taken. It is predicted from low-resolution tokens via a scale head consisting of \(N_{\text{lr}}\) Transformer layers and an MLP. - Design Motivation: The scale definition directly reflects the proportion of a person in the image, which is strongly correlated with whether that region requires high-resolution tokens. Using a lightweight prediction head introduces minimal computational overhead.
Design 2: Scale-Adaptive Token Selection and Replacement - Function: Dynamically adjusts the token resolution of each region. - Mechanism: The tokens are divided into three groups based on thresholds \(\alpha_c\) (confidence) and \(\alpha_s\) (scale): (1) Background \(\mathcal{T}_B\): every 4 adjacent tokens are pooled into 1 to obtain \(\mathcal{T}_B'\); (2) Small-scale \(\mathcal{T}_{\text{SMALL}}\): pruned and replaced by corresponding tokens from the high-resolution image, where \(k_{\text{hr}} = 4 k_{\text{small}}\); (3) Large-scale \(\mathcal{T}_{\text{LARGE}}\): kept at low resolution. Finally, they are concatenated as \(\mathcal{T}_{\text{SA}} = \{\mathcal{T}_B', \mathcal{T}_{\text{LARGE}}, \mathcal{T}_{\text{HR}}\}\). - Design Motivation: Small-scale persons lack adequate features at low resolutions (which dominates the gains brought by high resolution), and replacing them with 4x resolution tokens compensates for this deficiency; large-scale persons are already covered by enough tokens; background is kept and pooled rather than discarded to preserve contextual information.
Design 3: Dual-Resolution Encoder Alignment - Function: Ensures that low-resolution and high-resolution tokens can be concatenated within the same feature space. - Mechanism: The low-resolution and high-resolution branches are processed by their respective shallow Transformer encoders (\(N_{\text{lr}} = N_{\text{hr}} = 3\) layers), which share the same pretrained DINOv2 weights. After independent encoding in both branches, feature space alignment is performed at the concatenation point, followed by processing through a unified Transformer encoder with \(N_{\text{sa}} = 9\) layers. - Design Motivation: Having the same number of layers ensures consistent feature abstraction levels, facilitating seamless processing of mixed-resolution tokens by the subsequent unified encoder.
Loss & Training¶
The total loss is a weighted sum of multiple loss terms: \(\mathcal{L} = \lambda_{\text{map}} \mathcal{L}_{\text{map}} + \lambda_{\text{depth}} \mathcal{L}_{\text{depth}} + \lambda_{\text{pose}} \mathcal{L}_{\text{pose}} + \lambda_{\text{shape}} \mathcal{L}_{\text{shape}} + \lambda_{\text{j3d}} \mathcal{L}_{\text{j3d}} + \lambda_{\text{j2d}} \mathcal{L}_{\text{j2d}} + \lambda_{\text{box}} \mathcal{L}_{\text{box}} + \lambda_{\text{det}} \mathcal{L}_{\text{det}}\). Specifically, \(\mathcal{L}_{\text{map}}\) consists of a focal loss and an L1 loss for the scale map, \(\mathcal{L}_{\text{depth}}\) is the normalized depth L1 loss, \(\mathcal{L}_{\text{det}}\) represents the detection focal loss, and the rest employ L1 distance.
Key Experimental Results¶
Main Results: AGORA Test Set¶
| Method | Resolution | Time (ms) | MACs (G) | F1 ↑ | MPJPE ↓ | MVE ↓ |
|---|---|---|---|---|---|---|
| ROMP | 512 | 38.7 | 43.6 | 0.91 | 108.1 | 103.4 |
| BEV | 512 | 50.6 | 48.9 | 0.93 | 105.3 | 100.7 |
| AiOS | 1333 | 405.2 | 314.5 | 0.94 | 63.9 | 57.5 |
| Multi-HMR | 1288 | 231.7 | 6104.6 | 0.95 | 65.3 | 61.1 |
| SAT-HMR | 644* | 42.0 | 133.1 | 0.95 | 67.9 | 63.3 |
Generalization on Other Datasets¶
| Method | 3DPW PA-MPJPE ↓ | MuPoTS PCK All ↑ | CMU Panoptic Avg ↓ |
|---|---|---|---|
| Multi-HMR | 41.7 | 85.0 | - |
| BEV | 46.9 | 70.2 | 109.5 |
| SAT-HMR | 41.6 | 89.0 | 84.2 |
Ablation Study: Background Token Strategy¶
| Strategy | 0-20% MVE | 80%+ MVE | Avg MVE |
|---|---|---|---|
| Discard All | 59.9 | 70.8 | 57.2 |
| No Pooling | 60.3 | 64.1 | 56.1 |
| Pooling ×2 | 60.7 | 66.5 | 56.3 |
| Pooling ×1 (Ours) | 60.0 | 62.7 | 56.0 |
Key Findings¶
- SAT-HMR achieves comparable accuracy to Multi-HMR (4 FPS) at 24 FPS (42ms), rendering a speedup of approximately 5.5x.
- MACs are reduced from 6104.6G (Multi-HMR) to 133.1G, a decrease of 97.8%.
- Completely discarding background tokens increases the error for large-scale persons from 62.7 to 70.8, demonstrating the importance of background context.
- On the CMU Panoptic dataset, the error drops from 109.5 (BEV) to 84.2 (a 23.1% improvement), demonstrating outstanding generalization ability.
Highlights & Insights¶
- Precise Insights: The empirical observation that "high resolution primarily benefits small-scale persons" directly guides the methodology design.
- Intelligent Allocation of Computational Resources: Precious high-resolution tokens are concentrated in the areas where they are most needed, while background tokens are compressed instead of discarded.
- First Real-Time SOTA: Achieves real-time (24 FPS) SOTA performance on the AGORA leaderboard, offering extremely high practical value.
- The method design is simple and elegant; instead of introducing complex, new modules, it achieves its effectiveness through token-level reallocation.
Limitations & Future Work¶
- The definition of scale does not consider human height information (e.g., the bounding boxes of a crouching adult and a standing child can be similar), which may lead to depth estimation biases.
- Currently, it only estimates the SMPL body mesh, without extending to the SMPL-X whole-body mesh (including hands and face).
- The scale threshold \(\alpha_s\) is a fixed hyperparameter, which may require adaptive adjustment under highly varying scenes.
- It requires processing both high-resolution and low-resolution images, leaving room for further optimization in memory consumption.
Related Work & Insights¶
- Compared to efficient ViT methods such as FlexiViT/TORE, the token reallocation strategy in SAT-HMR is more task-adaptive.
- The concept of scale-adaptation can be transferred to other dense prediction tasks (e.g., panoptic segmentation, depth estimation).
- The design of pooling the background rather than discarding it serves as a valuable reference for detection and estimation tasks requiring global context.
Rating¶
⭐⭐⭐⭐ — Accurate problem insights, simple method design, and thorough experimentation achieve outstanding real-time SOTA results. It provides a highly valuable reference for applying efficient Vision Transformers to human understanding tasks.