PHAC: Promptable Human Amodal Completion¶

Conference: CVPR 2026 arXiv: 2603.14741 Code: None Area: Object Detection Keywords: human amodal completion, diffusion model, ControlNet, pose-guided generation, image inpainting

TL;DR¶

This paper introduces Promptable Human Amodal Completion (PHAC), a novel task that accepts point-based user prompts (pose/bounding box) via dedicated ControlNet modules to inject conditional signals, and designs an inpainting-based refinement module to preserve the appearance of visible regions, achieving high-quality and controllable completion of occluded human images.

Background & Motivation¶

Limitations of State of the Field: Existing Human Amodal Completion (HAC) methods can only hallucinate invisible regions from occluded images without accepting user-specified constraints (e.g., target pose or spatial extent), forcing users to repeatedly sample to obtain satisfactory results.

Limitations of Prior Work: Pose-Guided Person Image Synthesis (PGPIS) supports pose conditioning but struggles to preserve instance-specific visible appearance, tending to generate content biased toward the training distribution (e.g., garment characteristics of the DeepFashion dataset).

Visible Appearance Degradation: Latent-space denoising and VAE reconstruction discard fine-grained visible details; existing decoder fine-tuning approaches introduce blurring and boundary artifacts, while UV-coordinate schemes lose details under coordinate noise.

Lack of Multi-Type Prompt Support: Existing methods do not support multiple types of user prompts (pose + bounding box), limiting flexible trade-offs between performance and interaction cost.

Weak Zero-Shot Generalization: PGPIS baselines (e.g., PIDM, MCLD) frequently exhibit severe appearance hallucinations on real-world images outside the training set, producing extreme failures such as transforming elderly men into white women.

Boundary Artifacts: Directly compositing visible and generated regions introduces conspicuous artifacts at mask boundaries due to the absence of smooth transition mechanisms.

Method¶

Overall Architecture¶

The PHAC framework consists of two stages: (A) coarse image generation and (B+C) refinement.

Input: Occluded human image \(I_{ic}\) + user prompt \(P\) (five supported types: pose \(p_{po}\), interest-region bbox \(p_{ib}\), entire-region bbox \(p_{eb}\), pose + interest bbox \(p_{poib}\), pose + entire bbox \(p_{poeb}\))
Coarse Generation: User prompts are rendered into a prompt image \(I_p\), encoded by a dedicated ControlNet \(\Phi_{CN}\) into a conditional signal \(c_{pr}\), and injected into the denoising U-Net \(\epsilon_{cig}\), which denoises for \(T\) steps from random noise to produce the coarse completion \(I_{cc}\)
Refinement Stage: A lightweight U-Net \(\mathcal{U}_{iv}\) predicts the invisible-region mask \(M_{iv}\) and dilates it; a baseline composite image is constructed, and low-magnitude noise is injected only into the masked region; the refinement network \(\Phi_{RF}\) performs a small number of denoising steps to produce the final output \(I_{rc}\)

Key Designs¶

Multi-Type Point Prompts: Users need only provide a small number of points—pose prompts specify extra joint coordinates, while bbox prompts specify two corner points. Prompts are rendered as images and fed into the corresponding ControlNet module.
Fine-Tuning Cross-Attention Blocks Only: To preserve the generative prior of the pretrained diffusion model, only the cross-attention blocks of the denoising U-Net are fine-tuned; all remaining parameters are frozen, balancing prompt alignment with generation quality.
SAM-Based Visible Mask Prediction: Rather than relying on ground-truth visible region masks, SAM is used to automatically predict \(M_v\) from the input image, improving practical applicability.
Invisible Mask Prediction and Dilation: The lightweight U-Net takes \(I_{ic}\), \(I_{cc}\), and \(M_v\) as inputs to predict the invisible mask \(M_{iv}\), which is then dilated to expand the masked region and avoid missing boundary pixels.
Inpainting-Based Refinement: Instead of regenerating RGB values within the masked region from scratch, the method injects low-magnitude noise (\(s=0.5\)) into the coarse completion and performs approximately 40% of the standard denoising steps (20 steps), preserving the visible region and eliminating boundary artifacts.
Plug-and-Play Refinement Module: The refinement network can be directly applied to outputs from other diffusion models as a general post-processing component to boost performance.

Loss & Training¶

Coarse Generation Loss: Standard diffusion denoising objective \(\mathcal{L} = \mathbb{E}[\|\epsilon - \epsilon_{cig}(z_t, t, c_{te}, c_{pr})\|_2^2]\)
Mask Prediction Loss: Weighted combination of BCE and Dice losses \(\mathcal{L} = \mathcal{L}_{BCE} + 0.5 \cdot \mathcal{L}_{Dice}\)
Refinement Network: Directly employs the pretrained SDXL Inpainting model without additional training
Training Setup: Diffusion model learning rate \(5 \times 10^{-6}\), ControlNet learning rate \(5 \times 10^{-5}\), batch size 14, trained for 1750 epochs (4×A6000, ~16 hours); mask U-Net trained for 40 epochs (1×RTX 3090, ~30 minutes)
Stochastic Inference: The diffusion model generates \(N=16\) coarse outputs per training image; the mask U-Net randomly samples one of these during training for supervision

Key Experimental Results¶

Main Results¶

Datasets: OccThuman2.0 (synthetic, 5,260 images) + AHP (real-world, 56 images)

Method	Prompt Type	LPIPS*↓	SSIM↑	KID*↓	PSNR↑	Joint Err.↓
PIDM	2D pose	126.33	0.797	56.91	16.80	113.72
MCLD	UV map	115.90	0.833	41.11	18.37	53.38
pix2gestalt	-	90.11	0.911	16.51	22.63	36.65
SDHDO	2D pose	81.39	0.924	16.41	23.80	43.49
Ours	2D pose	49.47	0.948	6.12	25.86	23.33

Results on OccThuman2.0; values ×10³

Method	LPIPS*↓	SSIM↑	KID*↓	PSNR↑	Joint Err.↓
SDHDO	64.19	0.956	6.05	24.45	9.24
Ours	38.77	0.970	1.25	26.93	6.37

Results on the AHP real-world dataset

Ablation Study¶

Comparison of Different Prompt Types (OccThuman2.0):

Prompt Type	LPIPS*↓	SSIM↑	PSNR↑	Joint Err.↓
Pose \(p_{po}\)	49.47	0.948	25.86	23.33
Interest bbox \(p_{ib}\)	51.83	0.942	24.99	24.01
Entire bbox \(p_{eb}\)	52.28	0.941	25.07	28.23
Pose + interest bbox	49.35	0.947	25.69	22.15
Pose + entire bbox	49.42	0.946	25.49	21.96

Noise Strength Ablation (\(s\) parameter): \(s=0.5\) achieves the best performance on both OccThuman2.0 and AHP; values that are too small (0.1) yield insufficient denoising, while values that are too large (0.9) degrade the original appearance.

Plug-and-Play Refinement Effect: When applied to MCLD, MSE decreases by ~60% and KID by ~71%; when applied to pix2gestalt, LPIPS decreases by 37%.

Key Findings¶

Pose prompts provide the most effective single-prompt guidance; bbox prompts primarily constrain spatial extent but leave pose ambiguity.
Combined prompts (pose + bbox) consistently reduce joint error while maintaining perceptual quality.
The refinement module further improves all metrics even though the proposed method already surpasses all baselines prior to refinement.
The interest-region bbox achieves the highest per-point gain (\(\Delta\)JE pp=2.86), offering the best user interaction efficiency.

Highlights & Insights¶

Novel Task Definition: This work is the first to propose promptable human amodal completion, establishing a natural bridge between HAC and PGPIS.
Practical Prompt Design: Point-based prompts are extremely lightweight (only a few points required), avoiding hard-to-obtain inputs such as 3D information or dense masks.
Generality of the Refinement Module: The inpainting-based refinement serves as a universal plug-and-play component that yields significant gains for other methods as well.
Cross-Attention-Only Fine-Tuning: Strong prompt alignment is achieved with minimal parameter modification while preserving the pretrained generative prior.
Training Efficiency: The main model requires 4 GPUs for 16 hours, and the mask network requires a single GPU for 30 minutes, keeping overall training costs manageable.

Limitations & Future Work¶

Training is conducted only on synthetic data (OccThuman2.0, 526 3D human subjects), limiting diversity in real-world scenarios.
The AHP test set contains only 56 images, making real-world evaluation limited in scale.
The refinement network relies on pretrained SDXL Inpainting and requires approximately 4 seconds per image, which is insufficient for real-time applications.
Pose prompts require users to manually annotate missing joints, increasing interaction burden under severe occlusion.
Comparisons with recent DiT architectures or video diffusion models are absent.
Integration of text prompts or other high-level semantic conditions has not been explored.

HAC Methods: pix2gestalt (general amodal completion with diffusion prior) and SDHDO (human-specific with 2D pose prior) both lack user prompt support and suffer from visible appearance degradation.
PGPIS Methods: PIDM (2D pose map-conditioned diffusion) and MCLD (UV map-conditioned) both exhibit severe appearance hallucinations due to training data bias.
ControlNet: Serves as the foundational architecture for multi-type prompt injection; a dedicated ControlNet is trained for each prompt type.
SAM: Used to automatically predict visible region masks, replacing traditional approaches that depend on ground-truth masks.

Rating¶

Novelty: ⭐⭐⭐⭐ — Novel task formulation with a well-designed prompting mechanism
Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-baseline comparison and multi-dimensional ablations (prompt type / noise strength / plug-and-play), though real-world evaluation scale is limited
Writing Quality: ⭐⭐⭐⭐ — Clear structure, complete formulations, and rich figures and tables
Value: ⭐⭐⭐⭐ — The generality of the refinement module and the practicality of the prompt design offer promising application prospects