TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures¶

Basic Information¶

Conference: CVPR 2026
arXiv: 2602.19679
Code: Project Page
Area: 3D Vision / Human-Object Reconstruction
Keywords: 3D Human-Object Reconstruction, Text-Guided Optimization, Score Distillation Sampling, 3D Gaussian Splatting, Human-Object Interaction

TL;DR¶

TeHOR leverages text descriptions as semantic guidance and jointly optimizes the geometry and texture of 3D humans and objects via Score Distillation Sampling from pretrained diffusion models. This approach eliminates the reliance on contact information required by conventional methods, enabling accurate and semantically consistent 3D reconstruction of both contact and non-contact interactions.

Background & Motivation¶

Jointly reconstructing 3D humans and objects from a single image is a key task for understanding human behavior, with broad applications in robotics, AR/VR, and digital content creation. Existing methods suffer from two fundamental limitations:

Over-reliance on contact information: Existing methods (e.g., PHOSA, CONTHO, InteractVLM) primarily use human-object contact regions as the core cue for interaction reasoning, enforcing geometric proximity in contact areas through iterative fitting. However, a large proportion of real-world interactions are non-contact in nature (e.g., gazing at or pointing toward an object), rendering contact information entirely ineffective. Even when contact exists, erroneous contact predictions directly lead to reconstruction failure.

Neglect of global appearance context: The fitting process in existing methods is driven mainly by local geometric proximity, ignoring the global interaction context provided by appearance cues (color, shading, etc.) of the human and object. This results in globally implausible outputs, such as incorrect object orientation or misaligned human gaze direction.

Method¶

Overall Architecture¶

TeHOR adopts a two-stage framework: a Reconstruction Stage (initialization) and an HOI Optimization Stage (joint refinement).

Stage	Objective	Key Techniques
Reconstruction Stage	Obtain initial 3D human/object/background and text prompts	GPT-4 text generation, LHM human reconstruction, InstantMesh object reconstruction
HOI Optimization Stage	Jointly optimize geometry and texture (200 iterations)	SDS appearance loss, contact loss, collision loss

Stage 1: Reconstruction Stage

Text Generation: GPT-4 extracts two types of text prompts from the input image — \(P_{\text{holistic}}\) (global interaction description, e.g., "a person riding a bicycle on grass") and \(P_{\text{contact}}\) (contacting body parts, e.g., "right hand, left hand").
Human Reconstruction: SmartEraser removes the object → SAM segments the human → LHM generates initial 3D Gaussian attributes \(\phi_h\) (40,000 anchor points uniformly sampled on the SMPL-X surface) → Multi-HMR estimates SMPL-X pose \(\theta\) and shape \(\beta\).
Object Reconstruction: SmartEraser + SAM isolate the object → InstantMesh reconstructs a 3D mesh (Zero123++ first generates 6-view images, then a tri-plane network reconstructs the mesh) → converted to 3D Gaussian attributes \(\phi_o\) → ZoeDepth depth alignment estimates object pose \((R, t, s)\).
Background Reconstruction: SmartEraser removes the human and object to obtain a 2D background image used for constructing realistic front-view and novel-view renderings.

3D Representation¶

The human and object are represented by 3D Gaussian sets \(\Phi_h\) and \(\Phi_o\), respectively:

Human Gaussians: Parameterized as Gaussian attributes \(\phi_h\) + SMPL-X pose \(\theta\) + shape \(\beta\). \(\phi_h\) is defined in canonical pose, with each Gaussian anchored to a surface point on the SMPL-X mesh and animated via Linear Blend Skinning (LBS). Hand and face regions follow the original SMPL-X skinning weights; other regions use averaged weights from neighboring vertices.
Object Gaussians: Parameterized as Gaussian attributes \(\phi_o\) + rotation \(R\) + translation \(t\) + scale \(s\), defined in canonical space and mapped to final positions via affine transformation.

Advantages of 3D Gaussians over traditional meshes: (1) Gaussians better model high-fidelity visual appearance, providing richer signals for the appearance loss; (2) the flexible topology-free structure allows more effective optimization of human-object spatial relationships.

Key Designs¶

The total loss consists of four terms:

\[\mathcal{L} = \mathcal{L}_{\text{recon}} + \mathcal{L}_{\text{appr}} + \mathcal{L}_{\text{contact}} + \mathcal{L}_{\text{collision}}\]

1) Reconstruction Loss \(\mathcal{L}_{\text{recon}}\): MSE between the front-view rendering and the input image, including RGB reconstruction error and the discrepancy between human/object silhouettes and segmentation masks, ensuring the reconstruction is consistent with the input image at the observed viewpoint.

2) Appearance Loss \(\mathcal{L}_{\text{appr}}\) (Core Contribution): Based on the Score Distillation Sampling (SDS) strategy, this loss leverages the visual priors of pretrained StableDiffusion-v2.1 to align novel-view renderings with the semantics of \(P_{\text{holistic}}\):

\[\nabla_{\Phi}\mathcal{L}_{\text{appr}} = \mathbb{E}\left[w_t\left(\hat{\epsilon}_t(\mathbf{x}_t; P_{\text{holistic}}) - \epsilon_t\right)\frac{\partial \mathbf{x}_t}{\partial \Phi}\right]\]

where \(t\) is the noise level, \(\mathbf{x}_t\) is the noise-augmented rendered image, and \(w_t\) is a weighting factor. This loss minimizes the discrepancy between the diffusion model's predicted noise \(\hat{\epsilon}_t(\cdot)\) and the true noise \(\epsilon_t\), driving the rendered results of the 3D Gaussians toward a plausible appearance distribution conditioned on the text.

Key implementation details:

Viewpoints are uniformly sampled in spherical coordinates \((r, \upsilon, \psi)\): full-body views with \(r \in [1.0, 2.5]\), \(\upsilon \in [-30°, 30°]\), \(\psi \in [-180°, 180°]\); upper-body zoomed views centered at the SMPL-X spine with \(r \in [0.7, 1.5]\).
Classifier-free guidance (CFG) scale of 15.0; noise timestep randomly sampled within \([0.02, 0.98]\).
Gradient clipping with maximum norm 1.0.

This design offers two key advantages: (a) text descriptions transcend contact information and can reason about non-contact interactions (e.g., catching a frisbee, gazing at an object); (b) pixel-level dense gradients provide fine-grained spatial supervision, far surpassing the single global vector encoding of CLIP.

3) Contact Loss \(\mathcal{L}_{\text{contact}}\): Based on \(P_{\text{contact}}\), the Gaussian center point set \(V_{h,c}\) corresponding to the contacting body parts is identified, and the distance to the nearest object point \(V_o\) is minimized:

\[\mathcal{L}_{\text{contact}} = \frac{1}{|V_{h,c}|}\sum_{v_h \in V_{h,c}} d(v_h, V_o) \cdot \mathbb{1}[d(v_h, V_o) < \tau]\]

The threshold \(\tau = 10\) cm ensures local physical plausibility. Gradients are computed only for points within the threshold, preventing unrelated distant points from being incorrectly attracted.

4) Collision Loss \(\mathcal{L}_{\text{collision}}\): Penalizes interpenetration between the human and object by computing the proportion of human vertices lying inside the object mesh, ensuring physical plausibility.

Gaussians-to-Mesh Conversion¶

After optimization, the 3D Gaussians must be converted to meshes for evaluation (to enable fair comparison with existing mesh-based methods). Since Gaussians may deviate from the underlying base mesh, inconsistencies can arise in contact regions. The solution identifies contact regions where the human-object Gaussian distance is less than 5 cm, selects the corresponding mesh vertices, and minimizes their inter-distance to zero, achieving contact-consistent conversion.

Key Experimental Results¶

Datasets and Metrics¶

Open3DHOI: An open-vocabulary in-the-wild 3D HOI dataset with 2.5K+ images and 133 object categories (evaluation only).
BEHAVE: An indoor 3D HOI dataset with 8 subjects × 20 objects and 4.5K test images.
Metrics: \(\text{CD}_{\text{human}}\) / \(\text{CD}_{\text{object}}\) (Chamfer Distance, cm↓), Contact (F1↑), Collision (interpenetration rate↓).

Main Results: Comparison with State of the Art (Tab. 4)¶

Method	CD↓_human (O3D)	CD↓_obj (O3D)	Contact↑ (O3D)	Coll.↓ (O3D)	CD↓_human (BH)	CD↓_obj (BH)	Contact↑ (BH)
PHOSA	5.342	49.180	0.243	0.044	5.758	46.003	0.257
LEMON+PICO	5.948	25.889	0.335	0.078	6.159	22.585	0.082
InteractVLM	5.252	24.238	0.392	0.054	5.770	19.197	0.379
HOI-Gaussian	5.111	19.363	0.348	0.070	5.748	21.774	0.371
TeHOR	4.941	16.701	0.412	0.047	5.615	17.339	0.412

TeHOR outperforms all prior state-of-the-art methods across all metrics. On Open3DHOI, object CD improves from 19.363 to 16.701 (↓13.7%) and Contact F1 improves from 0.392 to 0.412.

Non-Contact Scenario Evaluation (Tab. 5)¶

Method	CD↓_human	CD↓_object	Collision↓
PHOSA	5.401	65.537	0.028
InteractVLM	5.390	46.819	0.011
HOI-Gaussian	5.244	25.374	0.037
TeHOR	4.958	17.546	0.005

The advantage is even more pronounced in non-contact scenarios, with object CD improving from 25.374 to 17.546 (↓30.8%), validating the critical role of text-semantic guidance.

Ablation Study¶

Effect of Text-Guided Optimization (Tab. 1):

Setting	CD↓_human	CD↓_obj	Contact↑	Collision↓
Before optimization	5.252	31.268	0.305	0.040
Optimization (w/o text)	5.028	20.348	0.374	0.052
Optimization (full)	4.941	16.701	0.412	0.047

Loss Function Configuration Ablation (Tab. 2):

\(\mathcal{L}_{\text{appr}}\)	\(\mathcal{L}_{\text{contact}}\)	CD↓_obj	Contact↑
✗	✓	22.094	0.330
✓	✗	19.849	0.374
CLIP substitute	✓	18.504	0.366
✓ (SDS)	✓	16.701	0.412

Key finding: The SDS appearance loss substantially outperforms CLIP loss — CLIP encodes to a single 1D vector and cannot model dense spatial relationships, whereas SDS provides pixel-level dense gradients.

Rendering Component Ablation (Tab. 3): Replacing 3D Gaussians with meshes degrades CD_obj to 25.162; removing the 2D background degrades CD_obj to 18.196, demonstrating that the complete scene context is essential for the diffusion prior.

Highlights & Insights¶

Breaking the contact-dependency paradigm: TeHOR is the first to incorporate text descriptions into joint 3D human-object reconstruction, enabling reasoning about non-contact interactions (gazing, pointing, catching a frisbee, etc.).
SDS appearance optimization: By exploiting the visual priors of pretrained diffusion models, multi-view SDS achieves fine-grained semantic alignment, which ablation experiments confirm is far superior to CLIP.
First joint texture reconstruction: The paper claims to be the first framework to simultaneously reconstruct complete 3D textures for both humans and objects, directly enabling the generation of immersive digital assets.
Thorough experimental design: Separate evaluations on general and non-contact scenarios, along with five groups of ablation experiments, sufficiently validate the contribution of each component.

Limitations & Future Work¶

The pipeline depends on multiple external models (GPT-4, StableDiffusion, LHM, InstantMesh), resulting in a long dependency chain and high inference cost.
Approximately 134 seconds per sample on a single RTX 8000 GPU; the 200-step optimization makes real-time application infeasible.
The appearance loss provides primarily global guidance, with insufficient supervision over local details (small accessories, subtle surface deformations).
Quantitative evaluation of texture quality is absent due to the lack of 3D HOI datasets with joint geometry and texture annotations.

Rating¶

⭐⭐⭐⭐ — The paper clearly identifies the fundamental limitations of existing methods (contact dependency and neglect of global appearance) and proposes a novel and effective text-guided SDS optimization approach. It achieves comprehensive state-of-the-art performance in both general and non-contact scenarios, with a systematic ablation study design. Points are deducted mainly for optimization efficiency and the long dependency chain on multiple external models.