GuideFlow3D: Optimization-Guided Rectified Flow For Appearance Transfer¶
Conference: NeurIPS 2025 arXiv: 2510.16136 Code: Project Page Area: Image Generation Keywords: 3D appearance transfer, rectified flow, universal guidance, structured latent, part-aware loss
TL;DR¶
This paper proposes GuideFlow3D, a training-free 3D appearance transfer framework that alternately injects differentiable guidance losses (part-aware appearance loss + self-similarity loss) into the sampling process of a pretrained rectified flow model, enabling robust texture and geometric detail transfer between objects with significant geometric discrepancies.
Background & Motivation¶
Transferring appearance (texture + fine geometric details) from one 3D object to another has broad applications in gaming, AR, and digital content creation. Existing methods struggle when the input object and the appearance object differ significantly in geometry:
- 2D style transfer → 3D lifting: Applies 2D style transfer on multi-view images before 3D reconstruction, but inter-view geometric inconsistencies introduce artifacts.
- Direct application of 3D generative models: Rectified flow models such as Trellis can generate high-quality 3D assets, but are constrained by training-time conditioning signals and data distributions, generalizing poorly to appearance transfer—especially under large geometric discrepancies.
- ControlNet-based methods (e.g., TEXTure, EasiTex): Rely on specific training setups and conditioning modalities, limiting generalizability.
- Pure optimization methods: Directly optimizing the latent space to match an appearance target deviates from the data distribution modeled by the generative network, producing unnatural results.
The core motivation is: can the inductive biases of a pretrained 3D generative model be leveraged to achieve flexible appearance transfer through inference-time guidance, without retraining?
Core Problem¶
How to robustly transfer the texture and fine geometric details of an appearance object onto an input object while preserving its global geometric structure—particularly when the two objects differ substantially in geometry (e.g., chair→bed, giraffe→furniture)?
Method¶
Overall Architecture¶
GuideFlow3D builds upon the structured latent (SLat) representation and rectified flow generative model of Trellis, controlling the generation process at inference time by alternating between flow steps and guided optimization steps.
1. Structured Latent Representation¶
A 3D object \(\mathcal{O}\) is encoded as a structured latent:
- \(p_i\): active voxel positions (intersecting the object surface), delineating coarse structure
- \(z_i\): latent vector for each voxel, capturing fine geometry and texture features
- Key design: \(p_i\) is fixed (preserving global geometry); only the generation of \(z_i\) is guided
2. Guidance Objective Functions¶
(a) Part-aware appearance loss \(\mathcal{L}_{\text{appearance}}\) (applicable when the appearance object has a mesh):
- Geometric features from PartField are used for co-segmentation clustering, establishing part-level correspondences between the input and appearance objects (e.g., chair back↔chair back, chair leg↔chair leg)
- \(m\) denotes the index of the matched voxel in the appearance object via part clustering
- Ensures localized texture and geometric correspondence
(b) Self-similarity loss \(\mathcal{L}_{\text{structure}}\) (applicable when the appearance object is only an image or text):
- A geometry-clustering-based contrastive loss: voxel features within the same part are encouraged to be similar (positives), while those from different parts are encouraged to differ (negatives)
- Promotes local consistency without enforcing global homogenization
3. Guided Rectified Flow Sampling¶
The reverse process of standard rectified flow is:
GuideFlow3D injects guidance gradients at each step:
- The condition \(\mathbf{c}\) can be an image or text
- From a Bayesian perspective, the rectified flow models the prior \(P(\mathcal{O})\) and likelihood \(P(\mathbf{c}|\mathcal{O})\), while the guidance term models additional constraints
- This extends universal guidance from diffusion models to arbitrary rectified flow models
4. Conditioning Flexibility¶
- Image + Mesh condition: Uses \(\mathcal{L}_{\text{appearance}}\) to transfer texture and geometric details
- Image-only condition: Uses \(\mathcal{L}_{\text{structure}}\) (Trellis can first generate a mesh from the image)
- Text condition: Uses \(\mathcal{L}_{\text{structure}}\) to transfer texture only
Loss & Training¶
Training-Free¶
GuideFlow3D is a fully training-free framework requiring no fine-tuning or retraining of the generative model. All appearance transfer control is injected at inference time through guidance.
Implementation Details¶
- Base model: Pretrained Trellis models (
trellis-image-largefor image conditioning,trellis-text-largefor text conditioning), using their default configurations - Part features: PartField is used to compute a part feature field for each mesh; voxel coordinates \(p_i\) are used to query per-voxel part features
- Sampling steps: Rectified flow sampling and single-instance optimization are alternated for a total of 300 steps
- Optimizer: AdamW, learning rate \(5 \times 10^{-4}\)
- Hardware: Single NVIDIA RTX 4090 GPU
- Runtime: 96 seconds (baseline Trellis: 78 seconds; ~23% overhead)
- All conditioning types (image/text) use the same optimization settings
Evaluation Rendering¶
- All assets are rendered in Blender with smooth area lighting
- Each object is rendered from 4 viewpoints (fixed radius 2, pitch 30°, yaw starting at 45° with 90° increments)
- All meshes are placed in canonical pose to ensure alignment
- Metrics are computed per viewpoint and per object, then averaged
Key Experimental Results¶
Dataset¶
- Input meshes: Procedurally generated simple geometries (simple)
- Appearance objects: ABO dataset (complex), ~8K 3D models across 55 categories
- 4 experimental settings: simple-complex intra-/inter-category, complex-complex intra-/inter-category
- 250 input–appearance pairs per setting
Evaluation Protocol¶
Traditional encoder-based metrics (PSNR, SSIM, LPIPS, FID, etc.) require ground truth and cannot handle dissimilar geometries. Therefore, a GPT-based ranking system is adopted, evaluating six dimensions: Style Fidelity, Structure Clarity, Style Integration, Detail Quality, Shape Adaptation, and Overall Quality (lower rank is better). A user study confirms high agreement between GPT rankings and human preferences.
Main Results (simple-complex intra-category, image conditioning)¶
| Method | Fidelity↓ | Clarity↓ | Overall↓ |
|---|---|---|---|
| UV Nearest Neighbor | 4.12 | 3.84 | 4.33 |
| MambaST | 4.94 | 3.55 | 4.87 |
| Cross Image Attention | 3.56 | 3.48 | 3.59 |
| EasiTex | 3.18 | 4.30 | 3.81 |
| Trellis | 2.51 | 2.58 | 2.62 |
| GuideFlow3D (Ours) | 1.89 | 2.41 | 2.12 |
Text Conditioning Results (simple-complex intra-category)¶
| Method | Fidelity↓ | Clarity↓ | Overall↓ |
|---|---|---|---|
| Trellis | 2.01 | 1.89 | 2.39 |
| GuideFlow3D (Ours) | 1.54 | 1.63 | 1.95 |
- GuideFlow3D achieves the best rankings across all settings (intra-/inter-category, simple/complex) and both conditioning modalities
- In-the-wild experiments demonstrate robust transfer across semantic categories (animals→furniture, furniture→vehicles, etc.)
Runtime¶
- GuideFlow3D: 96 seconds (NVIDIA RTX 4090 GPU)
- Trellis baseline: 78 seconds
- ~23% additional overhead for substantial quality improvement
Ablation Study¶
Ablations are conducted on the simple-complex intra-category image-conditioning setting:
| Variant | Fidelity↓ | Clarity↓ | Overall↓ |
|---|---|---|---|
| (i) w/o flow + global feat. | 4.52 | 4.51 | 4.50 |
| (ii) w/o flow + SLat spatial NN matching | 3.58 | 3.62 | 3.63 |
| (iii) w/ flow + K-means on SLat (no PartField) | 2.57 | 2.65 | 2.66 |
| (iv) w/ flow + \(\mathcal{L}_{\text{structure}}\) (image cond.) | 2.17 | 2.05 | 2.03 |
| (v) w/ flow + \(\mathcal{L}_{\text{appearance}}\) (image cond.) | 1.23 | 1.08 | 1.06 |
Key findings: 1. Global features are insufficient: Global latents via min/max/avg pooling fail to capture semantic correspondences 2. Unstructured NN matching is insufficient: Nearest-neighbor matching directly in SLat space improves fidelity but lacks robust semantic alignment 3. PartField vs. K-means on SLat: Semantically aware PartField segmentation significantly outperforms K-means on SLat features, demonstrating that part-aware semantic information is critical for accurate part correspondence 4. Two losses are complementary: \(\mathcal{L}_{\text{appearance}}\) yields stronger fidelity, while \(\mathcal{L}_{\text{structure}}\) provides better alignment and adaptability
Scene Editing Application¶
- Scene-level editing capability is validated on ScanNet indoor scenes
- Per-object CAD mesh annotations are used to select appearance objects for each semantic category in the scene
- Multiple objects in the scene can be selectively re-stylized while preserving the spatial layout
- Demonstrates potential for interactive scene customization
Limitations of Traditional Metrics¶
- DINOv2, CLIP Score, DreamSim, etc. require ground truth or assume geometric similarity
- These metrics cannot reflect true transfer quality when input and appearance objects differ substantially in geometry
- For instance, CLIP Score assigns higher scores to the Trellis baseline under text conditioning, because text typically describes shapes different from the input geometry
Highlights & Insights¶
- Training-free: Appearance transfer is achieved entirely through inference-time guidance injection, without modifying generative model parameters
- Geometric robustness: Global geometry is preserved by fixing voxel positions \(p_i\); part-aware losses handle large geometric discrepancies
- Unified multi-modal framework: The same framework supports three appearance representations—mesh, image, and text
- Principled formulation: Universal guidance is extended to rectified flow via Bayesian formulation, yielding a theoretically grounded framework
- Generality and scalability: The method generalizes to different diffusion/flow models and guidance functions
- Evaluation innovation: A GPT-based multi-dimensional ranking evaluation system is proposed and validated against human judgments via user study
Limitations & Future Work¶
- Not real-time: The optimization-based approach (96-second inference) is unsuitable for real-time applications; future work could train a self-supervised feed-forward model for acceleration
- Dependency on external models: Relies on Trellis (SLat encoding/decoding) and PartField (part features); failures in these models cascade to the final output
- Requires clean meshes: Assumes noise-free input meshes, limiting applicability to noisy inputs such as scanned data
- Limited scope of main experiments: Main experiments focus on the furniture category (ABO dataset); although in-the-wild results suggest broader generalization, systematic evaluation is lacking
- Absence of traditional metric comparisons: Complete reliance on GPT-based evaluation may overlook certain objective quality differences
Related Work & Insights¶
| Method | Training Required | Geometric Robustness | Multi-modal Support | Part-aware | Output Representation |
|---|---|---|---|---|---|
| StyleGaussian | Yes | Weak | Style only | No | Rendering only |
| TEXTure | SDS distillation | Moderate | Text | No | Texture |
| EasiTex | ControlNet | Weak (large geometric deviation) | Image | No | Texture |
| Trellis | No additional training | Weak | Image/Text | No | Mesh/3DGS/NeRF |
| Cross Image Attention | No | Weak (2D→3D artifacts) | Image | No | Depends on lifting method |
| GuideFlow3D | No | Strong | Mesh+Image+Text | Yes | Mesh/3DGS/NeRF |
Additional Insights: - 3D extension of universal guidance: The idea of universal guidance from Bansal et al. for 2D diffusion is extended to 3D rectified flow models, opening a new direction for controllability in 3D generation—any differentiable objective can be injected at inference time - Part-aware correspondence via PartField: Using PartField's geometric co-segmentation to establish inter-object part correspondences is an elegant solution to the correspondence problem under large geometric discrepancies - Position–feature decoupling in structured latent space: The design of fixing \(p_i\) while optimizing \(z_i\) elegantly achieves "preserve geometry, modify appearance," a principle transferable to other 3D editing tasks - GPT-as-evaluator paradigm: In generative task evaluation without ground truth, GPT ranking combined with user study validation is a methodology worth broader adoption
Rating¶
- Novelty: ⭐⭐⭐⭐ (Extending universal guidance to 3D rectified flow is a novel idea; the part-aware loss design is creative)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Multiple settings, multiple baselines, in-the-wild demonstrations, user study, and ablation study are all covered)
- Writing Quality: ⭐⭐⭐⭐ (Clear mathematical derivations, rich illustrations, and well-motivated method design)
- Value: ⭐⭐⭐⭐ (A training-free 3D appearance transfer framework with strong practicality and scalability)