JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers¶

Paper Information¶

Conference: ICCV 2025
arXiv: 2505.00482
Code: Project Page
Area: 3D Vision
Keywords: Diffusion Transformer, RGB-Depth Joint Generation, Depth Estimation, Joint Distribution Modeling, Flow Matching

TL;DR¶

JointDiT builds an RGB-Depth joint distribution model upon the Flux diffusion Transformer. Through adaptive scheduling weights and an unbalanced timestep sampling strategy, a single model can flexibly perform three tasks—joint generation, depth estimation, and depth-conditioned image generation—by controlling the timestep of each modality.

Background & Motivation¶

Diffusion models have achieved remarkable progress in image generation and conditional generation (depth estimation, depth-guided generation, etc.). Recent work has explored joint distribution modeling of RGB and depth, finding that it not only enables joint generation but also serves as a unified substitute for conditional generation. However, two core problems remain:

Limited generation quality: Existing joint models (LDM3D, JointNet) are built on the relatively weak Stable Diffusion architecture, yielding suboptimal image fidelity and depth accuracy.

Challenges of decoupled timestep training: Achieving "one model, multiple tasks" requires training with independent noise levels for each modality, yet how to train this effectively has not been thoroughly explored.

Key insight: Advanced diffusion Transformers such as Flux possess superior image priors and global receptive fields (Transformer architecture), while Transformers have also been proven effective for depth estimation (DPT, Depth Anything).

Method¶

Overall Architecture¶

JointDiT constructs a parallel Depth branch alongside the RGB branch of Flux, exchanging features via a Joint Connection Module to achieve joint distribution modeling. The pretrained backbone is frozen; only LoRA and the Joint Connection Module are trained.

Joint Conditional Flow Matching (JCFM)¶

The flow matching framework is extended to learn a joint vector field \(v_{t_x,t_y}(x,y|x_1,y_1)\), with independent timesteps \(t_x, t_y\) for each modality:

\[\mathcal{L}_{\text{JCFM}}(\theta) = \mathbb{E}_{t_x,t_y}\left[\|v_{t_x,t_y,\theta}(x,y) - v_{t_x,t_y}(x,y|x_1,y_1)\|\right]\]

Tasks are switched by controlling the initial timesteps: - Joint generation: \(t_x=0, t_y=0\) - Depth estimation: \(t_x=1, t_y=0\) (image is clean; depth starts from noise) - Depth-conditioned generation: \(t_x=0, t_y=1\)

Key Design 1: Adaptive Scheduling Weights¶

In joint cross-attention, information-passing weights are dynamically adjusted according to the relative noise levels of the two modalities:

\[w_x(t_x, t_y) = \text{sigmoid}\left(\alpha\left(\frac{t_y}{t_x+t_y} - \frac{1}{2}\right)\right)\]

\[w_y(t_x, t_y) = \text{sigmoid}\left(\alpha\left(\frac{t_x}{t_x+t_y} - \frac{1}{2}\right)\right)\]

where \(\alpha=3\). The intuition is that the noisier branch should draw more structural information from the cleaner branch.

Key Design 2: Unbalanced Timestep Sampling¶

To ensure sufficient coverage of the joint timestep combination space for both joint and conditional generation: - 50% probability: \(t_x\) and \(t_y\) are sampled independently from two distinct distributions \(f(t)\) and \(g(t)\) - 50% probability: \(t_x = t_y\), sampled from \(f(t)\)

This ensures the model receives adequate training across all \((t_x, t_y)\) combinations.

Loss & Training¶

The final output combines self-attention and joint cross-attention:

\[\mathbf{G}_x = \text{Attn}(\mathbf{S}_x) + w_x \cdot \text{JointAttn}(\mathbf{S}_x, \mathbf{S}_y)\]

Key Experimental Results¶

Main Results: Zero-Shot Depth Estimation Generalization¶

Type	Method	NYUv2 AbsRel↓	KITTI AbsRel↓	ETH3D AbsRel↓
Discriminative	Depth-Anything-V2	4.4	7.5	13.2
Diffusion-based	Marigold	5.5	9.6	6.5
Diffusion-based	GeoWizard	5.2	10.1	6.4
Joint	JointDiT	4.9	9.4	5.6

As a joint model, JointDiT achieves depth estimation performance comparable to dedicated depth estimation models.

Ablation Study: Contribution of Key Components¶

Adaptive Scheduling Weights	Unbalanced Sampling	Joint Gen. FID↓	Depth Est. AbsRel↓
✗	✗	High	High
✓	✗	Improved	Improved
✗	✓	Improved	Improved
✓	✓	Lowest	Lowest

Both techniques contribute significantly and are complementary to each other.

Key Findings¶

JointDiT's 3D lifting results substantially outperform LDM3D and JointNet, producing geometrically accurate 3D point clouds.
The RGB and Depth branches exhibit complementary behavior during generation: the Depth branch captures structural information while the RGB branch focuses on texture and appearance.
In challenging domains (cartoons, pixel art), JointDiT's depth estimation surpasses dedicated methods, benefiting from the complementary advantages of joint modeling.

Highlights & Insights¶

Joint distribution as a substitute for conditional generation: A single model covers multiple tasks through timestep control alone.
Leveraging advanced diffusion Transformers: Flux's image priors combined with the Transformer's global receptive field are key to successful joint modeling.
Lightweight adaptation: Only LoRA and the Joint Connection Module are trained, preserving pretrained knowledge.
Discovery of complementary behavior: The RGB and Depth branches naturally specialize during generation.

Limitations & Future Work¶

Training data consists of only 50k pairs, potentially limiting generalization.
The model relies on pseudo depth labels generated by Depth-Anything-V2, and may inherit its biases.
Although joint generation FID improves substantially, absolute values still leave room for improvement.
Only \(512 \times 512\) resolution is supported.

LDM3D / JointNet: SD-based RGB-Depth joint generation
Marigold / GeoWizard: Diffusion-based depth estimation
Flux: Advanced Flow Matching diffusion Transformer
ControlNet: Depth-conditioned image generation

Rating¶

Novelty: ⭐⭐⭐⭐ — Adaptive scheduling weights and unbalanced sampling strategy are original contributions.
Practicality: ⭐⭐⭐⭐ — A single model handles multiple tasks with flexible applicability.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across joint generation, depth estimation, and conditional generation.
Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated with complete technical detail.