Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning¶
Conference: CVPR2026 arXiv: 2505.20107 Code: ZiyiZhang27/MVC-ZigAL Area: Image Generation Keywords: Multi-view generation, diffusion models, reinforcement learning fine-tuning, few-step inference, cross-view consistency
TL;DR¶
This paper proposes MVC-ZigAL, a framework that improves single-view fidelity and cross-view consistency in few-step text-to-multiview diffusion models through multiview-aware MDP formulation, zigzag self-refining advantage learning, and Lagrangian dual constrained optimization.
Background & Motivation¶
- Growing demand for text-to-multiview generation: T2MV diffusion models must jointly generate images of the same scene from multiple viewpoints given a single text prompt, offering significant value in 3D content creation and related applications.
- Few-step models trade quality for speed: Few-step backbones such as LCM reduce inference steps to fewer than 8, but at the cost of substantially degraded image fidelity and cross-view consistency.
- Existing RL methods do not transfer directly: Prior RL fine-tuning approaches (DPOK, REBEL, etc.) are designed for single-image generation and overlook coordinated optimization across multiple views.
- Weak learning signals in few-step models: Samples generated by few-step models are generally of lower quality and exhibit tightly clustered reward values, resulting in insufficient gradient signals for standard RL methods.
- Limitations of single-view and joint-view rewards: Single-view rewards (e.g., PickScore) provide fine-grained feedback but neglect cross-view consistency, while joint-view rewards (e.g., HyperScore) assess overall quality but lack per-view supervision.
- Sensitivity of weighted-sum reward balancing: Naively combining the two reward types via weighted summation is highly sensitive to weight selection, making it difficult to stably balance the two optimization objectives.
Method¶
Overall Architecture¶
MVC-ZigAL comprises three core components: (1) multiview-aware MDP reformulation; (2) ZMV-Sampling with zigzag advantage learning; and (3) Lagrangian dual constrained optimization.
Multiview-Aware MDP¶
The T2MV denoising process is reformulated as a multiview MDP: the state \(s_t\) at each step encodes the noisy images and camera embeddings for all \(V\) views, and the action \(a_t\) denotes the joint denoising output across all views. A joint-view reward function \(\mathcal{R}_{\text{mv}}\) is introduced to evaluate the overall quality of the generated multiviews (based on the HyperScore overall dimension). Three baselines—MV-PG, MV-DPO, and MV-RDL—are adapted under this MDP formulation.
ZMV-Sampling with Self-Refining¶
At the first denoising step, a three-pass zigzag procedure is applied: high-guidance denoising → low-guidance reverse noising → high-guidance denoising again. The core idea is to instantiate a self-refining mechanism via the guidance scale contrast (\(\omega_{\text{high}}\) vs. \(\omega_{\text{low}}\)): features aligned with the condition survive the low-guidance inversion, while misaligned features are suppressed. The zigzag pass is applied only at the first step (\(t=T\)), as the early diffusion stage determines global geometric structure; applying it at all steps leads to excessive texture smoothing.
Zigzag Advantage Learning (MV-ZigAL)¶
Trajectory pairs are generated for the same prompt using standard sampling and ZMV-Sampling respectively, and a zigzag advantage function is defined as \(\mathcal{A}_{\text{mv}} = \mathcal{R}_{\text{mv}}(\mathbf{x}^z) - \mathcal{R}_{\text{mv}}(\mathbf{x}^s)\). The objective minimizes the squared error between the log-likelihood ratio difference and the advantage value. Compared to MV-RDL, which uses two standard trajectories, this approach exploits structured self-refining advantages to provide stronger learning signals.
Multiview Constrained Policy Optimization¶
The optimization is decomposed into a primary objective—maximizing the sum of single-view rewards \(\sum_v R(\mathbf{x}_0^v, \mathbf{c})\)—subject to the constraint that the joint-view reward \(\geq \tau\). A Lagrangian dual approach introduces multiplier \(\lambda\) to define a unified reward function:
Adaptive Primal-Dual Updates and Self-Paced Curriculum¶
- Adaptive step size: A larger step size \(\alpha^+\) is used when the constraint is violated for rapid tightening, and a smaller step size \(\alpha^-\) is used when it is satisfied for gradual relaxation, preventing oscillation in \(\lambda\).
- Self-paced threshold: \(\tau\) is adaptively adjusted via EMA tracking of the current policy's joint reward, encouraging exploration early in training and progressively tightening the constraint thereafter.
Key Experimental Results¶
Main Results (Training Prompts, 8-Step 6-View)¶
| Method | HyperScore Overall | PickScore |
|---|---|---|
| Baseline | 7.23 | 0.196 |
| MV-PG | 8.39 | 0.203 |
| MV-DPO | 8.00 | 0.200 |
| MV-RDL | 9.03 | 0.203 |
| MV-ZigAL | 9.17 | 0.205 |
Generalization (MATE-3D Unseen Prompts, Epoch 70)¶
| Method | HyperScore Overall | PickScore | HPSv2 | ImageReward |
|---|---|---|---|---|
| Baseline | 6.67 | 0.204 | 0.252 | -0.846 |
| MV-ZigAL | 6.95 | 0.205 | 0.254 | -0.770 |
| WS-ZigAL (w=0.5) | 6.83 | 0.217 | 0.270 | 0.183 |
| MVC-ZigAL (First-Step) | 7.04 | 0.217 | 0.268 | 0.180 |
Ablation Study¶
- Advantage learning vs. policy gradient: MVC-ZigAL outperforms MVC-ZigPG (which retains zigzag sampling but uses policy gradient) on all metrics, validating the contribution of advantage learning.
- First-step vs. all-step zigzag: First-step zigzag achieves superior HyperScore (7.04 vs. 6.91) without additional inference overhead.
- Constrained optimization vs. weighted sum: WS-ZigAL requires careful tuning—at \(w_{mv}=0.1\), HyperScore drops to 6.25—whereas MVC-ZigAL consistently outperforms all weighted configurations without manual weight selection.
- Adaptive vs. fixed threshold: A fixed threshold of 7.5 is too loose and renders the constraint ineffective, while 9.0 is too tight and suppresses single-view optimization; EMA-based adaptation yields the best results.
- Adaptive vs. fixed step size: A small fixed step size (0.01) responds too slowly to violations, while a large fixed step size (0.1) causes \(\lambda\) oscillation; the adaptive strategy achieves both responsiveness and stability.
Highlights & Insights¶
- This work is the first to systematically extend RL fine-tuning to few-step T2MV diffusion models, introducing a complete multiview-aware MDP framework.
- The combination of zigzag self-refining and advantage learning elegantly addresses the weak learning signal problem in few-step models; the reward gap progressively narrows during training, indicating that the base model has internalized self-refining capabilities.
- The Lagrangian dual method with self-paced curriculum eliminates the need for manual tuning of reward weights and thresholds, offering strong engineering practicality.
- Ablation experiments are systematic and comprehensive, with quantitative validation for each design decision.
Limitations & Future Work¶
- Validation is conducted solely on MV-Adapter + LCM-SDXL; applicability to other multiview architectures (e.g., Zero123++, Era3D) remains unexplored.
- The joint-view reward relies on HyperScore, and the robustness of this evaluator for T2MV generation warrants further investigation.
- The training prompt set contains only 45 animal names, limiting diversity; the MATE-3D evaluation also comprises only 160 prompts.
- ZMV-Sampling approximately triples the per-sample inference cost during training (three zigzag passes), incurring substantial training overhead.
- Integration with downstream tasks such as video generation and 3D reconstruction has not been explored.
Related Work & Insights¶
- T2I RL fine-tuning (DPOK, REBEL, PRDP): Designed for single-image generation and do not model cross-view coordination; the multiview MDP in MVC-ZigAL constitutes the key distinction.
- Zigzag Diffusion: The original method targets single-image full-step sampling; this work adapts it to a first-step multiview schedule and uses it as an advantage reference rather than directly improving inference.
- MV-Adapter / SPAD: Foundational multiview generation architectures; MVC-ZigAL functions as an orthogonal RL fine-tuning layer that can be stacked on top.
- DreamAlign and related T2-3D RL methods: Optimize 3D objects via SDS rendering loops; the proposed method optimizes directly at the multiview image level, offering greater efficiency.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Multiview RL fine-tuning is a novel and meaningful setting; the combination of zigzag advantage learning and Lagrangian constraints demonstrates strong originality.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Ablations are comprehensive, though the scale of training and evaluation prompts is limited.
- Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear, structure is well-organized, and figures complement the text effectively.
- Value: ⭐⭐⭐⭐ — Provides a practical and complete framework for RL alignment of few-step multiview generation.