Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning¶

Conference: CVPR2026 arXiv: 2505.20107 Code: ZiyiZhang27/MVC-ZigAL Area: Image Generation Keywords: Multi-view generation, diffusion models, reinforcement learning fine-tuning, few-step inference, cross-view consistency

TL;DR¶

This paper proposes MVC-ZigAL, a framework that improves single-view fidelity and cross-view consistency in few-step text-to-multiview diffusion models through multiview-aware MDP formulation, zigzag self-refining advantage learning, and Lagrangian dual constrained optimization.

Background & Motivation¶

Growing demand for text-to-multiview generation: T2MV diffusion models must jointly generate images of the same scene from multiple viewpoints given a single text prompt, offering significant value in 3D content creation and related applications.
Few-step models trade quality for speed: Few-step backbones such as LCM reduce inference steps to fewer than 8, but at the cost of substantially degraded image fidelity and cross-view consistency.
Existing RL methods do not transfer directly: Prior RL fine-tuning approaches (DPOK, REBEL, etc.) are designed for single-image generation and overlook coordinated optimization across multiple views.
Weak learning signals in few-step models: Samples generated by few-step models are generally of lower quality and exhibit tightly clustered reward values, resulting in insufficient gradient signals for standard RL methods.
Limitations of single-view and joint-view rewards: Single-view rewards (e.g., PickScore) provide fine-grained feedback but neglect cross-view consistency, while joint-view rewards (e.g., HyperScore) assess overall quality but lack per-view supervision.
Sensitivity of weighted-sum reward balancing: Naively combining the two reward types via weighted summation is highly sensitive to weight selection, making it difficult to stably balance the two optimization objectives.

Method¶

Overall Architecture¶

MVC-ZigAL comprises three core components: (1) multiview-aware MDP reformulation; (2) ZMV-Sampling with zigzag advantage learning; and (3) Lagrangian dual constrained optimization.

Multiview-Aware MDP¶

The T2MV denoising process is reformulated as a multiview MDP: the state \(s_t\) at each step encodes the noisy images and camera embeddings for all \(V\) views, and the action \(a_t\) denotes the joint denoising output across all views. A joint-view reward function \(\mathcal{R}_{\text{mv}}\) is introduced to evaluate the overall quality of the generated multiviews (based on the HyperScore overall dimension). Three baselines—MV-PG, MV-DPO, and MV-RDL—are adapted under this MDP formulation.

ZMV-Sampling with Self-Refining¶

At the first denoising step, a three-pass zigzag procedure is applied: high-guidance denoising → low-guidance reverse noising → high-guidance denoising again. The core idea is to instantiate a self-refining mechanism via the guidance scale contrast (\(\omega_{\text{high}}\) vs. \(\omega_{\text{low}}\)): features aligned with the condition survive the low-guidance inversion, while misaligned features are suppressed. The zigzag pass is applied only at the first step (\(t=T\)), as the early diffusion stage determines global geometric structure; applying it at all steps leads to excessive texture smoothing.

Zigzag Advantage Learning (MV-ZigAL)¶

Trajectory pairs are generated for the same prompt using standard sampling and ZMV-Sampling respectively, and a zigzag advantage function is defined as \(\mathcal{A}_{\text{mv}} = \mathcal{R}_{\text{mv}}(\mathbf{x}^z) - \mathcal{R}_{\text{mv}}(\mathbf{x}^s)\). The objective minimizes the squared error between the log-likelihood ratio difference and the advantage value. Compared to MV-RDL, which uses two standard trajectories, this approach exploits structured self-refining advantages to provide stronger learning signals.

Multiview Constrained Policy Optimization¶

The optimization is decomposed into a primary objective—maximizing the sum of single-view rewards \(\sum_v R(\mathbf{x}_0^v, \mathbf{c})\)—subject to the constraint that the joint-view reward \(\geq \tau\). A Lagrangian dual approach introduces multiplier \(\lambda\) to define a unified reward function:

\[\mathcal{R}_{\text{mvc}} = \frac{R(\mathbf{x}_0^v, \mathbf{c}) + \lambda \cdot \mathcal{R}_{\text{mv}}}{1 + \lambda}\]

Adaptive Primal-Dual Updates and Self-Paced Curriculum¶

Adaptive step size: A larger step size \(\alpha^+\) is used when the constraint is violated for rapid tightening, and a smaller step size \(\alpha^-\) is used when it is satisfied for gradual relaxation, preventing oscillation in \(\lambda\).
Self-paced threshold: \(\tau\) is adaptively adjusted via EMA tracking of the current policy's joint reward, encouraging exploration early in training and progressively tightening the constraint thereafter.

Key Experimental Results¶

Main Results (Training Prompts, 8-Step 6-View)¶

Method	HyperScore Overall	PickScore
Baseline	7.23	0.196
MV-PG	8.39	0.203
MV-DPO	8.00	0.200
MV-RDL	9.03	0.203
MV-ZigAL	9.17	0.205

Generalization (MATE-3D Unseen Prompts, Epoch 70)¶

Method	HyperScore Overall	PickScore	HPSv2	ImageReward
Baseline	6.67	0.204	0.252	-0.846
MV-ZigAL	6.95	0.205	0.254	-0.770
WS-ZigAL (w=0.5)	6.83	0.217	0.270	0.183
MVC-ZigAL (First-Step)	7.04	0.217	0.268	0.180

Ablation Study¶

Advantage learning vs. policy gradient: MVC-ZigAL outperforms MVC-ZigPG (which retains zigzag sampling but uses policy gradient) on all metrics, validating the contribution of advantage learning.
First-step vs. all-step zigzag: First-step zigzag achieves superior HyperScore (7.04 vs. 6.91) without additional inference overhead.
Constrained optimization vs. weighted sum: WS-ZigAL requires careful tuning—at \(w_{mv}=0.1\), HyperScore drops to 6.25—whereas MVC-ZigAL consistently outperforms all weighted configurations without manual weight selection.
Adaptive vs. fixed threshold: A fixed threshold of 7.5 is too loose and renders the constraint ineffective, while 9.0 is too tight and suppresses single-view optimization; EMA-based adaptation yields the best results.
Adaptive vs. fixed step size: A small fixed step size (0.01) responds too slowly to violations, while a large fixed step size (0.1) causes \(\lambda\) oscillation; the adaptive strategy achieves both responsiveness and stability.

Highlights & Insights¶

This work is the first to systematically extend RL fine-tuning to few-step T2MV diffusion models, introducing a complete multiview-aware MDP framework.
The combination of zigzag self-refining and advantage learning elegantly addresses the weak learning signal problem in few-step models; the reward gap progressively narrows during training, indicating that the base model has internalized self-refining capabilities.
The Lagrangian dual method with self-paced curriculum eliminates the need for manual tuning of reward weights and thresholds, offering strong engineering practicality.
Ablation experiments are systematic and comprehensive, with quantitative validation for each design decision.

Limitations & Future Work¶

Validation is conducted solely on MV-Adapter + LCM-SDXL; applicability to other multiview architectures (e.g., Zero123++, Era3D) remains unexplored.
The joint-view reward relies on HyperScore, and the robustness of this evaluator for T2MV generation warrants further investigation.
The training prompt set contains only 45 animal names, limiting diversity; the MATE-3D evaluation also comprises only 160 prompts.
ZMV-Sampling approximately triples the per-sample inference cost during training (three zigzag passes), incurring substantial training overhead.
Integration with downstream tasks such as video generation and 3D reconstruction has not been explored.

T2I RL fine-tuning (DPOK, REBEL, PRDP): Designed for single-image generation and do not model cross-view coordination; the multiview MDP in MVC-ZigAL constitutes the key distinction.
Zigzag Diffusion: The original method targets single-image full-step sampling; this work adapts it to a first-step multiview schedule and uses it as an advantage reference rather than directly improving inference.
MV-Adapter / SPAD: Foundational multiview generation architectures; MVC-ZigAL functions as an orthogonal RL fine-tuning layer that can be stacked on top.
DreamAlign and related T2-3D RL methods: Optimize 3D objects via SDS rendering loops; the proposed method optimizes directly at the multiview image level, offering greater efficiency.

Rating¶

Novelty: ⭐⭐⭐⭐ — Multiview RL fine-tuning is a novel and meaningful setting; the combination of zigzag advantage learning and Lagrangian constraints demonstrates strong originality.
Experimental Thoroughness: ⭐⭐⭐⭐ — Ablations are comprehensive, though the scale of training and evaluation prompts is limited.
Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear, structure is well-organized, and figures complement the text effectively.
Value: ⭐⭐⭐⭐ — Provides a practical and complete framework for RL alignment of few-step multiview generation.