Skip to content

Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

Conference: CVPR2026 arXiv: 2505.20107 Code: ZiyiZhang27/MVC-ZigAL Area: Image Generation Keywords: Multi-view generation, diffusion models, reinforcement learning fine-tuning, few-step inference, cross-view consistency

TL;DR

This paper proposes MVC-ZigAL, a framework that improves single-view fidelity and cross-view consistency in few-step text-to-multiview diffusion models through multiview-aware MDP formulation, zigzag self-refining advantage learning, and Lagrangian dual constrained optimization.

Background & Motivation

  1. Growing demand for text-to-multiview generation: T2MV diffusion models must jointly generate images of the same scene from multiple viewpoints given a single text prompt, offering significant value in 3D content creation and related applications.
  2. Few-step models trade quality for speed: Few-step backbones such as LCM reduce inference steps to fewer than 8, but at the cost of substantially degraded image fidelity and cross-view consistency.
  3. Existing RL methods do not transfer directly: Prior RL fine-tuning approaches (DPOK, REBEL, etc.) are designed for single-image generation and overlook coordinated optimization across multiple views.
  4. Weak learning signals in few-step models: Samples generated by few-step models are generally of lower quality and exhibit tightly clustered reward values, resulting in insufficient gradient signals for standard RL methods.
  5. Limitations of single-view and joint-view rewards: Single-view rewards (e.g., PickScore) provide fine-grained feedback but neglect cross-view consistency, while joint-view rewards (e.g., HyperScore) assess overall quality but lack per-view supervision.
  6. Sensitivity of weighted-sum reward balancing: Naively combining the two reward types via weighted summation is highly sensitive to weight selection, making it difficult to stably balance the two optimization objectives.

Method

Overall Architecture

MVC-ZigAL comprises three core components: (1) multiview-aware MDP reformulation; (2) ZMV-Sampling with zigzag advantage learning; and (3) Lagrangian dual constrained optimization.

Multiview-Aware MDP

The T2MV denoising process is reformulated as a multiview MDP: the state \(s_t\) at each step encodes the noisy images and camera embeddings for all \(V\) views, and the action \(a_t\) denotes the joint denoising output across all views. A joint-view reward function \(\mathcal{R}_{\text{mv}}\) is introduced to evaluate the overall quality of the generated multiviews (based on the HyperScore overall dimension). Three baselines—MV-PG, MV-DPO, and MV-RDL—are adapted under this MDP formulation.

ZMV-Sampling with Self-Refining

At the first denoising step, a three-pass zigzag procedure is applied: high-guidance denoising → low-guidance reverse noising → high-guidance denoising again. The core idea is to instantiate a self-refining mechanism via the guidance scale contrast (\(\omega_{\text{high}}\) vs. \(\omega_{\text{low}}\)): features aligned with the condition survive the low-guidance inversion, while misaligned features are suppressed. The zigzag pass is applied only at the first step (\(t=T\)), as the early diffusion stage determines global geometric structure; applying it at all steps leads to excessive texture smoothing.

Zigzag Advantage Learning (MV-ZigAL)

Trajectory pairs are generated for the same prompt using standard sampling and ZMV-Sampling respectively, and a zigzag advantage function is defined as \(\mathcal{A}_{\text{mv}} = \mathcal{R}_{\text{mv}}(\mathbf{x}^z) - \mathcal{R}_{\text{mv}}(\mathbf{x}^s)\). The objective minimizes the squared error between the log-likelihood ratio difference and the advantage value. Compared to MV-RDL, which uses two standard trajectories, this approach exploits structured self-refining advantages to provide stronger learning signals.

Multiview Constrained Policy Optimization

The optimization is decomposed into a primary objective—maximizing the sum of single-view rewards \(\sum_v R(\mathbf{x}_0^v, \mathbf{c})\)—subject to the constraint that the joint-view reward \(\geq \tau\). A Lagrangian dual approach introduces multiplier \(\lambda\) to define a unified reward function:

\[\mathcal{R}_{\text{mvc}} = \frac{R(\mathbf{x}_0^v, \mathbf{c}) + \lambda \cdot \mathcal{R}_{\text{mv}}}{1 + \lambda}\]

Adaptive Primal-Dual Updates and Self-Paced Curriculum

  • Adaptive step size: A larger step size \(\alpha^+\) is used when the constraint is violated for rapid tightening, and a smaller step size \(\alpha^-\) is used when it is satisfied for gradual relaxation, preventing oscillation in \(\lambda\).
  • Self-paced threshold: \(\tau\) is adaptively adjusted via EMA tracking of the current policy's joint reward, encouraging exploration early in training and progressively tightening the constraint thereafter.

Key Experimental Results

Main Results (Training Prompts, 8-Step 6-View)

Method HyperScore Overall PickScore
Baseline 7.23 0.196
MV-PG 8.39 0.203
MV-DPO 8.00 0.200
MV-RDL 9.03 0.203
MV-ZigAL 9.17 0.205

Generalization (MATE-3D Unseen Prompts, Epoch 70)

Method HyperScore Overall PickScore HPSv2 ImageReward
Baseline 6.67 0.204 0.252 -0.846
MV-ZigAL 6.95 0.205 0.254 -0.770
WS-ZigAL (w=0.5) 6.83 0.217 0.270 0.183
MVC-ZigAL (First-Step) 7.04 0.217 0.268 0.180

Ablation Study

  • Advantage learning vs. policy gradient: MVC-ZigAL outperforms MVC-ZigPG (which retains zigzag sampling but uses policy gradient) on all metrics, validating the contribution of advantage learning.
  • First-step vs. all-step zigzag: First-step zigzag achieves superior HyperScore (7.04 vs. 6.91) without additional inference overhead.
  • Constrained optimization vs. weighted sum: WS-ZigAL requires careful tuning—at \(w_{mv}=0.1\), HyperScore drops to 6.25—whereas MVC-ZigAL consistently outperforms all weighted configurations without manual weight selection.
  • Adaptive vs. fixed threshold: A fixed threshold of 7.5 is too loose and renders the constraint ineffective, while 9.0 is too tight and suppresses single-view optimization; EMA-based adaptation yields the best results.
  • Adaptive vs. fixed step size: A small fixed step size (0.01) responds too slowly to violations, while a large fixed step size (0.1) causes \(\lambda\) oscillation; the adaptive strategy achieves both responsiveness and stability.

Highlights & Insights

  • This work is the first to systematically extend RL fine-tuning to few-step T2MV diffusion models, introducing a complete multiview-aware MDP framework.
  • The combination of zigzag self-refining and advantage learning elegantly addresses the weak learning signal problem in few-step models; the reward gap progressively narrows during training, indicating that the base model has internalized self-refining capabilities.
  • The Lagrangian dual method with self-paced curriculum eliminates the need for manual tuning of reward weights and thresholds, offering strong engineering practicality.
  • Ablation experiments are systematic and comprehensive, with quantitative validation for each design decision.

Limitations & Future Work

  • Validation is conducted solely on MV-Adapter + LCM-SDXL; applicability to other multiview architectures (e.g., Zero123++, Era3D) remains unexplored.
  • The joint-view reward relies on HyperScore, and the robustness of this evaluator for T2MV generation warrants further investigation.
  • The training prompt set contains only 45 animal names, limiting diversity; the MATE-3D evaluation also comprises only 160 prompts.
  • ZMV-Sampling approximately triples the per-sample inference cost during training (three zigzag passes), incurring substantial training overhead.
  • Integration with downstream tasks such as video generation and 3D reconstruction has not been explored.
  • T2I RL fine-tuning (DPOK, REBEL, PRDP): Designed for single-image generation and do not model cross-view coordination; the multiview MDP in MVC-ZigAL constitutes the key distinction.
  • Zigzag Diffusion: The original method targets single-image full-step sampling; this work adapts it to a first-step multiview schedule and uses it as an advantage reference rather than directly improving inference.
  • MV-Adapter / SPAD: Foundational multiview generation architectures; MVC-ZigAL functions as an orthogonal RL fine-tuning layer that can be stacked on top.
  • DreamAlign and related T2-3D RL methods: Optimize 3D objects via SDS rendering loops; the proposed method optimizes directly at the multiview image level, offering greater efficiency.

Rating

  • Novelty: ⭐⭐⭐⭐ — Multiview RL fine-tuning is a novel and meaningful setting; the combination of zigzag advantage learning and Lagrangian constraints demonstrates strong originality.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Ablations are comprehensive, though the scale of training and evaluation prompts is limited.
  • Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear, structure is well-organized, and figures complement the text effectively.
  • Value: ⭐⭐⭐⭐ — Provides a practical and complete framework for RL alignment of few-step multiview generation.