Skip to content

Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models

Conference: CVPR 2026 arXiv: 2511.16955 Code: None Area: Image Generation Keywords: GRPO, Flow Matching, Human Preference Alignment, Contrastive Learning, ODE Sampling

TL;DR

This paper reinterprets SDE-based GRPO as distance optimization / contrastive learning, and proposes Neighbor GRPO — which completely bypasses SDE conversion by constructing neighborhood candidate trajectories through perturbation of ODE initial noise, combined with a softmax distance surrogate policy for policy gradient optimization, while preserving all advantages of deterministic ODE sampling.

Background & Motivation

GRPO has demonstrated strong performance in aligning image/video generation models with human preferences, but applying it to Flow Matching models introduces a fundamental conflict:

GRPO requires stochastic exploration: Policy gradient methods rely on stochasticity to explore the policy space.

Flow Matching's strength lies in deterministic ODE sampling: Efficient, and compatible with high-order solvers.

Existing methods (Flow-GRPO, DanceGRPO) introduce stochasticity by converting ODEs to equivalent SDEs, but sacrifice the core advantages of ODE sampling: - SDEs are restricted to first-order solvers: High-order solvers such as DPM-Solver++ cannot be leveraged for acceleration. - Inefficient credit assignment: Terminal rewards must be distributed across noise injections at all time steps. - MixGRPO and BranchGRPO partially alleviate these issues but remain constrained by the SDE framework.

Method

Overall Architecture

The core insight is to reinterpret SDE-based GRPO as distance optimization / contrastive learning — ODE samples serve as anchors, SDE samples serve as candidates, and optimization is equivalent to pulling high-reward candidates closer and pushing low-reward candidates further away.

Building on this insight, Neighbor GRPO operates directly within the ODE neighborhood: 1. Perturb the initial noise to construct a group of candidate trajectories. 2. Select one trajectory as the anchor. 3. Apply a distance loss to pull high-reward candidates closer and push low-reward candidates further. 4. Define a softmax distance surrogate policy that is rigorously integrated into the GRPO framework.

Key Designs

  1. ODE Neighborhood Sampling: Given a base initial noise \(\epsilon^*\), construct \(G\) perturbed initial conditions: \(\epsilon^{(i)} = \sqrt{1-\sigma^2}\epsilon^* + \sigma\delta^{(i)}, \quad \delta^{(i)} \sim \mathcal{N}(0, I)\) where \(\sigma \in (0,1)\) controls the perturbation magnitude (optimal \(\sigma=0.3\)). These initial conditions evolve through deterministic ODE integration, producing a bundle of trajectories that form a local solution neighborhood.

  2. Softmax Distance Surrogate Jump Policy: A training-specific surrogate policy is defined to make the policy ratio and gradient tractable: \(\pi_\theta(x_t^{(i)} \mid \{s_t\}) = \frac{\exp(-\|x_t^{(i)} - x_t^{(\theta)}\|_2^2)}{\sum_{k=1}^{G}\exp(-\|x_t^{(k)} - x_t^{(\theta)}\|_2^2)}\)

    • The anchor \(x_t^{(\theta)}\) is randomly selected from the candidates and contributes gradients.
    • Intuition: the sampled trajectory may "jump" to a neighbor at each step, with probability determined by softmax distance.
    • At inference, standard deterministic ODE is used without any surrogate policy.
    • Optimization dynamics: when \(A_i > 0\), the gradient reduces distance (attraction); when \(A_i < 0\), it increases distance (repulsion).
  3. Three Practical Techniques:

    • Symmetric Anchor Sampling: By the Johnson–Lindenstrauss lemma, neighborhood samples are approximately equidistant, so any candidate can serve as an anchor. Each GRPO iteration requires forward/backward passes for only \(B < G\) anchors (saving up to \(12\times\) gradient computation when \(G=12\)).
    • Intra-Group Quasi-Norm Advantage Reweighting: The standard \(L_2\) normalization is replaced by an \(L_p\) norm (\(p < 2\)): \(A'_i = A_i / (\sum|A_k|^p)^{1/p}\). This automatically downweights flat advantage signals and prevents reward hacking (optimal \(p=0.8\)).
    • High-Order Solver: DPM++ is used for data collection, while single-step DDIM is used to compute the surrogate policy during policy updates.

Loss & Training

The GRPO objective uses a clipped policy ratio with group-normalized advantages:

\[\mathcal{J}(\theta) = \mathbb{E}_{s,t,i}\left[\min\left(A_i\rho_t^{(i)}, A_i\lceil\rho_t^{(i)}\rfloor\right)\right]\]
  • Base model: FLUX.1-dev (Swin backbone)
  • Rewards: HPSv2.1 + Pick Score + ImageReward (equal-weight multi-reward training)
  • AdamW, lr=1e-5, 300 iterations, 32× H800 GPUs
  • Approximately 4 hours per training run (vs. DanceGRPO/MixGRPO at 237s/iter → 45s/iter)

Key Experimental Results

Main Results

Method Solver NFE_old NFE_θ ↓ s/Iter ↓ HPSv2.1 ↑ Pick ↑ ImgRwd ↑ CLIP ↑ Unified ↑ Aes ↑
FLUX.1-dev - - - - 0.310 0.227 1.131 0.389 3.211 6.108
DanceGRPO DDIM 25 14 237.9 0.371 0.231 1.306 0.364 3.156 6.552
MixGRPO DDIM 25 14 237.7 0.366 0.235 1.604 0.382 3.257 6.623
Ours DPM++ 8 1.33 45.1 0.366 0.234 1.640 0.391 3.334 6.621

Under the 8-step DPM++ configuration, training speed improves by 5.3× (45s vs. 238s/iter), with the proposed method achieving state-of-the-art performance across all out-of-domain metrics.

Ablation Study

Parameter Optimal Value Note
Perturbation strength \(\sigma\) 0.3 Too small leads to insufficient exploration; too large exits the neighborhood
Number of anchors \(B\) 4 \(B=2\) is already competitive; \(B=4\) yields the best trade-off
Quasi-norm \(p\) 0.8 \(p=2\) corresponds to standard GRPO; \(p=0.8\) achieves optimal out-of-domain performance

Key Findings

  • Neighbor GRPO converges faster: HPSv2.1 > 0.35 is reached within 50 iterations (DanceGRPO requires more).
  • Human evaluation: the proposed method achieves 72% and 61% preference rates over DanceGRPO and MixGRPO, respectively.
  • Reward hacking is avoided: no grid artifacts or color inconsistencies are observed.
  • Long-term training stability is superior to MixGRPO.

Highlights & Insights

  1. Deep theoretical insight: Reinterpreting SDE-based GRPO as contrastive learning reveals that its essence is distance optimization, providing a theoretical foundation for a fully ODE-based approach.
  2. Full preservation of ODE advantages: No SDE conversion is required; high-order solvers are supported; credit assignment is more direct.
  3. Symmetric anchor sampling exploits the geometric properties of the Johnson–Lindenstrauss lemma to elegantly reduce computation to \(B/G\) of the original cost.
  4. Quasi-norm reweighting concisely and effectively addresses reward flattening with a single hyperparameter.

Limitations & Future Work

  • Validation is limited to FLUX.1-dev; applicability to other Flow Matching models (e.g., SD3) remains to be confirmed.
  • Multi-reward training currently adopts equal weighting; adaptive weighting strategies are worth exploring.
  • The theoretical guarantees of the surrogate policy rely on the neighborhood being sufficiently tight (small \(\sigma\)); behavior under extreme settings is not fully analyzed.
  • Extension to video generation (currently image-only) is a natural direction.
  • The proposed method shares origins with DanceGRPO and Flow-GRPO but represents a paradigm shift: from SDE dependence to fully ODE-based training.
  • MixGRPO's hybrid sampling is a compromise; Neighbor GRPO is more principled and complete.
  • The contrastive learning perspective may generalize to other deterministic model optimization scenarios that require stochasticity.
  • Quasi-norm reweighting can be extended to other GRPO variants.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Both theoretical insight and methodological innovation make important contributions; SDE is completely bypassed.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-metric evaluation, thorough ablation, and human studies are conducted, though only one base model is evaluated.
  • Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear, figures are intuitive, and the logical flow from insight to method is coherent.
  • Value: ⭐⭐⭐⭐⭐ A 5× training efficiency gain with superior quality represents a significant advance for RLHF-based visual generation.