Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models¶
Conference: CVPR 2026 arXiv: 2511.16955 Code: None Area: Image Generation Keywords: GRPO, Flow Matching, Human Preference Alignment, Contrastive Learning, ODE Sampling
TL;DR¶
This paper reinterprets SDE-based GRPO as distance optimization / contrastive learning, and proposes Neighbor GRPO — which completely bypasses SDE conversion by constructing neighborhood candidate trajectories through perturbation of ODE initial noise, combined with a softmax distance surrogate policy for policy gradient optimization, while preserving all advantages of deterministic ODE sampling.
Background & Motivation¶
GRPO has demonstrated strong performance in aligning image/video generation models with human preferences, but applying it to Flow Matching models introduces a fundamental conflict:
GRPO requires stochastic exploration: Policy gradient methods rely on stochasticity to explore the policy space.
Flow Matching's strength lies in deterministic ODE sampling: Efficient, and compatible with high-order solvers.
Existing methods (Flow-GRPO, DanceGRPO) introduce stochasticity by converting ODEs to equivalent SDEs, but sacrifice the core advantages of ODE sampling: - SDEs are restricted to first-order solvers: High-order solvers such as DPM-Solver++ cannot be leveraged for acceleration. - Inefficient credit assignment: Terminal rewards must be distributed across noise injections at all time steps. - MixGRPO and BranchGRPO partially alleviate these issues but remain constrained by the SDE framework.
Method¶
Overall Architecture¶
The core insight is to reinterpret SDE-based GRPO as distance optimization / contrastive learning — ODE samples serve as anchors, SDE samples serve as candidates, and optimization is equivalent to pulling high-reward candidates closer and pushing low-reward candidates further away.
Building on this insight, Neighbor GRPO operates directly within the ODE neighborhood: 1. Perturb the initial noise to construct a group of candidate trajectories. 2. Select one trajectory as the anchor. 3. Apply a distance loss to pull high-reward candidates closer and push low-reward candidates further. 4. Define a softmax distance surrogate policy that is rigorously integrated into the GRPO framework.
Key Designs¶
-
ODE Neighborhood Sampling: Given a base initial noise \(\epsilon^*\), construct \(G\) perturbed initial conditions: \(\epsilon^{(i)} = \sqrt{1-\sigma^2}\epsilon^* + \sigma\delta^{(i)}, \quad \delta^{(i)} \sim \mathcal{N}(0, I)\) where \(\sigma \in (0,1)\) controls the perturbation magnitude (optimal \(\sigma=0.3\)). These initial conditions evolve through deterministic ODE integration, producing a bundle of trajectories that form a local solution neighborhood.
-
Softmax Distance Surrogate Jump Policy: A training-specific surrogate policy is defined to make the policy ratio and gradient tractable: \(\pi_\theta(x_t^{(i)} \mid \{s_t\}) = \frac{\exp(-\|x_t^{(i)} - x_t^{(\theta)}\|_2^2)}{\sum_{k=1}^{G}\exp(-\|x_t^{(k)} - x_t^{(\theta)}\|_2^2)}\)
- The anchor \(x_t^{(\theta)}\) is randomly selected from the candidates and contributes gradients.
- Intuition: the sampled trajectory may "jump" to a neighbor at each step, with probability determined by softmax distance.
- At inference, standard deterministic ODE is used without any surrogate policy.
- Optimization dynamics: when \(A_i > 0\), the gradient reduces distance (attraction); when \(A_i < 0\), it increases distance (repulsion).
-
Three Practical Techniques:
- Symmetric Anchor Sampling: By the Johnson–Lindenstrauss lemma, neighborhood samples are approximately equidistant, so any candidate can serve as an anchor. Each GRPO iteration requires forward/backward passes for only \(B < G\) anchors (saving up to \(12\times\) gradient computation when \(G=12\)).
- Intra-Group Quasi-Norm Advantage Reweighting: The standard \(L_2\) normalization is replaced by an \(L_p\) norm (\(p < 2\)): \(A'_i = A_i / (\sum|A_k|^p)^{1/p}\). This automatically downweights flat advantage signals and prevents reward hacking (optimal \(p=0.8\)).
- High-Order Solver: DPM++ is used for data collection, while single-step DDIM is used to compute the surrogate policy during policy updates.
Loss & Training¶
The GRPO objective uses a clipped policy ratio with group-normalized advantages:
- Base model: FLUX.1-dev (Swin backbone)
- Rewards: HPSv2.1 + Pick Score + ImageReward (equal-weight multi-reward training)
- AdamW, lr=1e-5, 300 iterations, 32× H800 GPUs
- Approximately 4 hours per training run (vs. DanceGRPO/MixGRPO at 237s/iter → 45s/iter)
Key Experimental Results¶
Main Results¶
| Method | Solver | NFE_old | NFE_θ ↓ | s/Iter ↓ | HPSv2.1 ↑ | Pick ↑ | ImgRwd ↑ | CLIP ↑ | Unified ↑ | Aes ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| FLUX.1-dev | - | - | - | - | 0.310 | 0.227 | 1.131 | 0.389 | 3.211 | 6.108 |
| DanceGRPO | DDIM | 25 | 14 | 237.9 | 0.371 | 0.231 | 1.306 | 0.364 | 3.156 | 6.552 |
| MixGRPO | DDIM | 25 | 14 | 237.7 | 0.366 | 0.235 | 1.604 | 0.382 | 3.257 | 6.623 |
| Ours | DPM++ | 8 | 1.33 | 45.1 | 0.366 | 0.234 | 1.640 | 0.391 | 3.334 | 6.621 |
Under the 8-step DPM++ configuration, training speed improves by 5.3× (45s vs. 238s/iter), with the proposed method achieving state-of-the-art performance across all out-of-domain metrics.
Ablation Study¶
| Parameter | Optimal Value | Note |
|---|---|---|
| Perturbation strength \(\sigma\) | 0.3 | Too small leads to insufficient exploration; too large exits the neighborhood |
| Number of anchors \(B\) | 4 | \(B=2\) is already competitive; \(B=4\) yields the best trade-off |
| Quasi-norm \(p\) | 0.8 | \(p=2\) corresponds to standard GRPO; \(p=0.8\) achieves optimal out-of-domain performance |
Key Findings¶
- Neighbor GRPO converges faster: HPSv2.1 > 0.35 is reached within 50 iterations (DanceGRPO requires more).
- Human evaluation: the proposed method achieves 72% and 61% preference rates over DanceGRPO and MixGRPO, respectively.
- Reward hacking is avoided: no grid artifacts or color inconsistencies are observed.
- Long-term training stability is superior to MixGRPO.
Highlights & Insights¶
- Deep theoretical insight: Reinterpreting SDE-based GRPO as contrastive learning reveals that its essence is distance optimization, providing a theoretical foundation for a fully ODE-based approach.
- Full preservation of ODE advantages: No SDE conversion is required; high-order solvers are supported; credit assignment is more direct.
- Symmetric anchor sampling exploits the geometric properties of the Johnson–Lindenstrauss lemma to elegantly reduce computation to \(B/G\) of the original cost.
- Quasi-norm reweighting concisely and effectively addresses reward flattening with a single hyperparameter.
Limitations & Future Work¶
- Validation is limited to FLUX.1-dev; applicability to other Flow Matching models (e.g., SD3) remains to be confirmed.
- Multi-reward training currently adopts equal weighting; adaptive weighting strategies are worth exploring.
- The theoretical guarantees of the surrogate policy rely on the neighborhood being sufficiently tight (small \(\sigma\)); behavior under extreme settings is not fully analyzed.
- Extension to video generation (currently image-only) is a natural direction.
Related Work & Insights¶
- The proposed method shares origins with DanceGRPO and Flow-GRPO but represents a paradigm shift: from SDE dependence to fully ODE-based training.
- MixGRPO's hybrid sampling is a compromise; Neighbor GRPO is more principled and complete.
- The contrastive learning perspective may generalize to other deterministic model optimization scenarios that require stochasticity.
- Quasi-norm reweighting can be extended to other GRPO variants.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Both theoretical insight and methodological innovation make important contributions; SDE is completely bypassed.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-metric evaluation, thorough ablation, and human studies are conducted, though only one base model is evaluated.
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear, figures are intuitive, and the logical flow from insight to method is coherent.
- Value: ⭐⭐⭐⭐⭐ A 5× training efficiency gain with superior quality represents a significant advance for RLHF-based visual generation.