Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models¶

Conference: CVPR 2026 arXiv: 2511.16955 Code: None Area: Image Generation Keywords: GRPO, Flow Matching, Human Preference Alignment, Contrastive Learning, ODE Sampling

TL;DR¶

This paper reinterprets SDE-based GRPO as distance optimization / contrastive learning, and proposes Neighbor GRPO — which completely bypasses SDE conversion by constructing neighborhood candidate trajectories through perturbation of ODE initial noise, combined with a softmax distance surrogate policy for policy gradient optimization, while preserving all advantages of deterministic ODE sampling.

Background & Motivation¶

GRPO has demonstrated strong performance in aligning image/video generation models with human preferences, but applying it to Flow Matching models introduces a fundamental conflict:

GRPO requires stochastic exploration: Policy gradient methods rely on stochasticity to explore the policy space.

Flow Matching's strength lies in deterministic ODE sampling: Efficient, and compatible with high-order solvers.

Existing methods (Flow-GRPO, DanceGRPO) introduce stochasticity by converting ODEs to equivalent SDEs, but sacrifice the core advantages of ODE sampling: - SDEs are restricted to first-order solvers: High-order solvers such as DPM-Solver++ cannot be leveraged for acceleration. - Inefficient credit assignment: Terminal rewards must be distributed across noise injections at all time steps. - MixGRPO and BranchGRPO partially alleviate these issues but remain constrained by the SDE framework.

Method¶

Overall Architecture¶

The core insight is to reinterpret SDE-based GRPO as distance optimization / contrastive learning — ODE samples serve as anchors, SDE samples serve as candidates, and optimization is equivalent to pulling high-reward candidates closer and pushing low-reward candidates further away.

Building on this insight, Neighbor GRPO operates directly within the ODE neighborhood: 1. Perturb the initial noise to construct a group of candidate trajectories. 2. Select one trajectory as the anchor. 3. Apply a distance loss to pull high-reward candidates closer and push low-reward candidates further. 4. Define a softmax distance surrogate policy that is rigorously integrated into the GRPO framework.

Key Designs¶

ODE Neighborhood Sampling: Given a base initial noise \(\epsilon^*\), construct \(G\) perturbed initial conditions: \(\epsilon^{(i)} = \sqrt{1-\sigma^2}\epsilon^* + \sigma\delta^{(i)}, \quad \delta^{(i)} \sim \mathcal{N}(0, I)\) where \(\sigma \in (0,1)\) controls the perturbation magnitude (optimal \(\sigma=0.3\)). These initial conditions evolve through deterministic ODE integration, producing a bundle of trajectories that form a local solution neighborhood.
Softmax Distance Surrogate Jump Policy: A training-specific surrogate policy is defined to make the policy ratio and gradient tractable: \(\pi_\theta(x_t^{(i)} \mid \{s_t\}) = \frac{\exp(-\|x_t^{(i)} - x_t^{(\theta)}\|_2^2)}{\sum_{k=1}^{G}\exp(-\|x_t^{(k)} - x_t^{(\theta)}\|_2^2)}\)
- The anchor \(x_t^{(\theta)}\) is randomly selected from the candidates and contributes gradients.
- Intuition: the sampled trajectory may "jump" to a neighbor at each step, with probability determined by softmax distance.
- At inference, standard deterministic ODE is used without any surrogate policy.
- Optimization dynamics: when \(A_i > 0\), the gradient reduces distance (attraction); when \(A_i < 0\), it increases distance (repulsion).
Three Practical Techniques:
- Symmetric Anchor Sampling: By the Johnson–Lindenstrauss lemma, neighborhood samples are approximately equidistant, so any candidate can serve as an anchor. Each GRPO iteration requires forward/backward passes for only \(B < G\) anchors (saving up to \(12\times\) gradient computation when \(G=12\)).
- Intra-Group Quasi-Norm Advantage Reweighting: The standard \(L_2\) normalization is replaced by an \(L_p\) norm (\(p < 2\)): \(A'_i = A_i / (\sum|A_k|^p)^{1/p}\). This automatically downweights flat advantage signals and prevents reward hacking (optimal \(p=0.8\)).
- High-Order Solver: DPM++ is used for data collection, while single-step DDIM is used to compute the surrogate policy during policy updates.

Loss & Training¶

The GRPO objective uses a clipped policy ratio with group-normalized advantages:

\[\mathcal{J}(\theta) = \mathbb{E}_{s,t,i}\left[\min\left(A_i\rho_t^{(i)}, A_i\lceil\rho_t^{(i)}\rfloor\right)\right]\]

Base model: FLUX.1-dev (Swin backbone)
Rewards: HPSv2.1 + Pick Score + ImageReward (equal-weight multi-reward training)
AdamW, lr=1e-5, 300 iterations, 32× H800 GPUs
Approximately 4 hours per training run (vs. DanceGRPO/MixGRPO at 237s/iter → 45s/iter)

Key Experimental Results¶

Main Results¶

Method	Solver	NFE_old	NFE_θ ↓	s/Iter ↓	HPSv2.1 ↑	Pick ↑	ImgRwd ↑	CLIP ↑	Unified ↑	Aes ↑
FLUX.1-dev	-	-	-	-	0.310	0.227	1.131	0.389	3.211	6.108
DanceGRPO	DDIM	25	14	237.9	0.371	0.231	1.306	0.364	3.156	6.552
MixGRPO	DDIM	25	14	237.7	0.366	0.235	1.604	0.382	3.257	6.623
Ours	DPM++	8	1.33	45.1	0.366	0.234	1.640	0.391	3.334	6.621

Under the 8-step DPM++ configuration, training speed improves by 5.3× (45s vs. 238s/iter), with the proposed method achieving state-of-the-art performance across all out-of-domain metrics.

Ablation Study¶

Parameter	Optimal Value	Note
Perturbation strength \(\sigma\)	0.3	Too small leads to insufficient exploration; too large exits the neighborhood
Number of anchors \(B\)	4	\(B=2\) is already competitive; \(B=4\) yields the best trade-off
Quasi-norm \(p\)	0.8	\(p=2\) corresponds to standard GRPO; \(p=0.8\) achieves optimal out-of-domain performance

Key Findings¶

Neighbor GRPO converges faster: HPSv2.1 > 0.35 is reached within 50 iterations (DanceGRPO requires more).
Human evaluation: the proposed method achieves 72% and 61% preference rates over DanceGRPO and MixGRPO, respectively.
Reward hacking is avoided: no grid artifacts or color inconsistencies are observed.
Long-term training stability is superior to MixGRPO.

Highlights & Insights¶

Deep theoretical insight: Reinterpreting SDE-based GRPO as contrastive learning reveals that its essence is distance optimization, providing a theoretical foundation for a fully ODE-based approach.
Full preservation of ODE advantages: No SDE conversion is required; high-order solvers are supported; credit assignment is more direct.
Symmetric anchor sampling exploits the geometric properties of the Johnson–Lindenstrauss lemma to elegantly reduce computation to \(B/G\) of the original cost.
Quasi-norm reweighting concisely and effectively addresses reward flattening with a single hyperparameter.

Limitations & Future Work¶

Validation is limited to FLUX.1-dev; applicability to other Flow Matching models (e.g., SD3) remains to be confirmed.
Multi-reward training currently adopts equal weighting; adaptive weighting strategies are worth exploring.
The theoretical guarantees of the surrogate policy rely on the neighborhood being sufficiently tight (small \(\sigma\)); behavior under extreme settings is not fully analyzed.
Extension to video generation (currently image-only) is a natural direction.

The proposed method shares origins with DanceGRPO and Flow-GRPO but represents a paradigm shift: from SDE dependence to fully ODE-based training.
MixGRPO's hybrid sampling is a compromise; Neighbor GRPO is more principled and complete.
The contrastive learning perspective may generalize to other deterministic model optimization scenarios that require stochasticity.
Quasi-norm reweighting can be extended to other GRPO variants.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Both theoretical insight and methodological innovation make important contributions; SDE is completely bypassed.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-metric evaluation, thorough ablation, and human studies are conducted, though only one base model is evaluated.
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear, figures are intuitive, and the logical flow from insight to method is coherent.
Value: ⭐⭐⭐⭐⭐ A 5× training efficiency gain with superior quality represents a significant advance for RLHF-based visual generation.