Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models¶
Conference: CVPR 2026
arXiv: 2511.16955
Code: None
Area: Image Generation
Keywords: GRPO, Flow Matching, Human Preference Alignment, Contrastive Learning, ODE Sampling
TL;DR¶
This paper reinterprets SDE-based GRPO as distance optimization/contrastive learning and proposes Neighbor GRPO. It completely bypasses SDE conversion by constructing neighborhood candidate trajectories through perturbed ODE initial noise and implements policy gradient optimization via a softmax distance proxy policy, thereby preserving all advantages of deterministic ODE sampling.
Background & Motivation¶
GRPO excels in aligning image/video generation models with human preferences, but its application to Flow Matching models faces a fundamental conflict:
GRPO requires stochastic exploration: Policy gradient methods rely on randomness to explore the policy space.
Advantages of Flow Matching lie in deterministic ODE sampling: It is efficient and supports high-order solvers.
Existing methods (Flow-GRPO, DanceGRPO) introduce randomness by converting ODEs into equivalent SDEs but sacrifice the core benefits of ODEs: - SDEs are limited to first-order solvers: They cannot utilize high-order solvers like DPM-Solver++ for acceleration. - Inefficient credit assignment: Terminal rewards must be distributed across noise injections at all time steps. - MixGRPO and BranchGRPO partially alleviate these issues but remain constrained by the SDE framework.
Method¶
Overall Architecture¶
This paper aims to resolve the fundamental conflict between GRPO and Flow Matching: GRPO relies on stochasticity for policy space exploration, whereas the value of Flow Matching lies in deterministic ODE sampling (efficiency, compatibility with high-order solvers). Prior approaches (Flow-GRPO, DanceGRPO) force stochasticity by converting ODEs to equivalent SDEs, which locks them into first-order solvers and inefficient credit assignment. The breakthrough of Neighbor GRPO is a reinterpretation: SDE-based GRPO is viewed as distance optimization/contrastive learning—where ODE samples are anchors and SDE samples are candidates, and optimization essentially pulls high-reward candidates closer while pushing low-reward ones away. Consequently, the authors bypass SDEs entirely and operate directly within the ODE neighborhood: perturbing initial noise to generate a group of candidate trajectories, selecting one as an anchor, and using a softmax distance proxy policy to strictly incorporate the "pull/push" mechanism into the GRPO framework. Standard deterministic ODE is restored during inference.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Base Noise ε*"] --> B["ODE Neighborhood Sampling<br/>Perturbation σ=0.3 creates G initial conditions → Deterministic ODE evolution"]
B --> C["Neighborhood Trajectory Bundle<br/>G mutually adjacent candidate trajectories"]
C --> D["Softmax Distance Proxy Jump Policy<br/>Randomly select anchor, define policy ratio ρ_t by L2 distance"]
D --> E["GRPO Optimization<br/>Clipped policy ratio × Group-normalized advantage"]
E -->|"A_i>0 Pull / A_i<0 Push"| F["Update Flow Model θ"]
G["Three Practical Techniques<br/>Symmetric Anchor Sampling · Quasi-norm Reweighting · High-order Solver Decoupling"] -. Acceleration & Stability .-> E
F --> H["Inference: Standard Deterministic ODE (Discard Proxy Policy)"]
Key Designs¶
1. ODE Neighborhood Sampling: Generating comparable candidates without SDEs
GRPO requires a group of diverse samples to compare rewards, but a pure deterministic ODE starting from fixed noise yields only one trajectory. Neighbor GRPO instead operates on the initial noise: given base noise \(\epsilon^*\), it constructs \(G\) perturbed initial conditions \(\epsilon^{(i)} = \sqrt{1-\sigma^2}\epsilon^* + \sigma\delta^{(i)},\ \delta^{(i)} \sim \mathcal{N}(0, I)\), where \(\sigma \in (0,1)\) controls the perturbation intensity (optimal \(\sigma=0.3\); too small leads to insufficient exploration, too large leaves the neighborhood). These initial conditions evolve via deterministic ODEs to produce a bundle of mutually adjacent trajectories forming a local solution neighborhood—stochasticity is moved to the starting point, while the evolution remains a clean ODE.
2. Softmax Distance Proxy Jump Policy: Making policy ratios and gradients computable on ODEs
After bypassing SDEs, the policy ratio \(\rho_t\) required by GRPO lacks a natural definition. The paper designs a training-specific proxy policy: \(\pi_\theta(x_t^{(i)} \mid \{s_t\}) = \frac{\exp(-\|x_t^{(i)} - x_t^{(\theta)}\|_2^2)}{\sum_{k=1}^{G}\exp(-\|x_t^{(k)} - x_t^{(\theta)}\|_2^2)}\), where the anchor \(x_t^{(\theta)}\) is randomly selected from candidates and contributes gradients. Intuitively, a sampled trajectory may "jump" to a neighbor at each step with a probability determined by the softmax distance. The optimization dynamics are clear—when advantage \(A_i > 0\), the gradient reduces the distance (pull), and when \(A_i < 0\), it increases the distance (push), perfectly corresponding to contrastive learning. This proxy exists only during training; it is discarded during inference in favor of standard deterministic ODEs, thus fully preserving all ODE advantages.
3. Three Practical Techniques: Maximizing neighborhood structure and high-order solver benefits
Neighborhood sampling provides additional exploitable structures. First is Symmetric Anchor Sampling: based on the Johnson-Lindenstrauss lemma, neighborhood samples are nearly equidistant, allowing any candidate to serve as an anchor. Thus, each iteration requires forward/backward passes for only \(B < G\) anchors (saving approximately 12x gradient computation when \(G=12\)). Second is Intra-group Quasi-norm Advantage Reweighting: replacing standard \(L_2\) normalization with \(L_p\) norm (\(p<2\)) such that \(A'_i = A_i / (\sum|A_k|^p)^{1/p}\). This automatically down-weights flat advantage signals to prevent reward hacking (optimal \(p=0.8\)). Third is High-order Solver Decoupling: using DPM++ for data collection and single-step DDIM for calculating the proxy policy during updates—an acceleration benefit unique to pure ODE frameworks that SDE frameworks cannot achieve.
Loss & Training¶
The GRPO objective uses a clipped policy ratio and group-normalized advantage:
- Base Model: FLUX.1-dev (Swin backbone)
- Rewards: HPSv2.1 + Pick Score + ImageReward (Equally weighted multi-reward training)
- AdamW, lr=1e-5, 300 iterations, 32×H800 GPU
- Approx. 4 hours per run; only 45s per iteration under 8-step DPM++ configuration, about 1/5 of the 237s required by DanceGRPO/MixGRPO.
Key Experimental Results¶
Main Results¶
| Method | Solver | NFE_old | NFE_θ ↓ | s/Iter ↓ | HPSv2.1 ↑ | Pick ↑ | ImgRwd ↑ | CLIP ↑ | Unified ↑ | Aes ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| FLUX.1-dev | - | - | - | - | 0.310 | 0.227 | 1.131 | 0.389 | 3.211 | 6.108 |
| DanceGRPO | DDIM | 25 | 14 | 237.9 | 0.371 | 0.231 | 1.306 | 0.364 | 3.156 | 6.552 |
| MixGRPO | DDIM | 25 | 14 | 237.7 | 0.366 | 0.235 | 1.604 | 0.382 | 3.257 | 6.623 |
| Ours | DPM++ | 8 | 1.33 | 45.1 | 0.366 | 0.234 | 1.640 | 0.391 | 3.334 | 6.621 |
Under the 8-step DPM++ configuration, training speed increases by 5.3x (45s vs 238s/iter), with out-of-domain metrics being overall superior.
Ablation Study¶
| Parameter | Optimal Value | Description |
|---|---|---|
| Perturbation Strength \(\sigma\) | 0.3 | Too small lacks exploration; too large leaves the neighborhood |
| Anchor Number \(B\) | 4 | \(B=2\) is already competitive; \(B=4\) is the best balance |
| Quasi-norm \(p\) | 0.8 | \(p=2\) is standard GRPO; \(p=0.8\) is best for out-of-domain |
Key Findings¶
- Neighbor GRPO converges faster: achieving HPSv2.1 > 0.35 in only 50 iterations (DanceGRPO requires more).
- Human Evaluation: Achieves preference rates of 72% and 61% compared to DanceGRPO and MixGRPO, respectively.
- Avoids reward hacking: No issues such as grid artifacts or uneven coloring occur.
- Long-term training stability is superior to MixGRPO.
Highlights & Insights¶
- Deep Theoretical Insight: Reinterpreting SDE-based GRPO as contrastive learning reveals its essence as distance optimization, providing the theoretical foundation for a pure ODE solution.
- Full Preservation of ODE Advantages: No SDE conversion required, compatible with high-order solvers, and more direct credit assignment.
- Symmetric Anchor Sampling leverages the geometric properties of the J-L lemma to cleverly reduce computation to \(B/G\) times.
- Quasi-norm Reweighting is a concise and effective solution for reward flattening, allowing control via a single hyperparameter.
Limitations & Future Work¶
- Validated only on FLUX.1-dev; applicability to other Flow Matching models (e.g., SD3) remains to be confirmed.
- Multi-reward training currently uses equal weights; adaptive weighting could be explored.
- Theoretical guarantees of the proxy policy depend on the neighborhood being sufficiently tight (\(\sigma\) small enough); behavior under extreme settings is not fully analyzed.
- Can be extended to video generation (currently image only).
Related Work & Insights¶
- Shares roots with DanceGRPO and Flow-GRPO but represents a paradigm shift: moving from SDE dependency to pure ODE training.
- MixGRPO's hybrid sampling is a compromise; Neighbor GRPO is more thorough.
- The contrastive learning perspective may apply to other deterministic model optimization scenarios requiring stochasticity.
- Quasi-norm reweighting can be generalized to other GRPO variants.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Important contributions in both theoretical insight and methodology by completely bypassing SDE.
- Experimental Thoroughness: ⭐⭐⭐⭐ Sufficient multi-metric evaluation, ablation studies, and human assessment, though only one base model was used.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear theoretical derivation, intuitive illustrations, and smooth logic from insights to methodology.
- Value: ⭐⭐⭐⭐⭐ 5x training efficiency boost with superior quality; a significant driver for RLHF in visual generation.