FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Clients¶
Conference: AAAI 2026 arXiv: 2602.12014 Code: https://github.com/Liar-Mask/FedGRPO Area: Self-Supervised Keywords: Federated Learning, Foundation Models, GRPO, Privacy Preservation, Reinforcement Learning
TL;DR¶
This paper proposes FedGRPO, which reformulates foundation model optimization as a reward-based evaluation process. Through competence-aware expert selection and federated group-relative policy optimization (transmitting only scalar reward signals), FedGRPO achieves privacy-preserving, communication-efficient federated foundation model optimization, approaching or surpassing centralized GRPO on mathematical reasoning and question-answering tasks.
Background & Motivation¶
Federated Foundation Models (FedFMs) aim to combine the strong generalization capability of large server-side foundation models with the domain expertise residing on client devices. The central question is: how can client-side domain knowledge effectively enhance foundation models while preserving local data privacy?
Existing client-to-server knowledge transfer methods fall into two categories:
Model-level transfer (e.g., FedPETuning): clients locally fine-tune parameters such as LoRA adapters and upload them for aggregation. This incurs substantial communication overhead (scaling with model size) and risks privacy leakage through gradients or parameters.
Synthetic data-level transfer (e.g., DPSDA-FL): clients generate and upload synthetic data. This similarly suffers from high communication overhead, and synthetic data may be exploited by honest-but-curious adversaries to infer original data.
Root Cause: existing methods struggle to simultaneously achieve strong privacy protection and communication efficiency. The authors' key insight is: transmitting only scalar evaluation scores—rather than model parameters or synthetic data—can simultaneously reduce privacy leakage risk and lower communication overhead by orders of magnitude.
This insight gives rise to two technical challenges: (1) how to select the most competent clients to evaluate a given query based on domain expertise; and (2) how to aggregate evaluations from multiple expert clients to effectively optimize the foundation model.
Method¶
Overall Architecture¶
FedGRPO operates as a three-step cycle: 1. Expert Selection: select the most suitable client subset based on a competence graph. 2. Dual Evaluation: selected clients evaluate server-generated policies and return scalar rewards. 3. Group-Relative Reward Aggregation: the server aggregates reward signals and optimizes the foundation model via the GRPO loss.
Key Designs¶
-
Competence-based Expert Selection:
- Function: identify the \(M\) most suitable clients for each query \(x_s\).
- Mechanism:
- The server computes query embeddings \(\mathbf{z}_s = \phi(x_s)\) using a frozen encoder \(\phi\).
- Based on cosine similarity, \(L\) annotated samples most similar to \(x_s\) are retrieved from auxiliary data, forming \(\mathcal{G}(x_s)\).
- These samples are distributed to all clients; each client returns its accuracy \(r_k^p\) on these samples.
- The \(M\) clients with the highest accuracy are selected as experts.
- Design Motivation: different clients specialize in different domains; adaptively matching queries to experts improves evaluation quality.
-
Dual Evaluation Mechanism:
- Function: expert clients evaluate candidate answers \(\hat{y}\) generated by the server.
- Mechanism: two evaluation paths are dynamically selected via a gating indicator \(\lambda_k\):
- Answer-based Evaluation (AE): when the query exists in the client's local data, the candidate is compared directly against the ground truth, returning a binary 0/1 score.
- Model-based Evaluation (ME): when no ground truth is available, a locally trained reward model provides a continuous score.
- Design Motivation: clients flexibly leverage their strongest knowledge source, avoiding dependence on a single evaluation mode.
-
Federated Group-Relative Policy Optimization:
- Function: aggregate reward signals from multiple clients into a scale-invariant reinforcement learning signal.
- Mechanism:
- Scores \(\{r_k^s\}_{k \in \mathcal{C}}\) from selected experts are normalized: \(R_k = (r_k^s - \mu_r) / (\sigma_r + \epsilon)\).
- Policy gradient updates are performed using the normalized group-relative rewards: \(\theta_g \leftarrow \theta_g + \eta R_k \nabla_{\theta_g} \log \pi_{\theta_g}(\hat{y}|x_s)\).
- Design Motivation: normalization eliminates scale discrepancies between different evaluation modes (AE vs. ME) and suppresses the influence of outliers.
Loss & Training¶
- Group-relative policy gradient updates reinforce responses that outperform the group average.
- 8 candidate policies are sampled per query for evaluation.
- Maximum generation length: 2048 tokens; temperature: 0.7.
- Auxiliary dataset: 100 samples; \(L=20\); \(M=2\).
Key Experimental Results¶
Main Results¶
Math-benchmark dataset (Qwen2.5-Math-7B):
| Method | Math500 | Minerva | AMC | Olympiad | AIME24 | Avg |
|---|---|---|---|---|---|---|
| Zero-shot | 0.426 | 0.121 | 0.326 | 0.163 | 0.111 | 0.199 |
| Fedpetuning+GRPO | 0.460 | 0.107 | 0.329 | 0.132 | 0.049 | 0.186 |
| DPSDA-FL+GRPO | 0.714 | 0.323 | 0.432 | 0.308 | 0.087 | 0.321 |
| Central-GRPO | 0.742 | 0.320 | 0.515 | 0.364 | 0.175 | 0.370 |
| FedGRPO | 0.738 | 0.321 | 0.504 | 0.371 | 0.167 | 0.369 |
OpenR1-Math dataset (Qwen2.5-Math-7B):
| Method | Math500 | Minerva | AMC | Olympiad | AIME24 | Avg |
|---|---|---|---|---|---|---|
| Central-GRPO | 0.755 | 0.325 | 0.529 | 0.370 | 0.180 | 0.379 |
| FedGRPO | 0.768 | 0.337 | 0.533 | 0.382 | 0.184 | 0.388 |
Ablation Study¶
| Configuration | Key Metric | Note |
|---|---|---|
| Communication Efficiency | ||
| FedGRPO | 2.4 MB | Transmits only scalar rewards; independent of model size |
| DPSDA-FL | 102.5 MB | 40× that of FedGRPO |
| FedPETuning (7B) | 6.1 GB | 2500× that of FedGRPO |
| Number of Clients | ||
| 4 clients | Avg ≈ 0.29 | Baseline performance |
| 10 clients | Steady improvement | More expert knowledge |
| 20 clients | Avg ≈ 0.36 | Continued benefit from larger federation |
| Without GT Answers | ||
| FedGRPO (7B, no GT) | Avg = 0.327 | Only slightly below the GT setting (0.369) |
Key Findings¶
- FedGRPO approaches or surpasses centralized GRPO in multiple settings (e.g., exceeding it by 0.009 on OpenR1-Math 7B).
- Communication overhead is only 2.4 MB—three orders of magnitude lower than FedPETuning and two orders of magnitude lower than DPSDA-FL.
- FedGRPO remains effective even without ground-truth answers, relying solely on local reward models.
- Performance scales steadily with the number of clients, demonstrating strong scalability.
- Larger models (7B) benefit more fully from federated learning than smaller ones (1.5B, 3B).
Highlights & Insights¶
- Paradigm Innovation: the approach shifts federated foundation model optimization from "transmitting parameters/data" to "transmitting rewards," fundamentally redefining the privacy–efficiency trade-off.
- Communication overhead is constant with respect to model size: FedGRPO's 2.4 MB is unaffected by model scale, whereas conventional methods grow linearly or super-linearly.
- The work creatively adapts DeepSeek's GRPO framework, extending the notion of "group" from multiple samples drawn from a single model to distributed evaluations across multiple clients.
- The competence-aware expert selection mechanism enables adaptive utilization of knowledge from heterogeneous clients.
- Significant improvements are observed even on highly challenging benchmarks such as AIME.
Limitations & Future Work¶
- The threat model assumes honest-but-curious participants; malicious clients that deliberately submit incorrect rewards are not addressed.
- The server is assumed to hold a small auxiliary dataset (100 samples), which may be impractical in extreme privacy scenarios.
- Sampling 8 candidate policies per query incurs non-trivial server-side inference overhead.
- Evaluation is currently limited to mathematical reasoning and question answering; applicability to other domains (e.g., code generation, multimodal tasks) remains to be verified.
- Competence assessment relies on the representativeness of the auxiliary data; if the auxiliary distribution diverges significantly from actual queries, expert selection may be inaccurate.
Related Work & Insights¶
- Extending GRPO's "intra-group relative evaluation" to the federated learning setting is a valuable conceptual contribution.
- The privacy-preserving paradigm of "transmitting reward signals only" may generalize to a broader range of distributed AI scenarios.
- The competence-aware routing mechanism bears conceptual similarity to expert routing in Mixture-of-Experts architectures.
- The dual evaluation design (AE + ME) is applicable to any scenario in which evaluation signals are heterogeneous across participants.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Innovatively extends GRPO to federated learning; paradigm-shifting contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Covers multiple model scales and datasets, though broader domain validation is lacking.
- Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clear and method description is rigorous.
- Value: ⭐⭐⭐⭐⭐ — Addresses a practical bottleneck in privacy-preserving federated learning with dramatic communication efficiency gains.