OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation¶
Conference: CVPR 2026 arXiv: 2509.18600 Code: Not available Area: Medical Imaging / LLM / Report Generation Keywords: radiology report generation, reinforcement-learning, GRPO, DPO, fact-based reward, data efficiency
TL;DR¶
OraPO is proposed as an adaptive hybrid RL framework combining GRPO and DPO for data-efficient radiology report generation. It dynamically switches between GRPO and DPO via Zero-Reward Rate detection, and employs a FactScore-based clinical fact-level reward. Using only 1K samples (vs. 227K for baselines), OraPO achieves state-of-the-art clinical F1 scores of 0.341/0.357 on CheXpert Plus and MIMIC-CXR.
Background & Motivation¶
Background: Mainstream radiology report generation (RRG) relies on multi-stage training, large-scale paired corpora, and large backbone models, incurring substantial computational and data costs.
Limitations of Prior Work: (a) Vanilla GRPO fails on RRG — approximately 30% of rollout groups receive all-zero rewards, leading to gradient vanishing and wasted computation; (b) designing sentence-level rewards that capture clinical accuracy rather than surface similarity is difficult; (c) large-scale annotated data is hard to acquire.
Key Challenge: GRPO generates numerous zero-reward groups when early outputs are highly uncertain, and these failed explorations are entirely wasted; a mechanism is needed to convert failed rollouts into useful learning signals.
Goal: (a) Achieve state-of-the-art report generation with minimal data (1K samples); (b) design rewards that capture clinical factual correctness; (c) resolve the GRPO zero-reward problem.
Key Insight: Monitor the Zero-Reward Rate and automatically switch to DPO when ZRR is high — using ground-truth reports as positives and failed rollouts as negatives, thereby converting wasted exploration into preference learning signals.
Core Idea: Use EMA-smoothed ZRR to detect GRPO failure moments, dynamically incorporate DPO to convert failed rollouts into preference pairs, and apply atomic fact-level rewards to capture clinical correctness.
Method¶
Overall Architecture¶
A lightweight VLM based on Qwen2.5-VL (3B). For each prompt, \(K\) rollouts are sampled and evaluated with FactScore-based rewards. GRPO and DPO losses are adaptively mixed via EMA-smoothed ZRR.
Key Designs¶
-
Zero-Reward Rate (ZRR) Detection and Adaptive Mixing:
- Function: Detects the proportion of zero-reward GRPO groups and automatically supplements with DPO training signals when ZRR is high.
- Mechanism: ZRR is smoothed via EMA and mapped to a mixing weight \(w_i\); \(\mathcal{L}_\text{OraPO} = (1-w_i)\mathcal{L}_\text{GRPO} + w_i\mathcal{L}_\text{DPO}\). Ground-truth reports serve as DPO positives; all GRPO rollouts serve as negatives.
- Design Motivation: Approximately 30% of GRPO groups receive all-zero rewards in early training, causing gradient vanishing. DPO does not rely on absolute rewards — requiring only preference ordering — and can learn from failures.
-
FactScore-based Reward (FactS):
- Function: Designs a dense reward based on atomic clinical facts rather than report-level text similarity.
- Mechanism: (a) Atomic clinical facts are extracted from generated reports using GPT-4; (b) entailment checking is performed against 14 CheXpert pathology labels; (c) precision/recall are computed, and \(F_\beta\) (\(\beta > 1\) to emphasize recall) is used as the reward.
- Design Motivation: Clinically, missed detections (false negatives) are more dangerous than false alarms (false positives), hence the emphasis on recall; atomic facts more accurately capture clinical correctness than sentence-level BLEU/ROUGE.
-
Oracle-educated DPO Component:
- Function: Converts GRPO failed rollouts into DPO negatives, with ground-truth reports as positives.
- Mechanism: No additional data is required; GRPO sampling results are directly reused.
- Design Motivation: Creates a self-reinforcing flywheel — poor rollouts → DPO negatives → better policy → more informative rollouts → higher rewards.
Loss & Training¶
- \(\mathcal{L}_\text{OraPO} = (1-w_i)\mathcal{L}_\text{GRPO} + w_i\mathcal{L}_\text{DPO}\), with \(w_\min=0.05\), \(w_\max=0.15\), \(\gamma=2.0\) (sharpening)
- EMA momentum \(\alpha=0.5\); \(K\) = group sample size
- Base model: Qwen2.5-VL (3B), 4× A10 GPUs
- Inference speed: 3.3s/image (vs. 25.2s for GPT-5)
Key Experimental Results¶
Main Results: CheXpert Plus (Clinical F1)¶
| Method | Train Size | Precision | Recall | F1 |
|---|---|---|---|---|
| MambaXray-L (CVPR'25) | 1.27M | 0.377 | 0.319 | 0.335 |
| R2GenGPT | 223K | 0.315 | 0.244 | 0.260 |
| OraPO (Ours) | 1K | 0.237 | 0.832 | 0.341 |
MIMIC-CXR Results¶
| Method | Train Size | F1 |
|---|---|---|
| MambaXray-L | 1.27M | 0.340 |
| MCA-RG | 227K | 0.335 |
| OraPO (Ours) | 1K | 0.357 |
Ablation Study¶
| Configuration | Train Size | Precision | Recall | F1 |
|---|---|---|---|---|
| Base model | 0 | 0.097 | 0.104 | 0.034 |
| + GRPO only | 1K | 0.026 | 0.162 | 0.089 |
| + GRPO + FactS | 1K | 0.204 | 0.605 | 0.291 |
| + FactS + OraPO (400) | 400 | 0.217 | 0.732 | 0.296 |
| + FactS + OraPO (1K) | 1K | 0.237 | 0.832 | 0.341 |
Key Findings¶
- The FactS reward contributes most significantly: F1 improves from 0.089 to 0.291 (+200%).
- OraPO further improves F1 by 17.2% (+37.5% recall) on top of FactS.
- OraPO with only 400 samples already outperforms GRPO+FactS with 1K samples (0.296 vs. 0.291).
- SFT severely degrades recall: from 0.732 to 0.176, as SFT causes the model to become overly conservative.
- The 3B model outperforms GPT-4.1 (F1: 0.341 vs. 0.253) with 7.6× faster inference.
Highlights & Insights¶
- Converting GRPO failures into DPO signals is the core innovation: No exploration is wasted — "failed" rollouts become negatives for preference learning.
- FactScore atomic fact-level reward: More accurately measures clinical correctness than report-level BLEU/ROUGE and is transferable to other medical text generation tasks.
- Extreme data efficiency: 1K samples outperform models trained on 1.27M samples — a three-order-of-magnitude efficiency gain.
- Recall-first clinical design: \(\beta > 1\) emphasizes recall, consistent with the clinical principle that false negatives are more dangerous than false positives.
Limitations & Future Work¶
- Precision is low (0.237 vs. 0.377 for baselines); excessive pursuit of recall may introduce more false positives.
- FactScore relies on GPT-4 for atomic fact extraction, incurring additional API costs.
- Validation is limited to chest X-rays; extension to other imaging modalities (CT, MRI, etc.) has not been explored.
- The 14 CheXpert labels may be insufficient to cover all clinical findings.
Related Work & Insights¶
- vs. MambaXray-L: Requires 1.27M samples, whereas OraPO uses only 1K and achieves higher F1, demonstrating the potential of RL in low-data regimes.
- vs. GPT-5: OraPO inference takes 3.3s vs. 25.2s for GPT-5, with no API costs; however, GPT-5 is marginally superior on human gold evaluation.
- OraPO can serve as a paradigm template for low-data medical report generation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The OraPO hybrid GRPO/DPO framework combined with FactScore reward is an elegant system design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two datasets, complete ablations, human gold evaluation, and GPT comparison.
- Writing Quality: ⭐⭐⭐⭐ Method motivation is clear; ZRR analysis is intuitive.
- Value: ⭐⭐⭐⭐⭐ Extreme data efficiency has substantial practical value in medical imaging.