Skip to content

OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

Conference: CVPR 2026 arXiv: 2509.18600 Code: Not available Area: Medical Imaging / LLM / Report Generation Keywords: radiology report generation, reinforcement-learning, GRPO, DPO, fact-based reward, data efficiency

TL;DR

OraPO is proposed as an adaptive hybrid RL framework combining GRPO and DPO for data-efficient radiology report generation. It dynamically switches between GRPO and DPO via Zero-Reward Rate detection, and employs a FactScore-based clinical fact-level reward. Using only 1K samples (vs. 227K for baselines), OraPO achieves state-of-the-art clinical F1 scores of 0.341/0.357 on CheXpert Plus and MIMIC-CXR.

Background & Motivation

Background: Mainstream radiology report generation (RRG) relies on multi-stage training, large-scale paired corpora, and large backbone models, incurring substantial computational and data costs.

Limitations of Prior Work: (a) Vanilla GRPO fails on RRG — approximately 30% of rollout groups receive all-zero rewards, leading to gradient vanishing and wasted computation; (b) designing sentence-level rewards that capture clinical accuracy rather than surface similarity is difficult; (c) large-scale annotated data is hard to acquire.

Key Challenge: GRPO generates numerous zero-reward groups when early outputs are highly uncertain, and these failed explorations are entirely wasted; a mechanism is needed to convert failed rollouts into useful learning signals.

Goal: (a) Achieve state-of-the-art report generation with minimal data (1K samples); (b) design rewards that capture clinical factual correctness; (c) resolve the GRPO zero-reward problem.

Key Insight: Monitor the Zero-Reward Rate and automatically switch to DPO when ZRR is high — using ground-truth reports as positives and failed rollouts as negatives, thereby converting wasted exploration into preference learning signals.

Core Idea: Use EMA-smoothed ZRR to detect GRPO failure moments, dynamically incorporate DPO to convert failed rollouts into preference pairs, and apply atomic fact-level rewards to capture clinical correctness.

Method

Overall Architecture

A lightweight VLM based on Qwen2.5-VL (3B). For each prompt, \(K\) rollouts are sampled and evaluated with FactScore-based rewards. GRPO and DPO losses are adaptively mixed via EMA-smoothed ZRR.

Key Designs

  1. Zero-Reward Rate (ZRR) Detection and Adaptive Mixing:

    • Function: Detects the proportion of zero-reward GRPO groups and automatically supplements with DPO training signals when ZRR is high.
    • Mechanism: ZRR is smoothed via EMA and mapped to a mixing weight \(w_i\); \(\mathcal{L}_\text{OraPO} = (1-w_i)\mathcal{L}_\text{GRPO} + w_i\mathcal{L}_\text{DPO}\). Ground-truth reports serve as DPO positives; all GRPO rollouts serve as negatives.
    • Design Motivation: Approximately 30% of GRPO groups receive all-zero rewards in early training, causing gradient vanishing. DPO does not rely on absolute rewards — requiring only preference ordering — and can learn from failures.
  2. FactScore-based Reward (FactS):

    • Function: Designs a dense reward based on atomic clinical facts rather than report-level text similarity.
    • Mechanism: (a) Atomic clinical facts are extracted from generated reports using GPT-4; (b) entailment checking is performed against 14 CheXpert pathology labels; (c) precision/recall are computed, and \(F_\beta\) (\(\beta > 1\) to emphasize recall) is used as the reward.
    • Design Motivation: Clinically, missed detections (false negatives) are more dangerous than false alarms (false positives), hence the emphasis on recall; atomic facts more accurately capture clinical correctness than sentence-level BLEU/ROUGE.
  3. Oracle-educated DPO Component:

    • Function: Converts GRPO failed rollouts into DPO negatives, with ground-truth reports as positives.
    • Mechanism: No additional data is required; GRPO sampling results are directly reused.
    • Design Motivation: Creates a self-reinforcing flywheel — poor rollouts → DPO negatives → better policy → more informative rollouts → higher rewards.

Loss & Training

  • \(\mathcal{L}_\text{OraPO} = (1-w_i)\mathcal{L}_\text{GRPO} + w_i\mathcal{L}_\text{DPO}\), with \(w_\min=0.05\), \(w_\max=0.15\), \(\gamma=2.0\) (sharpening)
  • EMA momentum \(\alpha=0.5\); \(K\) = group sample size
  • Base model: Qwen2.5-VL (3B), 4× A10 GPUs
  • Inference speed: 3.3s/image (vs. 25.2s for GPT-5)

Key Experimental Results

Main Results: CheXpert Plus (Clinical F1)

Method Train Size Precision Recall F1
MambaXray-L (CVPR'25) 1.27M 0.377 0.319 0.335
R2GenGPT 223K 0.315 0.244 0.260
OraPO (Ours) 1K 0.237 0.832 0.341

MIMIC-CXR Results

Method Train Size F1
MambaXray-L 1.27M 0.340
MCA-RG 227K 0.335
OraPO (Ours) 1K 0.357

Ablation Study

Configuration Train Size Precision Recall F1
Base model 0 0.097 0.104 0.034
+ GRPO only 1K 0.026 0.162 0.089
+ GRPO + FactS 1K 0.204 0.605 0.291
+ FactS + OraPO (400) 400 0.217 0.732 0.296
+ FactS + OraPO (1K) 1K 0.237 0.832 0.341

Key Findings

  • The FactS reward contributes most significantly: F1 improves from 0.089 to 0.291 (+200%).
  • OraPO further improves F1 by 17.2% (+37.5% recall) on top of FactS.
  • OraPO with only 400 samples already outperforms GRPO+FactS with 1K samples (0.296 vs. 0.291).
  • SFT severely degrades recall: from 0.732 to 0.176, as SFT causes the model to become overly conservative.
  • The 3B model outperforms GPT-4.1 (F1: 0.341 vs. 0.253) with 7.6× faster inference.

Highlights & Insights

  • Converting GRPO failures into DPO signals is the core innovation: No exploration is wasted — "failed" rollouts become negatives for preference learning.
  • FactScore atomic fact-level reward: More accurately measures clinical correctness than report-level BLEU/ROUGE and is transferable to other medical text generation tasks.
  • Extreme data efficiency: 1K samples outperform models trained on 1.27M samples — a three-order-of-magnitude efficiency gain.
  • Recall-first clinical design: \(\beta > 1\) emphasizes recall, consistent with the clinical principle that false negatives are more dangerous than false positives.

Limitations & Future Work

  • Precision is low (0.237 vs. 0.377 for baselines); excessive pursuit of recall may introduce more false positives.
  • FactScore relies on GPT-4 for atomic fact extraction, incurring additional API costs.
  • Validation is limited to chest X-rays; extension to other imaging modalities (CT, MRI, etc.) has not been explored.
  • The 14 CheXpert labels may be insufficient to cover all clinical findings.
  • vs. MambaXray-L: Requires 1.27M samples, whereas OraPO uses only 1K and achieves higher F1, demonstrating the potential of RL in low-data regimes.
  • vs. GPT-5: OraPO inference takes 3.3s vs. 25.2s for GPT-5, with no API costs; however, GPT-5 is marginally superior on human gold evaluation.
  • OraPO can serve as a paradigm template for low-data medical report generation.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The OraPO hybrid GRPO/DPO framework combined with FactScore reward is an elegant system design.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two datasets, complete ablations, human gold evaluation, and GPT comparison.
  • Writing Quality: ⭐⭐⭐⭐ Method motivation is clear; ZRR analysis is intuitive.
  • Value: ⭐⭐⭐⭐⭐ Extreme data efficiency has substantial practical value in medical imaging.