Online Iterative Self-Alignment for Radiology Report Generation¶

Conference: ACL 2025
arXiv: 2505.11983
Code: None
Area: Medical NLP
Keywords: Radiology Report Generation, Online Iterative Self-Alignment, Multi-Objective Preference Optimization, MODPO, DPO

TL;DR¶

Proposed the Online Iterative Self-Alignment (OISA) method: Through a four-stage self-loop consisting of self-generation \(\rightarrow\) self-evaluation \(\rightarrow\) self-alignment \(\rightarrow\) self-iteration, it leverages Multi-Objective Preference Optimization (MODPO) to continuously improve the quality of radiology reports generated by a lightweight RRG model without requiring external large language models or human annotations, achieving SOTA performance on MIMIC-CXR and IU-Xray.

Background & Motivation¶

Background: Radiology Report Generation (RRG) aims to automatically generate free-text descriptions for radiology images. Existing approaches primarily train on image-report pairs via supervised fine-tuning (SFT). Recently, researchers have started deploying reinforcement learning (RL) for post-training alignment to align model outputs with radiologist preferences.

Limitations of Prior Work: (a) The scale of high-quality annotated data is limited, making SFT models prone to overfitting and poor generalization.
(b) Traditional RL alignment methods (e.g., CMN+RL, MPO) remain constrained by the data coverage of the training set.
(c) Although the scheme by Hein et al. (2024) achieves strong performance, it relies on an 8B LLM (CheXagent) to generate preference data and an LLM-based evaluation metric (GREEN), which is cost-prohibitive and limited to offline alignment.

Key Challenge: Preference alignment requires substantial amounts of high-quality preference data, but expert annotation in the medical domain is extremely expensive and non-scalable, while relying on external LLMs contradicts the original intent of lightweight deployment.

Goal: To enable lightweight RRG models to achieve continuous multi-objective preference alignment solely using self-generated data, dispensing with the dependency on fixed datasets and external LLMs.

Key Insight: To condition the model on one-hot weight vectors, allowing it to generate diverse reports targeted at different clinical objectives, and then leverage existing radiological evaluation metrics to automatically construct multi-objective preference datasets for MODPO optimization and iteration.

Core Idea: A four-stage self-loop (generation-evaluation-alignment-iteration) combined with Multi-Objective Preference Optimization (MODPO) to achieve continuous self-improvement for lightweight RRG models.

Method¶

Overall Architecture¶

OISA consists of two primary modules and a four-step iterative loop:

Preference Data Construction (PDC) module: Comprises self-generation and self-evaluation, responsible for automatically constructing multi-objective preference datasets.
Multi-Objective Alignment (MOA) module: Comprises self-alignment and self-iteration, responsible for optimizing the model via MODPO and triggering the next round of iteration.

The overall workflow is: \(\pi_{\text{ref}}^{(i)} \xrightarrow{\text{PDC}} \mathcal{D}^{(i)} \xrightarrow{\text{MODPO}} \pi_{\theta_\mathbf{w}}^{(i)} \rightarrow \pi_{\text{ref}}^{(i+1)} \rightarrow \cdots\), where each iteration uses the updated model to generate higher-quality preference data.

Key Designs¶

Key Design 1: Conditional Multi-Objective Self-Generation¶

Mechanism: Introduce a one-hot weight vector \(\hat{\mathbf{w}}_k = [w_1, \ldots, w_N]\) (where \(w_k=1\)) as conditional input to the model, enabling the RRG model to generate specialized reports tailored to the \(k\)-th objective. By switching the weight vector, multiple reports biased towards different clinical objectives can be generated for the same image.
Deduplication Strategy: Two-level deduplication is utilized to ensure data diversity: (1) Patient level: For reports of different views from the same patient, only the one with the highest BERTScore is retained (227k \(\rightarrow\) 130k). (2) Disease label level: Utilize CheXbert to extract 14 disease labels for grouping (579 groups in total); within each group, reports with BERTScore < 0.5 are discarded, and for report pairs with a similarity > 0.8, only the higher-quality one is retained (130k \(\rightarrow\) 98k).
Design Motivation: Lightweight SFT models intrinsically struggle to generate diverse responses for the same prompt; conditional generation combined with deduplication addresses this lack of diversity.

Key Design 2: Stratified Sampling Self-Evaluation¶

Construction Process: For each objective dimension \(k\), use the corresponding evaluation metric \(M_k\) (RadCliQ / RadGraphF1 / GREEN) to score the candidate reports, and then construct preference pairs via stratified sampling: (1) Group reports by disease labels; (2) In each group, select the report with the highest evaluation score as the chosen response \(y^w\); (3) Randomly select one from the remaining reports in the same group as the rejected response \(y^l\).
Stratified Sampling Mechanism: Calculate the number of samples to be collected for each group \(K_c\) to ensure uniform proportional coverage of various disease categories, ultimately constructing a preference dataset of \(K=10000\) pairs.
Results: Repeating this process for \(N=3\) objectives yields the multi-objective preference dataset \(\mathcal{D} = [\mathcal{D}_{\text{RadCliQ}}, \mathcal{D}_{\text{RadGraphF1}}, \mathcal{D}_{\text{GREEN}}]\).

Key Design 3: MODPO-Based Multi-Objective Self-Alignment¶

Algorithm Solution: Adopt Multi-Objective DPO (MODPO) to achieve multi-objective alignment on top of DPO with minimal additional costs.
Two-Step Training: (1) For each preference dataset \(\mathcal{D}_k\), train a marginal reward model \(\mathcal{R}_k\) using the standard DPO loss; (2) Integrate marginal rewards as margin terms into the MODPO loss, and train the model using the weight vector \(\mathbf{w}\) as a prompt.
Weight Sampling: During training, each dimension of \(\mathbf{w}\) is sampled from \(\{0.2, 0.4, 0.6, 0.8, 1.0\}\) to generate a uniformly distributed Pareto frontier.
Weight Fusion: The weight vector \(\mathbf{w}\) is fused into image features via a multi-head attention mechanism (with \(\mathbf{w}\) as the query, and image features as the key and value).
Self-Iteration: After alignment, set \(\pi_{\text{ref}} \leftarrow \pi_{\theta_\mathbf{w}}\) to start a new cycle, iterating for 3 rounds in total.

Loss & Training¶

Baseline model: PromptMRG (219.9M parameters), OISA model: 230.1M parameters (about 10M additional parameters used for weight-conditional fusion).
3 iterative rounds, 60 epochs per round, with each epoch taking approximately 15 minutes (NVIDIA 4090, 24GB).
Batch size = 16, learning rate = 1e-5, Adam optimizer, \(\beta = 0.5\).
Beam search (beam width = 3) is used for inference, with maximum report lengths set to 150/110 for MIMIC-CXR and IU-Xray respectively.

Key Experimental Results¶

Main Results: MIMIC-CXR Comparison (Table 3)¶

Method	Params	B1	B4	BERTScore	RadCliQ(↓)	RadGraphF1	CheXbertF1	GREEN
R2Gen	78.5M	0.353	0.103	0.866	2.89	0.195	0.276	0.306
CMN+RL	60.8M	0.381	0.109	0.871	2.83	0.214	0.292	0.315
PromptMRG	219.9M	0.398	0.112	0.857	2.77	0.227	0.476	0.289
MPO	63.3M	0.416	0.139	0.878	2.63	0.257	0.353	0.324
OISA (iter3)	230.1M	0.428	0.129	0.885	2.54	0.273	0.516	0.341
MedVersa	7B	0.280	0.090	0.711	2.45	0.289	0.471	0.381
CheXagent	8B	0.172	0.021	0.669	2.88	0.190	0.265	0.268

IU-Xray Comparison (Table 4)¶

Method	Params	B1	B4	BERTScore	RadCliQ(↓)	RadGraphF1	CheXbertF1	GREEN
PromptMRG	219.9M	0.401	0.098	0.871	2.60	0.274	0.211	0.457
OISA (iter3)	230.1M	0.431	0.131	0.889	2.51	0.308	0.232	0.527
MedVersa	7B	0.247	0.047	0.884	2.71	0.209	0.217	0.516
CheXagent	8B	0.191	0.036	0.876	2.81	0.184	0.097	0.407

Iteration Performance Analysis (MIMIC-CXR, Equal Weights \(w=1/3\))¶

Stage	RadCliQ(↓)	RadGraphF1	GREEN	BERTScore
SFT baseline	2.77	0.227	0.289	0.857
Iteration 1	2.65	0.244	0.323	0.865
Iteration 2	2.63	0.251	0.325	0.874
Iteration 3	2.61	0.254	0.327	0.879

Key Findings¶

The quality of preference data continuously improves in each iteration: the quantiles of RadGraphF1 and GREEN steadily rise round by round, while RadCliQ consistently decreases.
When the weight of a specific objective is set to 1, the corresponding metric achieves the optimal performance; with equal weights, all metrics approach near-optimal performance, demonstrating the effectiveness of multi-objective alignment.
OISA (230M) comprehensively outperforms 7B-scale VLM models (MedVersa, CheXagent) on NLG metrics, and achieves comparable performance to MedVersa on radiological metrics.
Inference speed is 0.905s per report, which is close to the baseline PromptMRG (0.874s) and significantly faster than MedVersa (5.11s) and CheXagent (2.3s).

Highlights & Insights¶

Fully Autonomous Self-Improvement Loop: OISA does not rely on external large models or human annotations. Instead, it utilizes established radiological evaluation metrics (RadCliQ, RadGraphF1, GREEN) as proxies for preference signals, achieving incredibly low costs.
Multi-Objective Pareto Frontier: Implement a continuous and smooth Pareto frontier via weight conditioning, allowing users to control report styles (e.g., favoring clinical accuracy vs. language fluency) by adjusting weights during inference.
Theoretical Guarantees: Under the linear reward assumption, it is mathematically proven that the suboptimality upper bound tightens with iterations, meaning that newly generated preference data in each round better covers the target policy's distribution.
Extreme Efficiency: Preference learning requires only 10k pairs of data per round and takes just 0.14 hours per epoch (compared to 227k data samples and 2.39 hours per epoch in the SFT stage). The total training cost for 3 iterative rounds is approximately 25 GPU hours.

Limitations & Future Work¶

Only verified on a single baseline model (PromptMRG); other architectures and scales of RRG models have not been tested.
Existing evaluation metrics are used to substitute actual radiologist preferences; the alignment of these metrics with real clinical requirements remains to be further validated.
Experiments were only conducted on chest X-ray datasets; other modalities such as CT/MRI have not been tested.
The number of iterations is limited (3 rounds); whether performance saturation or degradation occurs with more rounds has not been fully explored.

vs MPO (Xiao et al., 2025): The previous work of the same team, which uses RL to optimize multi-dimensional rewards but is limited by fixed training data. OISA expands data coverage through online iterative generation.
vs Hein et al. (2024): Employs CheXagent (8B) to generate preference data and utilizes GREEN scoring for offline DPO. OISA dispenses with large models, enabling the lightweight model to perform self-generation and self-evaluation.
vs SPIN / Self-Play: Operates on a similar philosophy—improving by competing against its past self. However, OISA incorporates multi-objective dimensions and domain-specific medical evaluation metrics.
vs Constitutional AI: Shares a similar self-evaluation and self-improvement paradigm, but OISA utilizes domain-specific metrics rather than general principles.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of conditional multi-objective self-generation and iterative MODPO optimization is novel in the RRG domain.
Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets + 7 metrics + multiple weight configurations + 3 iterative rounds + Pareto frontier visualization.
Writing Quality: ⭐⭐⭐⭐ The method is clearly described with theoretical analysis.
Value: ⭐⭐⭐⭐ Achieves continuous improvement of lightweight models at an extremely low cost, holding practical significance for medical AI deployment.

title: >- [Paper Note] Online Iterative Self-Alignment for Radiology Report Generation description: >- [ACL 2025][Medical Image][Radiology Report Generation] Proposes the Online Iterative Self-Alignment (OISA) method for radiology report generation—a four-stage loop (self-generating diverse data → self-evaluating multi-objective preferences → self-aligning multi-objective optimization → self-iteration for further enhancement), which iteratively improves report quality without extra human annotations, achieving SOTA on multiple evaluation metrics. tags: - ACL 2025 - Medical Image - Radiology Report Generation - Self-Alignment - Iterative Optimization - Multi-Objective Preference - RLHF