Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset¶

Conference: CVPR 2026
arXiv: 2511.15186
Code: GitHub
Area: Medical Imaging
Keywords: Chest X-ray, lesion segmentation, instruction-guided, automatic dataset construction, vision-language model

TL;DR¶

This paper introduces instruction-guided lesion segmentation (ILS) for chest X-rays, constructs the first large-scale automatically generated instruction-answer dataset MIMIC-ILS (1.1M samples, 192K images, 91K masks), and trains the ROSALIA model to achieve gIoU of 71.2% and null-target accuracy of 91.8%, substantially outperforming existing general-purpose and medical segmentation models.

Background & Motivation¶

Chest X-ray (CXR) is one of the most common medical imaging examinations, and lesion localization and boundary delineation represent core tasks for radiologists — yet these are labor-intensive and demand high levels of clinical expertise.

Existing CXR lesion segmentation faces two major bottlenecks: 1. Limited annotation scale: Existing datasets (VinDr-CXR with 15K images, SIIM-ACR with 13K images) rely on expert manual annotation, restricting scale, and most provide only bounding boxes or masks for a single lesion type. 2. High barrier for user input: Existing text-guided segmentation methods require users to provide expert-level detailed descriptions (e.g., "bilateral pulmonary infection with two infected regions…"), rendering them inaccessible to non-expert users.

Key Challenge: How to generate high-quality lesion masks and instruction-answer pairs at scale without manual annotation, while supporting simple and accessible user instructions?

Key Insight: The paper leverages existing image-report paired data in MIMIC-CXR, extracting spatial and textual information from images and reports through a multimodal automated pipeline to produce a fully automatically annotated large-scale ILS dataset.

Method¶

Overall Architecture¶

The system consists of two major stages: 1. Lesion mask generation: Automatically generating lesion segmentation masks from CXR images and radiology reports. 2. Instruction-answer pair generation: Constructing diverse training samples based on information extracted in the previous stage.

The ROSALIA model is built upon the LISA architecture, integrating a VLM (LLaVA) and SAM for end-to-end training.

Key Designs¶

Multimodal Automatic Mask Generation Pipeline:
- Function: Automatically generates high-quality lesion segmentation masks from raw, unannotated CXR images.
- Mechanism: A four-step cascaded pipeline —
- Report structuring: An LLM converts abnormality descriptions in radiology reports into sextuple representations (entity, sentence index, existence, certainty, location, lesion type), with locations mapped to standard anatomical labels.
- Spatial information extraction: Three visual models are applied in parallel — RadEdit (a diffusion model generating an anomaly map \(\mathcal{A}\) by differencing the original image against a synthesized "lesion-free" image), CXAS (an anatomical segmentation model providing anatomical masks \(\{\mathcal{M}_i\}\)), and YOLO (a lesion detector producing bounding boxes \(\{\mathcal{B}_j\}\)).
- Mask generation: High-quality candidate boxes are selected via four-condition filtering (anatomical overlap c1 + detection confidence c2 + anomaly signal ratio c3 + minimum size c4); connected components intersecting with selected boxes are then extracted and refined.
- Location verification: Confirms whether the generated mask successfully localizes the region described in the report, while flagging blank locations for negative sample generation.
- Design Motivation: Multimodal cross-validation ensures mask quality — textual information specifies "where," the visual anomaly map indicates "what is abnormal," and the detection model provides "boundaries." The four-condition filtering effectively eliminates false positives.
Instruction-Answer Pair Generation System:
- Function: Automatically constructs diverse training samples based on localization information.
- Mechanism: Three instruction types are supported —
  - Basic instructions: Specify lesion type and location (e.g., "Segment the pneumonia in the right lung"), generated only when the mask is successfully localized.
  - Global instructions: Specify only the lesion type (e.g., "Segment the opacity"), generated only when the localized position fully matches the position reported in the report.
  - Lesion reasoning instructions: Require the model to predict the specific subtype of opacity, replacing pneumonia/atelectasis/edema with "opacity."
- Negative sample generation strategy: Lesion types not mentioned or explicitly negated in the report, as well as blank locations substituted for positive-sample locations, are used to construct negative samples.
- Design Motivation: The three instruction types cover a spectrum of user needs ranging from specific to general. The dynamic generation strategy ensures that only valid instruction-answer pairs are produced, avoiding inconsistencies.
ROSALIA Model Architecture:
- Function: Generates lesion segmentation masks and textual descriptions conditioned on user instructions.
- Mechanism: Built upon LISA-7B, the VLM (LLaVA) processes image and instruction inputs to produce a special [SEG] token and textual description. The hidden embedding of the [SEG] token is passed to SAM-H's mask decoder to generate the final segmentation mask.
- Training strategy: LoRA fine-tuning of the VLM (rank=128, alpha=256); full fine-tuning of the mask decoder. Trained for 15 epochs with AdamW, batch size 256, and a 1:1 positive-to-negative sample ratio.

Loss & Training¶

\[\mathcal{L} = \lambda_{txt}\mathcal{L}_{txt} + \lambda_{bce}\mathcal{L}_{bce} + \lambda_{dice}\mathcal{L}_{dice}\]

\(\mathcal{L}_{txt}\): Autoregressive cross-entropy loss (text generation), \(\lambda_{txt}=0.5\)
\(\mathcal{L}_{bce}\): Binary cross-entropy loss (segmentation), \(\lambda_{bce}=5\)
\(\mathcal{L}_{dice}\): Dice loss (segmentation, computed on positive samples only), \(\lambda_{dice}=1\)

Key Experimental Results¶

Main Results¶

Model	gIoU	cIoU	N-Acc.	Note
LISA-7B	8.3%	12.8%	0.7%	General domain
LISA-13B	8.9%	12.2%	0.0%	General domain
Text4Seg	6.1%	10.3%	20.6%	General domain
BiomedParse	23.8%	18.5%	0.6%	Medical domain
RecLMIS	22.4%	19.5%	0.0%	Medical domain
ROSALIA (Ours)	71.2%	75.6%	91.8%	Trained on MIMIC-ILS

Per-Lesion Performance¶

Lesion Type	gIoU	cIoU	N-Acc.
Cardiomegaly	89.0%	89.0%	85.8%
Pneumonia	57.2%	60.4%	97.1%
Atelectasis	60.2%	58.7%	91.7%
Opacity	60.5%	64.2%	85.0%
Consolidation	61.9%	65.6%	91.2%
Edema	64.8%	66.6%	92.2%
Pleural Effusion	60.3%	59.6%	90.4%

Ablation Study — Dataset Quality Evaluation¶

Expert	Overall Acceptance	Positive Acceptance	Negative Acceptance
Expert A	96.1%	95.6%	96.5%
Expert B	97.2%	96.0%	98.3%
Expert C	98.7%	99.8%	97.8%
Expert D	97.6%	96.9%	98.2%
Overall	96.4%	90.1%	97.7%

Key Findings¶

Existing general-purpose and medical-domain segmentation models systematically fail on the ILS task, achieving gIoU below 24% and near-zero null-target accuracy (N-Acc ≈ 0%).
The fully automatically generated dataset received an overall acceptance rate of 96.4% from four radiation oncology experts.
Text response accuracy is 94.4%, with basic instructions achieving the highest score of 96.8%; lesion reasoning instructions reach 84.8%, leaving room for improvement.
Cardiomegaly achieves the best segmentation (gIoU 89.0%) due to the use of cardiac masks as annotations; focal lesions such as pneumonia score somewhat lower.

Highlights & Insights¶

The fully automatic dataset construction pipeline is the core contribution, achieving annotation quality comparable to manual labeling through multimodal cross-validation.
The ILS task definition is clinically practical: it supports simple user instructions rather than expert-level descriptions and handles null-target detection ("no lesion found").
The dataset scale (1.1M samples) is 10–100× larger than existing CXR segmentation datasets.
The use of RadEdit to generate anomaly maps is elegant — a diffusion model synthesizes a "normal" image, and differencing localizes abnormal regions.

Limitations & Future Work¶

Annotation quality: The positive-sample acceptance rate of 90.1% falls below the 97.7% for negative samples, indicating that the precision of positive-sample annotations requires further improvement.
Only 7 major lesion types are covered; CXR encompasses a broader range of fine-grained abnormalities.
Accuracy on lesion reasoning tasks (opacity → specific subtype) is relatively low at 75.1%.
The pipeline depends on three pretrained models (RadEdit, CXAS, YOLO); failure of any single model may degrade overall quality.
Validation is conducted solely on MIMIC-CXR; CXR style variation across institutions may limit generalizability.

vs. BiomedParse: Although a medical-domain model, it only supports class-label prompts and cannot handle instruction-level inputs or null-target detection.
vs. RecLMIS: Requires users to provide expert-level descriptions (e.g., "bilateral pulmonary infection…"), imposing a high barrier to use.
vs. LISA: ROSALIA is built on the LISA architecture, but fine-tuning on MIMIC-ILS boosts performance from 8.3% to 71.2%, demonstrating the critical importance of task-specific data.

Rating¶

Novelty: ⭐⭐⭐⭐ Both the fully automatic dataset construction pipeline and the ILS task formulation are innovative.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple baselines, per-lesion evaluation, and expert quality validation are provided, though cross-dataset generalization experiments are absent.
Writing Quality: ⭐⭐⭐⭐ The problem definition is clear, the pipeline is described in detail, and figures and tables are comprehensive.
Value: ⭐⭐⭐⭐⭐ The dataset scale and the public release of code and data offer high practical value to the community.