Skip to content

Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

Conference: CVPR 2026
arXiv: 2511.15186
Code: GitHub
Area: Medical Imaging
Keywords: Chest X-ray, lesion segmentation, instruction-guided, automatic dataset construction, vision-language model

TL;DR

This paper introduces instruction-guided lesion segmentation (ILS) for chest X-rays, constructs the first large-scale automatically generated instruction-answer dataset MIMIC-ILS (1.1M samples, 192K images, 91K masks), and trains the ROSALIA model to achieve gIoU of 71.2% and null-target accuracy of 91.8%, substantially outperforming existing general-purpose and medical segmentation models.

Background & Motivation

Chest X-ray (CXR) is one of the most common medical imaging examinations, and lesion localization and boundary delineation represent core tasks for radiologists — yet these are labor-intensive and demand high levels of clinical expertise.

Existing CXR lesion segmentation faces two major bottlenecks: 1. Limited annotation scale: Existing datasets (VinDr-CXR with 15K images, SIIM-ACR with 13K images) rely on expert manual annotation, restricting scale, and most provide only bounding boxes or masks for a single lesion type. 2. High barrier for user input: Existing text-guided segmentation methods require users to provide expert-level detailed descriptions (e.g., "bilateral pulmonary infection with two infected regions…"), rendering them inaccessible to non-expert users.

Key Challenge: How to generate high-quality lesion masks and instruction-answer pairs at scale without manual annotation, while supporting simple and accessible user instructions?

Key Insight: The paper leverages existing image-report paired data in MIMIC-CXR, extracting spatial and textual information from images and reports through a multimodal automated pipeline to produce a fully automatically annotated large-scale ILS dataset.

Method

Overall Architecture

The system consists of two major stages: 1. Lesion mask generation: Automatically generating lesion segmentation masks from CXR images and radiology reports. 2. Instruction-answer pair generation: Constructing diverse training samples based on information extracted in the previous stage.

The ROSALIA model is built upon the LISA architecture, integrating a VLM (LLaVA) and SAM for end-to-end training.

Key Designs

  1. Multimodal Automatic Mask Generation Pipeline:

    • Function: Automatically generates high-quality lesion segmentation masks from raw, unannotated CXR images.
    • Mechanism: A four-step cascaded pipeline —
    • Report structuring: An LLM converts abnormality descriptions in radiology reports into sextuple representations (entity, sentence index, existence, certainty, location, lesion type), with locations mapped to standard anatomical labels.
    • Spatial information extraction: Three visual models are applied in parallel — RadEdit (a diffusion model generating an anomaly map \(\mathcal{A}\) by differencing the original image against a synthesized "lesion-free" image), CXAS (an anatomical segmentation model providing anatomical masks \(\{\mathcal{M}_i\}\)), and YOLO (a lesion detector producing bounding boxes \(\{\mathcal{B}_j\}\)).
    • Mask generation: High-quality candidate boxes are selected via four-condition filtering (anatomical overlap c1 + detection confidence c2 + anomaly signal ratio c3 + minimum size c4); connected components intersecting with selected boxes are then extracted and refined.
    • Location verification: Confirms whether the generated mask successfully localizes the region described in the report, while flagging blank locations for negative sample generation.
    • Design Motivation: Multimodal cross-validation ensures mask quality — textual information specifies "where," the visual anomaly map indicates "what is abnormal," and the detection model provides "boundaries." The four-condition filtering effectively eliminates false positives.
  2. Instruction-Answer Pair Generation System:

    • Function: Automatically constructs diverse training samples based on localization information.
    • Mechanism: Three instruction types are supported —
      • Basic instructions: Specify lesion type and location (e.g., "Segment the pneumonia in the right lung"), generated only when the mask is successfully localized.
      • Global instructions: Specify only the lesion type (e.g., "Segment the opacity"), generated only when the localized position fully matches the position reported in the report.
      • Lesion reasoning instructions: Require the model to predict the specific subtype of opacity, replacing pneumonia/atelectasis/edema with "opacity."
    • Negative sample generation strategy: Lesion types not mentioned or explicitly negated in the report, as well as blank locations substituted for positive-sample locations, are used to construct negative samples.
    • Design Motivation: The three instruction types cover a spectrum of user needs ranging from specific to general. The dynamic generation strategy ensures that only valid instruction-answer pairs are produced, avoiding inconsistencies.
  3. ROSALIA Model Architecture:

    • Function: Generates lesion segmentation masks and textual descriptions conditioned on user instructions.
    • Mechanism: Built upon LISA-7B, the VLM (LLaVA) processes image and instruction inputs to produce a special [SEG] token and textual description. The hidden embedding of the [SEG] token is passed to SAM-H's mask decoder to generate the final segmentation mask.
    • Training strategy: LoRA fine-tuning of the VLM (rank=128, alpha=256); full fine-tuning of the mask decoder. Trained for 15 epochs with AdamW, batch size 256, and a 1:1 positive-to-negative sample ratio.

Loss & Training

\[\mathcal{L} = \lambda_{txt}\mathcal{L}_{txt} + \lambda_{bce}\mathcal{L}_{bce} + \lambda_{dice}\mathcal{L}_{dice}\]
  • \(\mathcal{L}_{txt}\): Autoregressive cross-entropy loss (text generation), \(\lambda_{txt}=0.5\)
  • \(\mathcal{L}_{bce}\): Binary cross-entropy loss (segmentation), \(\lambda_{bce}=5\)
  • \(\mathcal{L}_{dice}\): Dice loss (segmentation, computed on positive samples only), \(\lambda_{dice}=1\)

Key Experimental Results

Main Results

Model gIoU cIoU N-Acc. Note
LISA-7B 8.3% 12.8% 0.7% General domain
LISA-13B 8.9% 12.2% 0.0% General domain
Text4Seg 6.1% 10.3% 20.6% General domain
BiomedParse 23.8% 18.5% 0.6% Medical domain
RecLMIS 22.4% 19.5% 0.0% Medical domain
ROSALIA (Ours) 71.2% 75.6% 91.8% Trained on MIMIC-ILS

Per-Lesion Performance

Lesion Type gIoU cIoU N-Acc.
Cardiomegaly 89.0% 89.0% 85.8%
Pneumonia 57.2% 60.4% 97.1%
Atelectasis 60.2% 58.7% 91.7%
Opacity 60.5% 64.2% 85.0%
Consolidation 61.9% 65.6% 91.2%
Edema 64.8% 66.6% 92.2%
Pleural Effusion 60.3% 59.6% 90.4%

Ablation Study — Dataset Quality Evaluation

Expert Overall Acceptance Positive Acceptance Negative Acceptance
Expert A 96.1% 95.6% 96.5%
Expert B 97.2% 96.0% 98.3%
Expert C 98.7% 99.8% 97.8%
Expert D 97.6% 96.9% 98.2%
Overall 96.4% 90.1% 97.7%

Key Findings

  • Existing general-purpose and medical-domain segmentation models systematically fail on the ILS task, achieving gIoU below 24% and near-zero null-target accuracy (N-Acc ≈ 0%).
  • The fully automatically generated dataset received an overall acceptance rate of 96.4% from four radiation oncology experts.
  • Text response accuracy is 94.4%, with basic instructions achieving the highest score of 96.8%; lesion reasoning instructions reach 84.8%, leaving room for improvement.
  • Cardiomegaly achieves the best segmentation (gIoU 89.0%) due to the use of cardiac masks as annotations; focal lesions such as pneumonia score somewhat lower.

Highlights & Insights

  • The fully automatic dataset construction pipeline is the core contribution, achieving annotation quality comparable to manual labeling through multimodal cross-validation.
  • The ILS task definition is clinically practical: it supports simple user instructions rather than expert-level descriptions and handles null-target detection ("no lesion found").
  • The dataset scale (1.1M samples) is 10–100× larger than existing CXR segmentation datasets.
  • The use of RadEdit to generate anomaly maps is elegant — a diffusion model synthesizes a "normal" image, and differencing localizes abnormal regions.

Limitations & Future Work

  • Annotation quality: The positive-sample acceptance rate of 90.1% falls below the 97.7% for negative samples, indicating that the precision of positive-sample annotations requires further improvement.
  • Only 7 major lesion types are covered; CXR encompasses a broader range of fine-grained abnormalities.
  • Accuracy on lesion reasoning tasks (opacity → specific subtype) is relatively low at 75.1%.
  • The pipeline depends on three pretrained models (RadEdit, CXAS, YOLO); failure of any single model may degrade overall quality.
  • Validation is conducted solely on MIMIC-CXR; CXR style variation across institutions may limit generalizability.
  • vs. BiomedParse: Although a medical-domain model, it only supports class-label prompts and cannot handle instruction-level inputs or null-target detection.
  • vs. RecLMIS: Requires users to provide expert-level descriptions (e.g., "bilateral pulmonary infection…"), imposing a high barrier to use.
  • vs. LISA: ROSALIA is built on the LISA architecture, but fine-tuning on MIMIC-ILS boosts performance from 8.3% to 71.2%, demonstrating the critical importance of task-specific data.

Rating

  • Novelty: ⭐⭐⭐⭐ Both the fully automatic dataset construction pipeline and the ILS task formulation are innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple baselines, per-lesion evaluation, and expert quality validation are provided, though cross-dataset generalization experiments are absent.
  • Writing Quality: ⭐⭐⭐⭐ The problem definition is clear, the pipeline is described in detail, and figures and tables are comprehensive.
  • Value: ⭐⭐⭐⭐⭐ The dataset scale and the public release of code and data offer high practical value to the community.