MEDISEG: A Medication Image Instance Segmentation Dataset for Preventing Adverse Drug Events¶

Conference: CVPR 2026 arXiv: 2603.10825 Code: github.com/williamcwi/MEDISEG Area: Instance Segmentation / Medical Imaging / Dataset Keywords: Medication recognition, instance segmentation, few-shot detection, dataset, medication safety

TL;DR¶

This work introduces MEDISEG, a medication image instance segmentation dataset (8,262 images, 32 pill classes, with real-world occlusion/overlap scenarios). YOLOv8/v9 achieve 99.5% mAP@0.5 on the 3-class subset and 80.1% on the 32-class subset. FsDet few-shot experiments demonstrate that MEDISEG pretraining significantly outperforms CURE in occluded scenarios (1-shot: 0.406 vs. 0.131).

Background & Motivation¶

Background: Medication errors and adverse drug events (ADEs) pose serious threats to patient safety — ADEs accounted for 8.9% of AEMT-related deaths between 1980 and 2014, with the rate continuing to rise. More than one-third of adults aged 75–85 take five or more prescription medications daily. AI-based pill recognition is a promising solution, yet progress is constrained by the quality of available datasets.

Limitations of Prior Work: Existing pill datasets suffer from three major shortcomings: (1) NIH Pillbox (the largest, with 133K images) was discontinued in 2021 and lacks instance segmentation annotations; (2) CURE provides partial instance segmentation annotations that are incomplete and include synthetic images; (3) all existing datasets predominantly feature single-pill images captured in controlled environments, failing to reflect real-world scenarios such as dosette boxes with multiple overlapping pills.

Key Challenge: In clinical and home settings, pills are routinely arranged in stacked, multi-pill configurations inside dosette boxes, requiring instance-level segmentation to distinguish individual pills — yet existing datasets consist almost exclusively of single-pill images.

Goal: To construct a pill image dataset that encompasses realistic multi-pill scenarios (overlap, occlusion, dosette boxes) with complete instance segmentation annotations, and to validate its contribution to few-shot generalization.

Key Insight: The dataset is designed from clinical needs — dosette boxes naturally produce multi-pill overlap and occlusion, while visually similar pill classes are deliberately selected to increase recognition difficulty.

Core Idea: Construct an instance segmentation dataset covering real-world multi-pill scenarios (occlusion/overlap/dosette boxes), and demonstrate that scene complexity — rather than sheer data volume — is the key factor for few-shot generalization.

Method¶

Overall Architecture¶

MEDISEG comprises two subsets: 3-Pills (3 pill classes) and 32-Pills (8,262 images, 32 classes). The pipeline proceeds as follows: capture with iPhone 12 Pro Max → crop dosette box cells to individual compartments → resize to 640×640 → manually annotate instance segmentation masks using COCO Annotator → validate with supervised YOLOv8/v9 training → evaluate with FsDet few-shot detection.

Key Designs¶

Real-World Acquisition Strategy:
- Function: Pills are arranged in standard 4-row × 7-column dosette boxes and photographed.
- Mechanism: The compartments naturally produce pill overlap, occlusion, and background clutter. Varying lighting intensity and angle introduces realistic shadows, reflections, and highlights. Each image contains 1–13 pills, spanning single-pill to dense multi-pill scenes.
- Design Motivation: To simulate real clinical and home medication scenarios rather than controlled laboratory settings. Dosette boxes are a widely used medication management tool among elderly patients.
Fine-Grained Category Design:
- Function: The 32 pill classes deliberately include visually highly similar categories.
- Mechanism: Pill A and Pill B share similar shapes but differ in color; Pill B and Pill C share similar colors but differ in shape — compelling the model to learn both shape and color features simultaneously. The 32-Pills subset includes multiple small white pills that are indistinguishable by coarse-grained features alone.
- Design Motivation: To prevent models from relying on simple cues (e.g., color) while overlooking fine morphological differences, better reflecting the clinical challenge of recognizing large numbers of white or featureless pills.
Few-Shot Evaluation Protocol:
- Function: FsDet (Faster R-CNN + ResNet + FPN) is used to conduct few-shot detection evaluation with base/novel class splits.
- Mechanism: MEDISEG and CURE are used separately as base training sets; the backbone and RPN are frozen while only ROI heads are fine-tuned; evaluation is performed under 1/5/10-shot settings. A dedicated "occlusion-only" test set is constructed to assess extreme scenarios.
- Design Motivation: To validate that dataset scene complexity — rather than data volume alone — contributes to few-shot representation transfer.

Loss & Training¶

YOLOv8/v9: Genetic algorithm hyperparameter search over 70 epochs (lr=0.01009, momentum=0.94, weight_decay=0.00048), best fitness=0.81253.
Data split: 70% train / 20% val / 10% test.
Few-shot: backbone and RPN are frozen; only ROI heads are fine-tuned with re-initialized classification layers.

Key Experimental Results¶

Main Results¶

Dataset	Model	mAP@50	mAP@50-95	Precision	Recall
3-Pills	YOLOv8	99.4%	95.0%	99.7%	99.7%
3-Pills	YOLOv9	99.5%	96.5%	99.6%	99.8%
32-Pills	YOLOv8	62.2%	50.9%	62.8%	57.4%
32-Pills	YOLOv9	80.1%	68.4%	81.2%	73.7%

Ablation Study¶

Few-Shot Setting	MEDISEG fg_cls_acc	CURE fg_cls_acc	Gain
1-shot (occlusion set)	0.406	0.131	3.1×
5-shot (occlusion set)	0.625	0.372	1.7×
10-shot (occlusion set)	0.740	0.558	1.3×

Dimension	MEDISEG	NIH Pillbox	CURE
# Images	8,262	133,774	~1,000
# Classes	32	4,392	196
Instance segmentation	✓ Complete	✗	Partial
Multi-pill scenes	✓ (1–13 pills/image)	✗	✗
Publicly available	✓ (CC BY 4.0)	Discontinued	✓

Key Findings¶

YOLOv9 substantially outperforms YOLOv8 on 32-Pills (mAP@50: 80.1% vs. 62.2%), indicating that stronger feature fusion is critical for fine-grained recognition.
Misclassifications primarily stem from side-view images of visually similar pills — top-view images are distinguishable, while side profiles are nearly identical.
Few-shot performance gaps are most pronounced on the occlusion-only test set, demonstrating that multi-pill scene pretraining on MEDISEG confers greater occlusion robustness.
The advantage of MEDISEG diminishes as the number of shots increases (3.1×→1.3×), suggesting its benefit is most critical in extremely low-data regimes.

Highlights & Insights¶

The dataset design philosophy is well-considered: deliberately constructing visually similar classes and realistic occlusion scenarios rather than simply accumulating data volume.
The few-shot evaluation goes beyond standard test sets by constructing a dedicated occlusion-only subset for stress testing — the evaluation methodology is more instructive than the results themselves.
An interesting finding is validated: the scene complexity of training data — rather than sheer volume — is the key factor for few-shot generalization.
CC BY 4.0 open license + COCO format + GitHub code = high usability.

Limitations & Future Work¶

The dataset covers only 32 pill classes, whereas clinical settings involve thousands of distinct drug types.
Images were captured exclusively with an iPhone 12, leaving cross-device generalization unvalidated.
Variation in dynamic lighting conditions and background diversity remains limited.
Evaluation is restricted to object detection and segmentation; performance on semantic understanding tasks has not been assessed.
No prospective validation in real clinical environments has been conducted.

vs. NIH Pillbox: The largest pill dataset (133K images), but discontinued and lacking instance segmentation annotations. MEDISEG is smaller in scale but offers complete annotations and multi-pill scenes.
vs. CURE: The closest competing dataset, with partial segmentation annotations that are incomplete and include synthetic images. Few-shot experiments directly demonstrate the advantage of real occluded scenes.
Design Insight: The concept of constructing an occlusion-only evaluation subset is worth adopting — designing targeted hard cases is more convincing than reporting numbers on standard benchmarks.

Rating¶

⭐⭐⭐ Novelty: Primarily a dataset contribution with no significant methodological innovation, though the dataset design reflects careful consideration.
⭐⭐⭐⭐ Experimental Thoroughness: Multi-model, multi-subset, few-shot evaluation, and hyperparameter search yield solid validation.
⭐⭐⭐⭐ Writing Quality: Follows standard dataset paper conventions with clear structure and detailed tables.
⭐⭐⭐ Value: Practically valuable for medication safety AI, though domain scope is relatively narrow.