MEDISEG: A Medication Image Instance Segmentation Dataset for Preventing Adverse Drug Events¶
Conference: CVPR 2026 arXiv: 2603.10825 Code: github.com/williamcwi/MEDISEG Area: Instance Segmentation / Medical Imaging / Dataset Keywords: Medication recognition, instance segmentation, few-shot detection, dataset, medication safety
TL;DR¶
This work introduces MEDISEG, a medication image instance segmentation dataset (8,262 images, 32 pill classes, with real-world occlusion/overlap scenarios). YOLOv8/v9 achieve 99.5% mAP@0.5 on the 3-class subset and 80.1% on the 32-class subset. FsDet few-shot experiments demonstrate that MEDISEG pretraining significantly outperforms CURE in occluded scenarios (1-shot: 0.406 vs. 0.131).
Background & Motivation¶
Background: Medication errors and adverse drug events (ADEs) pose serious threats to patient safety — ADEs accounted for 8.9% of AEMT-related deaths between 1980 and 2014, with the rate continuing to rise. More than one-third of adults aged 75–85 take five or more prescription medications daily. AI-based pill recognition is a promising solution, yet progress is constrained by the quality of available datasets.
Limitations of Prior Work: Existing pill datasets suffer from three major shortcomings: (1) NIH Pillbox (the largest, with 133K images) was discontinued in 2021 and lacks instance segmentation annotations; (2) CURE provides partial instance segmentation annotations that are incomplete and include synthetic images; (3) all existing datasets predominantly feature single-pill images captured in controlled environments, failing to reflect real-world scenarios such as dosette boxes with multiple overlapping pills.
Key Challenge: In clinical and home settings, pills are routinely arranged in stacked, multi-pill configurations inside dosette boxes, requiring instance-level segmentation to distinguish individual pills — yet existing datasets consist almost exclusively of single-pill images.
Goal: To construct a pill image dataset that encompasses realistic multi-pill scenarios (overlap, occlusion, dosette boxes) with complete instance segmentation annotations, and to validate its contribution to few-shot generalization.
Key Insight: The dataset is designed from clinical needs — dosette boxes naturally produce multi-pill overlap and occlusion, while visually similar pill classes are deliberately selected to increase recognition difficulty.
Core Idea: Construct an instance segmentation dataset covering real-world multi-pill scenarios (occlusion/overlap/dosette boxes), and demonstrate that scene complexity — rather than sheer data volume — is the key factor for few-shot generalization.
Method¶
Overall Architecture¶
MEDISEG comprises two subsets: 3-Pills (3 pill classes) and 32-Pills (8,262 images, 32 classes). The pipeline proceeds as follows: capture with iPhone 12 Pro Max → crop dosette box cells to individual compartments → resize to 640×640 → manually annotate instance segmentation masks using COCO Annotator → validate with supervised YOLOv8/v9 training → evaluate with FsDet few-shot detection.
Key Designs¶
-
Real-World Acquisition Strategy:
- Function: Pills are arranged in standard 4-row × 7-column dosette boxes and photographed.
- Mechanism: The compartments naturally produce pill overlap, occlusion, and background clutter. Varying lighting intensity and angle introduces realistic shadows, reflections, and highlights. Each image contains 1–13 pills, spanning single-pill to dense multi-pill scenes.
- Design Motivation: To simulate real clinical and home medication scenarios rather than controlled laboratory settings. Dosette boxes are a widely used medication management tool among elderly patients.
-
Fine-Grained Category Design:
- Function: The 32 pill classes deliberately include visually highly similar categories.
- Mechanism: Pill A and Pill B share similar shapes but differ in color; Pill B and Pill C share similar colors but differ in shape — compelling the model to learn both shape and color features simultaneously. The 32-Pills subset includes multiple small white pills that are indistinguishable by coarse-grained features alone.
- Design Motivation: To prevent models from relying on simple cues (e.g., color) while overlooking fine morphological differences, better reflecting the clinical challenge of recognizing large numbers of white or featureless pills.
-
Few-Shot Evaluation Protocol:
- Function: FsDet (Faster R-CNN + ResNet + FPN) is used to conduct few-shot detection evaluation with base/novel class splits.
- Mechanism: MEDISEG and CURE are used separately as base training sets; the backbone and RPN are frozen while only ROI heads are fine-tuned; evaluation is performed under 1/5/10-shot settings. A dedicated "occlusion-only" test set is constructed to assess extreme scenarios.
- Design Motivation: To validate that dataset scene complexity — rather than data volume alone — contributes to few-shot representation transfer.
Loss & Training¶
- YOLOv8/v9: Genetic algorithm hyperparameter search over 70 epochs (lr=0.01009, momentum=0.94, weight_decay=0.00048), best fitness=0.81253.
- Data split: 70% train / 20% val / 10% test.
- Few-shot: backbone and RPN are frozen; only ROI heads are fine-tuned with re-initialized classification layers.
Key Experimental Results¶
Main Results¶
| Dataset | Model | mAP@50 | mAP@50-95 | Precision | Recall |
|---|---|---|---|---|---|
| 3-Pills | YOLOv8 | 99.4% | 95.0% | 99.7% | 99.7% |
| 3-Pills | YOLOv9 | 99.5% | 96.5% | 99.6% | 99.8% |
| 32-Pills | YOLOv8 | 62.2% | 50.9% | 62.8% | 57.4% |
| 32-Pills | YOLOv9 | 80.1% | 68.4% | 81.2% | 73.7% |
Ablation Study¶
| Few-Shot Setting | MEDISEG fg_cls_acc | CURE fg_cls_acc | Gain |
|---|---|---|---|
| 1-shot (occlusion set) | 0.406 | 0.131 | 3.1× |
| 5-shot (occlusion set) | 0.625 | 0.372 | 1.7× |
| 10-shot (occlusion set) | 0.740 | 0.558 | 1.3× |
| Dimension | MEDISEG | NIH Pillbox | CURE |
|---|---|---|---|
| # Images | 8,262 | 133,774 | ~1,000 |
| # Classes | 32 | 4,392 | 196 |
| Instance segmentation | ✓ Complete | ✗ | Partial |
| Multi-pill scenes | ✓ (1–13 pills/image) | ✗ | ✗ |
| Publicly available | ✓ (CC BY 4.0) | Discontinued | ✓ |
Key Findings¶
- YOLOv9 substantially outperforms YOLOv8 on 32-Pills (mAP@50: 80.1% vs. 62.2%), indicating that stronger feature fusion is critical for fine-grained recognition.
- Misclassifications primarily stem from side-view images of visually similar pills — top-view images are distinguishable, while side profiles are nearly identical.
- Few-shot performance gaps are most pronounced on the occlusion-only test set, demonstrating that multi-pill scene pretraining on MEDISEG confers greater occlusion robustness.
- The advantage of MEDISEG diminishes as the number of shots increases (3.1×→1.3×), suggesting its benefit is most critical in extremely low-data regimes.
Highlights & Insights¶
- The dataset design philosophy is well-considered: deliberately constructing visually similar classes and realistic occlusion scenarios rather than simply accumulating data volume.
- The few-shot evaluation goes beyond standard test sets by constructing a dedicated occlusion-only subset for stress testing — the evaluation methodology is more instructive than the results themselves.
- An interesting finding is validated: the scene complexity of training data — rather than sheer volume — is the key factor for few-shot generalization.
- CC BY 4.0 open license + COCO format + GitHub code = high usability.
Limitations & Future Work¶
- The dataset covers only 32 pill classes, whereas clinical settings involve thousands of distinct drug types.
- Images were captured exclusively with an iPhone 12, leaving cross-device generalization unvalidated.
- Variation in dynamic lighting conditions and background diversity remains limited.
- Evaluation is restricted to object detection and segmentation; performance on semantic understanding tasks has not been assessed.
- No prospective validation in real clinical environments has been conducted.
Related Work & Insights¶
- vs. NIH Pillbox: The largest pill dataset (133K images), but discontinued and lacking instance segmentation annotations. MEDISEG is smaller in scale but offers complete annotations and multi-pill scenes.
- vs. CURE: The closest competing dataset, with partial segmentation annotations that are incomplete and include synthetic images. Few-shot experiments directly demonstrate the advantage of real occluded scenes.
- Design Insight: The concept of constructing an occlusion-only evaluation subset is worth adopting — designing targeted hard cases is more convincing than reporting numbers on standard benchmarks.
Rating¶
- ⭐⭐⭐ Novelty: Primarily a dataset contribution with no significant methodological innovation, though the dataset design reflects careful consideration.
- ⭐⭐⭐⭐ Experimental Thoroughness: Multi-model, multi-subset, few-shot evaluation, and hyperparameter search yield solid validation.
- ⭐⭐⭐⭐ Writing Quality: Follows standard dataset paper conventions with clear structure and detailed tables.
- ⭐⭐⭐ Value: Practically valuable for medication safety AI, though domain scope is relatively narrow.