Evaluating Few-Shot Pill Recognition Under Visual Domain Shift¶
Conference: CVPR 2026 arXiv: 2603.10833 Code: N/A Area: Object Detection Keywords: Few-shot learning, pill recognition, domain shift, object detection, deployment robustness
TL;DR¶
This paper systematically evaluates few-shot pill recognition under cross-dataset domain shift from a deployment-oriented perspective. It reveals a decoupling phenomenon in which semantic classification saturates at 1-shot while localization and recall degrade sharply under occlusion and overlap, and demonstrates that the visual realism of training data is the dominant factor governing few-shot generalization.
Background & Motivation¶
Adverse drug events (ADEs) are a significant source of preventable harm, motivating the development of automated pill recognition systems. Real-world deployment, however, faces substantial challenges:
- Visual complexity: Pills in practice are often stored in pill organizers, introducing cluttered scenes, overlapping occlusions, and specular reflections.
- Annotation scarcity: Constructing large-scale annotated datasets in medical settings is costly.
- Evaluation distortion: Most existing few-shot pill recognition studies are conducted under controlled conditions where training and test distributions are close, thereby overestimating real-world deployment robustness.
- Absence of cross-dataset evaluation: Systematic cross-dataset evaluation under domain shift is exceedingly rare in few-shot object detection.
The central objective of this work is not to propose a new architecture, but to systematically examine the generalization behavior and failure modes of few-shot pill recognition from a deployment diagnostic perspective.
Method¶
Overall Architecture¶
A two-stage few-shot object detection framework based on Faster R-CNN (FsDet) is adopted:
- Base training stage: A detector is trained on base categories to learn general visual representations, region proposal mechanisms, and feature embeddings.
- Few-shot fine-tuning stage: The detector is fine-tuned using a small number of annotated samples from novel categories.
Key Designs¶
-
Cross-domain evaluation protocol:
- Base training is performed on either the CURE or MEDISEG dataset.
- Few-shot fine-tuning and evaluation are conducted on entirely separate, unseen deployment datasets.
- Strict data leakage prevention is enforced across all three stages.
- A 5-way K-shot setting is used (\(K \in \{1, 5, 10\}\)).
-
Contrastive design of two base training datasets:
- CURE: 8,973 images, 196 classes, single pill per image, captured under controlled conditions without occlusion — visually simple.
- MEDISEG: 8,262 images, 32 classes, multiple pill instances per image, pill organizer scenes with occlusion and clutter — visually realistic.
- These two datasets are deliberately selected for their stark contrast to investigate the effect of base-domain realism on few-shot adaptation.
-
Classification-centric, error-driven evaluation metrics:
- Average precision (AP) is not used as the primary metric, as heterogeneous annotation granularity renders AP incomparable across datasets.
- Core metrics: foreground classification accuracy (FG-Acc), false negative rate (FN rate), classification loss, RPN loss, and total loss.
- These metrics isolate semantic recognition from localization artifacts and remain fairly comparable under annotation heterogeneity.
Loss & Training¶
- Few-shot fine-tuning: SGD with momentum 0.9, weight decay \(10^{-4}\), fixed learning rate \(10^{-3}\).
- All shot settings are trained for a fixed 2,000 iterations with no early stopping.
- The backbone (ResNet + FPN) is frozen; the RPN is partially trainable; ROI heads are fully fine-tuned.
- The classification layer is re-initialized for novel categories.
- Base training data is not revisited, and no additional data augmentation is applied.
Key Experimental Results¶
Main Results: Cross-Domain Few-Shot Adaptation¶
| Metric | CURE 1-shot | CURE 5-shot | CURE 10-shot | MEDISEG 1-shot | MEDISEG 5-shot | MEDISEG 10-shot |
|---|---|---|---|---|---|---|
| FG Acc | 0.989±.004 | 0.980±.004 | 0.977±.004 | 0.994±.005 | 0.991±.002 | 0.983±.003 |
| FN rate | 0.011±.004 | 0.020±.004 | 0.023±.004 | 0.006±.005 | 0.009±.002 | 0.017±.003 |
| loss_cls | 0.005±.001 | 0.014±.001 | 0.019±.002 | 0.005±.001 | 0.011±.001 | 0.015±.002 |
| total_loss | 0.015±.003 | 0.039±.003 | 0.055±.005 | 0.014±.002 | 0.032±.003 | 0.044±.003 |
Key observation: foreground classification accuracy reaches ≥0.989 at 1-shot; models trained on MEDISEG exhibit a false negative rate 45% lower than those trained on CURE.
Ablation Study: Overlap and Occlusion Stress Test¶
| Metric | CURE 1-shot | CURE 5-shot | CURE 10-shot | MEDISEG 1-shot | MEDISEG 5-shot | MEDISEG 10-shot |
|---|---|---|---|---|---|---|
| FG Acc | 0.131 | 0.372 | 0.558 | 0.406 | 0.625 | 0.740 |
| FN rate | 0.816 | 0.465 | 0.342 | 0.513 | 0.246 | 0.210 |
| loss_cls | 0.351 | 0.421 | 0.320 | 0.383 | 0.279 | 0.191 |
| loss_rpn_cls | 0.863 | 0.224 | 0.133 | 0.312 | 0.182 | 0.059 |
| total_loss | 1.326 | 0.844 | 0.674 | 0.963 | 0.680 | 0.445 |
Key Findings¶
- Semantic recognition saturates rapidly: Foreground classification accuracy reaches 0.989+ at 1-shot; marginal returns from additional annotations diminish quickly.
- Decoupling of localization and classification: Semantic classification is strong under standard evaluation, yet localization and recall degrade sharply under the occlusion stress test (CURE 1-shot FG Acc drops from 0.989 to 0.131).
- Training data realism is the dominant factor: Models trained on MEDISEG (multi-pill, realistic scenes) outperform those trained on CURE (single-pill, controlled scenes) by 210% in FG Acc on the 1-shot overlap test.
- Diminishing returns from increased supervision: The gain from 1-shot to 5-shot is substantial (MEDISEG FG Acc: 0.406→0.625), while the gain from 5-shot to 10-shot is considerably smaller (+18%).
- Loss increase is not a degradation signal: Total loss grows with the number of shots but does not indicate recognition degradation — rather, it reflects a more complex optimization landscape.
- MEDISEG advantage is greatest at low shot counts: The relative advantage is 210% at 1-shot and narrows to 33% at 10-shot, indicating that realistic training data is especially critical under extreme data scarcity.
Highlights & Insights¶
- Novel deployment diagnostic perspective: Few-shot learning is reframed as a diagnostic tool for deployment readiness rather than a pure adaptation strategy — varying supervision levels exposes stability–robustness trade-offs.
- Discovery of classification–localization decoupling as a systematic failure mode: High semantic recognition accuracy can mask severe localization failures, an insight entirely hidden under conventional AP evaluation.
- Data realism > data scale: CURE contains 196 classes versus MEDISEG's 32, yet the latter generalizes better in cross-domain few-shot settings due to higher visual complexity.
- Thoughtful evaluation design: The decision to abandon AP in favor of classification-error signals due to annotation heterogeneity is well-justified.
Limitations & Future Work¶
- No new method is proposed: This is a purely analytical study with no methodological contribution.
- Limited number of novel classes: Only a 5-way setting is employed, constrained by annotation availability.
- Whole-image bounding box annotations in CURE limit localization metrics: AP is not comparable across datasets, necessitating reliance on classification-centric evaluation.
- Stronger detectors are not explored: Only Faster R-CNN is used; the behavior of DETR-based or single-stage detectors remains unknown.
- Data augmentation effects are not studied: Synthetic occlusion augmentation in few-shot occlusion scenarios may offer a low-cost path to improved robustness.
- No cross-dataset baseline comparison: The evaluation protocol deviates from standard few-shot benchmarks, precluding direct comparison with other methods.
Related Work & Insights¶
- FsDet (Faster R-CNN) is a classical two-stage few-shot detection framework; this paper conducts cross-domain analysis upon it.
- CURE and MEDISEG represent the two extremes of controlled versus realistic settings — motivating reflection on how training data design affects deployment robustness.
- The cross-dataset evaluation discussions in the domain generalization literature align with the philosophy of this work.
- Implication: In safety-critical applications, one should not merely pursue benchmark performance but should systematically investigate failure modes and data–performance interactions.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐ |
| Theoretical Depth | ⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐ |
| Value | ⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Overall | ⭐⭐⭐ |