Skip to content

SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

Conference: NeurIPS 2025 arXiv: 2506.21355 Code: Project Page Area: Medical Imaging / Multimodal Learning Keywords: Multimodal In-Context Learning, Medical Benchmark, Multimodal Large Language Models, In-Context Example Quality, Recency Bias

TL;DR

This paper presents SMMILE — the first expert-driven benchmark for multimodal medical in-context learning (ICL), comprising 111 questions (517 image-text QA triplets) spanning 6 medical specialties and 13 imaging modalities, constructed by 11 clinical experts. The benchmark systematically exposes critical deficiencies of current MLLMs in medical multimodal ICL and reveals the pivotal impact of in-context example quality and ordering on model performance.

Background & Motivation

In-context learning (ICL) is a core capability of large language models — adapting to new tasks at inference time using a small number of in-context demonstrations without parameter updates. This capability is particularly valuable in medicine, where clinicians routinely handle specialized tasks based on limited prior examples.

However, three critical research gaps currently exist:

Evaluation Gap: While multimodal ICL has been preliminarily studied in general domains, no systematic multimodal ICL evaluation benchmark exists for the medical domain.

Example Quality Problem: Existing medical few-shot evaluations typically select examples at random rather than through carefully designed task demonstrations, which may partially explain the limited ICL gains observed.

Insufficient Capability Understanding: Although MLLMs have made progress on medical VQA, their ability to learn new tasks from multimodal context remains largely unknown.

The key starting point of SMMILE is that clinical experts carefully design in-context examples for each question (rather than retrieving them randomly), enabling accurate assessment of the true ICL capability of MLLMs. If even expertly curated examples fail to facilitate learning, the problem lies with the models themselves.

Method

Overall Architecture

SMMILE is a benchmark paper rather than a modeling methodology paper; its core contributions lie in dataset construction and evaluation analysis.

Key Designs

1. Expert-Driven Data Collection Pipeline

  • Expert Recruitment: 11 out of 21 recruited clinical experts successfully submitted data, including 9 physicians (average 6.4 years of clinical experience) and 2 medical students.
  • Web Interface: A step-by-step workflow was designed — (1) read detailed guidelines → (2) initialize a question → (3) construct a panel (add/remove/reorder in-context examples and queries) → (4) submit for validation.
  • Quality Control: A three-step manual quality review — two authors independently inspect submissions → categorize and annotate → correct spelling and grammar (15 grammatical and 6 spelling corrections were made).

Dataset Scale: 111 questions with an average of 3.65 in-context examples per question (range: 2–19), totaling 517 image-text QA triplets, covering 6 medical specialties (radiology, pathology, etc.) and 13 imaging modalities (X-ray, CT, MRI, etc.).

2. Evaluation Design

Two Evaluation Tasks: - Open-ended Generation: The MLLM receives a query image-text pair and generates a free-text answer. - Closed-ended Selection: The model selects an answer from the set of in-context examples.

Three Evaluation Metrics: - Exact Match (EM): Strict string match against the ground-truth answer. - LLM-as-a-Judge: Correctness assessed by Llama3.3 70B. - Human Expert Evaluation: Five clinical experts independently judge responses; inter-annotator agreement ≥ 98.2%.

SMMILE++: An augmented dataset generated by permuting in-context example orderings, comprising 1,038 questions (up to 24 permutations per question), used to study the effect of example order.

3. In-Context Example Analysis Framework

  • Quality Analysis: Two perturbed variants — Random-Noise and Targeted-Noise — are constructed to assess the impact of introducing irrelevant examples.
  • Order Analysis: The position of the most relevant example is controlled (first vs. last) to evaluate recency bias.

Evaluation Protocol

Fifteen MLLMs are evaluated, including: - Open-source general models: LLaVA series, Qwen2.5-VL series, Llama-3.2-Vision-90B - Open-source medical models: LLaVA-Med, MedGemma, MedVLM-R1 - Proprietary models: GPT-4o, Claude 3.7 Sonnet - Baselines: Random, Majority, Text-Only (Llama3.3 70B)

All models use a maximum generation length of 512 tokens; bootstrap resampling (1,000 iterations) is used to estimate sampling variability.

Key Experimental Results

Main Results (15 Models on SMMILE)

Model 0-shot (Judge)↑ ICL (Judge)↑ ICL (EM)↑ ICL (MCQA)↑
GPT-4o 32.56 49.88 8.94 58.85
Claude 3.7 Sonnet 37.18 36.17 2.63 42.01
Qwen2.5-VL-72B 29.90 42.59 15.71 54.71
Qwen2.5-VL-32B 25.27 41.79 31.84 49.97
Llama-3.2-Vision-90B 31.84 40.66 30.53 30.30
MedGemma-4B 27.73 36.86 12.14 40.67
LLaVA-Med-7B 21.65 10.19 0.00 0.00
Random Baseline 27.86 23.16 36.30

In-Context Example Quality Analysis

Model SMMILE (Normal)↑ Random-Noise↑ Targeted-Noise↑ Performance Drop
Qwen2.5-VL-32B 41.79 39.60 39.10 -5.2% / -6.4%
Qwen2.5-VL-3B 33.58 30.40 30.37 -9.5% / -9.6%
LLaVA-Med-7B 10.19 4.88 1.88 -52.1% / -81.6%
Average 24.92 22.65 22.55 -9.1% / -9.5%

Adding a single irrelevant example results in an average performance drop of 9.5%.

Key Findings

  1. Limited and Highly Uneven ICL Gains: 7 out of 15 models perform worse under ICL than the Random baseline (selecting answers randomly from examples); on average, ICL improves performance by only 8%.
  2. No Observed Advantage for Medical-Specific Models: LLaVA-Med's performance collapses under the ICL setting (21.65% → 10.19%), and MedGemma shows little advantage over general-purpose models of comparable scale.
  3. Severe Recency Bias: Placing the most relevant example at the end of the list yields up to 71% performance improvement, while placing it at the beginning results in a performance drop of up to 47%.
  4. Complete Failure on Numerical Reasoning: All models achieve 0% accuracy on questions requiring numerical answers.
  5. Optimistic Bias of Automatic Metrics: LLM-as-a-Judge agreement with expert scores under the ICL setting is only moderately correlated (\(r = 0.72\)) and tends to overestimate performance.
  6. Zero Performance on MRI and Illustration Modalities: All models completely fail on MRI and illustration modalities.

Highlights & Insights

  • Expert-Driven Rather Than Scale-Driven: Eleven clinical experts carefully design in-context examples for each question, ensuring that demonstrations are effective task illustrations and that evaluation results faithfully reflect models' true ICL capabilities.
  • Revealing an Illusion: The limited ICL gains observed in prior few-shot evaluations are likely not because ICL is inherently ineffective, but because randomly selected examples are themselves uninformative. Yet even expertly curated examples fail to substantially benefit most models.
  • Recency Bias Discovery: This finding has direct practical implications for deploying MLLMs — simply placing the most relevant example last can substantially improve performance.
  • Discrepancy Between Automatic and Human Evaluation: This finding cautions the community against over-relying on LLM-as-a-Judge, especially in ICL scenarios.

Limitations & Future Work

  • The dataset is relatively small (111 questions), which may not fully cover all medical scenarios.
  • Only image modalities are included; future work could extend to video, audio, and other medical data types.
  • The number of contributing experts is limited (11), potentially resulting in uneven coverage across specialties.
  • SMMILE++ augments only through permutation of example order without introducing new medical content.
  • The paper does not propose concrete methods to address the identified limitations.
  • The success of multimodal ICL models such as Flamingo in general domains has motivated exploration in the medical domain.
  • The recency bias finding echoes the "lost-in-the-middle" phenomenon observed in NLP.
  • The discrepancy between human expert evaluation and automatic metrics indicates that medical AI assessment requires greater involvement of domain experts.

Rating

  • Novelty: ⭐⭐⭐⭐ First multimodal medical ICL benchmark; uncovers recency bias and sensitivity to example quality.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 15 models + dual human/automatic evaluation + multi-dimensional fine-grained analysis + perturbation experiments.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, rich data visualizations, and thorough analysis.
  • Value: ⭐⭐⭐⭐ Serves as an important warning to the MLLM community, though no remedies are proposed.