LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings¶

Conference: CVPR 2026 arXiv: 2503.19740 Code: https://github.com/visurg-ai/LEMON Area: Medical Imaging / Surgical Vision Keywords: surgical dataset, foundation model, self-supervised learning, data curation pipeline, knowledge distillation

TL;DR¶

This paper introduces LEMON, the largest open surgical video dataset to date (4,194 videos, 938 hours, 35 procedure types), and proposes LemonFM, a foundation model based on augmented knowledge distillation, which comprehensively outperforms existing methods across four downstream tasks: surgical phase recognition, tool detection, action recognition, and semantic segmentation.

Background & Motivation¶

Background: Conventional surgical datasets typically contain fewer than 100 videos and 30 hours of footage, leading to poor model generalization. Although self-supervised learning reduces reliance on annotated data, large-scale high-quality surgical data remains scarce.
Limitations of Prior Work: While GenSurgery and SurgeNetXL have expanded dataset scale, they lack data curation steps and include non-surgical content (conference talks, patient testimonials, etc.), introducing noisy features.
Key Challenge: Surgical data is constrained by privacy regulations and annotation costs, raising the challenge of constructing high-quality large-scale datasets from publicly available online videos.
Goal: Design a systematic data curation pipeline to curate high-quality surgical datasets from YouTube videos.
Key Insight: Multi-stage automated curation (classification → trimming → preprocessing → annotation) combined with manual quality control.
Core Idea: An augmented knowledge distillation approach that exploits cross-patient appearance similarity and inter-frame motion invariance to learn superior surgical visual representations.

Method¶

Overall Architecture¶

On the data side: YouTube video collection → storyboard classification → frame-level classification and trimming → non-surgical content removal → annotation validation. On the model side: augmented knowledge distillation based on the DINO framework with a ConvNeXt-L backbone.

Key Designs¶

Multi-Stage Data Curation Pipeline:
Function: Curates 4,194 high-quality surgical videos from 18K raw videos.
Mechanism: (1) Video-level: a storyboard classifier (ResNet18) filters out non-surgical videos; (2) Frame-level: a frame classifier is trained to localize the start and end of surgical content, trimming non-surgical segments and discarding videos where surgical frames account for less than 90%; (3) Region-level: YOLOv8 is trained to detect and mask non-surgical regions within frames (UI elements, logos, etc.).
Design Motivation: Each filtering stage employs an independent model validated by human annotators, achieving 100% video-level precision and >99.9% frame-level precision.
Augmented Knowledge Distillation:
Function: Learns surgical representations invariant to subtle motion and cross-patient appearance variation.
Mechanism: Additional supervisory signals \(W_i\) are introduced into the DINO student-teacher framework. Each \(W_i\) consists of two images, preferentially retrieved as appearance-similar frames from other videos of the same procedure type (cosine distance < 3× the distance of adjacent frames), supplemented by temporally adjacent frames when necessary. The loss function is \(\mathcal{L} = -\sum_i \sum_{u \in U_i} \sum_{v \in V_i \cup W_i} P_t(z|u) \log P_s(z|v)\).
Design Motivation: Standard DINO learns invariance only across different augmentations of the same image; augmented distillation additionally introduces cross-patient and cross-frame invariance.
Video Classification Model LemonFM-Vid:
Function: Performs video-level procedure type classification using LemonFM frame embeddings.
Mechanism: Frame embeddings are aggregated via typicality-weighted pooling (typicality defined as the inverse of K-NN distance) to obtain a video embedding \(v_e = \sum_j \omega_j \phi_j\), which is then classified by a single-layer MLP.
Design Motivation: Surgical procedures are localized to specific anatomical regions, and characteristic scenes can be associated with procedure types.

Loss & Training¶

DINO cross-entropy loss with augmented data pairs. Trained for 60 epochs on 8 × V100 GPUs.

Key Experimental Results¶

Main Results¶

Task / Dataset	Metric	LemonFM	Prev. SOTA (SurgeNetXL)	Gain
AutoLaparo Phase Recognition	Jaccard	64.8	55.1	+9.7pp
Cholec80 Phase Recognition	Jaccard	85.1	72.0	+13.1pp
Cholec80 Tool Detection	mAP	93.7	86.5	+7.2pp
GraSP Tool Detection	mAP	94.4	83.8	+10.6pp
CholecT50 Action Recognition	mAP	61.9	57.5	+4.4pp
CholecSeg8k Semantic Segmentation	mDice	81.3	69.0	+12.3pp

Ablation Study¶

Configuration	AutoLaparo (F1)	CholecSeg8k (mDice)	Notes
ImageNet Pretraining	53.0	64.4	General-purpose pretraining
Cholec80 (51h)	46.9	64.1	Small-scale surgical data
LEMON (uncurated)	61.4	67.4	Without curation pipeline
LEMON (curated)	65.9	68.7	Curation gain: +4.5pp
LEMON + Augmented Distillation	66.9	71.9	Full model

Key Findings¶

The data curation pipeline contributes substantially: +4.5pp F1 and +1.3pp mDice.
ConvNeXt outperforms ViT, particularly on segmentation (+10.7pp mDice), as convolutional inductive biases better preserve fine-grained surgical details.
LemonFM fine-tuned with only 50% of labeled data still surpasses all baselines trained with 100% of data.
Discriminative pretraining (DINO-family) substantially outperforms generative pretraining (MAE-family).

Highlights & Insights¶

Quality Assurance at Scale: A rigorous curation pipeline reducing 18K raw videos to 4,194 high-quality videos achieves an unprecedented balance between scale and data quality.
Cross-Patient Augmented Distillation: Leveraging videos from different patients undergoing the same procedure to learn appearance invariance makes ingenious use of procedure-level annotations.
Comprehensive Benchmarking: Full evaluation across 6 datasets and 4 tasks establishes a standardized benchmark for surgical vision foundation models.

Limitations & Future Work¶

The data source consists of publicly available YouTube videos; while ethical considerations have been addressed, potential controversy remains.
Procedure classification mAP reaches only 57.8%, with significant confusion among anatomically adjacent procedure types.
Future work will focus on developing surgical-specific video foundation models.

vs. Endo-FM: Endo-FM is trained on private data, limiting reproducibility; LEMON is fully open-source.
vs. EndoViT: EndoViT aggregates multiple small public datasets, achieving far inferior scale and performance compared to LEMON.
vs. SurgeNetXL: SurgeNetXL lacks data curation; LEMON's curation pipeline yields substantial performance gains.

Rating¶

Novelty: ⭐⭐⭐⭐ — The augmented distillation design is novel and the data curation pipeline is systematic.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 6 datasets, 4 tasks, low-data experiments, cross-validation, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure with sufficient detail.
Value: ⭐⭐⭐⭐⭐ — Both the dataset and model will serve as important foundational resources for the surgical vision community.