CVPR 2025 Segmentation In-sensor computing SAM distillation INT8 quantization lightweight segmentation Sony IMX500 ROI prompts

PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation¶

Conference: CVPR 2025
arXiv: 2603.11917
Code: GitHub
Area: Segmentation / Edge Deployment
Keywords: In-sensor computing, SAM distillation, INT8 quantization, lightweight segmentation, Sony IMX500, ROI prompts

TL;DR¶

PicoSAM3 is an ultra-lightweight promptable segmentation model with 1.3M parameters. Through implicit ROI prompt encoding, a dense CNN architecture (transformer-free), SAM3 knowledge distillation, and INT8 quantization, it achieves 65.45% mIoU on COCO and real-time inference within 11.82ms on the Sony IMX500 vision sensor.

Background & Motivation¶

Background: The SAM/SAM2/SAM3 series has achieved breakthrough progress in promptable segmentation, but the computational and memory footprint of their Transformer architectures prevents deployment on extreme edge devices.

Limitations of Prior Work: (1) Compression schemes like TinySAM and EdgeSAM still retain Transformer structures, exceeding the 8MB memory limit and relying on unsupported operators (Softmax, LayerNorm); (2) In-sensor computing requires the model to run entirely within the CMOS sensor, imposing extremely strict limits on operator types, integer precision, and memory; (3) Existing lightweight SAM models ignore the spatial flexibility of the ROI.

Key Challenge: How to achieve high-quality promptable segmentation on sensor chips with <8MB SRAM and support only for quantization-friendly operators?

Key Insight: Completely abandon the Transformer in favor of a pure CNN architecture + implicit ROI prompts + SAM3 distillation + INT8 quantization, specifically designed for in-sensor deployment.

Method¶

Overall Architecture¶

Training: COCO data → extract ROI crops for each annotated instance (10% padding, square, 96×96) → SAM3 teacher generates soft masks for the same bbox (cached offline) → PicoSAM3 student learns to segment by looking only at the cropped RGB image → multi-loss distillation training. Inference: the user drags and selects an ROI on the IMX500 → hardware-based sensor cropping → INT8 model inference → return segmentation mask.

Key Designs¶

Implicit Prompt via Centered Cropping:
- Function: Implicitly encodes bbox prompts into the RGB input via cropping.
- Mechanism: During training, square crops with 10% padding are extracted based on the bbox and resized to 96×96. The target object is always roughly centered, and the network learns to segment the main object near the center of the input. During inference, the same cropping is achieved using the hardware ROI function of the IMX500.
- Design Motivation: The IMX500 only supports RGB input and cannot accept extra prompt tensors; thus, prompts can only be implicitly encoded through spatial cropping.
Dense CNN U-Net:
- Function: A symmetric encoder-decoder pure CNN with 1.37M parameters.
- Mechanism: Based on the U-Net structure of PicoSAM2 (depthwise separable convolutions, channels 48→96→160→256→320), introducing three improvements: (1) an enhanced bottleneck with dilated depthwise convolutions (dilation=2 to expand the receptive field); (2) an Efficient Channel Attention (ECA) block before the output head for adaptive feature recalibration; (3) a refinement head with depthwise convolutions to improve boundaries.
- Design Motivation: Completely avoid Self-Attention operators from Transformers to ensure all operations are compatible with INT8 quantization and the limited operator set of the IMX500.
SAM3 Knowledge Distillation:
- Function: Distills knowledge from a 1.2GB SAM3 teacher to a 5.26MB student.
- Mechanism: Offline caches the teacher's soft masks on all COCO annotations → joint training with three losses: (1) teacher loss \(\mathcal{L}_{teacher}\) = MSE + Dice (temperature scale \(\tau=5\)); (2) GT loss \(\mathcal{L}_{gt}\) = BCE + Dice; (3) area preservation loss \(\mathcal{L}_{area}\) (preventing the predicted mask area from collapsing below 40% of the GT).
- Adaptive weights: The weights of the teacher and GT in the total loss are automatically adjusted by the teacher's confidence—relying on the teacher's soft label when confidence is high, and reverting to the GT when confidence is low.
INT8 Post-Training Quantization:
- Function: 4× model compression (5.26MB → 1.31MB) with <0.2% accuracy loss.
- Mechanism: Uses the Sony MCT toolchain for symmetric per-channel weight quantization + per-tensor activation quantization, with 10-batch calibration.
- Design Motivation: Depthwise separable convolutions are naturally robust to quantization noise, eliminating the need for complex outlier suppression algorithms.

Key Experimental Results¶

Main Results (COCO / LVIS)¶

Model	Params	Size	COCO mIoU	LVIS mIoU	IMX500 Latency
SAM-H	635M	2420MB	53.6%	60.5%	—
TinySAM	9.7M	37MB	50.9%	52.1%	—
EdgeSAM	9.7M	37MB	48.0%	53.7%	—
PicoSAM2	1.3M	4.87MB	51.9%	44.9%	—
Q-PicoSAM2	1.3M	1.22MB	50.5%	45.1%	14.3ms
PicoSAM3	1.37M	5.26MB	65.45%	64.01%	—
Q-PicoSAM3	1.37M	1.31MB	65.34%	63.98%	11.82ms

PicoSAM3 has only 1/460 of the parameters of SAM-H, yet outperforms it in COCO mIoU by +11.85%.
Compared to PicoSAM2, COCO improves by +13.55% and LVIS improves by +19.11%.
INT8 quantization is almost lossless (-0.11% mIoU), achieving approximately 84 FPS on the IMX500.

Ablation Study¶

Model	Distillation	ROI	COCO mIoU	LVIS mIoU
PicoSAM2	✗	✗	53.00%	41.40%
PicoSAM2	SAM2	✗	51.93%	44.88%
PicoSAM2	SAM2	✓	63.11%	61.80%
PicoSAM2	SAM3	✓	63.51%	62.31%
PicoSAM3	SAM3	✓	65.45%	64.01%

ROI cropping is the most critical factor for improvement (mAP +10.56%), confirming that spatial focus is crucial for low-resolution models.
SAM3 distillation outperforms SAM2 (+0.4% mIoU), and the architectural improvements of PicoSAM3 yield an additional +1.9%.

Key Findings¶

Large models (SAM2.1 Large) experience performance collapse on 96×96 inputs (~25% mAP) due to resolution mismatch.
The activation distribution of the pure CNN architecture is compact and near-Gaussian, making it naturally friendly to INT8 quantization.
ROI cropping + centered implicit prompting perfectly interfaces with the hardware ROI function of the IMX500.

Highlights & Insights¶

Sensor-grade SAM: Achieving promptable segmentation inside a CMOS sensor for the first time with an edge latency of 11.82ms (~84 FPS), showing high practical value.
Counter-intuitive finding: The 1.3M parameter model outperforms the 635M SAM-H on 96×96 inputs, indicating that model-resolution matching is more important than absolute model size.
Implicit prompt design: Prompting functionality is achieved through cropping without requiring an additional network or input channels, elegantly bypassing hardware limitations.

Limitations & Future Work¶

The 96×96 input resolution limits the fine boundary quality, which may be insufficient for large objects or complex shapes.
Implicit ROI prompting only supports rectangular box regions, lacking support for prompt types like points or text.
Validated only on COCO/LVIS, lacking transfer experiments to other domains like medical or remote sensing.
Trained for only 1 epoch on COCO; longer training could potentially yield further improvements.

vs TinySAM/EdgeSAM: 7× fewer parameters, 14+% higher mIoU, showing a substantial gap.
vs PicoSAM2: Architectural improvements (ECA + dilated bottleneck) + SAM3 distillation + ROI prompting collectively contribute to a +13.55% improvement.
vs FastSAM: +13.85% mIoU, with the model being only 1.9% of the size.

Rating¶

Novelty: ⭐⭐⭐⭐ In-sensor segmentation is a novel and practical direction, with an elegantly designed implicit ROI prompt.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation studies (distillation/ROI/architecture/quantization), validated with real hardware deployment.
Writing Quality: ⭐⭐⭐⭐ Clear structure with a thorough introduction to the hardware background.
Value: ⭐⭐⭐⭐⭐ High practical value for extreme edge deployment; the method is complete and reproducible.