Skip to content

InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective

Conference: ICML2025 Spotlight
arXiv: 2505.21920
Code: InfoSAM project page
Area: Image Segmentation
Keywords: SAM Fine-Tuning, Information Bottleneck, Knowledge Distillation, Parameter-Efficient Fine-Tuning, Rényi Mutual Information, Domain-Invariant Relations

TL;DR

This paper proposes InfoSAM, which designs a relationship compression and distillation framework based on Rényi mutual information for the Parameter-Efficient Fine-Tuning (PEFT) of SAM from an information-theoretic perspective, enhancing fine-tuning performance by compressing pseudo-invariant information and preserving domain-invariant relationships.

Background & Motivation

  • Problem: SAM performs exceptionally well in general segmentation but poorly in specialized domains like medical imaging, remote sensing, and agriculture, necessitating PEFT adaptation.
  • Limitations of Prior Work: Existing PEFT methods (LoRA, Adapter, etc.) independently adjust parameters of each module, neglecting the implicit relationships between the encoder and decoder in pretrained models; traditional distillation methods focus on layer-wise feature alignment and lack guidance on inter-module relationships.
  • Key Insight: The massive pretraining of SAM learns domain-invariant structural relations (e.g., geometric contours), but fine-tuning easily destroys these relations; meanwhile, not all relations are beneficial—pseudo-invariant features like color can interfere with generalization.
  • Goal: How to extract domain-invariant relations from pretrained SAM? How to transfer them to the fine-tuned model?

Method

InfoSAM consists of two complementary information-theoretic objectives, forming a "compression-distillation" framework:

1. Relation Module

  • Input: Image encoder embedding \(z_i^T \in \mathbb{R}^{B \times H \times W \times D}\) and mask decoder output token \(z_m^T \in \mathbb{R}^{B \times N \times D}\).
  • Obtains Q and K through LayerNorm + linear projection, calculates attention scores, and adds a residual connection: $\(S_\alpha = \frac{QK^\top}{\sqrt{D}} + z_m^T \cdot {z_i^T}^\top\)$
  • The output is normalized via \(\ell_2\) normalization to obtain the relation representation \(r^T = f^T(z_i^T, z_m^T; \theta)\).

2. Relation Compression Loss \(\mathcal{L}_r\) (Intra-SAM)

  • Goal: Minimize \(\mathbf{I}_\alpha(z_i^T, z_m^T; r^T)\), serving as an information bottleneck to compress pseudo-invariant information.
  • Based on Rényi \(\alpha\)-entropy (\(\alpha=2\)), utilizing the Frobenius norm to avoid eigenvalue decomposition: $\(\mathcal{L}_r = -\log_2 \|G_r^T\|_F^2 + \log_2 \|G_{imr}^T\|_F^2\)$
  • Where \(G_{imr}^T = G_i^T \circ G_m^T \circ G_r^T\) (Hadamard product), and \(G\) is the polynomial kernel Gram matrix.

3. Cross-Model Distillation Loss \(\mathcal{L}_d\) (Inter-SAM)

  • Goal: Maximize the mutual information \(\mathbf{I}_\alpha(r^T; r^S)\) between the teacher relations \(r^T\) and student relations \(r^S\). $\(\mathcal{L}_d = \log_2 \|G_r^T\|_F^2 + \log_2 \|G_r^S\|_F^2 - \log_2 \|G_r^{TS}\|_F^2\)$
  • The teacher and student share the same relation module parameters \(\theta\).

4. Total Loss Function

$\(\mathcal{L} = \mathcal{L}_{ce} + \lambda_1 \mathcal{L}_r + \lambda_2 \mathcal{L}_d\)$ - \(\mathcal{L}_{ce}\) is the standard segmentation loss (weighted IoU + BCE).

Key Experimental Results

Table 1: Comparison of PEFT Methods (SAM ViT-B, 5 Datasets × 4 Domains)

Method CAMO \(S_\alpha\) ISIC Jac↑ Kvasir \(S_\alpha\) Leaf IoU↑ Road IoU↑
SAM (zero-shot) 79.7 61.0 71.4 37.6 7.2
LoRA 87.7 87.8 93.0 71.4 59.0
Adapter 88.2 87.7 93.4 74.4 60.5
SU-SAM 88.3 87.8 93.8 74.7 60.2
Adapter+Ours 88.6 88.0 94.4 75.6 61.4

Table 2: Comparison of Distillation Methods (Compared with Adapter Student)

Method Kvasir \(S_\alpha\) Leaf IoU↑ Road IoU↑
Student (No Distillation) 93.4 74.4 60.5
TinySAM 88.5 48.6 25.7
MobileSAM 92.5 71.9 59.2
VID 93.7 75.1 60.7
InfoSAM (Ours) 94.4 75.6 61.4

Ablation Study

\(\mathcal{L}_r\) \(\mathcal{L}_d\) Kvasir \(S_\alpha\) Leaf IoU Road IoU
93.4 74.4 60.5
93.6 (+0.2) 75.2 (+0.8) 61.0 (+0.5)
94.4 (+1.0) 75.6 (+1.2) 61.4 (+0.9)
  • Both losses make positive contributions, with \(\mathcal{L}_d\) playing a major role in cross-domain distillation.
  • Equally effective on SAM2 (Hiera-B+): Kvasir 94.5, Leaf 77.3, Road 61.3.

Highlights & Insights

  • First Information-Theoretic SAM Adaptation Framework: Introduces Information Bottleneck theory to SAM PEFT, offering a novel perspective and solid theoretical derivation.
  • Aligning Relations Instead of Features: Rather than performing layer-wise feature alignment, it extracts and transfers domain-invariant relations between the encoder and decoder, preventing performance degradation from distillation when the teacher performs poorly in the downstream domain.
  • Plug-and-Play: Orthogonal to PEFT methods like LoRA and Adapter, and independent of the SAM/SAM2 architectures.
  • Filtering Pseudo-Invariant Information: Compresses domain-specific information such as color via the information bottleneck, preserving only domain-invariant information such as geometric structure.
  • Simplified Calculation via Rényi \(\alpha=2\): Replaces eigenvalue decomposition with the Frobenius norm to reduce computational overhead.

Limitations & Future Work

  • Limited performance gain: The improvement on some datasets (e.g., LoRA+Ours vs. LoRA) is around 0.5-1%, suggesting a potential ceiling for the benefits of domain-invariant relations.
  • When the teacher model performs extremely poorly in the target domain (e.g., Road IoU is only 7.2%), positive transfer still occurs via distillation, but the magnitude is limited.
  • Only validated in box/point prompt scenarios, without exploring text prompts or fully automatic segmentation.
  • The Rényi entropy order \(\alpha\) is fixed at 2, and the impact of different \(\alpha\) values on performance is not explored.
  • The relation module introduces extra parameters and computation, and the paper does not analyze these overheads in detail.
  • Only validated on medium-sized datasets, lacking experiments on large-scale datasets (e.g., SA-1B subsets).
  • Insufficient sensitivity analysis of the hyperparameters \(\lambda_1, \lambda_2\).
  • The combination of Information Bottleneck + Distillation can be extended to the PEFT of other foundation models (e.g., CLIP, DINOv2).
  • Unlike mutual information-based distillation methods such as VID and IBD, InfoSAM focuses on inter-module relations rather than single-layer features.
  • The concept of domain-invariant features originates from Domain Adaptive Segmentation (DAS), but is quantified and transferred using information theory for the first time.

Rating

  • Novelty: ⭐⭐⭐⭐ — First time applying information bottleneck theory to SAM PEFT, with a complete theoretical derivation.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — 8 datasets across 4 domains, SAM+SAM2, with comprehensive comparisons.
  • Writing Quality: ⭐⭐⭐⭐ — Clear information-theoretic formulations and intuitive illustrations.
  • Value: ⭐⭐⭐⭐ — Provides a new perspective on SAM fine-tuning; the plug-and-play distillation scheme holds practical value.