Skip to content

FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion

Conference: ICLR 2026 arXiv: 2602.03137 Code: https://intellindust-ai-lab.github.io/projects/FSOD-VFM Area: Object Detection / Few-Shot Learning Keywords: Few-shot object detection, vision foundation models, graph diffusion, training-free, SAM2

TL;DR

This paper proposes a training-free few-shot object detection framework that combines three foundation models—UPN, SAM2, and DINOv2—for proposal generation and feature matching, and employs a graph diffusion algorithm to refine confidence scores and suppress fragmented proposals. The method achieves substantial improvements over prior state-of-the-art on Pascal-5i and COCO-20i.

Background & Motivation

Background: Few-shot object detection (FSOD) aims to detect novel categories from limited annotated samples. Conventional approaches require fine-tuning, while recent training-free methods leverage foundation models for direct detection.

Limitations of Prior Work: Proposals generated by foundation models such as UPN tend to be excessively fragmented—a single object is split into multiple overlapping sub-regions—and the resulting redundancy is difficult to suppress effectively with simple NMS.

Key Challenge: Post-processing methods such as SoftNMS consider only spatial relationships between bounding boxes, and cannot exploit semantic or mask-overlap information across proposals to determine which candidate is superior.

Goal: How can fragmented proposals be effectively suppressed within a training-free framework to yield high-quality detection results?

Key Insight: Model inter-proposal relationships as a directed graph and propagate confidence scores via PageRank-style graph diffusion.

Core Idea: A graph diffusion algorithm propagates suppression signals over the proposal graph, causing fragmented proposals covered by larger boxes to automatically receive lower confidence scores.

Method

Overall Architecture

UPN generates class-agnostic proposals → SAM2 refines masks → DINOv2 extracts features → cosine matching against support prototypes → graph diffusion refines confidence scores → NMS produces final detections.

Key Designs

  1. RoI Feature Extraction and Prototype Matching:

    • Function: Construct class prototypes from support samples and match them against query proposals.
    • Mechanism: SAM2 generates binary masks for each support annotation; DINOv2 extracts dense features, which are then pooled using the masks to obtain support features. Prototypes are built by averaging same-class support features followed by L2 normalization. Query proposals are matched via cosine similarity.
    • Design Motivation: SAM2 masks enable precise foreground feature extraction, mitigating background noise.
  2. Graph Diffusion (Core Innovation):

    • Function: Construct a directed graph over proposals and iteratively propagate suppression signals to reduce the confidence of fragmented proposals.
    • Mechanism: Each node represents a proposal; edge weights are defined by mask overlap ratio—if the UPN score of proposal \(i\) is lower than that of proposal \(j\), the edge weight from \(j\) to \(i\) is \(\text{Area}(M_i \cap M_j) / \text{Area}(M_i)\). Iterative updates follow a PageRank-style rule: \(\pi^{t+1} = \alpha P \pi^t + (1-\alpha) w\). The final confidence is \((1 - \pi)^\lambda \cdot \cos\_sim\); proposals with high \(\pi\) values (i.e., those heavily covered by others) receive reduced confidence.
    • Design Motivation: Compared with NMS/SoftNMS, which rely solely on bounding-box IoU, graph diffusion exploits precise mask overlaps and UPN objectness scores to more accurately identify redundant proposals.

Loss & Training

The framework is entirely training-free; all components are used directly with pretrained weights at inference time.

Key Experimental Results

Main Results

Dataset Shot Ours Prev. SOTA (NtTT) Gain
Pascal-5i 1-shot 77.5 70.8 +6.7
Pascal-5i 5-shot 85.8 77.2 +8.6
COCO-20i 10-shot 59.4 (nAP50) 54.1 +5.3
CD-FSOD (ArTaxOr) 1-shot 51.4 28.2 +23.2

Ablation Study

Post-processing Method Pascal-5i COCO-20i
None 7.4 9.9
NMS 23.4 26.1
Soft NMS 28.1 26.6
Soft Merging 66.0 50.4
Graph Diffusion 77.5 59.4

Key Findings

  • Graph diffusion outperforms the closest baseline (Soft Merging) by 11.5 and 9.0 points on Pascal-5i and COCO-20i, respectively.
  • The largest gains are observed on cross-domain FSOD (CD-FSOD, +23.2), demonstrating the generality of the graph diffusion approach.
  • Optimal performance is achieved with \(\alpha=0.3\) and \(\lambda=0.5\); convergence occurs within 5–30 iterations.

Highlights & Insights

  • Graph Diffusion as a Replacement for NMS: Proposal suppression is elevated from heuristic rules to graph-structured information propagation—an elegant engineering solution with substantial empirical gains. The approach is transferable to any task requiring proposal de-redundancy.
  • Pure Assembly Framework: The combination of three foundation models with a graph diffusion post-processing step requires no training whatsoever, demonstrating the potential of composing off-the-shelf foundation models.

Limitations & Future Work

  • Inference is relatively slow (2.4 s/image on an A40), owing to three separate forward passes through UPN, SAM2, and DINOv2.
  • Graph diffusion requires pairwise mask overlap computation, whose cost grows with the number of proposals.
  • Detection quality is inherently dependent on the quality of initial proposals generated by UPN.
  • vs. No-Time-To-Train: Both are training-free FSOD methods; NtTT uses SoftNMS as post-processing, whereas this work employs graph diffusion.
  • vs. DINOv2/DINOv3: Both are used as feature extractors; DINOv3 yields consistent marginal improvements over DINOv2.

Rating

  • Novelty: ⭐⭐⭐⭐ — Graph diffusion for proposal de-redundancy is novel, though the overall framework is largely compositional.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Full coverage across Pascal, COCO, and CD-FSOD with detailed ablations.
  • Writing Quality: ⭐⭐⭐⭐ — Algorithm descriptions are clear and well-structured.
  • Value: ⭐⭐⭐⭐ — Establishes a strong baseline for training-free FSOD.