FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion¶

Conference: ICLR 2026 arXiv: 2602.03137 Code: https://intellindust-ai-lab.github.io/projects/FSOD-VFM Area: Object Detection / Few-Shot Learning Keywords: Few-shot object detection, vision foundation models, graph diffusion, training-free, SAM2

TL;DR¶

This paper proposes a training-free few-shot object detection framework that combines three foundation models—UPN, SAM2, and DINOv2—for proposal generation and feature matching, and employs a graph diffusion algorithm to refine confidence scores and suppress fragmented proposals. The method achieves substantial improvements over prior state-of-the-art on Pascal-5i and COCO-20i.

Background & Motivation¶

Background: Few-shot object detection (FSOD) aims to detect novel categories from limited annotated samples. Conventional approaches require fine-tuning, while recent training-free methods leverage foundation models for direct detection.

Limitations of Prior Work: Proposals generated by foundation models such as UPN tend to be excessively fragmented—a single object is split into multiple overlapping sub-regions—and the resulting redundancy is difficult to suppress effectively with simple NMS.

Key Challenge: Post-processing methods such as SoftNMS consider only spatial relationships between bounding boxes, and cannot exploit semantic or mask-overlap information across proposals to determine which candidate is superior.

Goal: How can fragmented proposals be effectively suppressed within a training-free framework to yield high-quality detection results?

Key Insight: Model inter-proposal relationships as a directed graph and propagate confidence scores via PageRank-style graph diffusion.

Core Idea: A graph diffusion algorithm propagates suppression signals over the proposal graph, causing fragmented proposals covered by larger boxes to automatically receive lower confidence scores.

Method¶

Overall Architecture¶

UPN generates class-agnostic proposals → SAM2 refines masks → DINOv2 extracts features → cosine matching against support prototypes → graph diffusion refines confidence scores → NMS produces final detections.

Key Designs¶

RoI Feature Extraction and Prototype Matching:
- Function: Construct class prototypes from support samples and match them against query proposals.
- Mechanism: SAM2 generates binary masks for each support annotation; DINOv2 extracts dense features, which are then pooled using the masks to obtain support features. Prototypes are built by averaging same-class support features followed by L2 normalization. Query proposals are matched via cosine similarity.
- Design Motivation: SAM2 masks enable precise foreground feature extraction, mitigating background noise.
Graph Diffusion (Core Innovation):
- Function: Construct a directed graph over proposals and iteratively propagate suppression signals to reduce the confidence of fragmented proposals.
- Mechanism: Each node represents a proposal; edge weights are defined by mask overlap ratio—if the UPN score of proposal \(i\) is lower than that of proposal \(j\), the edge weight from \(j\) to \(i\) is \(\text{Area}(M_i \cap M_j) / \text{Area}(M_i)\). Iterative updates follow a PageRank-style rule: \(\pi^{t+1} = \alpha P \pi^t + (1-\alpha) w\). The final confidence is \((1 - \pi)^\lambda \cdot \cos\_sim\); proposals with high \(\pi\) values (i.e., those heavily covered by others) receive reduced confidence.
- Design Motivation: Compared with NMS/SoftNMS, which rely solely on bounding-box IoU, graph diffusion exploits precise mask overlaps and UPN objectness scores to more accurately identify redundant proposals.

Loss & Training¶

The framework is entirely training-free; all components are used directly with pretrained weights at inference time.

Key Experimental Results¶

Main Results¶

Dataset	Shot	Ours	Prev. SOTA (NtTT)	Gain
Pascal-5i	1-shot	77.5	70.8	+6.7
Pascal-5i	5-shot	85.8	77.2	+8.6
COCO-20i	10-shot	59.4 (nAP50)	54.1	+5.3
CD-FSOD (ArTaxOr)	1-shot	51.4	28.2	+23.2

Ablation Study¶

Post-processing Method	Pascal-5i	COCO-20i
None	7.4	9.9
NMS	23.4	26.1
Soft NMS	28.1	26.6
Soft Merging	66.0	50.4
Graph Diffusion	77.5	59.4

Key Findings¶

Graph diffusion outperforms the closest baseline (Soft Merging) by 11.5 and 9.0 points on Pascal-5i and COCO-20i, respectively.
The largest gains are observed on cross-domain FSOD (CD-FSOD, +23.2), demonstrating the generality of the graph diffusion approach.
Optimal performance is achieved with \(\alpha=0.3\) and \(\lambda=0.5\); convergence occurs within 5–30 iterations.

Highlights & Insights¶

Graph Diffusion as a Replacement for NMS: Proposal suppression is elevated from heuristic rules to graph-structured information propagation—an elegant engineering solution with substantial empirical gains. The approach is transferable to any task requiring proposal de-redundancy.
Pure Assembly Framework: The combination of three foundation models with a graph diffusion post-processing step requires no training whatsoever, demonstrating the potential of composing off-the-shelf foundation models.

Limitations & Future Work¶

Inference is relatively slow (2.4 s/image on an A40), owing to three separate forward passes through UPN, SAM2, and DINOv2.
Graph diffusion requires pairwise mask overlap computation, whose cost grows with the number of proposals.
Detection quality is inherently dependent on the quality of initial proposals generated by UPN.

vs. No-Time-To-Train: Both are training-free FSOD methods; NtTT uses SoftNMS as post-processing, whereas this work employs graph diffusion.
vs. DINOv2/DINOv3: Both are used as feature extractors; DINOv3 yields consistent marginal improvements over DINOv2.

Rating¶

Novelty: ⭐⭐⭐⭐ — Graph diffusion for proposal de-redundancy is novel, though the overall framework is largely compositional.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Full coverage across Pascal, COCO, and CD-FSOD with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ — Algorithm descriptions are clear and well-structured.
Value: ⭐⭐⭐⭐ — Establishes a strong baseline for training-free FSOD.