FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion¶
Conference: ICLR 2026 arXiv: 2602.03137 Code: https://intellindust-ai-lab.github.io/projects/FSOD-VFM Area: Object Detection / Few-Shot Learning Keywords: Few-shot object detection, vision foundation models, graph diffusion, training-free, SAM2
TL;DR¶
This paper proposes a training-free few-shot object detection framework that combines three foundation models—UPN, SAM2, and DINOv2—for proposal generation and feature matching, and employs a graph diffusion algorithm to refine confidence scores and suppress fragmented proposals. The method achieves substantial improvements over prior state-of-the-art on Pascal-5i and COCO-20i.
Background & Motivation¶
Background: Few-shot object detection (FSOD) aims to detect novel categories from limited annotated samples. Conventional approaches require fine-tuning, while recent training-free methods leverage foundation models for direct detection.
Limitations of Prior Work: Proposals generated by foundation models such as UPN tend to be excessively fragmented—a single object is split into multiple overlapping sub-regions—and the resulting redundancy is difficult to suppress effectively with simple NMS.
Key Challenge: Post-processing methods such as SoftNMS consider only spatial relationships between bounding boxes, and cannot exploit semantic or mask-overlap information across proposals to determine which candidate is superior.
Goal: How can fragmented proposals be effectively suppressed within a training-free framework to yield high-quality detection results?
Key Insight: Model inter-proposal relationships as a directed graph and propagate confidence scores via PageRank-style graph diffusion.
Core Idea: A graph diffusion algorithm propagates suppression signals over the proposal graph, causing fragmented proposals covered by larger boxes to automatically receive lower confidence scores.
Method¶
Overall Architecture¶
UPN generates class-agnostic proposals → SAM2 refines masks → DINOv2 extracts features → cosine matching against support prototypes → graph diffusion refines confidence scores → NMS produces final detections.
Key Designs¶
-
RoI Feature Extraction and Prototype Matching:
- Function: Construct class prototypes from support samples and match them against query proposals.
- Mechanism: SAM2 generates binary masks for each support annotation; DINOv2 extracts dense features, which are then pooled using the masks to obtain support features. Prototypes are built by averaging same-class support features followed by L2 normalization. Query proposals are matched via cosine similarity.
- Design Motivation: SAM2 masks enable precise foreground feature extraction, mitigating background noise.
-
Graph Diffusion (Core Innovation):
- Function: Construct a directed graph over proposals and iteratively propagate suppression signals to reduce the confidence of fragmented proposals.
- Mechanism: Each node represents a proposal; edge weights are defined by mask overlap ratio—if the UPN score of proposal \(i\) is lower than that of proposal \(j\), the edge weight from \(j\) to \(i\) is \(\text{Area}(M_i \cap M_j) / \text{Area}(M_i)\). Iterative updates follow a PageRank-style rule: \(\pi^{t+1} = \alpha P \pi^t + (1-\alpha) w\). The final confidence is \((1 - \pi)^\lambda \cdot \cos\_sim\); proposals with high \(\pi\) values (i.e., those heavily covered by others) receive reduced confidence.
- Design Motivation: Compared with NMS/SoftNMS, which rely solely on bounding-box IoU, graph diffusion exploits precise mask overlaps and UPN objectness scores to more accurately identify redundant proposals.
Loss & Training¶
The framework is entirely training-free; all components are used directly with pretrained weights at inference time.
Key Experimental Results¶
Main Results¶
| Dataset | Shot | Ours | Prev. SOTA (NtTT) | Gain |
|---|---|---|---|---|
| Pascal-5i | 1-shot | 77.5 | 70.8 | +6.7 |
| Pascal-5i | 5-shot | 85.8 | 77.2 | +8.6 |
| COCO-20i | 10-shot | 59.4 (nAP50) | 54.1 | +5.3 |
| CD-FSOD (ArTaxOr) | 1-shot | 51.4 | 28.2 | +23.2 |
Ablation Study¶
| Post-processing Method | Pascal-5i | COCO-20i |
|---|---|---|
| None | 7.4 | 9.9 |
| NMS | 23.4 | 26.1 |
| Soft NMS | 28.1 | 26.6 |
| Soft Merging | 66.0 | 50.4 |
| Graph Diffusion | 77.5 | 59.4 |
Key Findings¶
- Graph diffusion outperforms the closest baseline (Soft Merging) by 11.5 and 9.0 points on Pascal-5i and COCO-20i, respectively.
- The largest gains are observed on cross-domain FSOD (CD-FSOD, +23.2), demonstrating the generality of the graph diffusion approach.
- Optimal performance is achieved with \(\alpha=0.3\) and \(\lambda=0.5\); convergence occurs within 5–30 iterations.
Highlights & Insights¶
- Graph Diffusion as a Replacement for NMS: Proposal suppression is elevated from heuristic rules to graph-structured information propagation—an elegant engineering solution with substantial empirical gains. The approach is transferable to any task requiring proposal de-redundancy.
- Pure Assembly Framework: The combination of three foundation models with a graph diffusion post-processing step requires no training whatsoever, demonstrating the potential of composing off-the-shelf foundation models.
Limitations & Future Work¶
- Inference is relatively slow (2.4 s/image on an A40), owing to three separate forward passes through UPN, SAM2, and DINOv2.
- Graph diffusion requires pairwise mask overlap computation, whose cost grows with the number of proposals.
- Detection quality is inherently dependent on the quality of initial proposals generated by UPN.
Related Work & Insights¶
- vs. No-Time-To-Train: Both are training-free FSOD methods; NtTT uses SoftNMS as post-processing, whereas this work employs graph diffusion.
- vs. DINOv2/DINOv3: Both are used as feature extractors; DINOv3 yields consistent marginal improvements over DINOv2.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Graph diffusion for proposal de-redundancy is novel, though the overall framework is largely compositional.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Full coverage across Pascal, COCO, and CD-FSOD with detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ — Algorithm descriptions are clear and well-structured.
- Value: ⭐⭐⭐⭐ — Establishes a strong baseline for training-free FSOD.