Few-Shot Pattern Detection via Template Matching and Regression¶
Conference: ICCV 2025 arXiv: 2508.17636 Code: https://cvlab.postech.ac.kr/research/TMR Area: LLM Evaluation Keywords: few-shot detection, template matching, pattern detection, repetitive patterns, anchor-free detection
TL;DR¶
This paper proposes TMR, a method that combines classical template matching with support-conditioned bounding box regression to achieve few-shot detection of arbitrary patterns—including non-object-level patterns. The authors also introduce the RPINE dataset to cover a broader range of repetitive patterns. TMR surpasses existing FSCD methods on multiple benchmarks and demonstrates strong cross-dataset generalization.
Background & Motivation¶
- Background: Few-shot object detection (FSOD) and few-shot counting detection (FSCD) have achieved notable progress, but existing methods rely heavily on object-level priors and are primarily designed for categories with well-defined boundaries.
- Limitations of Prior Work: Mainstream methods (e.g., Counting-DETR, GeCo, PseCo) compress support samples into prototype vectors via global average pooling, discarding spatial structural information.
- Key Challenge: When detection targets are extended from objects to arbitrary patterns (e.g., textures, geometric structures, object parts), spatial layout information becomes critical—yet prototype matching discards precisely this information.
- Core Problem: How to design a few-shot pattern detector that preserves spatial structure without relying on object-level priors.
- Key Insight: Revisiting classical template matching and leveraging 2D cross-correlation to preserve the spatial layout of exemplars.
- Core Idea: Replace prototype matching with channel-wise template matching, and adaptively refine bounding boxes via support-conditioned regression.
Method¶
Overall Architecture¶
TMR adopts a minimalist structure: a frozen SAM-ViT/H backbone extracts feature maps → RoIAlign crops template features → channel-wise template matching produces correlation maps → original and matched features are concatenated → a bounding box regressor and existence classifier generate predictions → NMS post-processing. The entire detection head consists of only a few \(3\times3\) convolutions and linear layers, with no cross-attention or other complex modules.
Key Designs¶
-
Channel-wise Template Matching:
- Function: Performs cross-correlation between the template feature \(\mathbf{T} \in \mathbb{R}^{t_h \times t_w \times D}\) and the image feature map \(\mathbf{F}\) via a sliding window.
- Mechanism: \(\mathbf{F}_{\text{TM}}(x,y) = \frac{1}{t_w t_h} \sum_{x',y'} \mathbf{F}(x+x'-\lfloor t_w/2 \rfloor, y+y'-\lfloor t_h/2 \rfloor) \mathbf{T}(x',y')\), with the result retaining the channel dimension \(\mathbf{F}_{\text{TM}} \in \mathbb{R}^{H \times W \times D}\).
- Design Motivation: Unlike prototype matching, channel-wise template matching preserves the spatial structure and geometric characteristics of exemplars, which is critical for detecting non-object patterns. Ablation results show a significant AP drop when the 2D template is replaced by a pooled prototype (RPINE: 33.59 → 20.94).
-
Adaptive Template Extraction:
- Function: Uses RoIAlign to crop template features from the image feature map.
- Mechanism: Template size is adaptively determined based on the actual size of the support exemplar (rounded up), rather than fixed-size pooling.
- Design Motivation: Maintains translational alignment between the template and the feature map, avoiding spatial information loss.
-
Support-Conditioned Box Regression:
- Function: Predicts scaling and offset parameters relative to the support exemplar size, rather than absolute coordinates.
- Mechanism: For each feature point, the model predicts \((\Delta x, \Delta y, \alpha_w, \alpha_h)\); the final box is \((x + s_w \Delta x,\ y + s_h \Delta y,\ e^{\alpha_w} s_w,\ e^{\alpha_h} s_h)\).
- Design Motivation: Using exemplar size as a reference baseline allows the model to dynamically adapt to exemplars and targets of varying scales. Ablation experiments confirm that conditioned regression outperforms direct regression (AP: 36.01 vs. 17.01).
Loss & Training¶
- Existence Loss \(\mathcal{L}_P\): Binary cross-entropy loss with center-point expanded margins.
- Bounding Box Loss \(\mathcal{L}_B\): gIoU loss, computed only at locations where targets are present.
- Total Loss: \(\mathcal{L} = \mathcal{L}_P + \mathcal{L}_B\)
- The SAM-ViT/H backbone is kept frozen; only the detection head is trained (~19M trainable parameters).
- Feature maps are interpolated from \(64\times64\) to \(128\times128\) to improve dense prediction accuracy.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | TMR | Prev. SOTA (GeCo) | Gain |
|---|---|---|---|---|
| RPINE (1-shot) | AP | 33.59 | 23.33 | +10.26 |
| RPINE (1-shot) | AP50 | 64.05 | 45.93 | +18.12 |
| FSCD-LVIS seen (3-shot) | AP | 27.49 | 22.37 (PseCo) | +5.12 |
| FSCD-LVIS unseen (3-shot) | AP | 22.71 | 11.47 (GeCo) | +11.24 |
| FSCD-147 (1-shot) | AP | 36.01 | 32.71 (GeCo) | +3.30 |
| FSCD-147 (3-shot) | AP | 38.57 | 32.49 (GeCo) | +6.08 |
Ablation Study¶
| Configuration | RPINE AP | FSCD-147 AP | Notes |
|---|---|---|---|
| Image features only \(\mathbf{F}\) | 11.44 | 20.95 | Lower bound without exemplar information |
| Template matching features only \(\mathbf{F}_{\text{TM}}\) | 32.55 | 31.96 | Effectiveness of template matching |
| \(\mathbf{F} \oplus \mathbf{F}_{\text{PM}}\) (prototype matching) | 20.94 | 28.91 | Pooled prototype loses spatial information |
| \(\mathbf{F} \oplus \mathbf{F}_{\text{TM}}\) (full model) | 33.59 | 36.01 | Optimal combination of spatial structure and appearance |
Key Findings¶
- The SAM decoder degrades performance on RPINE (AP: 33.59 → 29.66), as SAM tends to align with edges, which is detrimental for non-object patterns.
- TMR's FLOPs (3.04T) are substantially lower than those of PseCo (5.08T) and GeCo (4.72T).
- In cross-dataset evaluation, TMR demonstrates a dominant advantage: when trained on RPINE and tested on FSCD-147, it achieves an AP of 41.39 vs. 36.99 for GeCo.
Highlights & Insights¶
- A successful case of revisiting classical methods: the time-honored template matching technique, combined with modern feature extractors, yields remarkable performance.
- Minimalist architecture: only \(3\times3\) convolutions and linear layers, without any attention mechanism, yet achieves state-of-the-art results.
- The RPINE dataset fills the gap in evaluation of non-object pattern detection, supporting multi-pattern annotation (up to 3 distinct patterns per image, independently annotated by 3 annotators).
- The results expose the tendency of prototype matching methods to overfit to object-level semantics.
- The insight regarding the SAM decoder is particularly valuable: its edge-alignment property is harmful for non-object patterns, cautioning the community against the uncritical use of SAM post-processing.
Limitations & Future Work¶
- Relies on a frozen SAM backbone; resolution for small instances is constrained by the ViT patch size.
- The RPINE dataset is relatively small (4,362 images), potentially limiting diversity in learned representations.
- Only 2D patterns are addressed; spatiotemporal patterns in 3D or video settings are not considered.
- Computational overhead from multi-scale inference can be further optimized.
- Robustness of template matching to rotation and large-scale variation remains to be validated.
- The authors suggest future exploration of lightweight, pattern-specific architectures that reduce dependence on object-level edge priors.
Related Work & Insights¶
- GeCo/PseCo: Prototype matching-based FSCD methods whose performance is limited by spatial information loss.
- Classical Template Matching: TMR successfully combines the classical approach with deep features, offering a paradigm worth borrowing for other detection tasks.
- Generality of the SAM Backbone: Frozen SAM features perform strongly in few-shot detection, demonstrating the transferability of large-model representations.
- FSCD-147/FSCD-LVIS: Existing standard FSCD datasets, but covering only object-level patterns.
- Insight: For matching tasks requiring spatial structure preservation (e.g., point cloud registration, texture analysis), the combination of classical methods and modern features may be an underexplored yet effective strategy.
- SEM Image Application: The authors demonstrate cross-domain detection on scanning electron microscope images, indicating practical application potential of the proposed method.
Rating¶
- Novelty: ⭐⭐⭐⭐ Integrates classical template matching with a modern detection framework and proposes support-conditioned regression from a fresh perspective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, cross-dataset evaluation, comprehensive ablation, computational complexity analysis, and real-world application validation.
- Writing Quality: ⭐⭐⭐⭐ Clear logic, well-motivated problem formulation, and rich figures and tables.
- Value: ⭐⭐⭐⭐ Extends few-shot detection from objects to arbitrary patterns, opening a new research direction; the RPINE dataset holds long-term value for the community.