UniFS: Universal Few-Shot Instance Perception with Point Representations¶
Conference: ECCV 2024
arXiv: 2404.19401
Code: https://github.com/jin-s13/UniFS
Area: Segmentation
Keywords: Few-Shot Learning, Unified Model, Point Representation, Instance Perception, Multi-Task Learning
TL;DR¶
This paper proposes UniFS, the first universal few-shot instance perception model. By unifying object detection, instance segmentation, pose estimation, and object counting into a dynamic point representation learning paradigm and introducing a structure-aware point learning (SAPL) loss to capture high-order structural relations among points, UniFS achieves performance close to expert models under minimal task hypotheses.
Background & Motivation¶
Instance perception tasks (object detection, instance segmentation, pose estimation, object counting) are crucial in industrial applications. Supervised learning methods are limited by high annotation costs, prompting the emergence of few-shot learning methods.
However, the core dilemma of existing few-shot methods is task fragmentation:
Data discrepancy: Detection datasets mostly consist of scene images (multi-object), whereas pose estimation datasets mostly consist of cropped single-object images.
Feature granularity discrepancy: Detection requires global semantic features, segmentation requires fine-grained semantic features, and pose estimation requires both semantic and localization features simultaneously.
Output structure discrepancy: Detection outputs bounding box coordinates, segmentation outputs pixel-level masks, and pose estimation outputs Gaussian heatmaps.
Each task relies on separate methods (DeFRCN, DCFS, SAFECount, etc.), independent datasets, and distinct evaluation metrics, lacking a unified framework. The authors pursue an ultimate goal: solving multiple instance perception tasks with a single model using a minimal number of exemplars.
Method¶
Overall Architecture¶
UniFS consists of three fully shared components (no task-specific designs): 1. Feature Extractor: ResNet-101 backbone, extracting support and query image features. 2. Point Decoder: Transformer decoder, enhancing point features via self-attention and cross-attention. 3. Point Head: MLP predicting point coordinate offsets.
Core idea: Unifying the output space of all tasks into a point set representation. By providing different types of point annotations on the support image, the model automatically learns to predict corresponding points on the query image.
Key Designs¶
-
Unified Point Representation:
- Object Detection: Representing a bounding box by uniformly sampling 16 points along the box edges.
- Instance Segmentation: Uniformly sampling 32 points on the instance mask contour (following Deep Snake, ensuring clockwise order starting from the leftmost point).
- Pose Estimation: Each semantic keypoint naturally corresponds to a point, where the number and definition of keypoints can vary across different categories.
- Object Counting: Predicting the center point (bounding box center) of each object; counting is achieved by counting these points.
- This unified representation offers four major advantages: a task-agnostic architecture, fully shared parameters, generalizability to new tasks, and cross-task knowledge sharing.
-
Point Decoder (Transformer Architecture):
- \(L=2\)-layer Transformer decoder.
- Self-Attention: Exchanges information among support point features to model the context of the point sequence (especially task-related information).
- Cross-Attention: Support point features (query) \(\times\) query image RoI features (key/value), bridging the representation gap between support and query.
- Outputs enhanced point features \(\{\widehat{S_i}\}_{i \in [1,K]}\).
- Finally, an MLP predicts the offsets relative to the RPN anchor center: \(P_{xi} = A_{cx} + \Delta x_i \times A_w\).
-
Structure-Aware Point Learning (SAPL):
- Core problem: \(L_1\)/\(L_2\) losses only focus on single-point errors. A predicted point falling on a diamond or circle centered at the ground truth receives the same loss, creating ambiguity.
- Solution: Supervise the angular relationships between a point and its neighboring points additionally.
- \(\theta_i^{(n)}\) denotes the angle formed by three points: \(i-n\), \(i\), and \(i+n\).
- SAPL loss: \(L_{SAPL} = \frac{1}{N} \sum_{n=1}^{N} L_1(\sin(\frac{\hat{\theta}_i^{(n)}}{2}), \sin(\frac{\theta_i^{(n)}}{2}))\)
- Using the \(\sin(\theta/2)\) transformation: Amplifies gradients at sharp corners and dampens gradients in flat areas, capturing detailed shape information.
- The optimal N-hop is set to \(N=2\): 1-hop is sensitive to noise, while 3/4-hop is overly smooth.
Loss & Training¶
- Total Loss: \(L_{point} = |P_i - \hat{P}_i| + L_{SAPL}\), objective combined with RPN and classification losses.
- Two-stage Transfer Learning:
- Base class training: Joint training of detection, segmentation, and pose estimation on 60 base classes (counting task is held out without training).
- Novel class finetuning: Fine-tuning on \(K\) samples with the learning rate reduced to 0.01.
- Training Configuration: batch=32 (4/GPU \(\times\) 8 GPUs), lr=0.028, SGD, up to 55K iterations.
- Results are averaged over 10 seeds to ensure reliability.
Key Experimental Results¶
Main Results (COCO-UniFS Benchmark)¶
| Model | Type | Det. AP(1-shot) | Det. AP(5-shot) | Seg. AP(1-shot) | Seg. AP(5-shot) | Kpt. AP(1-shot) | Kpt. AP(5-shot) | Cnt. MSE(1-shot)ā | Cnt. MSE(5-shot)ā |
|---|---|---|---|---|---|---|---|---|---|
| FRCN-ft | Expert | 1.0 | 4.0 | ā | ā | ā | ā | ā | ā |
| TFA | Expert | 4.4 | 7.7 | ā | ā | ā | ā | ā | ā |
| FADI | Expert | 5.7 | 10.1 | ā | ā | ā | ā | ā | ā |
| DCFS | Expert | 8.1 | 16.4 | 7.2 | 13.5 | ā | ā | ā | ā |
| MPSR | Expert | 5.1 | 8.7 | ā | ā | ā | ā | 1.42ā | 1.40ā |
| Mask-RCNN-ft | Universal | 2.4 | 6.9 | 2.0 | 5.5 | 2.3 | 6.7 | 1.48ā | 1.45ā |
| UniFS | Universal | 12.7 | 18.2 | 8.6 | 11.5 | 12.2 | 22.1 | 1.38ā” | 1.32ā” |
ā : Task seen during training | ā”: Zero-shot generalization to unseen tasks
Ablation Study¶
SAPL Effect (COCO-UniFS val):
| Loss Function | Det. AP(1/5-shot) | Seg. AP(1/5-shot) | Kpt. AP(1/5-shot) |
|---|---|---|---|
| L2 only | 10.9 / 16.1 | 6.1 / 7.7 | 9.1 / 19.5 |
| L1 only | 10.6 / 16.2 | 7.2 / 8.7 | 12.0 / 21.0 |
| L1 + 1-hop SAPL | 12.6 / 17.9 | 8.4 / 11.3 | 12.3 / 21.8 |
| L1 + 2-hop SAPL | 12.7 / 18.2 | 8.6 / 11.5 | 12.2 / 22.1 |
| L1 + 3-hop SAPL | 12.6 / 17.7 | 8.2 / 11.0 | 12.4 / 21.6 |
| L1 + 4-hop SAPL | 12.7 / 17.8 | 8.4 / 11.3 | 12.2 / 20.9 |
Multi-Task Learning Effect:
| Training Tasks | Det. AP(1/5) | Seg. AP(1/5) | Kpt. AP(1/5) |
|---|---|---|---|
| Det. only | 12.2 / 17.9 | ā | ā |
| Det.+Seg. | 12.6 / 17.6 | 8.5 / 11.2 | ā |
| Det.+Seg.+Kpt. | 12.7 / 18.2 | 8.6 / 11.5 | 12.2 / 22.1 |
Key Findings¶
- Universal model comprehensively outperforms multi-task baselines: UniFS significantly outperforms Mask-RCNN-ft on all tasks (e.g., 1-shot detection 12.7 vs. 2.4).
- Competitive with expert models: UniFS outperforms all expert methods in detection and pose estimation, while being only slightly inferior to DCFS in 5-shot segmentation (11.5 vs. 13.5).
- Strong generalization ability to unseen tasks: Even though never trained on the counting task, UniFS achieves a 1.38 MSE, which outperforms all baselines that observed the counting task during training (MPSR 1.42, FSDetView 1.42).
- More pronounced advantage in 1-shot scenarios: The advantage of UniFS is larger in extremely low-shot settings, indicating that the unified representation yields better priors.
- Significant contribution of SAPL: Segmentation AP improves from 7.2 to 8.6 (+19.4%), confirming that structural constraints are particularly important for shape-sensitive tasks.
- L1 consistently outperforms L2: L1 outperforms L2 robustly in both segmentation and pose estimation, likely because L1 is less sensitive to outliers.
- Synergistic multi-task benefits: Adding more tasks does not degrade performance on existing tasks; instead, it brings slight gains.
Highlights & Insights¶
- Point representation as a natural unified language: bbox = points uniformly sampled along edges, mask = points uniformly sampled on the contour, keypoint = semantic points, counting = center point; this concise mapping is niezwykle elegant (extremely elegant).
- Geometric intuition of SAPL: Traditional L1/L2 losses treat points as independent coordinate prediction tasks, ignoring the structural relationships among them. SAPL introduces the concept of "shape" via angular constraints.
- Zero task assumptions: The model behavior is agnostic to the current task ā it only knows "where the given support points are, and where the corresponding query points should be."
- Contribution of the COCO-UniFS benchmark: It unifies datasets, splits, and evaluation protocols across four tasks, providing a fair evaluation platform for future research.
Limitations & Future Work¶
- Potential quantization error in point sampling: Representing mask contours with a finite number of points introduces quantization errors, especially for complex shapes (32 points might be insufficient).
- Unimproved classification capabilities: Point representation focuses heavily on localization; the few-shot ability for image-level classification has not been specifically improved.
- Limited to 2D tasks: It has not been extended to 3D perception or temporal inputs (e.g., video tracking).
- Reliance on RPN: The architecture still relies on an RPN to generate region proposals, which is inherently not very few-shot friendly.
- Scalable future directions: Increasing the point count, introducing dynamic/variable point numbers, and supporting 3D or video tasks.
Related Work & Insights¶
- Works like Painter use dense maps to unify different tasks, but fail to handle instance-level tasks.
- Works like Pix2Seq use textual sequence representations but suffer from slow inference speeds.
- CenterNet, RepPoints, etc., have already utilized point representations for detection/segmentation, but are limited to single-task training and evaluation.
- UniFS generalizes point representations to a few-shot multi-task unified framework for the first time, proving that "less is more" in terms of design philosophy.
- The contour point representation of Deep Snake provides the technical foundation for converting masks into points.
Rating¶
- Novelty: āāāāā ā First to unify four instance perception tasks into a few-shot point learning paradigm, with an ingeniously designed SAPL loss.
- Experimental Thoroughness: āāāā ā COCO-UniFS benchmark is comprehensive, but evaluated only on ResNet-101 without verifying with stronger backbones.
- Writing Quality: āāāāā ā Clear problem definition, intuitive methodology, and elegant geometric explanation of SAPL.
- Value: āāāāā ā Pioneering work; unifying problem definitions and benchmarks holds significant value for the development of the field.