Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation¶
Conference: AAAI 2026 arXiv: 2511.14186 Code: github.com/LZYAndy/UMEG-Net Area: Multimodal VLM Keywords: Precise Event Spotting, Few-Shot Learning, Unified Multi-Entity Graph, Knowledge Distillation, Sports Video Analysis
TL;DR¶
This paper proposes UMEG-Net for few-shot Precise Event Spotting (PES). The method constructs a unified multi-entity graph integrating human skeletal keypoints, sports object keypoints, and environmental landmarks, combined with efficient spatiotemporal graph convolution and a parameter-free multi-scale temporal shift module. A multimodal knowledge distillation scheme transfers graph features to an RGB student network. The approach significantly outperforms existing methods across five sports datasets under extremely limited annotation budgets.
Background & Motivation¶
Precise Event Spotting (PES)¶
Precise Event Spotting aims to identify fine-grained events and their exact timestamps from untrimmed long videos, with extremely tight tolerance windows (1–2 frames). Typical scenarios include racket-impact moments in racket sports and takeoff/landing in gymnastics.
Three major challenges in PES:
Rapid successive events: Multiple events occur at very short intervals in sports videos.
Motion blur: High-speed motion degrades visual features.
Subtle visual differences: Low visual discriminability across different event types.
Limitations of Prior Work¶
End-to-end RGB methods (E2E-Spot, T-DEED, F3ED): Rely on large-scale frame-level annotated datasets; performance degrades sharply under few-shot conditions.
Skeleton-based methods (STGCN++, BlockGCN, etc.): Use only human pose, ignoring critical information such as the ball and court.
Conventional few-shot methods: Designed for coarse-grained action recognition; cannot satisfy the frame-level precision required by PES.
Root Cause¶
Event occurrence in sports involves interactions among multiple entities — a player hitting a ball requires joint modeling of the human body, ball, and court. Constructing a unified multi-entity graph to represent these interactions is key to improving few-shot PES performance. Additionally, the unreliability of keypoint detection necessitates multimodal distillation to enhance robustness.
Method¶
Overall Architecture¶
- Keypoint extraction: HRNet (human pose) + YOLOv8 (ball/player detection) + dedicated methods (court corner points)
- Unified multi-entity graph construction: All keypoints organized into a unified graph structure
- Stacked UMEG Blocks: Spatial GCN + multi-scale temporal shifting
- Event spotting and classification: Linear layer outputs frame-level event probabilities
- Multimodal distillation: UMEG-Net teacher → RGB student network
Key Designs¶
1. Unified Multi-Entity Graph Construction¶
Node set \(\mathcal{V}_t = \{V_p^t, V_b^t, V_c^t\}\) (player joints + ball + court corner points); edge set comprises four types: $\(\mathcal{E}_t = \mathcal{E}_t^{intra} \cup \mathcal{E}_t^{p-b} \cup \mathcal{E}_t^{p-c} \cup \mathcal{E}_t^{c-c}\)$
- Intra-skeleton connections (standard human joint topology)
- Person–ball connections (racket sports: wrist → ball; soccer: ankle/shoulder → ball)
- Person–court connections (foot joints → court corner points)
- Court corner interconnections (forming a rectangle)
Design Motivation: Different sports employ different person–ball connection schemes, making full use of sport-specific domain knowledge. Information ignored by conventional skeleton graphs is critical for event determination.
2. UMEG Block: Spatial GCN + Multi-Scale Temporal Shifting¶
Spatial GCN: Graph convolution is performed over the entire multi-entity graph (rather than independently per person), jointly modeling person–person and person–entity interactions: $\(\mathcal{H}^{(\ell+1)} = \text{ReLU}(A^{(\ell)} \mathcal{H}^{(\ell)} W^{(\ell)})\)$
Multi-scale temporal shift module (parameter-free): Replaces temporal convolution with parameter-free temporal shifts, substantially reducing trainable parameters:
- Features are split along the channel dimension into three parts (static, forward, backward, ratio \(\alpha = 1/8\))
- Bidirectional shifts are applied for \(\Delta \in \{1, 2, 4\}\)
- Each shifted stream is updated via the spatial GCN and then fused across scales
Design Motivation: \(\Delta \in \{1, 2, 4\}\) simultaneously captures short-, medium-, and long-range temporal dependencies. Temporal convolution is prone to overfitting under few-shot conditions; temporal shifting is a zero-parameter-cost alternative.
3. Multimodal Knowledge Distillation¶
- Teacher (frozen): Trained UMEG-Net
- Student: VideoMAEv2 feature extractor + BiGRU
- Distillation loss (computed on unlabeled data): \(\mathcal{L}_{feat} = \frac{1}{T}\sum_t \|\mathbf{F}_{tch}^{(t)} - \mathbf{F}_{stu}^{(t)}\|_2^2\)
- At inference, only the student is used; no keypoint detection is required
Design Motivation: Distillation over large amounts of unlabeled video enables the RGB student to acquire visual representations complementary to the graph model, while eliminating dependence on pose estimation.
Loss & Training¶
- Joint training of event-type classification and event localization
- Foreground class loss weight increased by 5× (event frames account for <3%; severe class imbalance)
- AdamW optimizer with cosine annealing
- UMEG-Net: 50 epochs, lr = 0.001; distillation: 50 epochs, lr = 0.0001
Key Experimental Results¶
Main Results¶
F1/Edit score comparison under the 100-clip few-shot setting:
| Method | F3Set F1 | ShuttleSet F1 | FineGym F1 | FigureSkating F1 | SoccerNet F1 |
|---|---|---|---|---|---|
| E2E-Spot_800MF | 13.3 | 54.6 | 53.1 | 42.7 | 43.1 |
| F3ED | 15.3 | 55.1 | 52.1 | 34.4 | 34.5 |
| BlockGCN | 18.3 | 59.4 | 49.1 | 48.2 | 43.3 |
| UMEG-Net | 31.7 | 64.0 | 54.4 | 49.6 | 44.8 |
| UMEG-Net_distill | 40.7 | 69.0 | 61.2 | 56.2 | 50.8 |
UMEG-Net achieves consistent improvements over all baselines on all five datasets. UMEG-Net_distill yields average gains of +5.8% F1 and +6.7% Edit.
Ablation Study¶
Effect of graph entity composition:
| Graph Configuration | F3Set F1 | F3Set Edit | ShuttleSet F1 |
|---|---|---|---|
| pose×N | 23.9 | 47.4 | 61.5 |
| pose×N + court | 26.1 | 46.7 | 61.5 |
| pose×N + ball | 30.2 | 48.1 | 62.5 |
| pose×N + ball + court | 31.7 | 49.2 | 64.0 |
Temporal module configuration:
| \(\Delta\) Configuration | FineGym F1 | FigureSkating F1 |
|---|---|---|
| {1} | 50.3 | 36.8 |
| {1, 2} | 49.8 | 45.3 |
| {1, 2, 4} | 54.4 | 49.6 |
Full-supervision comparison: UMEG-Net remains competitive in the fully supervised setting, surpassing E2E-Spot on 3 out of 5 datasets.
Key Findings¶
- Ball information contributes most (F1 +6.3), followed by court information (+2.2).
- UMEG-Net has only 2.2M parameters — the fewest among all methods while achieving the best performance.
- Distillation substantially outperforms self-supervised contrastive pretraining (F3Set F1: 40.7 vs. 29.1).
- The k-clip setting is more practically reasonable than conventional k-shot.
Highlights & Insights¶
- Precise problem formulation: Few-shot PES is a real and important problem, as frame-level annotation is extremely costly.
- k-clip is more reasonable than k-shot: Sports events are very short, densely occurring, and multi-class; k-clip better reflects practical annotation scenarios.
- Parameter-free temporal module: Zero-parameter-cost replacement for multi-scale temporal convolution, reducing overfitting in few-shot settings.
- Distillation leverages unlabeled data: An elegant use of in-domain data.
Limitations & Future Work¶
- Dependence on keypoint detection quality (though the distilled version does not require it).
- Person–ball connection designs require manual specification, lacking generalizability.
- Relatively smaller gains in multi-person scenarios (SoccerNet).
- UMEG-Net_distill employs VideoMAEv2 (67.8M parameters), far larger than the teacher (2.2M).
Related Work & Insights¶
- E2E-Spot / T-DEED / F3ED: Representative RGB end-to-end PES methods.
- BlockGCN / STGCN++: Skeleton-based action recognition methods.
- TSM: Source of inspiration for the temporal shift module.
- Hong et al.: Pioneer work on pose-to-RGB distillation in figure skating.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The unified multi-entity graph and parameter-free temporal shifting represent solid contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Five datasets, comprehensive ablations, and multiple k-clip settings.
- Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear and analysis is thorough.
- Value: ⭐⭐⭐⭐⭐ — The few-shot setting directly addresses practical annotation cost challenges.