Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding¶
Conference: ICLR 2026 arXiv: 2602.16545 Code: Available Area: Video Understanding Keywords: Category Splitting, Zero-Shot Editing, Fine-Grained Video Recognition, Classifier Modification, Compositional Structure
TL;DR¶
This paper introduces a new task called Category Splitting, which exploits latent compositional structure embedded in video classifier weights to decompose coarse-grained action categories into fine-grained subcategories under zero-shot conditions, without retraining or additional data.
Background & Motivation¶
Video recognition models are typically trained on fixed taxonomies that tend to be overly coarse. For example, a single label "open" may encompass vastly different scenarios such as "open cupboard," "open by pushing," "open quickly," and "open halfway." As application requirements evolve, finer-grained distinctions become necessary.
Existing solutions suffer from three categories of limitations: - Re-annotation + retraining: Expensive, requiring large-scale labeled data and full training cycles. - Vision-language models (VLMs): Rely on massive video-text corpora; domain-specific data is scarce, and fine-grained temporal cues are difficult to capture. - Continual learning: Requires training data for each new category and focuses on entirely new categories rather than refinement of existing ones.
Core insight: Modern video backbones already encode rich latent structure in their feature spaces, which can be decomposed to distinguish fine-grained variations even without direct supervision signals.
Method¶
Overall Architecture¶
The core of the approach is to edit only the classification head while keeping the backbone frozen, splitting a coarse category \(c\) into multiple fine-grained subcategories \(\mathcal{S}^c = \{s_1^c, s_2^c, \dots, s_k^c\}\). The updated label space becomes \(\mathcal{Y}' = (\mathcal{Y} \setminus \{c\}) \cup \mathcal{S}^c\).
The editing method \(E\) must satisfy two properties: - Generality: The edit should correctly classify previously unseen samples of new subcategories. - Locality: The edit should leave predictions for other categories unaffected.
Key Designs¶
1. Modifier Retrieval¶
Each fine-grained subcategory is treated as a composition of a "coarse concept + modifier." For example, "pushing left to right" = "pushing" + "left to right." The key steps are:
Modifier dictionary construction: Fine-grained categories already present in the classifier are grouped to identify "pseudo-coarse categories" \(\tilde{c}\) that share a common base concept. Their weight vectors are approximated as the mean of subcategory weights:
The modifier vector is obtained by subtracting the shared base: \(v_m = w_y - v_{\tilde{c}}\)
Modifier vector transfer: For target subcategories, cosine similarity of modifier text embeddings is computed via a text encoder \(\phi\), incorporating both modifier similarity and full-label similarity for retrieval:
The weight vector for the new subcategory is then: \(w_{s_j^c} = w_c + v_m^*\)
2. Modifier Alignment¶
To handle novel modifiers absent from the dictionary, a lightweight alignment module \(g_\psi: \mathbb{R}^n \to \mathbb{R}^m\) is trained to directly map text embeddings into the classifier weight space.
Training data are drawn from two sources: - Modifier-level pairs: \(\mathcal{D}_{mod} = \{(\phi(t_m), v_m)\}\) - Category-level pairs: \(\mathcal{D}_{cat} = \{(\phi(t_y), w_y)\} \cup \{(\phi(t_{\tilde{c}}), v_{\tilde{c}})\}\)
The module is trained with MSE loss and is implemented as a single-hidden-layer MLP (384-dimensional). Only \(\psi\) is updated; the classifier and text encoder remain frozen. The entire process remains zero-shot, requiring no video data.
3. Low-Shot Category Splitting¶
When a small number of annotated samples is available (as few as one video per subcategory), an isolated fine-tuning strategy is employed: only the newly added subcategory weights \(\theta_{head}'\) are fine-tuned, while the backbone and original classification head are frozen. Initializing from zero-shot method weights outperforms initialization from coarse-category weights.
Loss & Training¶
- Zero-shot stage requires no training data.
- Alignment module: MSE loss, AdamW optimizer, learning rate \(1 \times 10^{-3}\), cosine annealing, batch size 10.
- Low-shot fine-tuning: cross-entropy loss, AdamW, learning rate \(1 \times 10^{-3}\), weight decay \(1 \times 10^{-3}\), batch size 16.
- EMA early stopping (\(\beta=0.95\), patience=5, \(\delta=1\times10^{-3}\)).
Key Experimental Results¶
Main Results¶
Baseline setup: ViT-Small + MVD pretraining, CLIP ViT-L/14 text encoder. Datasets: SSv2-Split (54 coarse categories) and FineGym-Split (42 coarse categories).
| Method | SSv2-A Gen. | SSv2-A Loc. | FineGym-A Gen. | FineGym-A Loc. |
|---|---|---|---|---|
| CLIP | 27.6 | 100.0 | 12.1 | 100.0 |
| FG-CLIP | 30.9 | 100.0 | 19.4 | 100.0 |
| VideoPrism | 28.2 | 100.0 | 21.7 | 100.0 |
| Ours | 46.3 | 98.9 | 34.2 | 97.8 |
VLMs exhibit extremely low generality, while the proposed method improves generality on SSv2 by nearly 20 percentage points.
| Fine-tuning Strategy | Initialization | Generality | Locality | Mean |
|---|---|---|---|---|
| Full model one-shot | Coarse category | 33.6 | 0.0 | 16.8 |
| \(\theta_{head}'\) only one-shot | Coarse category | 48.4 | 98.4 | 73.4 |
| \(\theta_{head}'\) only one-shot | Modifier alignment | 52.8 | 98.2 | 75.5 |
| Full-data fine-tuning | Coarse category | 86.7 | 19.2 | 52.9 |
Ablation Study¶
- Modifier Retrieval (45.0%) vs. Modifier Alignment (46.3%): alignment improves generality by 1.3%.
- Zero-shot initialization vs. random initialization: +7.8% generality improvement.
- Pretraining impact: MVD (46.3%) > SIGMA (44.1%) > VideoMAE (42.9%) > training from scratch (37.0%).
- Text encoder: CLIP (46.3%) ≈ VideoPrism (46.5%) > RoBERTa (40.9%).
Key Findings¶
- Direction-based splits achieve the best performance; splits involving object count, intent, or success are the most challenging.
- Full-data fine-tuning performs worse than one-shot fine-tuning (75.5 vs. 52.9), as it introduces strong bias toward new categories and severely degrades locality.
- Performance is better when the original label space contains analogous categories sharing the same modifier, though the method remains effective even without them.
Highlights & Insights¶
- Novel task formulation: Category splitting addresses a natural yet overlooked real-world problem.
- Intrinsic structure outperforms VLMs: Exploiting the weight structure of video classifiers substantially surpasses VLMs for fine-grained video understanding.
- Minimal yet effective: Only the classification head is edited, no backbone update is required, and zero-shot operation is supported.
- Counter-intuitive finding: Full-data fine-tuning underperforms one-shot fine-tuning; isolated fine-tuning with zero-shot initialization is the optimal strategy.
Limitations & Future Work¶
- Relies on text labels to identify modifiers, making it difficult to handle purely visual distinctions (e.g., action speed).
- Zero-shot generality still has room for improvement (46% vs. an ideal 86%+).
- Validation is limited to classification tasks; downstream tasks such as detection and segmentation remain unexplored.
- Requires the original classifier to already contain some fine-grained categories for modifier dictionary construction.
Related Work & Insights¶
- Directly related to the concept of model editing in NLP, borrowing the generality/locality evaluation framework.
- Naturally connected to compositional action recognition.
- The transferability assumption of modifiers warrants validation in broader domains.
- Extensible to classification refinement scenarios in other visual tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Both the task formulation and zero-shot editing approach are pioneering.
- Technical Depth: ⭐⭐⭐⭐ — The method is elegant and concise, with moderate technical complexity.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — A dedicated benchmark is constructed with comprehensive ablations.
- Value: ⭐⭐⭐⭐ — Low-cost classifier updating has practical application prospects.