Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/MR-Sherif/TOGA.git
Area: Multimodal VLM
Keywords: Few-shot learning, CLIP adapter, Heterogeneous graph, Cross-modal distillation, Training-time supervision
TL;DR¶
TOGA attaches an "image-patch-text" heterogeneous graph teacher during the training phase for fine-grained cross-modal reasoning. These relational insights are distilled into the key-value cache of a Tip-Adapter student. During inference, the entire graph teacher is discarded, keeping the inference path identical to Tip-Adapter (zero extra latency or VRAM). It achieves new SOTA results on 11 benchmarks across 1–16 shot settings.
Background & Motivation¶
Background: Adapting large vision-language models like CLIP to few-shot downstream tasks primarily relies on Parameter-Efficient Fine-Tuning (PEFT). The Tip-Adapter family is particularly favored because it caches global features of a few support samples as a key-value cache. During testing, it performs prototype matching with minimal parameters and extremely fast inference.
Limitations of Prior Work: Existing lightweight adapters operate solely on "global unimodal feature vectors." CLIP spatially averages an image into a global descriptor, which blurs fine-grained local cues (e.g., beak shape or feather patterns) critical for distinguishing "Yellow-bellied Sapsucker" from "Grey Sapsucker." Consequently, global feature adapters are inherently disadvantaged in Fine-Grained Visual Categorization (FGVC).
Key Challenge: Current PEFT methods are stuck in a trade-off—either fast but coarse (global adapters like Tip-Adapter/CLIP-Adapter with zero inference overhead but limited to global vectors) or strong but slow (patch-level adapters like GraphAdapter/VPT that reason on patch tokens but permanently carry the computational burden of GNNs or extra tokens during testing, often ignoring text semantics). An ideal few-shot adapter must simultaneously achieve two conflicting goals: ① sufficient expressive power to reason over fine-grained patch evidence and its alignment with category text; ② preservation of the zero-overhead inference of lightweight baselines.
Goal: To provide an adapter with "patch-level + cross-modal" relational reasoning capabilities without changing the inference path at test time (latency, VRAM, and parameters remain identical to Tip-Adapter).
Key Insight: The authors observe that relational reasoning only needs to exist during the training phase to "teach" the adapter how to encode fine-grained knowledge. Since only the adapter and its key-value cache participate in testing, cross-modal relational reasoning can be injected during training and directionally distilled into the components that remain at deployment (the cache).
Core Idea: Utilize a high-capacity heterogeneous graph teacher that exists only during training to distill fine-grained cross-modal relational knowledge directly into Tip-Adapter's key-value cache (the student). The teacher is discarded post-training: "Train with graphs, test with Tip-Adapter."
Method¶
Overall Architecture¶
TOGA (Training-Only Graph Adapter) is an asymmetric distillation framework. During training, it runs an ensemble of three branches: ① a frozen Zero-Shot CLIP branch \(L_{ZS}\); ② a lightweight student—the key-value cache adapter \(A\) of Tip-Adapter-F, producing \(L_{Cache}\); and ③ a powerful, training-only heterogeneous graph teacher, producing \(L_{Graph}\).
Within the teacher branch: Multi-scale patches are cropped from the input image and passed through a frozen CLIP encoder alongside category text prompts to obtain node features → features pass through unimodal Transformers for "intra-modal" context enhancement → patch and text nodes are combined into a heterogeneous graph, where a Modality-aware Graph Transformer (MGT) performs type-sensitive cross-modal message passing → discriminative Top-N node filtering selects the most salient patches, which are aggregated into a teacher visual feature to compute teacher logits. Finally, a "cache-aware dual-objective" transfers this relational knowledge into the student adapter \(A\). During testing, the entire teacher branch is discarded, and prediction reverts to \(L_{test}=L_{ZS}+\alpha\cdot L_{Cache}\), identical to the original Tip-Adapter-F.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Input Image + Category Prompts"] --> B["Multi-scale Patches + Text Nodes<br/>(Frozen CLIP Encoding)"]
B --> C["Unimodal Transformer Enhancement<br/>(patch↔patch / prompt↔prompt)"]
C --> D["Modality-aware Graph Transformer (MGT)<br/>Cross-modal Type-sensitive Messaging"]
D --> E["Discriminative Top-N Node Filtering<br/>Select Discriminative Patches"]
E --> F["Teacher Logits L_Graph"]
F --> G["Cache-aware Dual-objective Distillation<br/>Knowledge Injection into Student Cache A"]
G -->|Discard Teacher at Test Time| H["Inference = Tip-Adapter<br/>L_ZS + α·L_Cache Zero Extra Overhead"]
Key Designs¶
1. Asymmetric "Training Teacher / Permanent Student" Distillation: Keeping reasoning costs on the training side
Patch-level cross-modal reasoning is powerful, but deploying GNNs is undesirable. TOGA resolves this via structural asymmetry: the permanent student is the Tip-Adapter-F key-value cache (a lightweight adapter \(A\) that maps query features \(z\) for cosine prototype matching, \(L_{Cache}(x)=\exp(-\beta(1-s))^{\top}V\) where \(s_j=\cos(A(z),K_j)\)). During training, a high-capacity graph teacher is attached. Classification uses mixed logits \(L_{train}(x)=L_{ZS}(x)+\alpha\cdot L_{Cache}(x)+\delta\cdot L_{Graph}(x)\), and gradients update both student and teacher. Unlike standard KD, the teacher is not pre-trained offline; it is trained online and asymmetrically with the student, with supervision targets applied directly to the cache's keys/values—upgrading the "prototype memory." Since \(L_{Graph}\) vanishes at test time, the expressive power of graph reasoning is "compiled" into the student without carrying the overhead to deployment.
2. Heterogeneous Image-Patch-Text Graph + MGT: Reasoning fine-grained visual evidence and category semantics in a single graph
Global vectors miss patch-level evidence and the mapping between specific patches and class names. TOGA performs multi-scale patch extraction: slicing the image into a union of 5 views—global, 3×3 local grid (9 patches), 2×2 mid-scale (4 patches), top/bottom halves, and left/right halves, totaling \(M=18\) patches. These are resized to 224×224 and passed through frozen CLIP for normalized features \(V_{vis}^{(0)}\), while category prompts provide text embeddings \(V_{text}^{(0)}\). Both modalities first pass through unimodal Transformers for intra-modal enhancement (patch-patch co-occurrence and prompt-prompt semantics).
The nodes then form a heterogeneous graph \(G=(N,E)\) with node types \(\phi(v)\in\{\text{patch},\text{text}\}\) and edge types \(r\in\{pp,pt,tp\}\). MGT uses node-type projections for type-dependent \(Q,K,V\) and relation-specific transforms for relation-aware keys \(\tilde K^{(h)}_{s\to t}=W^{(h)}_{K,r}K^{(h)}_s\) and biases \(b^{(h)}_{r}\). Attention is defined as:
where \(\mu_r^{(h)}\in\mathbb R^+\) is a learned relation-level scaling coefficient to weight different interactions (e.g., patch-patch vs. patch-text). Type-sensitive parameters preserve modal characteristics, while relation-sensitive parameters allow the model to specifically leverage cross-modal patch-text interactions, which align visual evidence with correct text labels.
3. Discriminative Top-N Node Filtering: Retaining discriminative patches to avoid background dilution
MGT outputs refined patch nodes \(V'_{vis}=\{h_i\}\). Global pooling would include all patches (including background), diluting small-object fine-grained evidence—particularly harmful for datasets like EuroSAT. TOGA learns a projection vector \(p\) to score nodes \(s_i=\langle h_i\cdot p\rangle/\|p\|_2\|h_i\|_2\) and selects the Top-N nodes for aggregation into \(f_{graph}\). Teacher logits are computed as \(L_{Graph}(x)_c=\cos(f_{graph},\hat t_c)\). Visualizations confirm the model retains high-score foreground patches (ant heads, cat eyes) while suppressing background nodes.
4. Cache-aware Dual-objective Collaborative Training: Using Focal Loss for effective teacher-forcing
Using only a joint Cross-Entropy (CE) loss allows the teacher to "free-ride" on the student's predictions without learning specialized knowledge. TOGA splits the total loss:
The first term is standard CE on mixed logits. The second is Focal Loss applied solely to the teacher's logits: \(p_t=\mathrm{softmax}(L_{Graph})_y\), \(L_{Focal}=-(1-p_t)^{\gamma}\log(p_t)\). Focal loss down-weights easy samples where the teacher is already correct (\(p_t\to1\)), forcing the teacher to dedicate capacity to difficult fine-grained samples. This ensures the teacher becomes a robust expert, providing higher-quality relational signals to be "imprinted" into the student's cache keys/values via the joint loss.
Key Experimental Results¶
Main Results¶
On 11 standard benchmarks (Aircraft, Flowers102, SUN397, Food101, Caltech101, UCF101, StanfordCars, DTD, ImageNet, OxfordPets, EuroSAT) using a frozen CLIP ViT-B/16 backbone, TOGA refreshes the SOTA across all shots and datasets while maintaining the exact inference latency of Tip-Adapter-F.
| Shot | Metric (Avg. % of 11 Datasets) | TOGA | CCA (Strong Baseline) | Tip-Adapter-F | GraphAdapter | Gain over CCA |
|---|---|---|---|---|---|---|
| 1 | Avg Acc | 72.2 | 66.3 | 64.6 | 64.8 | +5.9 |
| 2 | Avg Acc | 75.0 | 68.9 | 66.6 | 67.7 | +6.1 |
| 4 | Avg Acc | 77.9 | 72.2 | 69.7 | 70.3 | +5.7 |
| 8 | Avg Acc | 80.0 | 75.0 | 72.4 | 73.4 | +5.0 |
| 16 | Avg Acc | 82.3 | 77.6 | 75.7 | 76.2 | +4.7 |
On FGVC-Aircraft, TOGA outperforms CCA by +9.8% in the 2-shot setting. On EuroSAT, it reaches 89.4% (16-shot), significantly higher than Tip-Adapter-F (84.5%). For OOD robustness (ImageNet variants), TOGA averages 63.1, outperforming zero-shot CLIP (59.1) and various prompt-tuning baselines, indicating that it does not overfit the support set.
Ablation Study¶
| Configuration (EuroSAT example) | 1-shot | 16-shot | Note |
|---|---|---|---|
| Full (T+M+F+P, Focal) | 67.4 | 89.4 | Complete model |
| Only \(L_{CE}\) | 65.1 | 88.1 | Weak teacher signal |
| \(L_{CE} + L^{Graph}_{CE}\) | 67.1 | 87.5 | Gradient conflict at high shots |
| w/o MGT (Remove M) | 61.9 | 85.7 | Largest drop; cross-modal reasoning is key |
| w/o patch-text edges (Remove P) | 64.1 | 86.7 | Unimodal interaction is insufficient |
| Global Pooling (Remove Top-N) | 63.4 | 88.7 (N=All) | Background noise dilution |
| MultiScale → 3×3 fixed | 61.7 | 88.8 | Fixed scale lacks flexibility |
Key Findings¶
- MGT is the primary performance driver: Removing the cross-modal reasoning component (M) causes the most significant drop across all datasets.
- Top-N filtering has an optimal point: \(N=50\%\) balances retaining discriminative foreground and suppressing noise.
- Greater gains in data-scarce settings: The 1-shot average gain (+5.9) is higher than the 16-shot gain (+4.7), suggesting relational supervision effectively extracts category-defining evidence from very few samples.
- Multi-scale outperforms fixed grids: Parallelizing 18 multi-scale patches allows the teacher to capture both local texture and global structure.
Highlights & Insights¶
- Decoupling Expressive Power and Inference Cost: The asymmetric "Training Teacher / Permanent Student" design is elegant. Relational knowledge from the trainer is "compiled" into the student's cache, a paradigm applicable to any scenario requiring strong reasoning with strict deployment budgets.
- Direct KV Cache Supervision: Directly upgrading the prototype memory that dictates testing behavior (the cache) is more effective for zero-overhead adapters than traditional logit-based KD.
- Focal Loss as Teacher-Forcing: Forcing the teacher to focus on difficult samples prevents it from "free-riding" on ZS or student predictions.
- Heterogeneous graph structure unites the visual hierarchy (image↔patch) and cross-modality (patch↔text) in a single topology, using \(\mu_r\) to automatically weight different interactions.
Limitations & Future Work¶
- High training cost: Running multi-scale patches (18 crops through CLIP) + Transformers + MGT + Focal loss increases VRAM and training time significantly compared to Tip-Adapter.
- Hyperparameter sensitivity: Tuning \((\alpha, \beta, \gamma, N)\) requires effort, and the optimal \(N=50\%\) may vary depending on the foreground-to-background ratio of specific datasets.
- Evaluated only on CLIP ViT-B/16 for classification: Benefits in dense prediction tasks like detection or segmentation have not been verified.
- Static knowledge: Discarding the teacher means the model cannot use online patch evidence for OOD samples that fall outside the "knowledge" distilled into the cache.
Related Work & Insights¶
- vs. Tip-Adapter-F: Testing paths are identical. TOGA simply improves the quality of the cache during training, providing "free gains" with zero extra inference cost.
- vs. GraphAdapter: GraphAdapter requires the GNN during testing. TOGA performs cross-modal fusion in a training-only heterogeneous graph and outperforms GraphAdapter (82.3% vs 76.2% at 16-shot) while staying lighter at test time.
- vs. KD / Mutual Learning: Most VLM distillation uses offline teachers or symmetric learning on logits. TOGA uses online asymmetric learning and supervises the KV cache directly.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐