Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering¶

Conference: ICLR 2026 arXiv: 2506.06905 Code: None Area: Multimodal Learning / Few-Shot Learning Keywords: Meta-Learning, Prompt Distillation, Few-Shot VQA, LMM, MAML

TL;DR¶

This paper proposes MAPD (Meta-Adaptive Prompt Distillation), a MAML-based prompt distillation framework that leverages an attention mapper to distill soft prompts from task-relevant image features, enabling LMMs to adapt to novel visual question answering tasks at test time with only a few gradient steps. MAPD outperforms ICL by 21.2%.

Background & Motivation¶

Large multimodal models (LMMs) typically rely on in-context learning (ICL) for few-shot tasks, but several critical issues persist:

Unstable ICL performance in small models: Models with fewer than 7B parameters frequently exhibit stagnant or even degraded performance as the number of in-context examples increases, particularly on VQA tasks.

Information overload in image embeddings: Models are overwhelmed by task-irrelevant information embedded in image representations, hindering effective focus on task-relevant features.

Non-monotonic behavior of ICL: Performance does not necessarily improve monotonically with increasing shot count—a phenomenon that contradicts intuitions about human few-shot learning.

The authors hypothesize that the root cause lies in ICL's inability to effectively extract task-specific information from image embeddings. The proposed solution is to learn a set of fixed soft prompts that distill task-relevant visual features, followed by rapid test-time adaptation via a small number of gradient updates.

Method¶

Overall Architecture¶

MAPD builds upon the LLaVA v1.5 architecture and comprises three core components: 1. CLIP ViT-L/14 visual encoder (frozen) 2. Attention mapper + soft prompts (trainable, ~24M parameters) 3. Qwen2.5-7B-Instruct LLM (frozen)

Training proceeds in two stages: pre-training (feature alignment) and fine-tuning (meta-learning-based prompt distillation).

Key Designs¶

Attention Mapper:
- Replaces the MLP projection layer in LLaVA v1.5.
- Concatenates learnable soft prompts \(P\) (\(m=256\) tokens) with visual features \(Z_v\) to form \(C = (P, Z_v)\).
- Computes multi-head attention (8 heads): \(H_{p+v} = \sigma(QK^T) \cdot V\).
- Extracts the first \(m\) output embeddings as task-specific image prompts \(H_p\).
- Design Motivation: The soft prompts leverage the attention mechanism to "distill" task-relevant information from image features.
Meta-Task Construction:
- Meta-tasks \(T_j = \{D_{supp}, D_{query}\}\) are sampled from a mixture of training datasets.
- Each meta-task includes a support set and a query set, simulating few-shot test scenarios.
- Task diversity is ensured via a data mixture of 14 datasets (~802K samples).
MAPD Training (First-Order MAML):
- Inner loop: Computes loss on the support set and performs a gradient update to obtain task-specific parameters \(\theta_p' = \theta_p - \alpha \nabla_{\theta_p} L_{supp}\).
- Outer loop: Computes loss on the query set using task-specific parameters and updates meta-parameters \(\theta_p := \theta_p - \beta \sum_j \nabla_{\theta'_{p,j}} L_{query}\).
- A first-order approximation is adopted to avoid computing Hessian-vector products, substantially reducing GPU memory consumption.
- Inner loop: 5 steps, \(\alpha = 0.1\); outer loop: \(\beta = 10^{-3}\).

Loss & Training¶

Training objective: maximize the likelihood \(p_{\theta_p}(X_a | X_v, X_q)\).
Pre-training stage: trained for 4 epochs on LCS-558K with a learning rate of 2e-3.
Fine-tuning stage: trained for 1 epoch using MAML bi-level optimization.
Test-time adaptation: up to \(K=30\) gradient steps on the support set.

Key Experimental Results¶

Main Results¶

Performance on VL-ICL Bench (FT adaptation mode, accuracy %):

Dataset	Method	1-S	2-S	4-S	5/8-S	Avg
Open-MI (2-way)	NoMeta-task	21.5	67.5	89.0	94.0	68.0
	MAPD	43.5	78.0	94.5	95.5	77.9
Operator Induction	Multi-TaskPD	31.0	28.3	61.0	60.0	45.1
	MAPD	32.0	38.3	58.3	62.0	47.7
CLEVR Count	Multi-TaskPD	25.0	25.5	31.0	38.0	29.9
	MAPD	26.5	27.5	31.0	40.5	31.4
TextOCR	Multi-TaskPD	21.0	20.5	24.5	25.5	22.9
	MAPD	23.5	26.5	27.0	28.5	26.4

Comparison with ICL¶

Adaptation Mode	Avg Improvement	Notes
FT vs. ICL	+21.2%	Fine-tuning adaptation consistently outperforms ICL
MAPD vs. Multi-TaskPD (FT)	+3.5% (TextOCR)	Meta-learning further improves cross-task generalization
MAPD vs. In-ContextPD (ICL)	Significant advantage	Superior across all datasets

Ablation Study¶

Configuration	Key Metric	Notes
Number of soft prompts	MAPD improves with more prompts	In-ContextPD degrades
Robustness to image perturbation	MAPD avg. drop: 1.3%	Other methods drop 2.3–7.0%
Similar-example selection	All methods benefit	FT adaptation is more robust than ICL

Key Findings¶

MAPD is the only method exhibiting strictly monotonic improvement: performance consistently increases with shot count.
Meta-learning advantage is most pronounced at 2-shot: outperforms Multi-TaskPD by 10% on Operator Induction.
Only 24M parameters are trained, yet the 7B model surpasses 72B LLaVA-OneVision on Open-MI under ICL.
Most robust to image perturbations: retains near-original performance under strong perturbations such as CutMix and MixUp.

Highlights & Insights¶

Core insight of prompt distillation: Rather than requiring LMMs to directly extract information from lengthy image embedding sequences (as in ICL), the method learns a compact set of soft prompts to "distill" task-relevant visual information.
Combination of meta-learning and prompt tuning: The MAML-learned initialization enables adaptation to entirely novel tasks in as few as 30 gradient steps, mitigating overfitting.
Parameter efficiency: Only 24M trainable parameters—far fewer than full model fine-tuning—while achieving superior performance.
Three-level decomposition of Operator Induction (Task Induction + Perception + Math Reasoning) provides a fine-grained perspective for understanding model capabilities.

Limitations & Future Work¶

Limited to single-image VQA: The framework is not extended to multi-image scenarios.
Test-time computational overhead: FT adaptation requires approximately 5× the computation of ICL (30 gradient steps).
Limited task complexity: Evaluation tasks are relatively simple (2-way classification, basic arithmetic); effectiveness on more complex reasoning tasks remains unclear.
Frozen LLM: Fine-tuning the LLM jointly may yield further improvements.
Alternative attention mapper architectures (e.g., cross-attention, variable-resolution designs) are worth exploring.

MAML in VLMs: This work extends the line of Qin et al. (2023) and Najdenkoska et al. (2023), providing the first large-scale validation of meta-learned prompt distillation in a 7B LMM.
Comparison with ICL-based methods (Flamingo, MMICL, etc.): Demonstrates that parameter-efficient fine-tuning adaptation can surpass purely ICL-based approaches.
Insight: For small models (<10B), fine-tuning-based adaptation may be more reliable than ICL; future LMM designs should consider incorporating efficient built-in adaptation mechanisms.

Rating¶

Novelty: ⭐⭐⭐⭐ (The combination of MAML and prompt distillation is novel, though individual components are well-established)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Comprehensive ablations, robustness tests, and fine-grained Operator Induction analysis)
Writing Quality: ⭐⭐⭐⭐ (Clear structure with detailed appendices)
Value: ⭐⭐⭐⭐ (Provides a practical solution for few-shot adaptation in small models)