Skip to content

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

Conference: ICLR 2026 arXiv: 2506.06905 Code: None Area: Multimodal Learning / Few-Shot Learning Keywords: Meta-Learning, Prompt Distillation, Few-Shot VQA, LMM, MAML

TL;DR

This paper proposes MAPD (Meta-Adaptive Prompt Distillation), a MAML-based prompt distillation framework that leverages an attention mapper to distill soft prompts from task-relevant image features, enabling LMMs to adapt to novel visual question answering tasks at test time with only a few gradient steps. MAPD outperforms ICL by 21.2%.

Background & Motivation

Large multimodal models (LMMs) typically rely on in-context learning (ICL) for few-shot tasks, but several critical issues persist:

Unstable ICL performance in small models: Models with fewer than 7B parameters frequently exhibit stagnant or even degraded performance as the number of in-context examples increases, particularly on VQA tasks.

Information overload in image embeddings: Models are overwhelmed by task-irrelevant information embedded in image representations, hindering effective focus on task-relevant features.

Non-monotonic behavior of ICL: Performance does not necessarily improve monotonically with increasing shot count—a phenomenon that contradicts intuitions about human few-shot learning.

The authors hypothesize that the root cause lies in ICL's inability to effectively extract task-specific information from image embeddings. The proposed solution is to learn a set of fixed soft prompts that distill task-relevant visual features, followed by rapid test-time adaptation via a small number of gradient updates.

Method

Overall Architecture

MAPD builds upon the LLaVA v1.5 architecture and comprises three core components: 1. CLIP ViT-L/14 visual encoder (frozen) 2. Attention mapper + soft prompts (trainable, ~24M parameters) 3. Qwen2.5-7B-Instruct LLM (frozen)

Training proceeds in two stages: pre-training (feature alignment) and fine-tuning (meta-learning-based prompt distillation).

Key Designs

  1. Attention Mapper:

    • Replaces the MLP projection layer in LLaVA v1.5.
    • Concatenates learnable soft prompts \(P\) (\(m=256\) tokens) with visual features \(Z_v\) to form \(C = (P, Z_v)\).
    • Computes multi-head attention (8 heads): \(H_{p+v} = \sigma(QK^T) \cdot V\).
    • Extracts the first \(m\) output embeddings as task-specific image prompts \(H_p\).
    • Design Motivation: The soft prompts leverage the attention mechanism to "distill" task-relevant information from image features.
  2. Meta-Task Construction:

    • Meta-tasks \(T_j = \{D_{supp}, D_{query}\}\) are sampled from a mixture of training datasets.
    • Each meta-task includes a support set and a query set, simulating few-shot test scenarios.
    • Task diversity is ensured via a data mixture of 14 datasets (~802K samples).
  3. MAPD Training (First-Order MAML):

    • Inner loop: Computes loss on the support set and performs a gradient update to obtain task-specific parameters \(\theta_p' = \theta_p - \alpha \nabla_{\theta_p} L_{supp}\).
    • Outer loop: Computes loss on the query set using task-specific parameters and updates meta-parameters \(\theta_p := \theta_p - \beta \sum_j \nabla_{\theta'_{p,j}} L_{query}\).
    • A first-order approximation is adopted to avoid computing Hessian-vector products, substantially reducing GPU memory consumption.
    • Inner loop: 5 steps, \(\alpha = 0.1\); outer loop: \(\beta = 10^{-3}\).

Loss & Training

  • Training objective: maximize the likelihood \(p_{\theta_p}(X_a | X_v, X_q)\).
  • Pre-training stage: trained for 4 epochs on LCS-558K with a learning rate of 2e-3.
  • Fine-tuning stage: trained for 1 epoch using MAML bi-level optimization.
  • Test-time adaptation: up to \(K=30\) gradient steps on the support set.

Key Experimental Results

Main Results

Performance on VL-ICL Bench (FT adaptation mode, accuracy %):

Dataset Method 1-S 2-S 4-S 5/8-S Avg
Open-MI (2-way) NoMeta-task 21.5 67.5 89.0 94.0 68.0
MAPD 43.5 78.0 94.5 95.5 77.9
Operator Induction Multi-TaskPD 31.0 28.3 61.0 60.0 45.1
MAPD 32.0 38.3 58.3 62.0 47.7
CLEVR Count Multi-TaskPD 25.0 25.5 31.0 38.0 29.9
MAPD 26.5 27.5 31.0 40.5 31.4
TextOCR Multi-TaskPD 21.0 20.5 24.5 25.5 22.9
MAPD 23.5 26.5 27.0 28.5 26.4

Comparison with ICL

Adaptation Mode Avg Improvement Notes
FT vs. ICL +21.2% Fine-tuning adaptation consistently outperforms ICL
MAPD vs. Multi-TaskPD (FT) +3.5% (TextOCR) Meta-learning further improves cross-task generalization
MAPD vs. In-ContextPD (ICL) Significant advantage Superior across all datasets

Ablation Study

Configuration Key Metric Notes
Number of soft prompts MAPD improves with more prompts In-ContextPD degrades
Robustness to image perturbation MAPD avg. drop: 1.3% Other methods drop 2.3–7.0%
Similar-example selection All methods benefit FT adaptation is more robust than ICL

Key Findings

  1. MAPD is the only method exhibiting strictly monotonic improvement: performance consistently increases with shot count.
  2. Meta-learning advantage is most pronounced at 2-shot: outperforms Multi-TaskPD by 10% on Operator Induction.
  3. Only 24M parameters are trained, yet the 7B model surpasses 72B LLaVA-OneVision on Open-MI under ICL.
  4. Most robust to image perturbations: retains near-original performance under strong perturbations such as CutMix and MixUp.

Highlights & Insights

  • Core insight of prompt distillation: Rather than requiring LMMs to directly extract information from lengthy image embedding sequences (as in ICL), the method learns a compact set of soft prompts to "distill" task-relevant visual information.
  • Combination of meta-learning and prompt tuning: The MAML-learned initialization enables adaptation to entirely novel tasks in as few as 30 gradient steps, mitigating overfitting.
  • Parameter efficiency: Only 24M trainable parameters—far fewer than full model fine-tuning—while achieving superior performance.
  • Three-level decomposition of Operator Induction (Task Induction + Perception + Math Reasoning) provides a fine-grained perspective for understanding model capabilities.

Limitations & Future Work

  1. Limited to single-image VQA: The framework is not extended to multi-image scenarios.
  2. Test-time computational overhead: FT adaptation requires approximately 5× the computation of ICL (30 gradient steps).
  3. Limited task complexity: Evaluation tasks are relatively simple (2-way classification, basic arithmetic); effectiveness on more complex reasoning tasks remains unclear.
  4. Frozen LLM: Fine-tuning the LLM jointly may yield further improvements.
  5. Alternative attention mapper architectures (e.g., cross-attention, variable-resolution designs) are worth exploring.
  • MAML in VLMs: This work extends the line of Qin et al. (2023) and Najdenkoska et al. (2023), providing the first large-scale validation of meta-learned prompt distillation in a 7B LMM.
  • Comparison with ICL-based methods (Flamingo, MMICL, etc.): Demonstrates that parameter-efficient fine-tuning adaptation can surpass purely ICL-based approaches.
  • Insight: For small models (<10B), fine-tuning-based adaptation may be more reliable than ICL; future LMM designs should consider incorporating efficient built-in adaptation mechanisms.

Rating

  • Novelty: ⭐⭐⭐⭐ (The combination of MAML and prompt distillation is novel, though individual components are well-established)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Comprehensive ablations, robustness tests, and fine-grained Operator Induction analysis)
  • Writing Quality: ⭐⭐⭐⭐ (Clear structure with detailed appendices)
  • Value: ⭐⭐⭐⭐ (Provides a practical solution for few-shot adaptation in small models)