Skip to content

AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

Conference: CVPR 2026 arXiv: 2603.12659 Code: https://github.com/yuhu990424/AVION Area: Remote Sensing / Vision-Language Model Keywords: Remote Sensing, Knowledge Distillation, Prompt Tuning, Vision-Language Model, Cross-Modal Retrieval

TL;DR

AVION proposes a knowledge distillation framework that uses LLM-generated semantically rich remote sensing text prototypes as teacher supervision while injecting learnable prompts into both the visual and text encoders of the student, achieving tri-aspect alignment distillation that significantly outperforms existing PEFT methods on few-shot classification and cross-modal retrieval.

Background & Motivation

Background: RS-specific VLMs such as RemoteCLIP and GeoRSCLIP perform well on downstream tasks, but full fine-tuning is expensive. PEFT methods (CoOp, MaPLe, etc.) adapt to new tasks by learning a small number of parameters.

Limitations of Prior Work: (1) Semantic poverty — RS datasets typically only have category name labels (e.g., "airport") and cannot describe the vast visual variations of the same category across different regions, seasons, and sensors; (2) Visual rigidity — most PEFT methods only update text-side prompts while freezing the visual encoder, unable to capture RS-specific overhead perspective and scale variation features.

Key Challenge: The gap between simple category names and the rich visual patterns of RS images, combined with the inability of frozen visual encoders to adapt to the RS domain.

Goal: Simultaneously address semantic poverty and visual rigidity to make PEFT methods work effectively in RS scenarios.

Key Insight: Leverage LLMs to generate rich category descriptions as textual supervision, and achieve robust adaptation through visual-textual-logit tri-aspect distillation constraints.

Core Idea: Use LLM-enriched text prototypes to solve semantic poverty, employ dual-side prompts + tri-aspect distillation to solve visual rigidity, and incur no additional overhead at inference through the teacher-student framework.

Method

Overall Architecture

Offline Teacher stage: a large model (GeoRSCLIP ViT-H/14) encodes LLM-generated category descriptions, validates them with visual prototypes, and aggregates them into text prototypes \(\mathbf{t}_k^{T*}\). Training Student stage: a smaller model (GeoRSCLIP ViT-B/32) with visual and text prompts injected, trained through tri-aspect distillation alignment. Inference stage: only the student model performs forward passes.

Key Designs

  1. Selective Prototype Aggregation

    • Function: Filters and aggregates text prototypes for each class from LLM-generated candidate descriptions.
    • Mechanism: First uses Gemini 2.5 Flash to generate up to 50 RS-perspective descriptions per class, marking RS relevance with RS-Flag. Computes teacher visual prototypes \(\hat{\mathbf{v}}_k^T = \frac{1}{|\mathcal{B}_k|}\sum_i \mathbf{v}_{k,i}^T\), evaluates the similarity of each description to the visual prototype \(s_{k,j} = (\hat{\mathbf{v}}_k^T)^\top \mathbf{t}_{k,j}^T\), removes outliers using median/MAD z-score, and finally produces weighted aggregation with \(w_{k,j} \propto \exp(\beta s_{k,j} + \gamma \cdot \text{RS-Flag}_{k,j})\) as the final prototype.
    • Design Motivation: LLMs may produce non-visual or non-RS-related hallucinated descriptions; visual prototype verification and statistical filtering ensure prototype quality and RS relevance.
  2. Dual-Side Prompt Tuning

    • Function: Simultaneously injects learnable prompts into the student's visual and text encoders.
    • Mechanism: The text side learns context tokens similar to CoOp; the visual side injects prompt tokens at each ViT layer similar to VPT. Both sides keep the backbone frozen, updating only prompt parameters.
    • Design Motivation: Visual-side prompts give the encoder flexibility to adapt to RS overhead perspective features, addressing visual rigidity; text-side prompts absorb the teacher's rich semantic knowledge.
  3. Tri-Aspect Alignment

    • Function: Simultaneously aligns visual embeddings, text embeddings, and similarity logits.
    • Mechanism: \(\mathcal{L}_{\text{img}} = 1 - (\mathbf{v}_i^S)^\top \mathbf{v}_i^T\) aligns visual features; \(\mathcal{L}_{\text{text}} = 1 - (\mathbf{t}_k^S)^\top \mathbf{t}_k^{T*}\) aligns text prototypes; \(\mathcal{L}_{\text{logit}} = \tau^2 \text{KL}(\sigma(\mathbf{s}^T/\tau) \| \sigma(\mathbf{s}^S/\tau))\) aligns cross-modal similarity distributions. Total loss: \(\mathcal{L} = \mathcal{L}_{\text{task}} + 0.5\mathcal{L}_{\text{img}} + 0.5\mathcal{L}_{\text{text}} + \mathcal{L}_{\text{logit}}\), with 30% linear warmup for the logit term.
    • Design Motivation: Embedding alignment alone is insufficient; aligning inter-class relational structure (via logit distillation) enables the student to learn not only individual class representations but also relative inter-class relationships.

Loss Function / Training Strategy

Total loss adds task classification cross-entropy. AdamW optimizer, lr=5e-4, 100 epochs for few-shot, 50 epochs for base-to-novel. Distillation temperature \(\tau=2\), with linear warmup for logit weight.

Key Experimental Results

Main Results (6-Dataset Average Few-Shot Classification Accuracy)

Method 1-shot 2-shot 4-shot 8-shot 16-shot
GeoRSCLIP (zero-shot) 72.95
CoOp 69.98 78.95 84.52 87.57 90.24
CoCoOp 70.27 80.56 85.74 88.93 91.41
MMRL 70.57 79.47
AVION 73.12 82.34 87.21 90.48 92.85

Ablation Study (AID Dataset, 16-shot)

Configuration Base Acc Novel Acc HM Description
AVION full 95.2 88.7 91.8 Only method exceeding baseline on both base and novel
w/o textual alignment 94.1 85.3 89.5 Text prototype supervision is critical for novel-class generalization
w/o visual prompts 94.8 86.1 90.3 Visual-side prompts improve domain adaptation
w/o logit alignment 94.5 87.2 90.7 Logit distillation provides inter-class relationships
w/o RS-Flag 94.9 87.5 91.1 RS flag filtering improves prototype quality

Key Findings

  • AVION is the only method where both base and novel accuracy exceed the GeoRSCLIP baseline in the base-to-novel setting, demonstrating that distillation does not harm generalization
  • Both RS-Flag and visual verification in text prototype aggregation are indispensable; removing either degrades Novel Acc
  • Cross-modal retrieval mR also improves, indicating that tri-aspect distillation enhances overall modal alignment quality

Highlights & Insights

  • Precise diagnosis of semantic poverty: RS datasets having only category name labels is the fundamental cause of PEFT failure; generating rich descriptions via LLMs is an elegant solution
  • Ingenious selective prototype aggregation: Functions like parameter-free cross-attention, with visual prototypes as queries and text descriptions as keys/values, automatically filtering poor descriptions and balancing aggregation weights
  • Tri-aspect distillation preserves generalization: Logit alignment preserves inter-class relational structure, which is key to AVION not degrading on novel classes

Limitations & Future Work

  • Depends on LLM description quality; may produce low-quality descriptions for non-English or unconventional RS categories
  • Teacher model is fixed as GeoRSCLIP ViT-H/14; switching to other backbones requires rebuilding prototypes
  • Not validated on more complex RS downstream tasks such as detection and segmentation

Rating

  • Novelty: ⭐⭐⭐⭐ — Accurate problem diagnosis with a systematic and complete solution
  • Experimental rigor: ⭐⭐⭐⭐ — Six datasets + three task settings + thorough ablations
  • Writing quality: ⭐⭐⭐⭐ — Clear motivation derivation, well-designed figures and tables
  • Impact: ⭐⭐⭐⭐ — Practical solution for RS VLM adaptation