AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network¶
Conference: CVPR 2026 arXiv: 2603.12659 Code: https://github.com/yuhu990424/AVION Area: Remote Sensing Keywords: Vision-Language Model, Knowledge Distillation, Parameter-Efficient Fine-Tuning, Remote Sensing Scene Classification, Prompt Learning
TL;DR¶
AVION proposes a knowledge distillation framework that generates semantically rich text prototypes via LLMs and employs visual-textual dual-side prompt tuning with tri-aspect alignment distillation, addressing semantic poverty and visual rigidity in remote sensing VLM adaptation and comprehensively surpassing SOTA on few-shot classification, base-to-novel generalization, and cross-modal retrieval.
Background & Motivation¶
Remote sensing (RS) vision-language models (e.g., RemoteCLIP, GeoRSCLIP) exhibit strong zero-shot capabilities after pretraining, but still require efficient adaptation for new scenarios. Full-parameter fine-tuning is computationally expensive and prone to overfitting. Parameter-efficient fine-tuning (PEFT) serves as a lightweight alternative, but existing methods face two core bottlenecks in remote sensing:
Semantic Poverty: RS datasets provide only category names (e.g., "airport") without describing the vast visual variations of the same category across different regions, seasons, and sensors. Methods like CoOp that learn from "a photo of [CLASS]" templates leave the text encoder unable to fully express diverse appearance patterns.
Visual Rigidity: Most PEFT methods update only the text encoder while freezing the visual encoder, preventing the model from capturing RS-specific scale variations and cross-source heterogeneity.
Key Insight: Using large language models to generate rich category descriptions as teacher signals while injecting learnable prompts on both the visual and textual sides, achieving efficient adaptation through tri-aspect alignment distillation.
Core Idea: LLM-enhanced text prototypes serve as the teacher to guide tri-aspect distillation alignment for a student model with visual-textual dual-side prompt learning.
Method¶
Overall Architecture¶
AVION adopts a teacher-student distillation architecture: a frozen large teacher model (GeoRSCLIP ViT-H/14) constructs semantically rich text prototypes offline; the student model (GeoRSCLIP ViT-B/32) has learnable prompts injected into both visual and text encoders and is trained via tri-aspect alignment losses. Only the student model is used during inference.
Key Designs¶
-
LLM Domain Prompting + Selective Prototype Aggregation (Textual Prototype Enhancement)
- Function: Generates semantically rich text prototypes for each category, replacing single category names.
- Mechanism: (1) An LLM (Gemini 2.5 Flash) generates up to 50 RS-related descriptions per class; (2) RS-Flag rules filter out non-RS descriptions; (3) the teacher's visual prototype is used as a query to compute similarity for each description; (4) a robust z-score based on median/MAD removes outlier descriptions; (5) softmax-weighted aggregation produces the final prototype, with weights incorporating RS-Flag prior boosting.
- Design Motivation: LLM-generated descriptions may contain hallucinations or non-RS content, which must be validated through visual prototype verification and RS-Flag filtering. This aggregation process resembles parameter-free cross-attention, ensuring prototypes are both semantically rich and visually aligned.
-
Dual-Side Deep Prompt Tuning
- Function: Injects learnable prompts into both the visual and text encoders of the student model simultaneously.
- Mechanism: Similar to VPT and CoOp, deep prompt tokens are injected across multiple layers of the ViT, enabling the student encoder to gain adaptation flexibility while keeping pretrained weights frozen.
- Design Motivation: Adjusting only the text side cannot handle RS images' scale variations and overhead perspective features; adjusting only the visual side lacks semantic guidance. Dual-side prompts allow both encoders to accumulate RS knowledge under teacher guidance.
-
Tri-Aspect Alignment Distillation
- Function: Achieves comprehensive knowledge transfer through three complementary losses.
- Mechanism:
- Visual alignment: Brings student and teacher visual embeddings closer (cosine distance)
- Textual alignment: Brings student text embeddings closer to teacher semantic prototypes (cosine distance)
- Similarity alignment: Aligns cross-modal probability distributions using temperature-scaled KL divergence
- Design Motivation: Visual alignment addresses visual rigidity, textual alignment addresses semantic poverty, and logit alignment transfers implicit knowledge about inter-class relationships.
Loss Function / Training Strategy¶
The total objective is the weighted sum of task loss and three alignment losses. Visual and textual alignment weights are set to 0.5, logit alignment weight to 1.0, with 30% linear warm-up for the logit loss. Distillation temperature τ=2. AdamW optimizer, lr 5e-4, batch size 4. All experiments are conducted on a single NVIDIA L4 GPU.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | AVION | Previous SOTA (APPLeNet) | Improvement |
|---|---|---|---|---|
| 6-dataset avg (1-shot) | Accuracy | 74.27% | 74.27% | Tied |
| 6-dataset avg (8-shot) | Accuracy | 91.85% | 89.20% | +2.65pp |
| 6-dataset avg (16-shot) | Accuracy | 93.69% | 91.61% | +2.08pp |
| 6-dataset avg (B2N) | HM | 87.05% | 83.84% | +3.21pp |
| 6-dataset avg (B2N) | Novel | 79.94% | 75.75% | +4.19pp |
| RSITMD | mR | 52.92% | - | +1.11pp vs GeoRSCLIP-FT |
| RSICD | mR | 39.80% | - | +0.93pp vs GeoRSCLIP-FT |
Ablation Study¶
| Configuration | HM (%) | 1-shot (%) | Description |
|---|---|---|---|
| B0: CoOp-style shallow text prompts | 78.88 | 69.98 | Baseline |
| B1: + deep prompts | 66.71 | 66.95 | Severe novel-class degradation |
| B2: + visual alignment | 72.74 | 70.21 | Regularization restores generalization |
| B5: + LLM prototypes + selective aggregation | 83.05 | 72.52 | Largest HM improvement |
| B7: + logit alignment + warm-up | 87.05 | 74.27 | Overall best |
Key Findings¶
- AVION is the only method that simultaneously exceeds the GeoRSCLIP baseline in the base-to-novel setting (Novel 79.94% vs 79.75%)
- As shot count increases, the gap between AVION and the runner-up widens from 0pp to +2.65pp
- LLM domain prompting + selective aggregation is the largest contributing component (HM +10.31pp)
- t-SNE visualization shows AVION maintains good multimodal alignment on novel classes
Highlights & Insights¶
- Precisely diagnoses two core bottlenecks in RS PEFT and systematically addresses each one
- Selective prototype aggregation uses parameter-free cross-attention to align LLM knowledge with visual semantics, leveraging rich semantics while filtering hallucinations
- Step-by-step ablation of tri-aspect distillation is clear, with quantified evidence for each component's contribution
- Outperforms full-parameter fine-tuning on cross-modal retrieval with fewer trainable parameters
Limitations & Future Work¶
- Offline pre-computation of teacher prototypes still incurs overhead, potentially inefficient with extremely large numbers of categories
- Description quality depends on LLM prompt design; effects of different LLMs are unexplored
- Validated only on optical RS; applicability to other modalities such as SAR is unknown
- Experiments use only RS-specific VLMs; generalization to general-purpose CLIP is unverified
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of dual-side prompts + LLM prototype enhancement + tri-aspect distillation is novel, though individual components are not entirely new
- Experimental rigor: ⭐⭐⭐⭐⭐ — 6 classification + 2 retrieval datasets, three evaluation protocols, thorough ablations
- Writing quality: ⭐⭐⭐⭐⭐ — Clear problem diagnosis, well-motivated method, progressively layered ablations
- Impact: ⭐⭐⭐⭐ — Practical value for RS VLM adaptation; the LLM-assisted prototype construction approach is inspirational