AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network¶

Conference: CVPR 2026 arXiv: 2603.12659 Code: https://github.com/yuhu990424/AVION Area: Remote Sensing / Vision-Language Model Keywords: Remote Sensing, Knowledge Distillation, Prompt Tuning, Vision-Language Model, Cross-Modal Retrieval

TL;DR¶

AVION proposes a knowledge distillation framework that uses LLM-generated semantically rich remote sensing text prototypes as teacher supervision while injecting learnable prompts into both the visual and text encoders of the student, achieving tri-aspect alignment distillation that significantly outperforms existing PEFT methods on few-shot classification and cross-modal retrieval.

Background & Motivation¶

Background: RS-specific VLMs such as RemoteCLIP and GeoRSCLIP perform well on downstream tasks, but full fine-tuning is expensive. PEFT methods (CoOp, MaPLe, etc.) adapt to new tasks by learning a small number of parameters.

Limitations of Prior Work: (1) Semantic poverty — RS datasets typically only have category name labels (e.g., "airport") and cannot describe the vast visual variations of the same category across different regions, seasons, and sensors; (2) Visual rigidity — most PEFT methods only update text-side prompts while freezing the visual encoder, unable to capture RS-specific overhead perspective and scale variation features.

Key Challenge: The gap between simple category names and the rich visual patterns of RS images, combined with the inability of frozen visual encoders to adapt to the RS domain.

Goal: Simultaneously address semantic poverty and visual rigidity to make PEFT methods work effectively in RS scenarios.

Key Insight: Leverage LLMs to generate rich category descriptions as textual supervision, and achieve robust adaptation through visual-textual-logit tri-aspect distillation constraints.

Core Idea: Use LLM-enriched text prototypes to solve semantic poverty, employ dual-side prompts + tri-aspect distillation to solve visual rigidity, and incur no additional overhead at inference through the teacher-student framework.

Method¶

Overall Architecture¶

Offline Teacher stage: a large model (GeoRSCLIP ViT-H/14) encodes LLM-generated category descriptions, validates them with visual prototypes, and aggregates them into text prototypes \(\mathbf{t}_k^{T*}\). Training Student stage: a smaller model (GeoRSCLIP ViT-B/32) with visual and text prompts injected, trained through tri-aspect distillation alignment. Inference stage: only the student model performs forward passes.

Key Designs¶

Selective Prototype Aggregation
- Function: Filters and aggregates text prototypes for each class from LLM-generated candidate descriptions.
- Mechanism: First uses Gemini 2.5 Flash to generate up to 50 RS-perspective descriptions per class, marking RS relevance with RS-Flag. Computes teacher visual prototypes \(\hat{\mathbf{v}}_k^T = \frac{1}{|\mathcal{B}_k|}\sum_i \mathbf{v}_{k,i}^T\), evaluates the similarity of each description to the visual prototype \(s_{k,j} = (\hat{\mathbf{v}}_k^T)^\top \mathbf{t}_{k,j}^T\), removes outliers using median/MAD z-score, and finally produces weighted aggregation with \(w_{k,j} \propto \exp(\beta s_{k,j} + \gamma \cdot \text{RS-Flag}_{k,j})\) as the final prototype.
- Design Motivation: LLMs may produce non-visual or non-RS-related hallucinated descriptions; visual prototype verification and statistical filtering ensure prototype quality and RS relevance.
Dual-Side Prompt Tuning
- Function: Simultaneously injects learnable prompts into the student's visual and text encoders.
- Mechanism: The text side learns context tokens similar to CoOp; the visual side injects prompt tokens at each ViT layer similar to VPT. Both sides keep the backbone frozen, updating only prompt parameters.
- Design Motivation: Visual-side prompts give the encoder flexibility to adapt to RS overhead perspective features, addressing visual rigidity; text-side prompts absorb the teacher's rich semantic knowledge.
Tri-Aspect Alignment
- Function: Simultaneously aligns visual embeddings, text embeddings, and similarity logits.
- Mechanism: \(\mathcal{L}_{\text{img}} = 1 - (\mathbf{v}_i^S)^\top \mathbf{v}_i^T\) aligns visual features; \(\mathcal{L}_{\text{text}} = 1 - (\mathbf{t}_k^S)^\top \mathbf{t}_k^{T*}\) aligns text prototypes; \(\mathcal{L}_{\text{logit}} = \tau^2 \text{KL}(\sigma(\mathbf{s}^T/\tau) \| \sigma(\mathbf{s}^S/\tau))\) aligns cross-modal similarity distributions. Total loss: \(\mathcal{L} = \mathcal{L}_{\text{task}} + 0.5\mathcal{L}_{\text{img}} + 0.5\mathcal{L}_{\text{text}} + \mathcal{L}_{\text{logit}}\), with 30% linear warmup for the logit term.
- Design Motivation: Embedding alignment alone is insufficient; aligning inter-class relational structure (via logit distillation) enables the student to learn not only individual class representations but also relative inter-class relationships.

Loss Function / Training Strategy¶

Total loss adds task classification cross-entropy. AdamW optimizer, lr=5e-4, 100 epochs for few-shot, 50 epochs for base-to-novel. Distillation temperature \(\tau=2\), with linear warmup for logit weight.

Key Experimental Results¶

Main Results (6-Dataset Average Few-Shot Classification Accuracy)¶

Method	1-shot	2-shot	4-shot	8-shot	16-shot
GeoRSCLIP (zero-shot)	72.95	—	—	—	—
CoOp	69.98	78.95	84.52	87.57	90.24
CoCoOp	70.27	80.56	85.74	88.93	91.41
MMRL	70.57	79.47	—	—	—
AVION	73.12	82.34	87.21	90.48	92.85

Ablation Study (AID Dataset, 16-shot)¶

Configuration	Base Acc	Novel Acc	HM	Description
AVION full	95.2	88.7	91.8	Only method exceeding baseline on both base and novel
w/o textual alignment	94.1	85.3	89.5	Text prototype supervision is critical for novel-class generalization
w/o visual prompts	94.8	86.1	90.3	Visual-side prompts improve domain adaptation
w/o logit alignment	94.5	87.2	90.7	Logit distillation provides inter-class relationships
w/o RS-Flag	94.9	87.5	91.1	RS flag filtering improves prototype quality

Key Findings¶

AVION is the only method where both base and novel accuracy exceed the GeoRSCLIP baseline in the base-to-novel setting, demonstrating that distillation does not harm generalization
Both RS-Flag and visual verification in text prototype aggregation are indispensable; removing either degrades Novel Acc
Cross-modal retrieval mR also improves, indicating that tri-aspect distillation enhances overall modal alignment quality

Highlights & Insights¶

Precise diagnosis of semantic poverty: RS datasets having only category name labels is the fundamental cause of PEFT failure; generating rich descriptions via LLMs is an elegant solution
Ingenious selective prototype aggregation: Functions like parameter-free cross-attention, with visual prototypes as queries and text descriptions as keys/values, automatically filtering poor descriptions and balancing aggregation weights
Tri-aspect distillation preserves generalization: Logit alignment preserves inter-class relational structure, which is key to AVION not degrading on novel classes

Limitations & Future Work¶

Depends on LLM description quality; may produce low-quality descriptions for non-English or unconventional RS categories
Teacher model is fixed as GeoRSCLIP ViT-H/14; switching to other backbones requires rebuilding prototypes
Not validated on more complex RS downstream tasks such as detection and segmentation

Rating¶

Novelty: ⭐⭐⭐⭐ — Accurate problem diagnosis with a systematic and complete solution
Experimental rigor: ⭐⭐⭐⭐ — Six datasets + three task settings + thorough ablations
Writing quality: ⭐⭐⭐⭐ — Clear motivation derivation, well-designed figures and tables
Impact: ⭐⭐⭐⭐ — Practical solution for RS VLM adaptation