LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration¶

Conference: CVPR 2026 arXiv: 2603.24696 Code: https://github.com/OSUPCVLab/LLaVA-LE Area: Model Compression Keywords: Lunar exploration, vision-language model, geological understanding, multimodal reasoning, domain fine-tuning

TL;DR¶

LLaVA-LE is the first vision-language model tailored for lunar exploration. By constructing LUCID, a large-scale real lunar image-text dataset (96K images + 81K QA pairs), and applying two-stage curriculum fine-tuning on LLaVA, the model achieves a 3.3× improvement over the baseline on lunar geological understanding and multimodal reasoning.

Background & Motivation¶

VLMs have made remarkable progress in natural image understanding, yet remain nearly absent in planetary science. The primary obstacle is the lack of large-scale, high-quality paired planetary image-text data. Existing lunar datasets are small, unimodal, and often contain synthetic data, making them unsuitable for training modern VLMs.

Key Challenge: Planetary remote sensing is fundamentally different from natural image understanding — lunar geological analysis requires joint reasoning across physical modalities (optical, gravity anomaly, topographic slope), whereas a single image provides only surface reflectance information, which is insufficient for understanding geological structure.

Goal: To construct the first large-scale multimodal lunar dataset grounded in real NASA mission data, and to train a vision-language assistant capable of lunar geological description, geological question answering, and multimodal reasoning.

Method¶

Overall Architecture¶

Data construction → two-stage fine-tuning. Data are sourced from three NASA missions: LROC (high-resolution optical), GRAIL (gravity anomaly), and LOLA (topographic slope). Scientific descriptions and QA pairs are generated via GPT-5. The model is built on the LLaVA framework, employing a CLIP visual encoder paired with an LLM and trained in two stages.

Key Designs¶

LUCID Dataset Construction:
- Function: Provides 96K panchromatic images with detailed scientific descriptions and 81K VQA pairs.
- Mechanism: Panchromatic lunar images are sourced from LROC WAC; structured prompts are used to invoke GPT-5 to generate detailed scientific descriptions covering geological context, topographic morphology, and inferred subsurface features. Three categories of QA pairs are then derived from these descriptions: detailed description, conversation, and reasoning.
- Design Motivation: Combining real data with GPT-5 annotation balances dataset scale with annotation quality.
Two-Stage Curriculum Learning:
- Function: Progressively adapts a general-purpose VLM to the planetary science domain.
- Mechanism: Stage 1 (concept alignment) — fine-tunes on image-description pairs to teach the model domain-specific lunar geological terminology and visual-semantic mappings. Stage 2 (instruction tuning) — fine-tunes on QA pairs to enhance interactive question answering and reasoning capabilities.
- Design Motivation: Direct instruction tuning without prior concept alignment yields suboptimal results; establishing a domain conceptual foundation first is necessary.
Multi-Level Evaluation Benchmark:
- Function: Assesses model performance across varying levels of reasoning complexity.
- Mechanism: Three evaluation dimensions are designed — Detailed (descriptive), Conversation (dialogic), and Reasoning — scored by a dual-judge system using GPT-4 and Gemini.
- Design Motivation: A single metric is insufficient to evaluate a domain-specific VLM; multidimensional measurement is required.

Loss & Training¶

Standard LLaVA training strategy: Stage 1 freezes the LLM and trains only the projection layer for alignment; Stage 2 unfreezes all parameters for full instruction tuning.

Key Experimental Results¶

Main Results¶

Model	Detailed	Conversation	Reasoning	Overall	Relative Judge Score
Base LLaVA	Low	Low	Low	~0.32	—
LLaVA-LE Stage 1	Medium	Medium	Medium	~0.51	—
LLaVA-LE Stage 2	High	High	1.070	~1.06	Exceeds reference

LLaVA-LE Stage 2 achieves a 3.3× overall improvement over Base LLaVA. The reasoning dimension score of 1.070 surpasses the judge's own reference answers.

Ablation Study¶

Configuration	Overall	Notes
Base LLaVA (no fine-tuning)	~0.32	General-purpose model performs poorly on lunar domain
Stage 1 only	~0.51	Concept alignment contributes ~60% improvement
Stage 1 + Stage 2	~1.06	Instruction tuning approximately doubles the gain

Key Findings¶

General-purpose VLMs are nearly unusable in planetary science; domain fine-tuning is critical.
The concept alignment in Stage 1 contributes substantially, demonstrating that establishing domain terminology and conceptual mappings is foundational.
Reasoning scores exceeding the judge's reference answers indicate that data-intensive training enables the model to produce high-quality geological analyses.

Highlights & Insights¶

First planetary science VLM: Fills a gap in AI for planetary exploration and establishes a new application direction.
GPT-5-based scientific annotation pipeline: The approach of using large models to automatically generate high-quality annotations for domain-specific data is transferable to other scientific domains with limited labeled resources.
Fully open-source: The dataset, code, and model weights are all publicly released, offering significant value to follow-up research.

Limitations & Future Work¶

The current work uses only panchromatic images, leaving the joint reasoning potential of multimodal remote sensing data (gravity, slope) largely unexploited.
GPT-5-generated annotations may contain geological inaccuracies and require expert validation.
Evaluation still relies on LLM judges, lacking human assessment by planetary scientists.
Future work may extend the framework to other celestial bodies such as Mars and asteroids.

vs. LLaVA-Med: LLaVA-Med adapts LLaVA to medicine; LLaVA-LE adapts it to planetary science. The approaches are conceptually similar, but the domain challenges differ substantially.
vs. Space-LLaVA: Space-LLaVA relies on synthetic data, whereas LLaVA-LE uses real NASA data, yielding higher data quality.
vs. AlphaEarth: AlphaEarth targets Earth observation; LLaVA-LE targets the Moon, where data scarcity presents a greater challenge.

Rating¶

Novelty: ⭐⭐⭐⭐ — Innovative domain application, though the methodology (LLaVA fine-tuning) is relatively standard.
Experimental Thoroughness: ⭐⭐⭐ — Evaluation design is reasonable but limited in scale; comparisons with more baselines are lacking.
Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated; dataset construction is described in detail.
Value: ⭐⭐⭐⭐ — The open-source dataset and model are of significant importance to the planetary science community.