ICLR2026 Remote Sensing AI paper notes paper summaries Multimodal/VLM Agents Diffusion Models Time-Series Forecasting Reasoning

🛰️ Remote Sensing¶

🔬 ICLR2026 · 11 paper notes

📌 Same area in other venues: 📷 CVPR2026 (57) · 🧪 ICML2026 (3) · 🤖 AAAI2026 (7) · 🧠 NeurIPS2025 (12) · 📹 ICCV2025 (11)

🔥 Top topics: Remote Sensing ×4 · Multimodal/VLM ×2

Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents: Earth-Agent is the first Earth Observation (EO) Agent framework based on the Model Context Protocol (MCP) tool ecosystem. it unifies RGB and spectral remote sensing data, achieving cross-modal, multi-step, and quantitative spatio-temporal reasoning by dynamically invoking 104 expert tools. The proposed Earth-Bench benchmark includes 248 expert tasks and 13,729 images. Experiments demonstrate that Earth-Agent significantly outperforms general-purpose Agents and remote sensing MLLMs.
MARS - A Foundational Map Auto-Regressor: This work treats vector maps (points, polylines, polygons) as a "language," using a unified vision encoder and auto-regressive decoder for end-to-end generation of road networks and building outlines without any segmentation post-processing. It releases MAP-3M, the largest multi-class map dataset to date (approximately 3M images).
Measuring the Intrinsic Dimension of Earth Representations: This paper presents the first systematic measurement of the Intrinsic Dimension (ID) of Geographic Implicit Neural Representations (Geographic INR). It finds that the true ID of 256-512D embeddings is only 2-10. A high ID in the frozen embedding space correlates positively with downstream performance, while a low ID in the supervised task-head activation space correlates with high performance, revealing a dual mechanism of "Representativeness vs. Task-Alignment."
MoRA: Mobility as the Backbone for Geospatial Representation Learning at Scale: MoRA treats human mobility graphs as the "structural backbone" for multimodal fusion. Using CLIP-style asymmetric contrastive learning, it aligns POIs, satellite imagery, and demographics with a billion-edge mobility graph. It outperforms SOTA by an average of 12.9% across 9 socioeconomic downstream tasks using 128-dimensional representations and provides the first empirical evidence of scaling laws in geospatial representation learning.
Object Fidelity Diffusion for Remote Sensing Image Generation: OF-Diff utilizes category labels to directly extract "shape mask priors" of remote sensing objects to constrain diffusion generation. An "online distillation" framework is employed to distill mixed features containing real image information into a shape-dependent decoder. This enables the model to generate high-fidelity, layout-consistent remote sensing images without requiring real image references during inference. Finally, DDPO reinforcement fine-tuning is used to further align with the real distribution, resulting in a 4–8% mAP improvement for categories such as airplanes, ships, and vehicles in downstream detection tasks.
SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery: Starting from a single satellite image and predefined ground camera trajectories, SatDreamer360 utilizes triplane scene representations, ray-guided pixel attention, and panoramic epipolar-constrained temporal attention to generate geometrically aligned and cross-frame consistent 360° ground panorama sequences within a diffusion model, outperforming Sat2Density, ControlS2S, and EscherNet on the newly constructed VIGOR++ benchmark.
SelvaBox: A high-resolution dataset for tropical tree crown detection: SelvaBox constructs the largest open-access high-resolution UAV RGB tree crown detection dataset for tropical forests. Using a unified multi-resolution detection benchmark, it demonstrates that high-resolution inputs, DINO-Swin detectors, and cross-dataset training significantly improve in-distribution and zero-shot generalization for tropical tree crown detection.
TAMMs: Change Understanding and Forecasting in Satellite Image Time Series with Temporal-Aware Multimodal Models: The authors propose TAMMs—the first unified framework to jointly execute Temporal Change Description (TCD) and Future Satellite Image Forecasting (FSIF) in a single MLLM-Diffusion architecture. It awakens the temporal reasoning capabilities of a frozen MLLM through a Temporal-Aware Module (TAM) and translates change understanding into generative control signals via a Semantic Fusion Control Injection (SFCI) mechanism.
Task-free Adaptive Meta Black-box Optimization: This paper proposes ABOM, a task-free adaptive meta black-box optimizer that parameterizes evolutionary operators (selection, crossover, and mutation) as differentiable attention modules. By utilizing self-generated data to update parameters online during the optimization process, it achieves competitive zero-shot performance on synthetic benchmarks and UAV path planning.
TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation: TerraFM is designed for multisensor Earth observation data, treating Sentinel-1 SAR and Sentinel-2 optical imagery as natural augmented views of the same location. Through modality-specific patch embedding, per-position cross-attention fusion, and dual-centering DINO training for long-tail land cover, it achieves strong generalization on classification and segmentation tasks in GEO-Bench and Copernicus-Bench.
Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded Geospatial Chain-of-Thought for Vision-Language Models: This paper proposes "Perceptually-Grounded Geospatial Chain-of-Thought" (Geo-CoT), which decomposes the analysis process of remote sensing VLMs into three steps: "Planning → Grounding Evidence → Synthesis." Each step anchors assertions to specific pixel regions using bounding boxes. By constructing the Geo-CoT380k dataset with 380,000 structured reasoning entries and employing two-stage alignment (SFT for cognitive structure and GRPO for faithfulness refinement), the resulting RSThinker significantly outperforms existing SOTA models across over ten remote sensing tasks, including visual grounding, counting, detection, captioning, and VQA.