CVPR2026 Remote Sensing AI paper notes paper summaries Multimodal/VLM Segmentation Navigation Diffusion Models Adversarial Robustness

🛰️ Remote Sensing¶

📷 CVPR2026 · 57 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (11) · 🧪 ICML2026 (3) · 🤖 AAAI2026 (7) · 🧠 NeurIPS2025 (12) · 📹 ICCV2025 (11)

🔥 Top topics: Remote Sensing ×28 · Multimodal/VLM ×10 · Segmentation ×5 · Navigation ×3 · Diffusion Models ×3

ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery: Ours proposes ACPV-Net, the first framework to generate topologically consistent all-class polygonal vector maps from aerial imagery in a single pass. It utilizes a Semantic Supervised Conditioning (SSC) diffusion model to generate vertex heatmaps and ensures zero-gap/zero-overlap through proposition-driven PSLG reconstruction.
APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation: APEX decomposes the "UAV target search" task into three decoupled modules—using MLLMs to dynamically construct 3D spatio-temporal semantic maps as memory, PPO-based reinforcement learning to translate maps into actions, and an open-vocabulary detector for final target confirmation. These modules run at different frequencies via an asynchronous parallel framework to bypass the inference latency of large models, achieving a \(+4.2\%\) SR and \(+2.8\%\) SPL improvement over the Prev. SOTA on the UAV-ON benchmark.
Asking like Socrates: Socrates helps VLMs understand remote sensing images: This work reveals the "pseudo-reasoning" phenomenon in remote sensing VLMs (where explicit reasoning chains lead to performance degradation), attributed to the "glance effect" (insufficient single coarse-grained perception). It proposes the RS-EoT (Evidence-of-Thought) iterative evidence search paradigm. The method uses SocraticAgent self-play to synthesize reasoning trajectories for SFT cold startup, followed by two-stage progressive RL (grounding → VQA) for enhancement and generalization. RS-EoT-7B achieves SOTA on multiple remote sensing VQA and grounding benchmarks.
AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network: AVION proposes a knowledge distillation framework that utilizes semantic-rich remote sensing text prototypes generated by an LLM as a Teacher for supervision. Simultaneously, it injects learnable prompts into both the vision and text encoders of the Student model to achieve tri-aspect alignment distillation. It significantly outperforms existing PEFT methods in few-shot classification and cross-modal retrieval.
Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction: Addressing the frequent fragmentation and incorrect connections of urban road models in wilderness/off-road scenarios, this paper proposes "path-centric" connectivity reasoning. Instead of relying solely on local features of two endpoints, the method samples multi-scale road evidence along the entire geodesic of candidate edges to determine connectivity. The authors also release WildRoad, the first intercontinental vectorized off-road road dataset, achieving SOTA on off-road benchmarks while generalizing well to urban datasets.
Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation: Bearing-UAV abandons the "matching a UAV view to a specific satellite tile" paradigm. Instead, it utilizes 4 adjacent satellite tiles and 1 UAV view to directly regress the absolute coordinates and heading angle of the UAV. In scenarios with misalignment, sparse features, and cross-view discrepancies, it reduces errors by an order of magnitude compared to retrieval/matching methods (UAV view MLE reduced from ~30 m to 8.6 m) and integrates heading prediction into end-to-end navigation.
Beyond Tie Points: Satellite Image Block Adjustment based on Dense Feature Consistency: Addressing the long-standing limitation of Planar Block Adjustment (PBA) relying on sparse tie points and accumulating errors in high-disparity regions such as tall buildings, this paper proposes the "Beyond Tie Points" paradigm. It utilizes a pre-trained feature extractor to generate dense features and confidence maps, reformulating block adjustment as a self-supervised optimization problem to "minimize the dense feature distance of homologous object points." Combined with a grid-based coarse-to-fine solver, it reduces average errors by up to 75.43% on data from Beijing, Guangzhou, and San Jose.
ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing: Ours proposes ChangeBridge, the first conditional spatiotemporal image generation model for remote sensing. Based on asymmetrically drifting diffusion bridges, it generates post-event images from pre-event images and multimodal conditions (coordinate-text/semantic masks/instance layouts), simultaneously modeling foreground event-driven changes and background temporal evolution, while serving as a data engine for downstream change detection tasks.
Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark: The Cross-modal Fuzzy Alignment Network (CFAN) is proposed, utilizing fuzzy logic to quantify token-level reliability for fine-grained alignment. It introduces the ground view as a bridging proxy to mitigate the semantic gap between aerial images and text, alongside the construction of a large-scale text-aerial person retrieval benchmark, AERI-PEDES.
Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark: This paper proposes PanScale, the first cross-scale pansharpening dataset and evaluation benchmark (PanScale-Bench), along with the ScaleFormer framework. The method reinterprets resolution changes as sequence length variations, achieving cross-scale generalization through Scale-Aware Patchify bucketed sampling, decoupled spatial-sequence modeling, and RoPE.
CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation: To address the concurrent spatial, semantic, and frequency domain shifts in remote sensing (RS) imagery, CrossEarth-Gate integrates three types of PEFT modules (LoRA / Adapter / Earth-Adapter) as a "toolbox" into every backbone layer. By periodically measuring the contribution of each module to the task's gradient flow using Fisher information and activating only the Top-k most critical ones, it achieves 16 SOTAs across 18 RS cross-domain benchmarks with only 3-4M trainable parameters.
Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets: This paper performs a quality audit on three commonly used remote sensing datasets for building footprint extraction using perceptual hashing. It discovers that the AICrowd Mapping Challenge dataset suffers from severe duplication (approx. 89% of training images are exact/augmented duplicates) and cross-split leakage (approx. 93% of validation images appear in the training set). The authors provide a lightweight, reusable de-duplication and leakage detection pipeline, revealing that many "SOTA" methods are actually overfitted to leaked data.
Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: The first to advance Spectral Compressive Imaging (SCI) from image-level to video-level reconstruction, constructing the first high-quality dynamic hyperspectral dataset DynaSpec (30 sequences/300 frames). Proposed PG-SVRT achieves 41.52dB PSNR and optimal temporal consistency through spatial-then-temporal progressive attention + bridging tokens, with FLOPs (28.18G) lower than several image-level SOTA methods.
Geo2: Geometry-Guided Cross-view Geo-Localization and Image Synthesis: Geo2 leverages 3D priors from a Geometry Foundation Model (VGGT) to embed ground panoramas and satellite images into a shared geometry-aware latent space. This framework enables Cross-view Geo-localization (CVGL) and bidirectional Cross-view Image Synthesis (CVIS) to mutually enhance each other within the same architecture. By utilizing reversible flow matching, bidirectional generation is achieved through unidirectional training, setting new SOTA benchmarks in both localization and synthesis on CVUSA/CVACT/VIGOR.
GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization: GeoBridge utilizes a "location-aware unified text description" as a semantic anchor to bind images from three perspectives—UAV, street-view panorama, and satellite—into a shared semantic space. This approach breaks away from the traditional "satellite-centric" localization paradigm, enabling both arbitrary peer-to-peer view matching and text-to-image retrieval. The associated GeoLoc dataset (50,000+ triple-aligned sets across 36 countries) allows it to achieve SOTA performance in both cross-view and cross-modal retrieval.
GeoCoT: Towards Reliable Remote Sensing Reasoning with Manifold Perspective: GeoCoT explicitly integrates the "low-dimensional manifold" prior of remote sensing (RS) images into the Mixture-of-Experts (MoE) architecture. By using spectral clustering and low-rank compression to project redundant visual tokens into low-rank subspaces, it guides sparse expert allocation via manifold structure. Combined with a multi-stage training pipeline (CPT → Cold-start → RSV-GRPO reinforcement learning) and the self-constructed RS-CoT-20k dataset, the 12B RS model outperforms current SOTA by an average of 5.27% across five RS tasks.
GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding: GeoDiT shifts text generation for remote sensing images from "autoregressive token-by-token" to "discrete diffusion parallel iterative denoising." Using SigLIP-2 visual conditioning and the LLaDA-8B bidirectional Transformer, it predicts entire sequences at once and refines them through low-confidence remasking, achieving new SOTA on tasks requiring structured output such as multi-object detection, visual grounding, and image captioning.
GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction: GeoFlow is proposed as a lightweight cross-view fine-grained geolocalization framework inspired by flow matching. By learning a probabilistic displacement field combined with an Iterative Refinement Sampling (IRS) algorithm, it achieves precise 2-DoF localization from ground-level images to satellite images in continuous space, reaching 29 FPS real-time speed with accuracy comparable to SOTA.
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing: This paper introduces GeoMMBench (1,053 expert-level geoscience multiple-choice questions) and GeoMMAgent (a retrieval-perception-reasoning multi-agent framework). It systematically evaluates 36 MLLMs in the remote sensing domain, revealing systemic deficiencies in domain knowledge, perceptual grounding, and reasoning.
GeoSANE: Learning Geospatial Representations from Models, Not Data: GeoSANE treats the weights themselves of 103 off-the-shelf remote sensing models as training data. It utilizes a weight-space autoencoder to learn a shared latent representation across all models. New "ready-to-fine-tune" model weights are then sampled and decoded from this latent space for a target architecture. This shifts remote sensing pre-training from "learning from satellite data" to "learning from models." Generated models consistently outperform training-from-scratch and rival or exceed SOTA Remote Sensing Foundation Models (RSFMs) across ten datasets for classification, segmentation, and detection.
GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding: GeoViS reformulates remote sensing visual grounding from a "one-step box regression" into a two-stage process: first, a reward-guided tree-based visual search locates the sub-region most likely to contain the target, and then this sub-region serves as a visual cue for conditional grounding. A unified VisualRAG model simultaneously provides reward evaluation, action guidance, and grounding inference, achieving SOTA performance on metrics like [email protected] across five benchmarks.
LNEM: Lunar Neural Elevation Model: The first lunar DEM reconstruction framework to explicitly embed a pushbroom camera's Rigorous Sensor Model (RSM) into neural volume rendering. It is accompanied by the Lunar Studio data pipeline, which generates geometrically consistent inputs from raw orbital imagery to reconstruct high-fidelity lunar elevation models under multi-sensor and multi-illumination conditions.
Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts: SpectralMoE feeds per-layer features of frozen foundation models (DINOv3/DOFA) into a dual-gated MoE for per-pixel fine-grained modulation, while injecting structural depth priors estimated from RGB bands via cross-attention to achieve SOTA across seven remote sensing domain generalization benchmarks.
LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation: Addressing the issues of "high ambiguity in landmark descriptions and expensive global scene graph maintenance" in UAV aerial navigation, LookasideVLN proposes a "lookaside" paradigm. It constructs a lightweight egocentric landmark graph using directional cues (left turn/right turn/ascend) naturally present in instructions. By translating candidate paths into "instruction-like" text for MLLM semantic alignment, it outperforms SOTA methods (CityNavAgent) that require global sequence lookahead, even under zero-shot and single-layer lookahead conditions.
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging: MetaSpectra+ proposes a metasurface-refractive lens hybrid optical paradigm. By independently controlling 4-channel dispersion, exposure, and polarization via double-layer metasurfaces, it achieves 250nm broadband snapshot hyperspectral+ HDR/polarization imaging within a minimal 17mm optical path. It reaches a PSNR of 33.31dB on the KAUST benchmark, outperforming existing snapshot hyperspectral systems.
MM-OVSeg: Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing: MM-OVSeg introduces SAR into open-vocabulary segmentation (OVS) for remote sensing. It utilizes contrastive distillation to align SAR features into the representation space of an RGB visual foundation model (CMU) and employs a dual-encoder fusion module to align CLIP global semantics and DINO dense structural features with text (DEF). This enables pixel-level segmentation according to arbitrary text categories even under cloudy or hazy weather conditions, achieving an average mIoU of 51.7% across six benchmarks, outperforming the best previous single-modal methods by 6.1 points.
MOGeo: Beyond One-to-One Cross-View Object Geo-localization: Addressing the unrealistic assumption that existing Cross-View Object Geo-localization (CVOGL) can only locate a single target per image, this paper proposes a new multi-target task CVMOGL and the accompanying CMLocation benchmark (25,520 image pairs, 63,888 instances). It designs MOGeo, an end-to-end method whose core is to ground each query target into sharp attention peaks using Dirac-like one-hot position encoding, combined with cross-view multi-feature fusion and inter-object similarity loss, significantly surpassing DetGeo/VAGeo in multi-target scenarios.
Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening: Addressing the pan-sharpening task, this paper replaces the "semantic-agnostic fixed raster scanning" of Vision RWKV with a semantic prototype scanning driven by Locality Sensitive Hashing (LSH) clustering. Combined with a "Global + Prototype + Register" tri-token prompt mechanism and an invertible Q-shift high-frequency enhancement, it achieves new SOTA results across PSNR, SSIM, SAM, and ERGAS on three datasets: WorldView and GaoFen2.
NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining: NeighborMAE transforms MAE from "reconstructing a single remote sensing image" to "jointly reconstructing a pair of geographically adjacent images." By utilizing relative position encoding, an adaptive masking ratio based on IoU, and a reconstruction loss weighted by visibility, the model explicitly learns spatial dependencies between neighboring geographic features. It consistently outperforms baselines like SatMAE and ScaleMAE across multiple downstream remote sensing classification and segmentation tasks.
No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors: The authors propose LightStab, an unsupervised online video stabilization framework. By combining a classical three-stage pipeline (motion estimation → motion propagation → motion compensation) with multi-threaded asynchronous buffering, it achieves performance comparable to offline SOTA for the first time across five benchmarks. Additionally, the first multi-modal UAV aerial stabilization dataset, UAV-Test (including visible and infrared light), is released.
Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments: Olbedo proposes the first large-scale real-world aerial albedo-shading decomposition dataset (5,664 UAV images, 4 landscapes, multi-illumination across years). It generates multi-view consistent pseudo-ground truth annotations through a physical inverse rendering pipeline. The study demonstrates that synthetic pre-training combined with Olbedo LoRA fine-tuning significantly improves outdoor albedo prediction and supports downstream applications such as relighting, material editing, and scene change analysis.
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation: OlmoEarth utilizes a self-supervised recipe designed specifically for Earth Observation (Latent MIM Lite with frozen random projections as target encoders + modality-aware masking + intra-modality contrastive loss). This approach stably trains spatio-temporal multimodal foundation models in latent space. It outperforms 12 other foundation models on 15 out of 24 embedding tasks and 19 out of 29 fine-tuning tasks, and has been deployed as an end-to-end platform for non-profit organizations.
ORSATR-X: A Foundation Model based on Differential-and-Excitation Networks for Optical Remote Sensing Object Recognition: ORSATR-X utilizes a frozen DINOv3 as its backbone, attaching side adapters to each Transformer block: a Weber Local Adapter (WLA) inspired by Weber's Law to amplify boundaries of low-contrast targets, and a Multi-Scale Aggregation Module (MSAM) to handle extreme scale variations in remote sensing (RS) objects. Trained via distillation from DINOv3-L, it achieves SOTA results for single-modal RS foundation models across scene classification, detection, and segmentation (75.30% mAP50 on DIOR-R, surpassing SkySense V2 which was pre-trained on 21M images).
Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data: Addressing the novel scenario of incomplete remote sensing multi-view clustering where "certain views contain missing pixels," OSMAGC initializes multi-scale spatial-aware anchor graphs using superpixels. It then unifies multi-scale anchor graph learning, structure-aligned consensus feature learning, and orthogonal spatial-aware regularization into a single objective function for alternating optimization. The method consistently outperforms SOTA methods across four remote sensing datasets under various missing rates while achieving the fastest execution speed.
PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction: PhenoYieldNet utilizes a unified model for multi-crop county-level yield prediction: it assigns learnable query vectors via a "Crop Phenology Bank" to each crop, decomposes temporal features into long-term trends and short-term fluctuations via "Crop Phenology Attention" (injecting them into attention biases), and utilizes two-stage Temporal Contrastive Adaptation to transfer remote sensing foundation models to agricultural time series. It consistently outperforms single-crop and multi-crop SOTA on CropNet and MODIS.
PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization: PiLoT unifies "UAV ego-localization + arbitrary target geo-localization" into a single problem: "pixel-to-3D registration between real-time video frames and georeferenced 3D maps." Using a dual-thread engine, a lightweight network trained on millions of synthetic data points, and a Joint Neural-Guided Optimizer (JNGO), it achieves a median error of 1.37 m and 25+ FPS on Jetson Orin under GNSS/IMU-denied conditions.
Prompt-Free Unknown Label Generation for Open World Detection in Remote Sensing: HSGDet enables remote sensing detectors to discover unknown objects during deployment without any text prompts. By utilizing a "Hierarchical Semantic Graph + Scene Co-occurrence Context," it automatically synthesizes CLIP semantic labels for unknowns and integrates new classes into the vocabulary. It outperforms SOTA by 6.6 points in Known mAP, 9.9 points in Unknown Recall, and reduces Wilderness Impact by 36%.
QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification: QuCNet integrates an ultra-lightweight convolutional encoder with 16 parallel 4-qubit Trainable Quantum Circuits (TQCs). It employs "Hybrid Cyclic Weight Sharing (HCWS)" to manage 16 circuits with only 64 independent parameters and utilizes KL divergence expressibility analysis to select gate sequences that avoid barren plateaus. Ultimately, it achieves higher accuracy than classical CNNs on 7 remote sensing benchmarks using only 87k parameters (85× smaller than similar hybrid models) and completes hardware inference on real IBM quantum processors.
RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation: RAMEN is a "sensor-agnostic, resolution-adjustable" unified Transformer encoder: it explicitly encodes modality, spatial resolution (GSD), and temporal resolution as input features into a shared latent space. It treats spatial resolution as a controllable output parameter during inference, allowing users to balance precision and computation. Pre-trained on heterogeneous Earth Observation (EO) corpora using masked reconstruction, it outperforms larger state-of-the-art (SOTA) models like TerraMind-L on 8 downstream tasks in the PANGAEA benchmark using a frozen ViT-Base backbone.
RECS4R: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation: RECS4R unifies Referring Visual Grounding (VG) and Referring Image Segmentation (RIS) in remote sensing by "decoding a sequence of language-conditioned polygon contour vertices"—where the contour's bounding box serves as the box and the filled region as the mask. By integrating residual coarse-to-fine encoding, channel-separated multi-scale fusion, and gradient-domain boundary supervision, it achieves new state-of-the-art RECS scores across six datasets, including RefDIOR, RRSIS-D, and the RefCOCO series.
Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework: To address the "globally random but locally clustered" nature of textures in remote sensing images—which leads to extreme texture imbalance compared to natural images—this paper proposes TexADiff. The framework estimates a Relative Texture Density Map (RTDM) to characterize texture distribution and injects it into the diffusion super-resolution process via a threefold strategy: "spatial condition + loss modulation + sampling schedule." This ensures the model generates realistic high-frequency details in texture-rich regions while suppressing hallucinations in texture-sparse regions, achieving superior perceptual metrics across multiple remote sensing benchmarks.
Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels: To address the misalignment between building footprint labels and roof positions caused by the lack of orthorectification in Google Earth images, this paper proposes the OMAF framework. It uses a differentiable self-alignment with edge and variance constraints to estimate instance-level offsets, filters these using Bayesian confidence with minimal manual priors, and distills the knowledge into an offset regression network. This process generates clean corrected labels, improving various segmentation models' mIoU by up to 40.6%.
RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization: This work presents CV-RHO, the first OSM-based metric-level cross-view localization benchmark targeting adverse weather and sensor noise (2.7M+ images). A dual-branch Pin-Pan architecture model, RHO, is proposed, incorporating Split-Undistort-Merge (SUM) and Position-Orientation Fusion (POF) mechanisms, achieving up to a 20% improvement in localization performance under various degradation conditions.
RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction: Ours first constructs WorldRoadSeg-360K—a global aerial road network segmentation benchmark covering 223 cities in 38 countries with 367,000 pixel-level annotations. Based on this, RoadGIE is proposed: a real-time road extraction framework with only 3.7M parameters that supports "connectivity-aware" interaction (clicks/scribbles), achieving Prev. SOTA in segmentation accuracy and topological consistency while reducing manual annotation time by approximately 79%.
Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence: This paper investigates the "Noisy Correspondence" issue (misaligned image-text pairs) in Remote Sensing Image-Text Retrieval (RSITR) for the first time and proposes the RRSITR framework. By categorizing training pairs into clean, fuzzy, and noisy sets based on contrastive loss, the method utilizes multimodal self-paced learning for easy-to-hard scheduling and applies a robust triplet loss with dynamic soft margins to noisy pairs. It significantly outperforms existing SOTA, particularly under high noise rates.
SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images: Addressing four complex requirements in remote sensing (small objects, multi-granularity, multi-object, and implicit instructions), this work introduces LaSeRS, the first large-scale dataset systematically covering these dimensions (40k masks, 122 classes, 30k QA triplets). It proposes SegEarth-R2, a 3B-parameter MLLM segmentation model that surpasses 7B, 8B, and even 13B models across multiple benchmarks using spatial attention supervision and flexible segmentation queries.
Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion: SA-STF utilizes a residual diffusion framework constrained by low-resolution observations and decoupled through Taylor expansion to separate residuals from noise. Combined with Temporal Feature Alignment (TFA) and Semantic-Adaptive Fusion (SAF) modules, it fuses multi-source satellite imagery (e.g., MODIS/Landsat) into high-spatiotemporal resolution images, particularly excelling at recovering semantic changes in dynamic land covers that traditional or data-driven methods fail to capture.
SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization: SinGeo utilizes "Dual Discriminative Learning + Curriculum Learning" to enable a single model to simultaneously adapt to cross-view geo-localization with arbitrary orientations and FoVs, eliminating the need to train separate models for different FoVs. It pushes R@1 past 70%/50% for extreme narrow fields (FoV=90°/70°) on CVUSA for the first time and provides plug-and-play robustness improvements for ViT/CNN/hybrid architectures.
SkySense-VITA: Towards Universal In-context Segmentation of Multi-modal Remote Sensing Imagery: SkySense-VITA utilizes a "prompt-and-prediction decoupling" architecture to unify visual prompts, text prompts, and their fusion into a single tuning-free in-context segmentation model. It natively supports both optical and SAR imagery while employing a coarse-to-fine semantic granularity annealing pre-training strategy, leading to an average mIoU improvement of over 10% across 18 remote sensing datasets.
Sparsely Timing the Change: A Spiking Temporal Framework for Remote Sensing Interpretation: Addressing the pain point in remote sensing change detection where "only two temporal images are available, making it difficult to model sparse temporal evolution," this paper proposes SpikeAdapter. It utilizes a brain-inspired "Time-to-First-Spike" mechanism to encode bitemporal radiation differences into sparse spike sequences (GSI-P). It then uses a Spiking Neural Network (SNN) to extract temporal cues and STSpikeFuse to adaptively fuse them with semantic features from an ANN backbone. On LEVIR-CD, CLCD, and SYSU-CD, it outperforms CNN, Transformer, Mamba, and pseudo-video methods in F1/IoU metrics.
Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery: SATtxt employs a two-stage training process—"Spectral Representation Distillation + Instruction-Augmented LLM Alignment"—to inject multi-spectral (MS) priors into an RGB-only vision encoder and align it with frozen LLM text embeddings. By training only lightweight projectors, it outperforms multi-spectral SOTA models across zero-shot classification, retrieval, open-vocabulary segmentation, and linear probing tasks.
UniChange: Unifying Change Detection with Multimodal Large Language Model: UniChange unifies Binary Change Detection (BCD) and Semantic Change Detection (SCD) into a single MLLM-based framework. By utilizing the embeddings of three special tokens—[T1], [T2], and [CHANGE]—as "queries" to drive a segmentation decoder and replacing fixed classification heads with text prompts, it allows joint training on multi-source remote sensing datasets with conflicting category definitions. It achieves new SOTA performance on WHU-CD, S2Looking, LEVIR-CD+, and SECOND, with IoUs of 90.41, 53.04, 78.87, and 57.62 respectively.
UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization: UniGeoRS constructs the first cross-view geo-localization (CVGL) benchmark (1154 targets, approximately 140,000 images) that unifies satellite, UAV, and ground views while mixing real and synthetic imagery. It further proposes CAME, a plug-and-play two-stage re-ranking module that utilizes Rank Distance and cross-attention to mine inter-platform and intra-platform relationships within candidate sets, providing stable Recall@1 and AP gains across multiple mainstream CVGL models.
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes: The authors construct the first million-scale remote sensing instruction segmentation dataset, GeoSeg-1M (590K images, 117 categories, 1.1M triplets), along with the companion GeoSeg-Bench. They propose a unified framework, UniGeoSeg, which integrates referring, interactive, and reasoning segmentation into a single model using Task-Adaptive Text Enhancement (TATE), Latent Knowledge Memory (LKM), and Progressive Task Scheduling (PTS). It achieves state-of-the-art (SOTA) performance on GeoSeg-Bench and multiple public benchmarks with strong zero-shot generalization.
WHU-MARS: A Multispectral Aerial-Ground Benchmark Towards Any-Scenario Person Re-Identification: This paper proposes the "Any-Scenario Person Re-Identification" (AS-ReID) task, which requires a single model to perform any-to-any retrieval across heterogenous galleries mixing all modalities and viewpoints. The authors construct WHU-MARS, the largest multispectral aerial-ground dataset to date (2,337 identities, 430k RGB/NIR/TIR images, ground + UAV). They further introduce the UAD framework, which achieves state-of-the-art results with minimal parameters on AS-ReID through progressive center alignment and global prototype discrimination, without requiring multi-branch architectures or pairwise alignment.
YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction: YieldSAT formulates crop yield prediction as a "per-pixel regression" task, constructing the first multimodal remote sensing benchmark covering 4 countries and 4 crops, featuring 2,173 expert-validated fields and 12.2 million yield labels at 10m resolution. It integrates Sentinel-2 time series imagery with meteorological, soil, and topographic auxiliary data while systematically revealing model collapse caused by yield distribution shifts in real-world scenarios and providing mitigation via Deep Ensembles.
ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks: To address the deadlock where ultra-high-resolution (UHR) remote sensing images face "information redundancy when fed as a whole and loss of detail when downsampled," ZoomEarth enables a 3B VLM to mimic human behavior by surveying the global view before "zooming in" on regions of interest: the model predicts ROI boxes, crops the local patches from the original high-definition image, and re-feeds them for inspection. Trained via a two-stage SFT + GRPO process with a new "Region-Guided Reward" to alleviate the sparse IoU reward problem in UHR, it achieves zero-shot SOTA on the self-built LRS-GRO benchmark and three public UHR remote sensing benchmarks.