Skip to content

🛰️ Remote Sensing

📷 CVPR2026 · 19 paper notes

ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery

ACPV-Net is the first framework that generates topologically consistent all-class polygonal vector maps from aerial imagery in a single pass, employing a semantically supervised conditional diffusion model for vertex heatmap generation and proposition-driven PSLG reconstruction to ensure zero gaps and zero overlaps.

AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

AVION proposes a knowledge distillation framework that generates semantically rich text prototypes via LLMs and employs visual-textual dual-side prompt tuning with tri-aspect alignment distillation, addressing semantic poverty and visual rigidity in remote sensing VLM adaptation and comprehensively surpassing SOTA on few-shot classification, base-to-novel generalization, and cross-modal retrieval.

AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

AVION proposes a knowledge distillation framework that uses LLM-generated semantically rich remote sensing text prototypes as teacher supervision while injecting learnable prompts into both the visual and text encoders of the student, achieving tri-aspect alignment distillation that significantly outperforms existing PEFT methods on few-shot classification and cross-modal retrieval.

Conflated Inverse Modeling for Urban Vegetation Patterns

A framework conflating a forward prediction model with a diffusion-based inverse generative model to produce diverse yet physically plausible urban vegetation spatial configurations (NDVI patterns) under specified temperature change targets, achieving 3.4× diversity improvement while reducing temperature control error by 37%.

Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark

A cross-modal fuzzy alignment network (CFAN) that leverages fuzzy logic to quantify token-level reliability for fine-grained alignment and introduces ground-view bridging to alleviate the semantic gap between aerial images and text descriptions, along with a large-scale text-aerial person retrieval benchmark AERI-PEDES.

Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction

The first work to advance spectral compressed imaging (SCI) from image-level to video-level reconstruction, introducing the first high-quality dynamic hyperspectral dataset DynaSpec (30 sequences / 300 frames), and proposing PG-SVRT with spatial-then-temporal attention plus bridge tokens that achieves 41.52 dB PSNR with optimal temporal consistency at lower FLOPs (28.18G) than several image-level SOTAs.

GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction

GeoFlow is a lightweight flow-matching-inspired framework for fine-grained cross-view geolocalization (FG-CVG). It learns probabilistic displacement fields combined with an iterative refinement sampling (IRS) algorithm to achieve precise 2-DoF localization from ground to satellite images in continuous space, reaching SOTA-competitive accuracy at 29 FPS real-time speed.

GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction

GeoFlow reformulates fine-grained cross-view geolocalization (FG-CVG) as probabilistic displacement regression—the model learns displacement fields (distance + direction probability distributions) from arbitrary hypothesis positions to true locations, combined with an iterative refinement sampling (IRS) algorithm that flows multiple random hypotheses from different starting points toward a consensus position, achieving 29 FPS real-time inference with 7.8× fewer parameters and 4× less computation while maintaining competitive localization accuracy.

GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

This work proposes GeoMMBench (1053 expert-level geoscience multiple-choice questions) and GeoMMAgent (a retrieval-perception-reasoning multi-agent framework), systematically evaluating 36 MLLMs in the remote sensing domain and revealing systematic deficiencies in domain knowledge, perceptual grounding, and reasoning capabilities.

Joint and Streamwise Distributed MIMO Satellite Communications with Multi-Antenna Ground Users

This paper studies downlink transmission from multiple LEO satellites jointly serving multi-antenna ground users. Two non-coherent transmission modes are proposed—joint transmission and streamwise transmission—with precoders designed under the WMMSE framework and stream-to-satellite association solved via the Hungarian algorithm, achieving near-optimal spectral efficiency while substantially reducing fronthaul overhead.

Joint and Streamwise Distributed MIMO Satellite Communications with Multi-Antenna Ground Users

Two downlink transmission schemes (joint transmission & streamwise transmission) are proposed for distributed LEO satellite systems serving multi-antenna ground users. Through WMMSE precoding design based on statistical CSI and a stream-satellite association strategy based on the Hungarian algorithm, the proposed framework achieves a flexible trade-off between high spectral efficiency and low fronthaul overhead without requiring inter-satellite phase synchronization.

Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels

This paper presents Lumosaic, an active hyperspectral video system that synchronizes an array of 12 narrowband LEDs with a coded-exposure pixel (CEP) camera at microsecond precision. Within 158 sub-frames per video frame, the system jointly encodes spatial, temporal, and spectral information, achieving motion-robust hyperspectral video reconstruction at 30 fps, VGA resolution, and 31 spectral channels (400–700 nm), with PSNR exceeding passive snapshot systems by more than 10 dB.

MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging

This paper presents MetaSpectra+, the first multifunctional metasurface imaging system operating across the full visible spectrum (250 nm bandwidth). Through a dual-layer metasurface design enabling beam splitting and precise dispersion control, the system acquires a hyperspectral data cube together with HDR/polarization images in a single snapshot, achieving 33.31 dB PSNR on benchmark datasets with a total track length (TTL) of only 17 mm.

MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging

MetaSpectra+ proposes a metasurface–refractive hybrid optical paradigm that employs a dual-layer metasurface to independently control the dispersion, exposure, and polarization of four channels, enabling snapshot hyperspectral+HDR/polarization multi-functional imaging over a 250 nm bandwidth with a minimum total track length (TTL) of 17 mm. On the KAUST benchmark, it achieves a PSNR of 33.31 dB, comprehensively surpassing existing snapshot hyperspectral systems.

No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors

This paper proposes LightStab, an unsupervised online video stabilization framework built upon the classical three-stage pipeline (motion estimation → motion propagation → motion compensation) augmented with multi-threaded asynchronous buffering. LightStab is the first online method to comprehensively match offline SOTA across 5 benchmark datasets, and introduces UAV-Test, the first multimodal UAV aerial stabilization benchmark covering both visible-light and infrared imagery.

Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments

Olbedo introduces the first large-scale real-world aerial albedo–shading decomposition dataset (5,664 UAV images, 4 terrain types, multi-year multi-illumination conditions). A physics-based inverse rendering pipeline generates multi-view-consistent pseudo-ground-truth annotations. Results demonstrate that synthetic pre-training combined with Olbedo LoRA fine-tuning substantially improves outdoor albedo prediction and supports downstream applications including relighting, material editing, and scene change analysis.

Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

This paper evaluates 24 families of pretrained image matchers on SAR-optical satellite registration under a zero-shot setting, finding that deployment protocol choices (geometric model, tile size, etc.) can affect accuracy by up to 33×, sometimes surpassing the effect of switching the matcher itself.

RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization

This paper introduces CV-RHO, the first OSM-based metric cross-view geo-localization benchmark targeting adverse weather and sensor noise (2.72M+ images), and proposes RHO, a dual-branch Pin-Pan architecture integrating panoramic undistortion (SUM) and position-orientation fusion (POF) mechanisms, achieving up to 20% localization improvement under diverse degradation conditions.

SDF-Net: Structure-Aware Disentangled Feature Learning for Optical-SAR Ship Re-identification

This paper proposes SDF-Net, a physics-guided structure-aware disentangled feature learning network that enforces cross-modal geometric consistency via intermediate-layer gradient energy (SCL) and decouples shared/modality-specific features at the terminal layer (DFL) with parameter-free additive fusion, achieving 60.9% mAP (+3.5% vs. SOTA TransOSS) on HOSS-ReID.