CVPR2025 Remote Sensing AI paper notes paper summaries Dynamic Scenes Dialogue Object Tracking Adversarial Robustness Few-/Zero-Shot Learning

🛰️ Remote Sensing¶

📷 CVPR2025 · 11 paper notes

📌 Same area in other venues: 📷 CVPR2026 (63) · 🔬 ICLR2026 (11) · 🧪 ICML2026 (3) · 🤖 AAAI2026 (7) · 🧠 NeurIPS2025 (12) · 📹 ICCV2025 (11)

🔥 Top topics: Remote Sensing ×4

Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes: This paper proposes the Dense Dispersed Structured Light (DDSL) method, which utilizes an inexpensive diffraction grating film (<$20), a stereo RGB camera, and an RGB projector. By designing spectrally multiplexed DDSL patterns, the required number of projection frames is significantly reduced, achieving real-time hyperspectral 3D imaging at 6.6 fps with a spectral resolution of 15.5 nm FWHM and a depth error of 4 mm.
DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery: The DiSciPLE framework is proposed to automatically synthesize interpretable Python programs for visual data analysis using an LLM-guided evolutionary algorithm. It achieves SOTA on scientific tasks such as population density estimation, reducing error by 35% compared to recent baselines while remaining fully interpretable.
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues: This work proposes EarthDial, a conversational vision-language model tailored for Earth Observation (EO) data. It supports the unified understanding of multispectral (SAR/NIR/infrared), multi-temporal, and multi-resolution remote sensing imagery. Trained on an 11.11 million instruction-tuning dataset, it outperforms existing remote sensing VLMs across 44 downstream datasets.
Hierarchical Dual-Change Collaborative Learning for UAV Scene Change Captioning: This paper proposes a new task named UAV Scene Change Captioning (UAV-SCC) and a novel HDC-CL framework. It models the overlapping and non-overlapping regions of image pairs under moving viewpoints using a Dynamic Adaptive Layout Transformer, enhances viewpoint shift direction awareness via hierarchical cross-modal directional consistency calibration, and constructs a dedicated benchmark dataset.
Joint and Streamwise Distributed MIMO Satellite Communications with Multi-Antenna Ground Users: This paper investigates distributed MIMO downlink communications where multiple LEO satellites jointly serve multi-antenna ground users. Two modes, namely joint transmission and streamwise transmission, are proposed. The former optimizes the precoder using WMMSE iterations to maximize the sum spectral efficiency, while the latter employs a Hungarian algorithm-based stream-satellite association to reduce the fronthaul overhead, achieving a flexible trade-off between performance and the fronthaul signaling load.
Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking: The ORTrack framework is proposed to learn occlusion-robust ViT feature representations through random masking based on spatial Cox processes (imposing mask constraints during training and achieving zero overhead during inference). An adaptive feature distillation method is designed to compress large models into a lightweight student model ORTrack-D, achieving the best balance of state-of-the-art accuracy and real-time speed across several UAV tracking benchmarks.
Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning: MetaPEFT proposes a meta-learning framework that unifies discrete position selection and continuous scaling factors in PEFT into differentiable modulators. Through bi-level optimization, it automatically searches for the optimal PEFT hyperparameter configuration, achieving SOTA on remote sensing and natural image long-tailed distribution adaptation tasks.
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging: Proposed MetaSpectra+, a compact and multi-functional camera based on hybrid metasurface-refractive optics. By utilizing double-layer metasurfaces to independently control dispersion, exposure, and polarization for each channel, it achieves snapshot hyperspectral+HDR or hyperspectral+polarization joint imaging within an ~250nm visible bandwidth, achieving SOTA reconstruction accuracy on benchmark datasets.
MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting: MFogHub constructs the first multi-regional (15 coastal regions) and multi-satellite (6 geostationary satellites) global marine fog detection and forecasting dataset, containing over 68,000 high-resolution samples and 11,600+ pixel-level annotations. Extensive experiments on 16 baseline models reveal the influence of regional differences and satellite variations on model generalization.
SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion: This work introduces satellite imagery to the 3D Semantic Scene Completion (SSC) task for the first time, proposing a dual-branch framework named SGFormer. By utilizing ground-view guided satellite feature correction and adaptive fusion strategies, it effectively addresses the scene incompleteness issue caused by visual occlusions.
Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing: Constructs the remote sensing multi-entity reasoning grounding benchmark ME-RSRG (the first remote sensing grounding dataset explicitly annotated with subject-object roles), and proposes the Entity-Aware Reasoning (EAR) framework. Combining SFT cold-start with entity-aware reward-driven GRPO optimization, it achieves structured reasoning chain outputs and joint subject-object localization, with the Qwen2.5-VL series obtaining over 10% improvement in [email protected] after EAR optimization.