CVPR2025 Segmentation AI paper notes paper summaries Multimodal/VLM Speech & Audio Object Detection Alignment/RLHF Remote Sensing

✂️ Segmentation¶

📷 CVPR2025 · 94 paper notes

📌 Same area in other venues: 📷 CVPR2026 (122) · 🔬 ICLR2026 (32) · 🧪 ICML2026 (14) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (45) · 📹 ICCV2025 (73)

🔥 Top topics: Segmentation ×39 · Multimodal/VLM ×6 · Speech & Audio ×5 · Object Detection ×5 · Alignment/RLHF ×4

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification: Proposes 2DMamba, the first native 2D selective State Space Model with an efficient parallel algorithm. By maintaining 2D spatial continuity (rather than flattening into a 1D sequence) to model inter-patch relationships in WSIs, it comprehensively outperforms 1D Mamba methods across 10 public pathology datasets, while also achieving improvements on ImageNet classification and ADE20K segmentation.
A Distractor-Aware Memory for Visual Object Tracking with SAM2: A Distractor-Aware Memory (DAM) model is proposed for SAM2.1++, splitting the memory of SAM2 into Recent Appearance Memory (RAM, ensuring segmentation accuracy) and Distractor Resolution Memory (DRM, ensuring tracking robustness). Through an introspective update strategy, DAM detects distractors and automatically stores anchor frames, setting a new SOTA on 7 benchmarks.
Assessing and Learning Alignment of Unimodal Vision and Language Models (SAIL): The SAIL framework is proposed: first, the alignment potential of unimodal vision and language models is assessed through alignment probing (discovering that k-NN clustering quality is more crucial than linear separability); second, DINOv2 and pretrained language models are efficiently aligned using a lightweight GLU alignment layer + Sigmoid loss + multi-positive sample strategy, outperforming CLIP with only 6% of its training data.

SAIL: Assessing and Learning Alignment of Unimodal Vision and Language Models

Audio-Visual Instance Segmentation

G2HFNet: GeoGran-Aware Hierarchical Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images: This paper proposes G2HFNet, which designs differentiated optimization strategies for features at different levels through four modules: Multi-scale Detail Enhancement (MDE), Dual-branch Geometric-Granularity Complementarity (DGC), Deep Semantic Perception (DSP), and Local-Global Guided Fusion (LGF), comprehensively outperforming SOTA on three remote sensing salient object detection datasets.
Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging: This paper systematically reviews the performance of traditional methods and deep learning methods in MRI brain glioma segmentation and classification. Through a comprehensive comparative evaluation, it concludes that CNN architectures significantly outperform traditional techniques in segmentation accuracy and robustness.

Condensing Action Segmentation Datasets via Generative Network Inversion

Continuous Locomotive Crowd Behavior Generation: Generates continuous crowd locomotive behaviors by jointly synthesizing trajectories and actions, producing natural and diverse collective motion patterns.
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training: COSMOS proposes a cross-modality self-distillation framework that learns fine-grained cross-modality representations in a student-teacher architecture using a text-cropping strategy and a cross-attention module. Pre-trained on only 30M data, it consistently outperforms CLIP-like baselines across zero-shot retrieval, classification, and semantic segmentation tasks, even surpassing OpenCLIP trained on billions of data points.
CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation: This paper proposes CrossEarth-SAR, the first billion-parameter-scale SAR vision foundation model. Based on a physics-guided sparse Mixture-of-Experts (MoE) architecture, a training set containing 200K images and an evaluation framework of 22 sub-benchmarks are constructed. It achieves state-of-the-art (SOTA) performance on 20 out of 22 cross-domain semantic segmentation benchmarks.
DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers: DA-VPT proposes a distribution-aware visual prompt tuning framework. By utilizing metric learning in the deep layers of ViT to construct a semantic metric space between prompts and visual/CLS tokens, it guides prompts to act as "semantic bridges" that transfer class-specific information from image patches to the CLS token. It significantly outperforms standard VPT with minimal parameters across 24 recognition tasks and 2 segmentation tasks.
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception: DeCLIP identifies the "proxy token" phenomenon in CLIP's self-attention, which prevents image tokens from aggregating spatial correlation information. It proposes a framework that decouples the self-attention module into "content" and "context" features, optimizing them respectively through CLIP self-distillation and Vision Foundation Model (VFM) distillation. It out-performs existing methods across open-vocabulary object detection and semantic segmentation.
DefMamba: Deformable Visual State Space Model: DefMamba proposes a visual state space model based on a deformable mechanism. By dynamically adjusting the scanning path (reference point offsets + scanning order offsets) through a deformable scanning strategy, it overcomes the issue of spatial structural information loss caused by fixed scanning orders in existing Visual Mamba methods, achieving SOTA performance on ImageNet classification, COCO detection, and ADE20K segmentation.
DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation: Proposes utilizing the depth map directly as a geometric prior instead of encoding it through neural networks. It designs Geometry Self-Attention (GSA) to fuse depth distance and spatial distance into decay factors that modulate attention weights, matching or surpassing dual-encoder RGBD segmentation methods with approximately half the FLOPs.
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment: This paper proposes dino.txt, which aligns the frozen DINOv2 vision encoder with a text encoder trained from scratch using the LiT strategy. It innovatively uses a concatenation of [CLS] and average-pooled patch tokens as the image representation. Combined with text-image bi-modal data curation, the approach achieves state-of-the-art results on zero-shot classification and open-vocabulary segmentation with only 50K iterations (a fraction of CLIP's training cost).
DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation: DPSeg proposes leveraging both text prompts and visual prompts generated by Stable Diffusion to construct a dual-prompt cost volume for open-vocabulary semantic segmentation. Utilizing a multi-scale visual cost volume guided decoder and a two-round inference semantic refinement strategy, the method consistently outperforms existing approaches across five public datasets.
Dual-Agent Optimization framework for Cross-Domain Few-Shot Segmentation: A Dual-Agent Optimization (DATO) framework is proposed, consisting of a Consistent Mutual Aggregation (CMA) module to learn cross-domain invariant features for representation enhancement, and a Correlation Rectification Strategy (CRS) to shift support-query matching into a domain-insensitive feature space, effectively improving the generalization capability of cross-domain few-shot segmentation.
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics: Starting from the intrinsic characteristics of audio and addressing the dual challenges of feature confusion in mixed audio and intra-class variation of different sounds from the same object, DDESeg proposes a Dynamic Derivation Module to derive independent source representations from mixed signals and enhance discriminability. It then employs a Dynamic Elimination Module to filter out irrelevant audio semantics such as off-screen voices, achieving SOTA performance on all AVS benchmarks.
EdgeTAM: On-Device Track Anything Model: Through a detailed latency analysis, EdgeTAM identifies that the bottleneck of SAM 2 lies in memory attention rather than the image encoder. To address this, it proposes a 2D Spatial Perceiver to compress frame-level memory from 64×64 dimensions to ~500 tokens (while preserving spatial structure). Coupled with a two-stage knowledge distillation pipeline, EdgeTAM achieves 16 FPS real-time Track Anything on an iPhone 15 Pro Max.
EditAR: Unified Conditional Generation with Autoregressive Models: EditAR is proposed as the first method to unify image editing (texture modification, object replacement/removal, local editing) and image translation (depth/edge/segmentation map to image) within a single autoregressive framework. By introducing conditional image token prefixing and DINOv2 distillation loss on top of LlamaGen, it achieves performance competitive with specialized models across various conditional generation tasks under the standard next-token prediction paradigm.
Effective SAM Combination for Open-Vocabulary Semantic Segmentation: This paper proposes ESC-Net, a single-stage open-vocabulary semantic segmentation model. By generating pseudo prompts from CLIP image-text correlation maps and embedding them into pre-trained SAM decoder blocks, the model efficiently leverages SAM's class-agnostic segmentation capability to enhance spatial aggregation. Coupled with a Vision-Language Fusion (VLF) module to achieve precise mask prediction, ESC-Net achieves SOTA performance on ADE20K, PASCAL-VOC, and PASCAL-Context.
Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance: An efficient RGB-D multi-task scene understanding network is proposed. It accelerates inference by utilizing redundant features in an improved fusion encoder, introduces a Normalized Focus Channel Layer (NFCL) and Context Feature Interaction Layer (CFIL) for cross-dimensional feature guidance, and designs a multi-task adaptive loss function to dynamically adjust task weights, achieving SOTA performance on NYUv2/SUN RGB-D/Cityscapes.
Exploiting Temporal State Space Sharing for Video Semantic Segmentation: This work proposes the TV3S (Temporal Video State Space Sharing) framework, which leverages Mamba state space models to achieve efficient temporal information sharing across video frames. By processing spatial patches independently and incorporating a shifted window mechanism, TV3S enables highly parallelized computation. It outperforms existing Transformer and RNN methods on the VSPW and Cityscapes datasets while maintaining a superior accuracy-efficiency trade-off.
Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation: ExCEL proposes utilizing the patch-text alignment paradigm (instead of traditional image-text alignment) to mine the dense knowledge of CLIP for weakly supervised semantic segmentation. By enhancing dense alignment capabilities through Text Semantic Enrichment (TSE) and Visual Calibration (VC) modules, it substantially surpasses SOTA on PASCAL VOC and MS COCO while requiring only 3.2GB of VRAM and 6% of the training time.
Exploring Simple Open-Vocabulary Semantic Segmentation: This paper proposes S-Seg, a minimalist open-vocabulary semantic segmentation model. Without relying on CLIP pre-training, annotated masks, or customized grouping encoders, S-Seg trains a MaskFormer using only pseudo-masks (generated from DINO K-Means clustering) and image-text contrastive loss. It achieves comparable performance to complex methods on Pascal VOC, Pascal Context, and COCO, with self-training further boosting the average mIoU by 5.5%.
F-LMM: Grounding Frozen Large Multimodal Models: F-LMM freezes all parameters of off-the-shelf LMMs and trains only a lightweight CNN mask decoder to translate the inherent word-pixel correspondences in LMM attention maps into segmentation masks, achieving competitive visual grounding performance while fully preserving conversational capability.
Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation: PartCATSeg improves h-IoU by over 10% on multiple open-vocabulary part segmentation benchmarks by disentangling and aggregating object-level and part-level image-text cost volumes, introducing a compositional loss to constrain part-whole relationships, and leveraging DINO features for structural guidance.
FineCaption: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity: FineCaption proposes a vision-language model supporting arbitrary mask referencing and high-resolution image inputs. By integrating a mask-aware CLIP encoder, ConvNeXT and SAM high-resolution encoders, along with the newly constructed CompositionCap dataset, it realizes the multi-granularity compositional region image captioning task.
Foveated Instance Segmentation: FSNet proposes an instance segmentation framework simulating the human foveated vision mechanism. By guiding non-uniform downsampling via a learnable saliency map, it maintains high-resolution details in target gaze regions while lowering resolution in peripheral areas. This achieves plug-and-play efficiency gains across various pre-trained segmentation networks.
Fractal Calibration for Long-Tailed Object Detection: Proposes FRACAL (FRActal CALibration), a training-free post-processing method that introduces fractal dimension into post-calibration for long-tailed object detection for the first time. By symmetrically calibrating the frequency axis (category frequency) and the spatial axis (category spatial uniformity), it improves rare category mask AP by up to 8.6% on the LVIS dataset, with demonstrated generalization on COCO, V3Det, and OpenImages.
Frequency Dynamic Convolution for Dense Image Prediction: FDConv redesigns dynamic convolution from a frequency domain perspective. By leveraging Fourier Disjoint Weights (FDW), it constructs frequency-diverse convolutional kernels without increasing parameters. Combined with Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM) for fine-grained frequency adaptation, FDConv outperforms existing dynamic convolution methods requiring 65-90M extra parameters while only introducing 3.6M parameters.
Generative Video Propagation: The GenProp framework is proposed, which cooperates a selective content encoder (SCE) with an I2V generative model to uniformly propagate first-frame edits to the entire video, simultaneously supporting multiple video tasks such as video editing, object removal, object insertion, and object tracking in a single model.
GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation: This paper proposes the GLUS framework, which unifies global understanding and local temporal consistency into a single MLLM through a frame partition strategy of "context frames (global reasoning) + query frames (local tracking)". Combined with an end-to-end trained VOS memory bank module, it significantly outperforms all MLLM-based methods on MeViS (J&F 51.3%).
Golden Cudgel Network for Real-Time Semantic Segmentation: This paper proposes GCNet, featuring the Golden Cudgel Block (GCBlock) as its core. It self-expands during training (multi-convolution, multi-path) to enhance learning capacity, and self-contracts during inference (reparameterized into a single \(3\times3\) convolution) to accelerate speed. This yields a "self-distillation" paradigm without requiring an external teacher model, outperforming existing real-time segmentation models on Cityscapes with 77.3% mIoU at 193.3 FPS.
GroupMamba: Efficient Group-Based Visual State Space Model: This paper proposes the Modulated Group Mamba layer, which divides input channels into four groups to perform unidirectional SSM scans in four distinct directions. It enhances cross-group channel interaction via Channel Affinity Modulation (CAM) and employs a distillation training objective to address instability in large models, achieving 83.3% Top-1 accuracy on ImageNet-1K with only 23M parameters.
HFP-SAM: Hierarchical Frequency Prompted SAM for Efficient Marine Animal Segmentation: HFP-SAM proposes a hierarchical frequency-prompted SAM framework. By injecting marine scene information through a Frequency Guided Adapter (FGA), automatically generating high-quality point prompts via Frequency-aware Point Selection (FPS), and decoding efficiently with Full-view Mamba (FVM), it achieves SOTA performance on four marine animal segmentation datasets.
Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning: COCA-Net proposes a hierarchical clustering attention layer based on physical compactness, discovering object centers via a bottom-up hierarchical merging strategy. It resolves the inherent limitations of Slot Attention—such as initialization sensitivity, the requirement of preset slot quantities, and poor background segmentation—achieving state-of-the-art performance on six unsupervised object discovery datasets.
ID-Patch: Robust ID Association for Group Photo Personalization: ID-Patch addresses the identity leakage problem in multi-identity image generation by feeding the same facial features into both an ID patch (for spatial control) and an ID embedding (for identity similarity preservation) simultaneously, comprehensively outperforming baseline models in facial similarity, ID-position association accuracy, and generation efficiency.
Image Quality Assessment: From Human to Machine Preference: This paper introduces Image Quality Assessment for Machine Vision System (IQA for MVS) for the first time, establishing the Machine Preference Database (MPD) which contains 2.25 million fine-grained annotations and 30,000 reference/distorted image pairs. Experiments demonstrate that existing HVS-centric IQA metrics fail to accurately characterize machine preferences, revealing fundamental differences between human and machine vision systems.
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene: This paper proposes a 4D panoptic scene graph generation framework based on 4D-LLMs and 2D-to-4D transfer learning. By utilizing chained scene graph inference, it leverages the open-vocabulary capabilities of LLMs and transfers dimension-invariant features from abundant 2D scene annotations to 4D scenes, significantly mitigating issues of data scarcity and limited vocabulary.
LiVOS: Light Video Object Segmentation with Gated Linear Matching: Proposed LiVOS—the first lightweight VOS network to replace softmax attention with gated linear attention for memory matching. It compresses the spatio-temporal attention matrix into a constant-sized 2D state matrix, achieving constant memory consumption for videos of arbitrary length and supporting 4096p inference on a 32G consumer GPU.
M3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation: This paper introduces the physical concept of "phase" to the video object segmentation task, constructing the M3-VOS benchmark containing 479 videos, 205K masks, covering 6 physical phases and 23 transition types. It also proposes a plug-and-play method, ReVOS, to improve the performance of phase-transitioning object segmentation through reverse propagation refinement.
MambaOut: Do We Really Need Mamba for Vision?: Through conceptual analysis, this paper points out that the SSM mechanism in Mamba is suited for long-sequence + autoregressive tasks, neither of which is satisfied by ImageNet image classification. Consequently, the authors construct the MambaOut series (a pure Gated CNN) by removing SSM. MambaOut completely outperforms all state-of-the-art vision Mamba models on image classification, thoroughly demonstrating that SSM is unnecessary for visual classification.
MambaVision: A Hybrid Mamba-Transformer Vision Backbone: NVIDIA proposes MambaVision, the first systematic study of hybrid Mamba-Transformer formulations for vision backbones. By redesigning the MambaVision Mixer and adding self-attention in the final blocks, it addresses the limitation of SSMs in capturing global context. It achieves a new Pareto front for accuracy-throughput on ImageNet-1K, while also outperforming comparable competitors in downstream detection and segmentation tasks.
MammAlps: A Multi-view Video Behavior Monitoring Dataset of Wild Mammals in the Swiss Alps: This paper proposes MammAlps—a multimodal, multi-view dataset for monitoring the behavior of wild mammals in the Swiss National Park (8.5 hours of dense annotation, 5 species, 11 activities + 19 actions), along with two benchmark tasks: multimodal species + hierarchical behavior recognition (B1) and the first multi-view long-term event understanding (B2), filling the gap in wildlife video behavior analysis regarding hierarchical behavior annotation, multimodality, and multi-view.
Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation: Reveals the performance upper bound bottleneck of mask pooling methods in open-vocabulary segmentation—precise masks often fail to achieve accurate classification, and proposes Mask-Adapter to extract semantic activation maps from proposal masks and CLIP features to replace direct mask pooling, significantly improving the classification accuracy of various OVS methods in a plug-and-play manner.
MaSS13K: A Matting-level Semantic Segmentation Benchmark: This work constructs MaSS13K, a matting-level semantic segmentation dataset containing 13,348 images at 4K resolution (with mask complexity 20-50 times higher than existing datasets), and proposes the MaSSFormer model, which utilizes a dual-branch pixel decoder (global semantics + local structure) to achieve high-quality segmentation of fine boundaries in high-resolution scenarios while maintaining computational efficiency.
MatAnyone: Stable Video Matting with Consistent Memory Propagation: The MatAnyone framework is proposed, which achieves consistent propagation in memory space via a region-adaptive memory fusion mechanism (maintaining semantic stability in core regions and capturing fine alpha details in boundary regions). Together with a new dataset VM800 and a training strategy that directly supervises the matting head using segmentation data, it realizes robust and high-quality object-specified video matting.
MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation: MV-SSM introduces State Space Models (Mamba) to multi-view 3D human pose estimation for the first time. By explicitly modeling joint spatial sequences at both the feature and keypoint levels through the Projective State Space (PSS) block, combined with Grid Token-guided Bidirectional Scanning (GTBS), it achieves 93.5 AP25 on CMU Panoptic and significantly outperforms prior SOTA methods in cross-camera and cross-scene generalization tests.
OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels: OverLoCK is proposed, which is the first pure convolutional backbone network that explicitly incorporates a top-down attention mechanism. Through a deep-stage decomposition strategy (DDS) and context-mixing dynamic convolution (ContMix), it surpasses ConvNeXt-B on ImageNet-1K using only 1/3 of the FLOPs, achieving comprehensive leadership in detection and segmentation tasks.
Paint by Inpaint: Learning to Add Image Objects by Removing Them First: This work proposes the "Paint by Inpaint" framework. Leveraging the key insight that "adding objects is the inverse process of removing them," they construct the PIPE dataset containing approximately 1 million high-quality image pairs through an automated inpainting pipeline. The trained diffusion model achieves state-of-the-art (SOTA) performance on object addition and general editing tasks.
PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation: PicoSAM3 is an ultra-lightweight promptable segmentation model with 1.3M parameters. Through implicit ROI prompt encoding, a dense CNN architecture (transformer-free), SAM3 knowledge distillation, and INT8 quantization, it achieves 65.45% mIoU on COCO and real-time inference within 11.82ms on the Sony IMX500 vision sensor.
POSTA: A Go-to Framework for Customized Artistic Poster Generation: This paper proposes POSTA, a modular artistic poster generation framework driven by diffusion models and multimodal large language models (MLLMs). It achieves highly customizable, professional-grade poster creation through three modules: background generation, layout design planning, and artistic text stylization.
Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains: SAM FTI-FDet proposes an automatic-prompt instance segmentation framework based on a lightweight SAM. By utilizing a Transformer decoder-style prompt generator to automatically generate task-specific prompts, an adaptive feature dispatcher to fuse multi-scale features, and a TinyViT backbone to reduce computational overhead, it achieves 74.6 \(AP^{box}\) / 74.2 \(AP^{mask}\) on a freight train fault detection dataset.
RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images: RDNet proposes a region proportion-aware dynamic adaptive salient object detection network to address the dramatic object scale variations in remote sensing images. By introducing the Dynamic Adaptive Detail-aware module (DAD, selecting combinations of different kernel sizes based on target region proportions), the Frequency-matching Context Enhancement module (FCE, performing feature interaction in the wavelet domain), and the Region Proportion-aware Localization module (RPL, cross-attention + proportion guidance), the method achieves state-of-the-art (SOTA) performance on three datasets: EORSSD, ORSSD, and ORSI-4199.
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference: Discovers that cross-correlation self-attention in the intermediate layers of CLIP possesses localization properties, and proposes two plug-and-play modules: Residual Cross-correlation Self-attention (RCS) and Semantic Feedback Refinement (SFR), significantly improving CLIP's dense inference capabilities in open-vocabulary semantic segmentation.
Rethinking Query-Based Transformer for Continual Image Segmentation: This paper deeply analyzes the mechanism of emergence and extinction of built-in objectness in query-based Transformers. It proposes the SimCIS method, which consists of three modules: Query Pre-Alignment (QPA), Consistent Selection Loss (CSL), and Virtual Queries (VQ). By maintaining objectness while enhancing plasticity, SimCIS significantly outperforms state-of-the-art methods in continual panoptic segmentation and continual semantic segmentation tasks on ADE20K.
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer: This paper proposes a Vision-Centric Transformer (VCT) framework to address the audio-visual segmentation task. By replacing traditional audio-derived queries with queries derived from visual features and pairing them with a Prototype Prompting Query Generation (PPQG) module, VCT achieves new state-of-the-art results on three AVSBench subsets, with particularly significant improvements on the challenging AVSS subset.
RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring: RipVIS introduces the first large-scale rip current video instance segmentation benchmark dataset (184 videos / 210k frames) and proposes a post-processing method named Temporal Confidence Aggregation (TCA). TCA enhances the stability and recall of rip current segmentation through cross-frame confidence accumulation, providing a systematic computer vision solution for beach safety monitoring.
Robust 3D Shape Reconstruction in Zero-Shot from a Single Image in the Wild: ZeroShape-W proposes an occlusion-aware single-view 3D shape reconstruction model. It estimates the complete 3D shape (including the occluded parts) by jointly regressing the visible mask, occlusion mask, depth map, and camera intrinsics. Simultaneously, it designs a scalable synthetic data pipeline to simulate diverse foregrounds, occluders, and backgrounds. With only 194M parameters, it significantly outperforms state-of-the-art (SOTA) methods utilizing >1100M parameters on the Pix3D benchmark.
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment: This work proposes Audio-guided Modality Alignment (AMA) and Uncertainty Estimation (UE) modules to resolve incorrect association of visually similar objects and over/under-segmentation caused by frequent vocal state changes in audio-visual segmentation, achieving a 4.2% boost on AVS-Semantic.
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting: ROCKET-1 proposes a novel communication protocol named visual-temporal context prompting. By prompting object segments on past visual observations, this protocol guides policy models to interact with the environment. Through training a segment-conditioned low-level policy and combining it with GPT-4o, Molmo, and SAM-2 to construct a hierarchical agent, ROCKET-1 achieves a 76% absolute performance gain in open-world interaction within Minecraft.
ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object: ROS-SAM adapts SAM to the high-quality interactive segmentation task of moving objects in remote sensing videos by fine-tuning the encoder via LoRA, improving the HQ decoder, and redesigning the data pipeline. This achieves a 13% IoU improvement and demonstrates strong zero-shot generalization capabilities.
RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection: The region-guided selective optimization network (RSONet) is proposed to address the inconsistency of salient regions between RGB and thermal images through a two-stage process (region guidance and saliency generation). It dynamically selects the modality with more accurate information to dominate subsequent fusion based on similarity scores.
SAM2-LOVE: Segment Anything Model 2 in Language-Aided Audio-Visual Scenes: SAM2-LOVE designs a multimodal fusion Transformer to compress text, audio, and visual tri-modal information into a learnable token to prompt SAM2. Combined with token propagation and accumulation strategies to enhance spatiotemporal consistency, it outperforms the state-of-the-art (EEMC) by 8.5 percentage points on the Ref-AVS benchmark with a \(\mathcal{J\&F}\) score of 58.5%.
SAMWise: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation: By introducing a Cross-Modal Temporal Adapter (CMT) and a Conditional Memory Encoder (CME), SAMWISE infuses natural language understanding and explicit temporal modeling into SAM2 without fine-tuning its original weights. Operating in a streaming fashion, it achieves state-of-the-art (SOTA) performance on Referring Video Object Segmentation (RVOS) with less than 5M additional parameters.
SAP: Segment Any 4K Panorama: This work reformulates 360° panoramic segmentation as a perspective video segmentation problem. By decomposing the panorama into a sequence of overlapping patches along a zigzag trajectory and fine-tuning the memory module of SAM2, combined with large-scale training on 183K synthetic 4K panoramas, it achieves a zero-shot panoramic segmentation improvement of +17.2 mIoU.
Scale Efficient Training for Large Datasets: Proposes SeTa (Scale Efficient Training), a loss-based dynamic sample pruning framework. Through a three-step strategy consisting of random sampling for de-redundancy, loss clustering for difficulty division, and sliding window progressive curriculum learning, it achieves up to 50% reduction in training costs without performance loss across 11 datasets, 10 task categories, and 14 models.
Scene-Centric Unsupervised Panoptic Segmentation: CUPS is the first unsupervised panoptic segmentation method trained directly on scene-centric images (such as autonomous driving scenarios). By fusing self-supervised visual features, stereo depth, and optical flow motion cues to generate high-quality pseudo-labels, it outperforms the previous SOTA, U2Seg, by 9.4% PQ on Cityscapes.
GleSAM: Segment Any-Quality Images with Generative Latent Space Enhancement: GleSAM introduces the denoising capability of pre-trained Latent Diffusion Models (LDMs) into the latent space of SAM. It enhances the feature representation of low-quality images through single-step denoising, achieving robust segmentation for images of any quality.
Segment Any Motion in Videos: This paper proposes a moving object segmentation method that combines long-range point trajectory motion cues, DINO semantic features, and SAM2 pixel-level mask densification. By employing spatio-temporal trajectory attention and motion-semantic decoupled embedding, it significantly outperforms traditional optical flow-based methods on multiple benchmarks, particularly in fine-grained multi-object segmentation scenarios.
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation: SemLA proposes a training-free test-time domain adaptation framework. By building a LoRA adapter library indexed by CLIP, it dynamically retrieves and fuses the most relevant adapters at inference based on the embedding distance between the input image and domain centroids. This achieves on-the-fly and highly efficient domain adaptation for open-vocabulary semantic segmentation models.
SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data: This paper proposes the SGMA framework to address three major challenges in incomplete multimodal remote sensing segmentation: modality imbalance, intra-class variations, and cross-modality heterogeneity. Specifically, a Semantic-Guided Fusion (SGF) module constructs global semantic prototypes to estimate modality robustness for adaptively weighted fusion, while a Modality-Aware Sampling (MAS) module dynamically prioritizes training fragile modalities.
ShiftwiseConv: Small Convolutional Kernel with Large Kernel Effect: This paper reveals that the effectiveness of large kernel convolutions can be decoupled into two factors: "feature extraction at a specific granularity" and "multi-path feature fusion." Based on this insight, the authors propose ShiftwiseConv (SW Conv)—a plug-and-play CNN module that uses standard \(3 \times 3\) convolutions through spatial shift operations and multi-path connections to simulate the effect of large kernels. SW Conv outperforms large-kernel CNNs such as SLaK and UniRepLKNet, as well as various Transformer architectures, across tasks such as classification, detection, and segmentation.
Show and Tell: Visually Explainable Deep Neural Nets via Spatially-Aware Concept Bottleneck Models: SALF-CBM is proposed to convert any vision network into a spatially-aware concept bottleneck model. By using CLIP visual prompting to generate spatialized concept maps, it provides dual explanations of "where" (heatmaps) and "what" (concepts), achieving performance that even surpasses the original backbone accuracy on ImageNet.
SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models: Introduces SketchFusion, which dynamically injects CLIP visual features into the denoising process of Stable Diffusion to compensate for SD's high-frequency bias and sketch feature deficiencies. Combined with adaptive multi-scale feature aggregation, it establishes the first universal sketch feature representation in the foundation model era, achieving state-of-the-art (SOTA) performance across four tasks: retrieval, recognition, segmentation, and correspondence learning.
SmartEraser: Remove Anything from Images using Masked-Region Guidance: SmartEraser proposes a new paradigm called Masked-Region Guidance, which retains the masked region as a guide instead of discarding it. Combined with the million-scale synthetic Syn4Removal dataset, it significantly outperforms existing mask-and-inpaint methods on object removal tasks.
Soft Self-Labeling and Potts Relaxations for Weakly-Supervised Segmentation: This paper proposes a soft pseudo-label-based self-labeling method. By systematically evaluating multiple formulations of Potts relaxations and cross-entropy variants, it achieves segmentation performance close to or even exceeding fully supervised levels on standard network architectures using only scribble (3% of pixels) supervision, without any network structure modifications.
Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation: Proposed the SERA framework, which introduces lightweight expression-aware Mixture-of-Experts (MoE) refinement into pre-trained vision-language models. It performs expert routing at both the backbone level (SERA-Adapter) and the fusion level (SERA-Fusion), achieving state-of-the-art (SOTA) performance on referring image segmentation benchmarks while updating less than 1% of the parameters.
StoryGPT-V: Large Language Models as Consistent Story Visualizers: This paper proposes StoryGPT-V, which achieves accurate, high-quality, and temporally consistent character image generation in story visualization with low memory overhead through a two-stage training scheme: first training a Character-Aware Latent Diffusion Model (Char-LDM) for high-quality character generation, and then aligning LLM output with the input space of Char-LDM to achieve anaphora resolution and contextual consistency.
Style-Editor: Text-driven Object-Centric Style Editing: This paper proposes Style-Editor, which utilizes patch-level directional loss and adaptive background preservation loss in the CLIP space to achieve precise style editing of target objects using only text descriptions, without requiring segmentation masks or reference images.
Task-driven Image Fusion with Learnable Fusion Loss: This paper proposes TDFusion, which trains a loss generation module via meta-learning to adaptively adjust the fusion loss function based on downstream tasks (semantic segmentation or object detection), thereby achieving optimal performance for infrared-visible fusion images on downstream tasks.
The Devil is in Low-Level Features for Cross-Domain Few-Shot Segmentation: This paper deeply analyzes the phenomenon in CDFSS (Cross-Domain Few-Shot Segmentation) where "performance peaks in early training and then drops sharply". It finds that the culprit is the vulnerability of low-level features to domain shift, which leads to a sharp loss landscape. Consequently, two plug-and-play modules are proposed: LEM (for sharpness-aware minimization of low-level features via random convolution + FFT during training) and LCM (for directly calibrating segmentation results using low-level query features during testing), outperforming SOTA by an average of 3.71%/5.34% mIoU on four target domains.
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation: VRS-HQ proposes hierarchical temporal token encoding (frame-level <SEG> + video-level <TAK>) and a token-driven keyframe selection strategy. Incorporating SAM2, it achieves end-to-end video reasoning segmentation, outperforming VISA by 9.1% on ReVOS.
The Power of Context: How Multimodality Improves Image Super-Resolution: Proposes MMSR, a diffusion-based super-resolution method that integrates multimodal information including depth, semantic segmentation, edge, and textual description, effectively suppressing hallucinations and improving SR quality through a Multimodal Latent Connector and multimodal CFG.
Token CropR: Faster ViTs for Quite a Few Tasks: Proposes Token CropR (Cropr), a cross-attention-based ViT token pruning method that learns to select tokens based on task relevance end-to-end via auxiliary prediction heads. During inference, these auxiliary heads are discarded to achieve a throughput close to that of a random pruner, achieving \(1.5\text{-}4\times\) speedups with minimal performance loss across classification, semantic segmentation, object detection, and instance segmentation.
Towards Generalizable Scene Change Detection: Proposes GeSCF, the first zero-shot scene change detection framework, which leverages internal features of SAM to achieve cross-domain generalization and temporally consistent change mask generation, while defining a generalized SCD benchmark.
Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video: Uni4D proposes a multi-stage optimization framework that unifies multiple pre-trained visual foundation models (depth estimation, point tracking, segmentation, etc.) into an energy minimization problem. Without requiring retraining or fine-tuning, it jointly recovers camera poses, static/dynamic 3D geometry, and dense 3D motion trajectories from casual monocular videos, achieving state-of-the-art performance on multiple dynamic scene datasets.
Universal Domain Adaptation for Semantic Segmentation: This work introduces the Universal Domain Adaptation for Semantic Segmentation (UniDA-SS) task and proposes the UniMAP framework. By leveraging two core components—Domain-Specific Prototype Differentiation (DSPD) and Target-guided Image Matching (TIM)—UniMAP achieves effective adaptation from synthetic to real-world data without prior knowledge of category configurations, significantly outperforming existing UDA-SS methods.
Using Diffusion Priors for Video Amodal Segmentation: This paper reformulates video amodal segmentation as a conditional generation task, leveraging the shape priors of a pretrained video diffusion model (Stable Video Diffusion). Conditioning on modal masks and pseudo-depth maps, it achieves completion in occluded areas with an improvement of up to 13% mIoU, and realizes video-level amodal content completion for the first time.
V-CLR: View-Consistent Learning for Open-World Instance Segmentation: v-CLR proposes a view-consistent learning framework. By transforming natural images into appearance-invariant views such as depth maps and stylized images, enforcing cross-view query feature consistency within a DETR architecture, and utilizing unsupervised object proposals to guide the matching direction, the framework effectively overcomes the texture bias of detection networks. It achieves state-of-the-art performance on multiple open-world segmentation benchmarks.
Visual Consensus Prompting for Co-Salient Object Detection: This paper is the first to introduce the parameter-efficient prompt learning paradigm into the Co-Salient Object Detection (CoSOD) task. It proposes Visual Consensus Prompting (VCP) by embedding the processes of consensus extraction and dispersion into learnable prompts. Under the condition of freezing the foundation model, it outperforms 13 fully fine-tuned methods with extremely few trainable parameters.
Your ViT is Secretly an Image Segmentation Model: This paper proposes the Encoder-only Mask Transformer (EoMT), demonstrating that with large-scale pre-training and sufficiently large models, plain ViT can achieve high-quality image segmentation without task-specific components such as CNN adapters, pixel decoders, and Transformer decoders, while being up to 4x faster.