✂️ Segmentation¶
📷 CVPR2026 · 117 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (32) · 🧪 ICML2026 (14) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (45) · 📹 ICCV2025 (73)
🔥 Top topics: Segmentation ×82 · Remote Sensing ×7 · Object Detection ×6 · Diffusion Models ×4 · Few-/Zero-Shot Learning ×4
- 3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion
-
Ours proposes 3M-TI, a calibration-free multi-camera cross-modal diffusion framework. It automatically aligns and fuses uncalibrated RGB-thermal image pairs in the VAE latent space via Cross-modal Self-Attention (CSM). Combined with a misalignment augmentation strategy, it achieves SOTA on mobile thermal super-resolution tasks and significantly improves downstream object detection and semantic segmentation performance.
- A Mixed Diet Makes DINO An Omnivorous Vision Encoder
-
An Omnivorous Vision Encoder is proposed, which performs cross-modal alignment distillation training (RGB/Depth/Segmentation) on top of a frozen DINOv2 via a lightweight adapter. This enables a single encoder to produce consistent embeddings for diverse visual modalities while preserving original discriminative semantics.
- Annotation-Efficient Coreset Selection for Context-dependent Segmentation
-
Focusing on the extremely high annotation cost in "context-dependent" segmentation tasks like camouflaged objects and medical lesions, this paper assigns an "importance score" to each image via point-annotation-based Optimal Transport. A Max-Distance Entropy strategy is then used to select a coreset (CostSet) that balances coverage and diversity. At a 40% pruning rate, it only loses approximately 1% IoU compared to full training.
- Bayesian Decomposition and Semantic Completion for Few-shot Semantic Segmentation
-
The authors decompose Few-shot Semantic Segmentation (FSS) into three lightweight probabilistic terms—Prior, Likelihood, and Class Consistency—using the Bayesian formula. The method utilizes SAM to generate structured candidate regions, a small binary classification network (CALM) to estimate likelihood and consistency simultaneously, and a Semantic Completion Module (SCM) to merge regional fragments into a complete mask. It achieves SOTA performance on PASCAL-5\(^i\) and COCO-20\(^i\) with high efficiency.
- Beyond Appearance: Camouflaged Object Detection via Geometric Structure
-
DepthSAM adapts the monocular depth estimation (MDE) foundation model, Depth Anything v2, for camouflaged object detection. By freezing the backbone and injecting Sparse Mixture-of-Experts Adapters (SMEA), it pivots the task from "reconstructing the entire scene geometry" to "highlighting camouflaged object geometry." A Geometric-Semantic Fusion Module (GSFM) is then used to align geometric cues with semantic information, achieving new SOTA results on COD10K, CAMO, and NC4K benchmarks (surpassing the runner-up by 3.0% \(S_\alpha\) and 4.3% \(F^\omega_\beta\) on COD10K).
- Beyond Text: Visual Description Assembly by Probabilistic Model for CLIP-based Weakly Supervised Semantic Segmentation
-
To address the issues of "modality gap between text prototypes and visual features" and "static text failing to adapt to diverse instances" in CLIP-based weakly supervised segmentation, this paper uses an Invertible Neural Network to model CLIP visual features as a Hierarchical Gaussian Mixture Model (H-GMM). It explicitly decouples intra-class attributes in the visual space, dynamically assembles them into visual description prototypes based on instance responses to replace text queries, and adaptively reverts to text anchors using density weights. It achieves new SOTAs of 79.9%/51.4% mIoU on VOC/COCO for single-stage WSSS.
- BiPA: Bilevel Prompt Adaptation for Underwater Instance Segmentation
-
BiPA reformulates SAM's dense prompt learning as a bilevel optimization problem with "prompts at the upper level and model parameters at the lower level." It employs Bayesian optimization and a two-stage training strategy to make the problem solvable, combined with a Foreground Attention Injection (FAI) module to restore local details. This efficiently transfers the general SAM to severely degraded underwater scenes, achieving mAP scores that comprehensively surpass previous SOTAs on UIIS and USIS10K datasets.
- AFRO: Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning
-
This paper proposes AFRO, a self-supervised 3D visual pre-training framework. By employing an Inverse Dynamics Model (IDM) to infer latent actions, a Forward Dynamics Model (FDM) based on Diffusion Transformers to predict future features, and an inverse consistency constraint to ensure temporal symmetry, the method achieves an average success rate of 76.0% on MetaWorld 14 tasks after pre-training on the large-scale RH20T dataset (vs. 64.9% for DynaMo-3D and 63.9% for PointMAE). It also achieves state-of-the-art results on four real-world tasks.
- Bootstrap Your Own AV-Proxies: Adaptive Contrastive and Prototype Learning for Audio-Visual Segmentation
-
Addressing the "intra-modal noise + audio-visual semantic gap" in audio-visual segmentation (AVS), this paper proposes BYOAVP. It utilizes BYOL-style negative-free contrastive learning (SSAE) to allow high-level visual semantics to supervise audio, suppressing off-screen/background noise. Additionally, it employs momentum-updated dynamic prototypes (DPC) for pixel-level classification and cross-modal reinforcement of sounding regions. Without any priors like SAM or offline prototypes, it achieves SOTA performance across six sub-tasks on AVSBench and VPO datasets.
- Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation
-
Proposes DEO (Distillation for Earth Observation), a dual-teacher contrastive distillation framework. It utilizes a multispectral self-distillation teacher to learn spectral representations and an optical VFM teacher (DINOv3) to inject high-level semantic priors. This enables a single student network to excel in both optical and multispectral remote sensing tasks, achieving SOTA across semantic segmentation, change detection, and classification.
- CDICS: Delving Into Fine-Grained Attribute for In-Context Segmentation via Compositional Prompts and Phased Decoupling
-
CDICS upgrades traditional in-context segmentation from "one reference image defines one target" to "a combination of semantic, part, and color reference images defines the target." By utilizing a decoupled two-stage decoder (first for coarse semantic localization, then for refinement with appearance constraints), it separates the sub-problems of "what it is" and "what it looks like." In compositional prompt segmentation tasks, it improves IoU from 42.9% to 57.6% and reduces the False Positive Rate (FPR) from 8.3% to 3.9%.
- DeBias-CLIP: CLIP Is Shortsighted — Paying Attention Beyond the First Sentence
-
It is discovered that CLIP models exhibit a heavy bias towards encoding summary sentences and early tokens in long-text scenarios ("shortsighted" behavior). By employing three zero-parameter incremental training augmentation strategies — summary removal, random sentence sampling, and token prefix padding — the proposed method achieves comprehensive SOTA performance in long-text retrieval while simultaneously improving short-text retrieval.
- Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation
-
To address the dilemma in synthetic segmentation data generation where "fine-tuning leads to overfitting, yet not fine-tuning leads to domain misalignment," this paper proposes Concept-Aware LoRA (CA-LoRA). It first identifies the projection layers in a T2I model most sensitive to a specific target concept (viewpoint or style) using a "concept loss," and then applies LoRA fine-tuning only to these top-\(k\)% layers. This approach learns only the desired concepts while preserving pre-trained knowledge, generating image-label pairs that are both domain-aligned and diverse. It achieves a +2.30% mIoU improvement on Cityscapes few-shot and an average +1.53% mIoU gain in domain generalization.
- Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
-
Ours proposes CFT (Concept-Guided Fine-Tuning), which utilizes LLMs to generate category-level semantic concepts and obtains concept masks through GroundedSAM zero-shot segmentation. ViT is fine-tuned with the objective of aligning AttnLRP relevance maps with these concept regions. Using only 1500 images, it significantly improves robustness across five OOD benchmarks.
- ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization
-
ConceptPrism is proposed to automatically disentangle shared target concepts from image-specific residual information in personalized T2I diffusion models. By introducing image-level residual tokens and cross-image exclusion loss, the method achieves state-of-the-art performance across CLIP-T, DINO, and CLIP-I metrics on DreamBench.
- Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
-
This paper introduces the "Conversational Image Segmentation (CIS)" task—grounding abstract concepts such as affordances, physical stability, and user intent onto pixel-level masks. It presents the CONVERSEG benchmark, a fully automated VLM data engine (synthesizing 61K prompt–mask pairs without manual annotation), and CONVERSEG-NET, a single-pass model. CONVERSEG-NET achieves 70.5% (3B) / 73.3% (7B) gIoU on CONVERSEG while remaining competitive on traditional benchmarks like RefCOCO and ReasonSeg.
- CrackSSM: Reviving SSMs for Crack Segmentation via Dynamic Scanning
-
Addressing the slender, intermittent, and irregular nature of cracks, CrackSSM replaces the "fixed-path scanning" in Mamba-based vision models with adaptive token reordering (dynamic scanning) driven by crack direction intensity. This ensures that adjacent crack pixels remain adjacent in 1D sequences, restoring the causal modeling capability of S6. Combined with a wavelet high-frequency prior-guided decoder, it achieves superior accuracy over SOTAs like SCSegamba on three crack datasets with only 2.95M parameters / 4.69G FLOPs.
- Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation
-
To address the dilemma in Cross-Domain Few-Shot Segmentation (CD-FSS) where "scarcity of target samples + large domain gap weakens the few-shot capability of source models," this paper proposes Multi-view Progressive Adaptation (MPA). It performs "easy-to-hard" adaptation from both data and strategy perspectives—generating increasingly complex multi-views via Hybrid Progressive Augmentation (HPA) and fully exploiting supervision signals through Dual-chain Multi-view Prediction (DMP) across serial and parallel paths. MPA outperforms Prev. SOTA by an average of 7.0% (1-shot) across four data-scarce domains and reduces training time by 80% with negligible performance drops when omitting source domain training.
- DeDelayed: Deleting Remote Inference Delay via On-Device Correction
-
This paper proposes DeDelayed, an edge-cloud collaborative inference framework that combines a lightweight local image model with a delay-aware cloud temporal prediction video model. By training the cloud model for temporal prediction to compensate for network latency, the framework improves mIoU by 6.4 compared to purely local inference and by 9.8 compared to purely remote inference under a 100ms delay.
- Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation
-
DAPASS transfers a pinhole-pre-trained segmentation model to the panoramic domain without source data. It partitions target samples into reliable and unreliable sets based on confidence consistency, cleans pseudo-labels via bilevel optimization and class-balanced copy-paste, and aligns local details with global semantics using a cross-resolution attention module to mitigate ERP distortion. It achieves 55.04% and 70.38% mIoU on outdoor C-to-D and indoor Spin-to-Span benchmarks, respectively.
- Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification
-
The authors propose IFA-Net, which detects AI forgeries from the perspective of "modeling what is real" rather than "learning what is fake". By utilizing a frozen MAE to reconstruct inputs, the method produces residuals that expose regions deviating from the natural image manifold. Through a two-stage closed loop—coarse detection → task-adaptive prior injection → residual amplification → refinement—manifold deviations are iteratively amplified. The model achieves SOTA performance on both diffusion inpainting and traditional tampering detection.
- Differentiable Laplacian Matrix Guided Superpixel Segmentation
-
Addressing the issue where deep superpixel models rely on non-differentiable "Enforced Connectivity (EC)" post-processing to eliminate fragments, this paper proposes a fully differentiable, model-agnostic Graph Laplacian loss (along with a minimal semantic distance loss and weighted reconstruction loss). It pushes superpixels toward connectivity during training, significantly reducing fragments with almost no loss in ASA, moving closer to "post-processing-free, true end-to-end" learning.
- DIMOS: Disentangling Instance-level Moving Object Segmentation
-
Addressing the challenges of "entangled appearance and motion information in event cameras and sparse features for small objects," DIMOS employs dual disentangled encoders to extract dual branches of appearance and motion features from both image and event modalities. By utilizing adversarial domain adaptation and modality translation for distribution-level and semantic-level alignment before fusion, DIMOS achieves State-of-the-Art (SOTA) performance on three small-object moving instance segmentation benchmarks: MouseSIS, SEVD-Fixed, and EVIMO.
- Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation
-
A training-free open-vocabulary semantic segmentation method is proposed that bypasses the logits optimization process. Based on the hypothesis that "distribution discrepancies from logits to a degenerate distribution are consistent for homogeneous regions," segmentation maps are directly constructed via analytical solutions of optimal transport paths or maximum transport velocities. It achieves SOTA performance on 8 benchmarks without training or model-specific modulations.
- DSS: Discover, Segment, and Select for Zero-shot Camouflaged Object Segmentation
-
The proposed DSS is a three-stage progressive pipeline (Discover→Segment→Select) that achieves zero-shot training-free camouflaged object segmentation. It discovers the target (FOD) via self-supervised visual features and Leiden clustering, generates candidate masks with SAM, and selects the optimal mask through heuristic scoring and iterative MLLM pairwise comparisons. It significantly outperforms existing methods, particularly in multi-instance scenarios.
- DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime
-
DSFlash improves panoptic scene graph generation speed to 56 FPS on an RTX 3090 while achieving SOTA performance with \(mR@50=30.9\) on the PSG dataset by merging segmentation and relation prediction backbones, employing a bidirectional relation prediction head, and utilizing dynamic patch pruning.
- Efficient Video Object Segmentation and Tracking with Recurrent Dynamic Submodel
-
To address the slow inference speeds of large video segmentation models like SAM2, this paper introduces a "Predictive-Aware Router" (taking the previous segmentation mask and current visual features) to activate only a specific subset of blocks per frame. Combined with "Importance-Aware LoRA" that fine-tunes only critical blocks, it achieves a 1.3× real-world speedup on DAVIS 2017 with a performance drop of <0.4%, training only 3% of the parameters.
- ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
-
ELVIS proposes the first low-light Video Instance Segmentation (VIS) framework, which achieves gains of +3.7 AP and +2.8 AP on synthetic and real low-light videos, respectively. This is accomplished through a physics-driven synthetic low-light video pipeline with motion blur modeling, an uncalibrated degradation parameter estimation network (VDP-Net), and an enhancement decoder integrated into the VIS architecture for degradation-content decoupling.
- EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection
-
Ours proposes EReCu, a unified framework that utilizes Multi-cue Native Perception (MNP) to extract texture and semantic priors from the DINO teacher-student architecture. These priors guide Pseudo-label Evolution Fusion (PEF) and Local Pseudo-label Refinement (LPR) to recover boundary details. This work represents the first unification of pseudo-label guidance and feature learning paradigms in UCOD, achieving SOTA across four COD datasets.
- Exploring the Underwater World Segmentation without Extra Training
-
Addressing the scarcity of data and models in underwater scenarios, this work introduces the first fine-grained underwater open-vocabulary segmentation dataset and benchmark (AquaOV255 / UOVSBench). It also proposes Earth2Ocean, a training-free framework that corrects CLIP visual features with geometric self-similarity priors and enhances text embeddings via MLLM reasoning. This transfers terrestrial VLMs to underwater contexts without any extra training, achieving an average mIoU improvement of 6+.
- F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation
-
F2Net decomposes ultra-high resolution (UHR) remote sensing images in the frequency domain into high-frequency and low-frequency components for separate processing. A high-frequency branch preserves full resolution for boundary details, while the low-frequency branch is downsampled and split into two sub-branches (short-range and long-range) for semantic capture. A Hybrid Frequency Fusion (HFF) module integrates the three features, supported by two cross-frequency losses to stabilize multi-branch training, achieving SOTA results of 80.22 and 83.39 mIoU on DeepGlobe and Inria Aerial, respectively.
- FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
-
FlowDIS reformulates high-precision Dichotomous Image Segmentation (DIS) as a flow matching problem—directly learning a time-dependent velocity field to transport the "image distribution" to the "mask distribution," replacing the stochastic denoising process of diffusion models with a deterministic ODE. Combined with the PAIP instance-pairing training strategy to enhance language controllability, it achieves new SOTA results on all DIS5K test sets. With only 1-step inference, it achieves a approximately 5.5% higher \(F_\beta^\omega\) and 43% lower MAE on DIS-TE compared to the runner-up LawDIS.
- Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
-
The STaRC framework is proposed to unify retrieval (saliency-guided segmentation + retrieval) and description generation (saliency prompt injection for the decoder) through supervised frame-level saliency learning, significantly improving temporal alignment and caption quality in Dense Video Captioning (DVC).
- FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
-
FoV-Net is proposed as the first rotation-invariant framework for CAD B-rep learning that simultaneously captures local surface geometry and global structural context. It achieves robust classification and segmentation under arbitrary \(\mathbf{SO}(3)\) rotations through Local Reference Frame UV grids (LRF UV) and Field-of-View (FoV) ray casting descriptors.
- Frequency-Aware Affinity for Weakly Supervised Semantic Segmentation
-
Addressing the issue where ViT self-attention acts as a low-pass filter, resulting in affinity that only diffuses within object interiors and loses boundaries, this paper proposes the Dual Frequency-Aware (DFA) framework. DFA uses low-frequency affinity to align internal semantics and high-frequency (inverse) affinity to correct object boundaries. By employing Optimal Transport-based Frequency-Guided CAM generation, the "generation + refinement" process is merged into a single step, achieving new single-stage WSSS SOTA results on PASCAL VOC (val 79.3% mIoU) and MS COCO (val 51.5%).
- From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
-
The authors decouple two-hand reconstruction into 2D structural alignment (fusing keypoint/segmentation/depth priors) and 3D spatial interaction alignment (penetration-removal diffusion model). This approach achieves an MPJPE of 5.36mm on InterHand2.6M, significantly outperforming the state-of-the-art.
- From Softmax to Dirichlet: Evidential Learning for Semi-supervised Semantic Segmentation
-
To address the issue of unreliable pseudo-label filtering caused by network overconfidence in softmax scores, this paper utilizes evidential learning to model per-pixel class probabilities as a Dirichlet distribution, obtaining principled uncertainty. Furthermore, HESS is proposed to decouple "exclusive evidence" from "collective evidence." Serving as a plug-and-play module for UniMatch/UniMatch V2, it achieves stable performance gains across Pascal/Cityscapes/COCO benchmarks under low-label settings (up to +2.3% mIoU on the most challenging 1/16 split).
- Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation
-
The paper proposes CoMCS, which leverages a dual approach of "content modulation + style modulation" to enhance the generalization of Co-Salient Object Detection (CoSOD) in unseen domains. Specifically, it employs CLIP semantic embeddings to inject domain-invariant scene structure priors (MCM), synthesizes expanded training domain styles using feature statistics (MSM), and pushes prototypes apart on a hypersphere using a uniformity loss (SCM). CoMCS outperforms 17 SOTA methods across four benchmarks, including a self-constructed unseen domain dataset (UND).
- GenMask: Adapting DiT for Segmentation via Direct Mask Generation
-
This paper proposes GenMask, which directly trains a Diffusion Transformer (DiT) to generate black-and-white segmentation masks (sharing the same model used for color image generation). By discovering the unique property that VAE latent representations of binary masks are linearly separable, the authors design an extreme long-tailed timestep sampling strategy specifically for segmentation. This enables single-step inference to produce segmentation masks, achieving SOTA performance on referring and reasoning segmentation benchmarks.
- GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation
-
This paper proposes GeoGuide, a hierarchical geometric guidance framework for open-vocabulary 3D semantic segmentation. By utilizing three complementary modules—uncertainty-guided superpoint distillation, instance-level mask reconstruction, and inter-instance relationship consistency—the framework leverages geometric priors from pre-trained 3D models to correct geometric biases in 2D-to-3D knowledge distillation, achieving SOTA performance with 64.8 mIoU on ScanNet v2.
- GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry
-
GeoMotion reformulates motion segmentation from "explicit estimation of camera pose and point correspondence + iterative optimization" to "direct feed-forward decoding of motion masks from latent geometric features of a pre-trained 4D reconstruction model (π3)". Utilizing a feature aggregation module and a 5-layer self-attention decoder, it decouples object motion from camera motion in a single forward pass. It achieves SOTA on multiple zero-shot benchmarks and runs at 0.31s per frame, more than 20x faster than iterative optimization methods.
- GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
-
GeoSURGE proposes hierarchical geographic embeddings and a semantic fusion module, modeling the global image geo-localization problem as a matching task between visual representations and learned geographic representations. It achieves SOTA on 22 out of 25 metrics across 5 benchmarks.
- GKD: Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation
-
Ours proposes the GKD framework, which decouples representation learning from task learning using a multi-stage distillation process (general feature learning → freeze encoder → task head training) combined with a Query-based Soft Distillation (QSD) mechanism. By distilling cross-domain generalization capabilities from VFMs into lightweight student models, it achieves an average mIoU gain of +10.6% in F2L settings and +1.9% in F2F settings.
- Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions
-
This paper reformulates class-level curriculum learning in unsupervised domain adaptation as a sequential decision-making problem in reinforcement learning. It proposes the HeuSCM framework, which achieves autonomous curriculum planning through high-dimensional semantic state perception and a class-fair policy gradient, reaching SOTA performance on ACDC, Dark Zurich, and Nighttime Driving (72.9 mIoU).
- High-Precision Dichotomous Image Segmentation via Depth Integrity-Prior and Fine-Grained Patch Strategy
-
Addressing the dilemma in high-precision Dichotomous Image Segmentation (DIS) where "non-diffusion models are fast but semantically weak, while diffusion models are accurate but heavy and slow," this paper observes that complete objects in depth maps exhibit "low variance, smooth interiors, and sharp boundaries," while the background shows "high variance and chaos." Termed the depth integrity-prior, the authors utilize a pre-existing monocular depth estimation model (DAM-v2) to generate pseudo-depth as a new modality. Combined with the cross-modal fusion network PDFNet, a depth integrity loss, and an 8×8 fine-grained patch strategy, the method achieves SOTA results on DIS-VD with \(F^{max}_\beta=0.915\) using less than half the parameters of diffusion-based methods.
- Hilbert Curve-Based Attention Enabling Topology-Preserving Image Tensor Representation for Semantic Segmentation Network
-
Aiming at building surface defect segmentation from UAV images, this paper proposes TPSegformer. Before the attention calculation in the decoder, it utilizes the Hilbert curve instead of traditional row-major flattening to compress 2D features into 1D sequences, thereby preserving the spatial adjacency of pixels during dimensionality reduction. Combined with dual-branch feature enhancement, high-low resolution fusion, and joint auxiliary supervision using Dice and edge losses, it achieves 80.77% mIoU and 90.22% Acc on the self-built BD3 defect dataset.
- HOPS: Hierarchical Open-vocabulary Part Segmentation with Attention-Aware Filtering and Affinity-Guided Enhancement
-
HOPS utilizes a bidirectional attention fusion mechanism of "CLIP Semantics \(\otimes\) DINO Structure" within a two-stage framework to address Open-vocabulary Part Segmentation (OVPS). The first stage employs an Attention-Aware Filtering Module (AFM) to eliminate object-level over-segmentation, while the second stage uses an Affinity-Guided Enhancement Module (AEM) to iteratively expand weak activations for small parts. It achieves new SOTA performance on Pascal-Part-116, ADE20K-Part-234, and PartImageNet.
- HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation
-
HySeg reformulates remote sensing image semantic segmentation (RSISS) as "posterior inference constrained by generative structural priors." It first learns a structural prior encoding topological continuity and regional adjacency in label space using a MeanFlow-based MeanStruct module. This abstract prior is then projected into topology-aware pixel-wise affinities via P2A. Finally, a DAS head performs constrained message passing based on these affinities, achieving plug-and-play improvements in structural consistency and cross-dataset generalization across four remote sensing benchmarks.
- INSID3: Training-Free In-Context Segmentation with DINOv3
-
This paper proposes INSID3, a training-free in-context segmentation method relying solely on frozen DINOv3 features. Through a three-stage pipeline consisting of positional bias elimination, fine-grained clustering, and seed cluster aggregation, it outperforms methods relying on SAM or fine-tuning across semantic, part, and personalized segmentation tasks using a single self-supervised backbone, achieving an average mIoU gain of +7.5%.
- Joint Spectral Image Reconstruction and Semantic Segmentation with Cooperative Unfolding
-
To address error accumulation in the "reconstruction-then-segmentation" two-stage pipeline for Coded Aperture Snapshot Spectral Imaging (CASSI) and the loss of complementary clues between tasks, this paper proposes the first Cooperative Reconstruction-Segmentation Deep Unfolding Network (CRSDUN). It integrates HSI reconstruction and segmentation into a unified Half-Quadratic Splitting (HQS) optimization framework for alternating solutions. A Cross-Aggregation Super-token Attention (CASTA) module is introduced to bidirectionally transfer pixel-level and semantic-level representations between branches. It achieves SOTA performance in both reconstruction and segmentation on synthetic and real CASSI data with lower computational cost.
- Kαlos finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
-
Ours proposes the KαLOS meta-algorithm, which converts complex spatial-category annotation consistency problems into standard nominal reliability matrices through "Localization-First" principles and data-driven parameter calibration. It provides a unified framework to evaluate Inter-Annotator Agreement (IAA) across diverse vision tasks such as object detection, instance segmentation, and pose estimation.
- Learning and Aligning Click-Aware Shape Prior for Interactive Amodal Instance Segmentation
-
ClickPriorNet reformulates amodal instance segmentation (segmenting both visible and occluded regions) as an interactive task. Based on user clicks, the model retrieves complementary shape priors from a codebook using the "previous mask + current clicks" as a query and aligns these priors to the target instance via deformable attention. This approach achieves more complete amodal masks with fewer clicks across KINS, D2SA, and COCOA datasets.
- Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction
-
Ours proposes CCMP, a cross-view object correspondence framework based on conditional binary segmentation. It utilizes cycle-consistency constraints to provide self-supervised signals and supports Test-Time Training (TTT), achieving SOTA performance of 44.57% mIoU on Ego-Exo4D.
- Leveraging Class Distributions in CLIP for Weakly Supervised Semantic Segmentation
-
Addressing the "under-activation" issue in CLIP-generated CAMs caused by inaccurate MHSA affinity, CD-CLIP identifies that "patches of the same class exhibit highly similar probability distributions across all classes." It constructs Class Distribution-Aware (CDA) affinity using JS divergence to complete the foreground. Furthermore, it introduces Super-class Boundary Exploration (SBE) using DINO-based super-class prototype CAMs to suppress over-activation through boundary supervision. This single-stage approach achieves 82.5% mIoU on PASCAL VOC and 54.1% mIoU on MS COCO.
- Live Interactive Training for Video Segmentation
-
LIT (Live Interactive Training) proposes a framework that enables interactive vision systems (e.g., SAM2) to learn online from user corrections during inference. Its lightweight implementation, LIT-LoRA, generalizes user feedback to subsequent frames by updating LoRA modules in real-time. It reduces user corrections by 18-34% on challenging VOS benchmarks with a training overhead of only approximately 0.5 seconds.
- LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
-
This paper proposes LoD-Loc v3, which addresses poor cross-scene generalization and pose ambiguity in dense cities for LoD-based UAV localization. By constructing a large-scale synthetic instance segmentation dataset (InsLoD-Loc) containing 100,000 images and upgrading the localization paradigm from semantic to instance silhouette alignment, it achieves a 2000% precision improvement (2m, 2°) on the dense Tokyo-LoDv3 scene compared to the previous SOTA.
- Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
-
To address cross-window semantic inconsistency caused by sliding window inference in training-free open-vocabulary semantic segmentation, this paper proposes the GLA-CLIP framework. By integrating global key-value extension, proxy anchor attention, and dynamic normalization, the method achieves global context integration and attains SOTA performance with an average 44.0% mIoU across 8 benchmarks.
- Making Training-Free Diffusion Segmentors Scale with the Generative Power
-
This work reveals the fundamental reason why existing training-free diffusion segmentation methods fail to scale with the increasing power of generative models: the existence of two gaps (the aggregation gap and the score imbalance gap) between cross-attention maps and semantic correlation. The authors propose the GoCA framework, consisting of auto aggregation and per-pixel rescaling, enabling stronger diffusion models (SDXL, PixArt-Sigma, Flux) to significantly outperform older models in training-free semantic segmentation for the first time.
- MARIS: Marine Open-Vocabulary Instance Segmentation
-
This paper introduces MARIS, the first fine-grained underwater open-vocabulary instance segmentation benchmark (16K images, 158 fine-grained categories), and proposes a unified framework consisting of a Geometric Prior Enhancement Module (GPEM) and a Semantic Alignment Injection Mechanism (SAIM). By leveraging geometric priors from depth maps to counter underwater visual degradation and employing underwater-aware text prompts to address semantic misalignment, the framework significantly outperforms existing OV segmentation baselines in both in-domain and cross-domain settings.
- MARSS: Radar Semantic Segmentation via Modular Attention and State Space Models
-
Addressing the three major characteristics of radar frequency maps—"anisotropy, multi-scale, and sparse noise"—MARSS replaces general CNN/Transformer operators with three modules tailored for radar: the denoising encoder RADE, adaptive multi-scale fusion RFAF, and a State Space Decoder RADM combining Mamba and axial attention. On the CARRADA dataset, it improves RA view mIoU from 44.3% to 46.97% with 9.3M parameters, demonstrating particular robustness for small, fast-moving targets.
- MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
-
Ours proposes a learned Matting Quality Evaluator (MQE) to evaluate alpha quality pixel-wise without ground truth. MQE serves as both online training guidance and an offline data filter. This enabled the construction of VMReal, a real-world video matting dataset with 28K clips and 2.4 million frames. Combined with a reference-frame training strategy, the method significantly outperforms all existing state-of-the-art approaches.
- MatchMask: Mask-Centric Generative Data Augmentation for Label-Scarce Semantic Segmentation
-
MatchMask utilizes only a tiny amount of labeled masks by first identifying a few key layers in the diffusion model responsible for spatial control via a "gradient probe." It then attaches a 0.7M parameter LoRA adapter to these layers for mask-to-image synthesis and employs "relative filtering" to eliminate misaligned noisy regions in the synthesized images. This significantly enhances semantic segmentation performance in label-scarce scenarios (e.g., +6.8% mIoU under VOC 1/8 labels).
- Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
-
OVRCOAT addresses the issues in open-vocabulary panoptic segmentation where "unseen objects are discarded as background" and "CLIP regional features misalign with categories." It introduces a lightweight "CLIP-conditioned objectness adjustment via COAT" and "mask-level image-text alignment fine-tuning (OVR)." This approach pushes Panoptic Quality (PQ) to a new SOTA on ADE20K (relative +5.5%) while being more memory-efficient than previous full fine-tuning schemes.
- MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
-
Ours proposes MixerCSeg, which decouples channels into global/local branches by analyzing the implicit attention mechanism of Mamba. These branches are respectively enhanced with Self-Attention and CNN, combined with direction-guided edge gated convolutions. It achieves SOTA performance in crack segmentation with 2.05 GFLOPs and 2.54M parameters.
- Mixture of Prototypes for Test-time Adaptive Segmentation
-
The conventional "one prototype per class" approach in TTA-Seg is upgraded to a "cluster of experts per class." By using K-means to cluster intra-class prototypes from the source domain into multiple experts, employing a gating network for dynamic instance-wise weighted fusion, and applying min-max entropy optimization to update only the gating module, this method achieves new SOTA results on benchmarks such as Cityscapes→ACDC and GTA5→Real.
- Masked Representation Modeling for Domain-Adaptive Segmentation
-
The paper proposes MRM, an auxiliary task that performs masked modeling in latent space instead of input space. By using a lightweight Rebuilder module to perform mask-reconstruction on encoder features supervised by segmentation loss, it achieves an average +2.3 mIoU improvement across four UDA baselines on GTA→Cityscapes with zero extra overhead during inference.
- PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation
-
PCA-Seg proposes a Parallel Cost Aggregation paradigm to replace traditional serial spatial-class aggregation architectures. It efficiently integrates semantic and spatial context flows via an Expert-driven Perception Learning (EPL) module and eliminates redundancy between knowledge streams using a Feature Orthogonal Decoupling (FOD) strategy. Each parallel block adds only 0.35M parameters while achieving SOTA performance on 8 open-vocabulary semantic and part segmentation benchmarks.
- PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation
-
PEARL proposes a two-step inference method based on Procrustes alignment and text-aware Laplacian propagation. Without introducing additional training or auxiliary backbones, it achieves a new SOTA in training-free open-vocabulary semantic segmentation by correcting the key-query geometric mismatch in the final self-attention layer of CLIP and utilizing text semantics to guide label propagation.
- PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation
-
PIX-TAB utilizes "Position-Aware Pixel-level (PAPP)" tokens to embed row/column pixel coordinates directly into the sequence, eliminating the need for a separate bounding box head during inference. Combined with analytical speculative decoding and a flood-fill-based Region-Based Image Segmentation (RBIS) for large tables, this lightweight encoder-decoder model achieves over 3x speedup compared to full-scale versions while remaining deployable on mobile devices.
- PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
-
This paper defines the UAV Reasoning Segmentation task, constructs the DRSeg benchmark containing 10K high-resolution UAV images with Chain-of-Thought (CoT) annotations, and proposes a dual-path pixel-level multimodal large language model, PixDLM, as a baseline.
- Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
-
Ours proposes a command sequence representation based on a Pointer mechanism, explicitly introducing B-Rep geometric entities (edges/faces) into autoregressive CAD generation. This is the first command sequence method to support chamfer/fillet operations while significantly reducing topological errors caused by quantization.
- PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
-
PR-MaGIC is a training-free, test-time prompt refinement framework. It treats the gradient of the SAM mask decoder as a "discriminator gradient flow" backpropagated to query image embeddings, iteratively "shifting" low-quality automatically generated prompt points to more accurate positions. By using top-1 similarity to select the most robust mask from multiple candidate steps, it serves as a plug-and-play module that consistently improves performance for one/few-shot segmentation frameworks like PerSAM-F and Matcher.
- PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting
-
PromptMoE transforms the task of "generating prompts for SAM to refine coarse masks" from a fixed heuristic rule into a Mixture of Experts (MoE) problem: using 10 complementary pixel-wise visual cues as experts, a sparse router selects the two most relevant experts to fuse into a guidance map, while a spatially diverse sampling module places prompts on the guidance map. This achieves an average improvement of +6.24 IoU / +8.99 BIoU over the strongest baseline across 5 benchmarks.
- PRUE: A Practical Recipe for Field Boundary Segmentation at Scale
-
This paper provides a systematic evaluation of 18 segmentation and Geospatial Foundation Models (GFM), proposing PRUE—a field boundary segmentation recipe combining a U-Net backbone, composite loss functions, and targeted data augmentation. It achieves 76% IoU and 47% object-F1 on the FTW benchmark, improvements of 6% and 9% over the baseline respectively, while introducing a new set of metrics for evaluating deployment robustness.
- RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation
-
RAVEN treats the raw ADC stream of mmWave FMCW radar as a temporal sequence based on "chirp arrival time." It employs independent State Space Models (SSMs) for each receiving channel to preserve the phase structure of the MIMO array, utilizes a lightweight cross-attention mechanism as a "learnable beamformer" to reconstruct virtual antenna features, and enables detection/segmentation results before a frame is fully collected through chirp-wise early exit. It achieves SOTA performance on two automotive radar datasets while reducing computation by up to \(170\times\) and end-to-end latency by \(4\times\).
- RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
-
Ours proposes the RealVLG framework, comprising the 11B-level real-world multi-granular annotated dataset RealVLG-11B and the Reinforcement Learning (RL) fine-tuned unified model RealVLG-R1. This work unifies Visual-Language Grounding (VLG) and robotic grasping into the same paradigm for the first time, achieving end-to-end prediction from natural language instructions to bounding boxes, segmentation masks, grasp poses, and contact points, while demonstrating zero-shot generalization capabilities.
- ReAttnCLIP: Training-Free Open-Vocabulary Remote Sensing Image Segmentation via Re-defined Attention in CLIP
-
ReAttnCLIP decomposes the attention map of the final CLIP layer into three components—"patch↔patch, [CLS]→patch, and patch→[CLS]"—and applies specialized modifications to each. It replaces patch-patch attention with raw patch embedding similarity (enhanced by rotation and middle-layer fusion), reconstructs a more informative global [CLS] representation using middle-layer attention, and zeros out the [CLS]-to-patch column. Without any training, it achieves SOTA performance across 10 remote sensing datasets (Ours +1.7% in open-vocabulary mean IoU and +1.1% in object extraction).
- REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion
-
Proposes REL depth representation (a three-channel Rectified Depth + EGVIA + LOA based on cylindrical coordinates) and Spherical Dynamic Multi-modal Fusion (SMMF) for panoramic semantic segmentation. It achieves a 63.06% mean mIoU on Stanford2D3D (a 2.35% improvement over the HHA baseline) and reduces performance variance by approximately 70% under 3D perturbations.
- ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
-
ReSAM converts sparse clicks for each instance into coarse masks through Segment Anything Model (SAM), which are then back-projected into compact boxes serving as "self-prompts" to requery SAM. By employing a lightweight rolling queue for cross-augmentation semantic alignment, ReSAM approaches full-mask supervision performance (reducing gaps to 1.3% / 4.9% / 8.5%) across three remote sensing datasets using only 1-point labels, while saving 84% VRAM compared to prototype-based alignment methods.
- Rethinking Box Supervision: Bias-Free Weakly Supervised Medical Segmentation
-
Addressing the "box-shaped bias" where box supervision causes predictions to tend toward rectangles, the authors propose the WeakMed framework. It uses a differentiable Mask-to-Box (M2B) transformation to project predicted masks onto box-aligned representations to eliminate shape bias, and a Scale Consistency (SC) loss to compensate for fine-grained information lost by M2B. Both components are only enabled during training, require no architectural changes, and incur zero inference overhead. WeakMed significantly outperforms existing weakly supervised methods across 9 tasks, 9 datasets, and 6 modalities, approaching full supervision performance.
- Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
-
Addressing the gap where Open-Vocabulary Segmentation (OVS) lags behind fully-supervised models, this paper proposes RNS, a retrieval-augmented test-time adapter that complements text prompts with "a few pixel-annotated support images." By training a per-image lightweight linear classifier using "learned per-image fusion" of retrieved visual and text support features, RNS narrows the zero-shot to fully-supervised gap to 11.5 mIoU in less than 1 second on an A100.
- Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
-
Addressing the vulnerability of geometric obfuscation—replacing keypoints with random lines—to neighborhood geometric recovery attacks, this paper proposes Dual Convergent Lines (DCL). By lifting each keypoint into a line pointing toward one of two fixed anchors, DCL transforms the attacker's recovery optimization into an ill-posed problem (either collapsing to anchors or diverging with high variance at the boundary). DCL remains compatible with l6P solvers for real-time localization while being the only geometric obfuscation scheme currently resistant to such attacks.
- RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments
-
For semantic segmentation in off-road/unstructured scenes, this paper employs a ViT-MAE encoder (RMAE) with half the layers removed to extract non-adjacent multi-layer features. It is paired with a lightweight decoder, ProGRess, consisting of three modules: Progressive Leapwise Fusion (PLF), Lightweight Channel Attention with Residuals (LCAR), and Bottleneck Feature Fusion (BFF). It achieves SOTA mIoU of 57.41% / 78.95% / 45.63% on RELLIS-3D / RELLIS-3DC / RUGD datasets with significantly fewer parameters.
- RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
-
This paper proposes RobotSeg, the first foundation model supporting both image and video robot segmentation. Based on SAM 2, it introduces the Structural-enhanced Memory Associator (SEMA), Robot Prompt Generator (RPG), and a label-efficient training strategy. Requiring only first-frame annotations, it achieves 85.1 J&F for Whole Robot segmentation in autonomous mode, surpassing the fine-tuned SAM 2.1 by 4.9 points with only 41.3M parameters (significantly smaller than existing 638M+ solutions).
- RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation
-
RS-SSM is proposed to extract channel-wise specific information distribution features (CwAP) through frequency domain analysis and adaptively invert the forget gate matrix to supplement and refine spatio-temporal details lost during SSM state space compression (FGIR). It achieves SOTA performance on four video semantic segmentation benchmarks while maintaining high efficiency.
- S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation
-
S2C2Seg is a training-free, plug-and-play framework compatible with any CLIP-based segmentation method. It first prunes ultra-large vocabularies into a compact Candidate Subset (CSS) through a three-way scoring mechanism involving "global semantics + local spatial + cross-view consistency." Then, it adaptively fuses CLIP's global features with CLIPSeg's local predictions using category confidence weighting (CSG). Across 8 benchmarks, it provides mIoU improvements of +9.7, +6.8, and +3.4 for SCLIP, ProxyCLIP, and CorrCLIP respectively, pushing the average mIoU to a new SOTA of 51.2%.
- SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains
-
Addressing privacy-constrained deployment scenarios where "segmentation models are frozen and internal parameters cannot be accessed," SAGE avoids fine-tuning the backbone. Instead, it learns a generator for each style to produce border-shaped visual prompts, then adaptively fuses these prompts using cross-attention based on the input style to re-apply them to the input image. This allows a frozen model to surpass similar privacy-preserving methods across five DGSS benchmarks and outperform full fine-tuning in all settings.
- SAMIX: Reinforcing SAM2 with Semantic Adapter and Reference Selecting Policy for Mix-Supervised Segmentation
-
SAMIX transforms the video "instance tracking" memory mechanism of SAM2 into cross-image "semantic tracking." By employing a lightweight semantic adapter and a reference selecting policy network trained via reinforcement learning, it selects a set of semantically similar reference images for each weakly-labeled or unlabeled image as dense contextual prompts. This generates high-quality pseudo-labels to unify mixed-supervised training (mask/box/scribble/point/class/unlabeled), achieving SOTA performance on VOC, Cityscapes, Camouflaged Object Detection (COD), and polyp segmentation datasets.
- SAMTok: Representing Any Mask with Two Words
-
SAMTok compresses any region mask into two discrete text tokens, enabling standard MLLMs (like QwenVL) to understand and generate masks just like text via next-token prediction. It requires no specialized segmentation heads or custom losses, and by turning masks into "text," it allows reinforcement learning with pure character-matching rewards for the first time.
- SARMAE: Masked Autoencoder for SAR Representation Learning
-
The SARMAE framework is proposed, achieving noise-robust SAR self-supervised pre-training through the million-scale SAR dataset SAR-1M, Speckle-Aware Representation Enhancement (SARE), and Semantic Anchor Representation Constraint (SARC). It achieves SOTA results across multiple downstream tasks including classification, detection, and segmentation.
- SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection
-
SDDF proposes a new task, Open-Vocabulary Camouflaged Object Detection (OVCOD), and establishes the OVCOD-D benchmark. It achieves a 56.4 AP under open-set settings through a sub-description principal component contrastive fusion strategy to remove redundant textual noise, alongside specificity-guided regional weak alignment and dynamic focusing mechanisms to enhance the discriminative power between camouflaged targets and backgrounds.
- Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation
-
The EDA-PSeg framework is proposed, which utilizes two core modules: the Graph Matching Adapter (GMA) and Euler-Margin Attention (EMA). It achieves open-set unsupervised domain adaptive semantic segmentation from pinhole views to 360° panoramic images for the first time, simultaneously addressing geometric Field of View (FoV) distortion and unknown category discovery.
- Seeing Both Sides: Towards Bidirectional Semantic Alignment for Open-Vocabulary Camouflaged Object Segmentation
-
BaCLIP utilizes a Mutual Refinement Enhancement Module (MREM) to enable bidirectional calibration between text and visual features. By transforming refined text embeddings into semantic prompts for SAM, BaCLIP achieves SOTA performance on the OVCamo benchmark for Open-Vocabulary Camouflaged Object Segmentation (OVCOS) with a lightweight architecture, surpassing the previous SOTA by 4.5% in cIoU.
- SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation
-
SegGBC introduces the "Granular-Ball Computing (GBC)" paradigm, a coarse-to-fine multi-granularity clustering approach, to image segmentation for the first time. It explicitly models inherent image uncertainty using Intuitionistic Fuzzy Sets (IFS) and guides granular-ball splitting and merging with a semantic-aware "Semantic Compactness Measure for Granular Balls (SCMGB)." It can perform unsupervised segmentation independently and serves as a plug-and-play front end that enhances SA / mIoU of existing clustering segmentation methods by more than 3%.
- Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation
-
HERA identifies that the failure of using Vision Foundation Models (VFMs) for cross-domain few-shot segmentation stems from "layer sensitivity + attention noise + pixel error." It proposes a three-stage select-regularize-calibrate framework: first, it adaptively selects the most stable layer per episode via Hierarchical Layer Selection (HLS); second, it regularizes the self-attention of that layer using an entropy-gated Gaussian prior (PGR); finally, it fuses multi-path residuals to calibrate pixel predictions (PAC). The entire backbone remains frozen, fine-tuning <2.7% of parameters at test-time without accessing source data, surpassing SOTA by over 4.1 mIoU across four CD-FSS benchmarks.
- SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons
-
SemLayer is proposed as a generative model-based pipeline to restore semantic layered structures from flattened vector icons. The method redefines segmentation as a coloring task via a diffusion model, performs semantic completion of occluded regions, and determines layer order using Integer Linear Programming (ILP), achieving improvements of +5.0 in mIoU and +16.7 in PQ.
- SouPLe: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
-
Proposes SouPLe (Sound-aware Prompt Learning), which enhances semantic correspondence between audio embedding tokens and visual features by replacing fixed text prompts in CLIP with learnable context tokens generated from image features. This achieves a 3.75 cIoU improvement on VGG-SS and a 6.32 cIoU improvement in open-set settings, outperforming previous methods.
- SPAR: Single-Pass Any-Resolution ViT for Open-Vocabulary Segmentation
-
This work proposes SPAR, a method that distills the spatial reasoning capabilities of a fine-stride sliding window teacher into a single-pass student. This transforms ViTs into resolution-agnostic dense feature extractors, achieving a 10.5 mIoU improvement over single-pass baselines in open-vocabulary segmentation while being 52x faster than the teacher.
- Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation
-
SARD shifts segmentation knowledge distillation from "mask imitation" to "aligning feature space geometry." It utilizes a "structure importance map" \(W(i)\), synthesized from boundaries, curvature, and spatial crowding, to weight the feature distillation loss. This directs the lightweight student model to concentrate its capacity on boundaries and dense contact zones, consistently improving mIoU and boundary IoU (bIoU) across Cityscapes, ADE20K, and the industrial rock fragmentation dataset RockFrag (specifically +4.3 mIoU / +6.7 bIoU on RockFrag over CWD), with zero additional inference overhead.
- Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
-
SOC is an "object-centric" synthetic data pipeline: it first generates 20 million high-quality single-object segmented snippets using generative models, then assembles them into 2 million images using 3D geometric layout and camera configuration augmentations, accompanied by pixel-precise masks, boxes, and referring expressions. Training with only 100,000 synthetic images allows open-vocabulary detection, segmentation, and grounding to outperform real datasets like GRIT 20M and V3Det 200K (+10.9 AP on LVIS, +8.4 NAcc on gRefCOCO).
- Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation
-
This paper proposes the TODSynth framework, which achieves text-image-mask joint-controlled remote sensing image synthesis via the unified tri-modal attention of MM-DiT. It innovatively introduces the Control-Rectify Flow Matching (CRFM) method, which utilizes the semantic loss of a downstream segmentation model during the sampling stage to dynamically adjust the generation trajectory. This approach improves mIoU by 4.14% and 2.08% on FUSU-4k and LoveDA, respectively.
- Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation
-
Addressing the overlooked "textual ambiguity" problem in open-vocabulary remote sensing image segmentation (OVRSIS), this paper proposes the plug-and-play TMPA: it first utilizes an LLM to expand naive category names into multiple context-aware descriptions, and then calibrates text embeddings during inference guided by high-confidence visual features, achieving an average gain of 4.6% for SegEarth-OV across 17 remote sensing datasets.
- TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection
-
Without training any network, this method treats massive candidate masks generated by SAM as a "raw material pool." It employs three-level quality filtering, intra-image saliency via DINO attention, and cross-image semantic consistency via DINO prototypes to progressively converge masks into co-salient predictions. It achieves a 13.7% higher F-measure on CoCA compared to the previous training-free SOTA.
- The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation
-
The GOLD framework is proposed for Continual Test-Time Adaptation (CTTA). The core discovery is that the minimal feature update subspace ("Golden Subspace") aligns with the row space of classifier weights and is naturally low-rank. By estimating this subspace online via Average Gradient Outer Product (AGOP) and performing feature adaptation with lightweight scaling vectors, the method achieves SOTA performance on classification and segmentation benchmarks with extremely low computational overhead.
- The Missing Point in Vision Transformers for Universal Image Segmentation
-
This paper argues that the bottleneck of current mask segmentation models (Mask2Former/OneFormer, etc.) lies in mask classification rather than mask generation. It proposes ViT-P—a two-stage framework that decouples mask generation from classification: a frozen proposal generator produces class-agnostic masks, and a ViT-based "point classifier" classifies the maximum value point of each mask. It achieves SOTA on multiple benchmarks, including 54.0 PQ on ADE20K Panoptic and 87.4 mIoU on Cityscapes Semantic.
- The Power of Prior: Training-Free Open-Vocabulary Semantic Segmentation with LLaVA
-
Treating a frozen LLaVA as a segmenter: through structured question-answering, it is prompted to "acknowledge" which classes are present in the image. Activation regions are then back-traced from the visual-category token distances in the LLM's intermediate layers. Finally, high-confidence regions purified by prototypes are fed as point/box prompts to SAM. Without any training, this method establishes a new SOTA on VOC21 (68.0% mIoU) and COCO-Object (42.0%).
- Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels
-
Ours proposes Same Class Neighbor Penalization (SCNP), which significantly improves the topological accuracy of segmentation at an extremely low cost (only 3 lines of code, a few milliseconds/iteration). By replacing each pixel's logit with its worst neighbor prediction within the same class during training, the model is forced to prioritize fixing weak pixels in the neighborhood.
- Towards Robust Multi-Modal Semantic Segmentation with Teacher-Student Framework and Hybrid Prototype Distillation
-
RobustSeg is proposed—a teacher-student self-distillation framework with a feedback loop. Using a "cross-modal prototype distillation + primary modality IFV distillation" hybrid strategy (HPD), the model maintains robustness during sensor loss or degradation while incurring almost no loss in full-modality accuracy (+2.40% mIoU on DeLiVER for missing modalities, and only -0.1% for full modalities).
- Training-Free Open-Vocabulary Camouflaged Object Segmentation via Fine-Grained Object Binding and Adaptive Hybrid Prompt
-
This paper proposes a completely training-free open-vocabulary camouflaged object segmentation (OVCOS) framework. It utilizes MLLMs to generate fine-grained "object descriptions + background descriptions" to supplement sparse text semantics. A Semantic Probe is then used to decouple object/background features and model category similarity between patches via Spearman rank consistency for precise "object binding." Combined with Entropy-Guided Text Embedding Adjustment (EGTEA) and Adaptive Hybrid Prompt Generation (AHPG) to drive SAM, the method significantly outperforms the previous strongest training-free method, ResCLIP, on OVCamo (average +16.8% across six metrics).
- Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object Detection
-
Addressing salient object detection with spatially unaligned RGB and thermal images, UMFNet reformulates "alignment" from explicit geometric registration to uncertainty representation learning in feature space. It uses pixel-wise Gaussian distributions to implicitly find cross-modal consistent regions and gated fusion guided by uncertainty-derived confidence maps, achieving SOTA across 5 unaligned and 3 aligned benchmarks with better efficiency than registration-based methods.
- Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
-
USF proposes a modular, lens-agnostic spherical vision frontend. By projecting arbitrary calibrated camera images onto a unit sphere and performing spatial-domain spherical resampling, convolution, and pooling, it naturally guarantees rotation equivariance using only distance-weighted kernels. It demonstrates zero-shot generalization robustness to random rotations and cross-lens scenarios in classification, detection, and segmentation tasks.
- Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
-
Ours proposes UniMatch, a semantic-aware coarse-to-fine 3D shape matching framework. The coarse stage establishes part-level correspondence through category-agnostic 3D segmentation, MLLM naming, and FG-CLIP language embeddings. The fine stage learns dense correspondence within an extended functional map framework using a Group-wise Ranking Contrastive (RnC) Loss, achieving universal matching for cross-category and non-isometric shapes.
- V²-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
-
V²-SAM transforms the single-view segmentation foundation model SAM2 into a cross-view object correspondence framework. It employs a geometry-aware coordinate prompt generator (V2-Anchor) and an appearance-aware visual prompt generator (V2-Visual) to address "where the target is" and "what the target looks like," respectively. A three-expert MoE architecture coupled with a Posterior Cycle-Consistency Selector (PCCS) adaptively identifies the most reliable prediction, achieving new SOTA results on Ego-Exo4D, DAVIS-17, and HANDAL-X.
- VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
-
VGGT-Segmentor (VGGT-S) utilizes the multi-view geometry foundation model VGGT as a frozen backbone, appending a three-stage "Union Segmentation Head." It translates VGGT's reliable object-level feature alignment into pixel-level masks and eliminates the need for paired annotations through single-image self-supervised training. It achieves an average IoU of 67.7%/68.0% on Ego–Exo4D cross-view segmentation, outperforming the previous SOTA by 18.0%/12.8%.
- VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
-
VidEoMT is proposed as an encoder-only video segmentation architecture that unifies segmentation and temporal association within a single ViT encoder through query propagation and query fusion. It achieves a 5×–10× speedup (reaching 160 FPS with ViT-L) while maintaining accuracy comparable to the SOTA.
- VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
-
VIRST proposes an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single Vision-Language Model (VLM). By incorporating Spatio-Temporal Fusion (STF) and a Temporal Dynamic Anchor Updater (TDAU), it achieves spatio-temporally consistent video segmentation. VIRST reaches 70.8 J&F on ReVOS (+7.5 over SOTA) and 62.9 on MeViS (+9.2), while maintaining an inference speed of 5.1 FPS (1.3x faster than VRS-HQ).
- XSeg: A Large-scale X-ray Contraband Segmentation Benchmark for Real-World Security Screening
-
This study constructs XSeg, the largest X-ray contraband segmentation dataset to date (98,644 images, 295,932 instance masks, 30 fine-grained categories), and proposes APSAM. By leveraging X-ray dual-energy physical properties via an Energy-Aware Encoder and intelligently expanding user clicks with an Adaptive Point Generator, APSAM achieves 72.83% mIoU, outperforming SAM fine-tuning by 4.96%.