✂️ Segmentation¶

📷 CVPR2026 · 103 paper notes

3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion: This paper proposes 3M-TI, a calibration-free multi-camera cross-modal diffusion framework that performs implicit alignment and fusion of uncalibrated RGB–thermal infrared image pairs via a Cross-modal Self-attention Module (CSM) in the VAE latent space. Combined with a misalignment augmentation strategy, the method achieves state-of-the-art performance on mobile thermal imaging super-resolution and significantly improves downstream object detection and semantic segmentation.
MEDISEG: A Medication Image Instance Segmentation Dataset for Preventing Adverse Drug Events: This work introduces MEDISEG, a medication image instance segmentation dataset (8,262 images, 32 pill classes, with real-world occlusion/overlap scenarios). YOLOv8/v9 achieve 99.5% mAP@0.5 on the 3-class subset and 80.1% on the 32-class subset. FsDet few-shot experiments demonstrate that MEDISEG pretraining significantly outperforms CURE in occluded scenarios (1-shot: 0.406 vs. 0.131).
MEDISEG: A Dataset of Medication Images with Instance Segmentation Masks for Preventing Adverse Drug Events: This paper introduces MEDISEG — a dataset of 8,262 real-world multi-pill scene images covering 32 pill types (including overlapping, occluded, and varying-illumination scenarios within dosette boxes), with instance segmentation annotations. YOLOv8/v9 achieve mAP@50 of 99.5% on the 3-Pills subset and 80.1% on the 32-Pills subset. Few-shot experiments demonstrate that MEDISEG as a base training set significantly outperforms the CURE dataset.
A Mixed Diet Makes DINO An Omnivorous Vision Encoder: This paper proposes an Omnivorous Vision Encoder that performs cross-modal alignment distillation training (RGB/Depth/Segmentation) on top of a frozen DINOv2 via lightweight adapters, enabling a single encoder to produce consistent embeddings across different visual modalities while preserving the original discriminative semantics.
AFRO: Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning: This paper proposes AFRO, a self-supervised 3D visual pretraining framework that infers latent actions via an Inverse Dynamics Model (IDM), predicts future features via a Diffusion Transformer Forward Dynamics Model (FDM), and enforces temporal symmetry through an inverse consistency constraint. Pretrained on the large-scale RH20T dataset, AFRO achieves an average success rate of 76.0% across 14 MetaWorld tasks (vs. DynaMo-3D 64.9%, PointMAE 63.9%) and attains state-of-the-art performance on 4 real-world tasks.
Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation: This paper proposes a lightweight dual-loss training framework for temporal action segmentation (TAS) that requires only one additional boundary output channel and two auxiliary losses—a boundary regression loss and a CDF segment shape regularization loss. The framework consistently improves F1 and Edit scores across three architectures (MS-TCN, C2F-TCN, and FACT), demonstrating that precise segmentation can be achieved through simple loss design rather than heavier architectural modifications.
Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation: This paper proposes DEO (Distillation for Earth Observation), a dual-teacher contrastive distillation framework that employs a multispectral self-distillation teacher to learn spectral representations and a frozen optical VFM teacher (DINOv3) to inject high-level semantic priors. The resulting single student network excels at both optical and multispectral remote sensing tasks, achieving state-of-the-art performance across semantic segmentation, change detection, and classification.
CA-LoRA: Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation: This paper proposes Concept-Aware LoRA (CA-LoRA), which automatically identifies weight layers in a T2I model that are sensitive to specific concepts (e.g., viewpoint, style) and applies LoRA fine-tuning exclusively to those layers. This selective adaptation achieves domain alignment while preserving the diverse generation capability of the pretrained model, enabling the synthesis of high-quality urban-scene segmentation datasets.
CLIP Is Shortsighted: Paying Attention Beyond the First Sentence: This paper reveals a systematic bias in CLIP-family models toward the summary sentence and early tokens in long-form text, and proposes DeBias-CLIP, which eliminates this bias via three text augmentation strategies — summary removal, sentence sub-sampling, and token padding — achieving state-of-the-art performance on both long- and short-text retrieval benchmarks without introducing any additional parameters.
DeBias-CLIP: CLIP Is Shortsighted — Paying Attention Beyond the First Sentence: The paper shows that CLIP and Long-CLIP suffer from a serious early-token bias and a first-sentence summary shortcut. DeBias-CLIP uses three simple augmentations — removing the summary sentence, sentence sub-sampling, and prefix-token padding — that introduce no extra parameters and reach SOTA on multiple long-text retrieval benchmarks.
Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper: This paper provides a systematic review of two major technical paradigms for brain glioma MRI segmentation and classification — traditional methods (thresholding, region growing, clustering, etc.) and deep learning methods (CNN-based architectures). Through a methodological taxonomy and performance comparison, the paper concludes that CNN architectures comprehensively outperform traditional techniques, while also noting that semi-automatic methods are preferred by radiologists in clinical settings due to their controllability.
Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging: A systematic review paper that comprehensively compares traditional methods (thresholding, region growing, fuzzy clustering, etc.) and deep learning methods (CNN, U-Net, SegNet, etc.) for brain glioma MRI segmentation and classification, concluding that CNN-based architectures consistently outperform traditional techniques in both accuracy and degree of automation.
Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness: This paper proposes CFT (Concept-Guided Fine-Tuning), which leverages LLM-generated class-level semantic concepts and zero-shot segmentation via GroundedSAM to obtain concept masks. ViTs are then fine-tuned by aligning AttnLRP relevance maps with concept regions. Using only 1,500 training images, CFT achieves substantial robustness improvements across 5 OOD benchmarks.
ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization: This paper proposes ConceptPrism, which introduces image-level residual tokens and cross-image repulsion losses to automatically disentangle shared target concepts from image-specific residual information in personalized T2I diffusion models, achieving state-of-the-art performance on DreamBench across all three metrics: CLIP-T, DINO, and CLIP-I.
CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation: This paper presents CrossEarth-SAR, the first billion-scale SAR vision foundation model, which integrates a physics-guided sparse MoE architecture with SAR physical descriptors. It achieves state-of-the-art performance on 20 out of 22 cross-domain semantic segmentation benchmarks, surpassing prior methods by over 10% mIoU in certain multi-gap scenarios.
CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation: This paper introduces CrossEarth-SAR, the first billion-scale SAR visual foundation model, which replaces the FFN in each Transformer block of a DINOv2 ViT backbone with a physics-guided sparse Mixture-of-Experts (MoE) layer. Routing is conditioned on three SAR physical descriptors—directional entropy, equivalent number of looks, and local roughness. The work also contributes a 200K-scale cross-domain pretraining dataset and a benchmark of 22 evaluation settings covering 8 types of domain shift. CrossEarth-SAR achieves state-of-the-art performance on 20 out of 22 cross-domain semantic segmentation benchmarks.
CTFS: Collaborative Teacher Framework for Forward-Looking Sonar Image Semantic Segmentation with Extremely Limited Labels: This paper proposes CTFS, the first semi-supervised semantic segmentation framework specifically designed for forward-looking sonar (FLS) images. It introduces a multi-teacher collaboration mechanism (one general teacher + two sonar-specific teachers simulating acoustic shadow and energy attenuation, respectively), combined with multi-view pseudo-label reliability assessment (intra-teacher stability × inter-teacher consistency). With only 2% labeled data, CTFS achieves 62.32% mIoU, surpassing the state of the art by 5.08 percentage points.
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training: This paper proposes Data Warmup, a curriculum learning strategy that requires no modifications to the model or loss function. It schedules training images from easy to hard using a semantics-aware image complexity metric (foreground dominance × foreground typicality). On ImageNet 256×256, it yields improvements of up to +6.11 IS and −3.41 FID for the SiT family. Notably, the reversed curriculum (hard-to-easy) performs worse than the uniform baseline, demonstrating that ordering itself is the key mechanism.
DeDelayed: Deleting Remote Inference Delay via On-Device Correction: DeDelayed is an edge-cloud collaborative inference framework that combines a lightweight on-device image model with a latency-aware cloud-side temporal prediction video model. By training the network with temporally predictive objectives to compensate for communication delay, DeDelayed achieves gains of 6.4 mIoU over local-only inference and 9.8 mIoU over remote-only inference under 100 ms latency.
Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification: This paper proposes IFA-Net, which detects AI-generated forgeries from the perspective of "modeling what is real" rather than "learning what is fake." A frozen MAE reconstructs the input to produce residuals that expose regions deviating from the natural image manifold. A two-stage closed-loop pipeline—coarse detection → task-adaptive prior injection → residual amplification → refinement—iteratively amplifies manifold deviation, achieving state-of-the-art performance on both diffusion inpainting and traditional image tampering detection.
Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation: This paper proposes an open-vocabulary semantic segmentation method that bypasses the logits optimization process entirely. Based on the assumption that homogeneous regions exhibit consistent distributional discrepancies from their logits to a degenerate distribution, the method directly constructs segmentation maps via either the optimal transport path or the analytical solution of maximum transport velocity. The approach achieves state-of-the-art performance on 8 benchmarks without requiring training or model-specific modulation.
DSS: Discover, Segment, and Select for Zero-shot Camouflaged Object Segmentation: This paper proposes DSS, a three-stage progressive pipeline (Discover→Segment→Select) that achieves zero-shot, training-free camouflaged object segmentation by: discovering foreground regions via self-supervised visual encoders and Leiden clustering (FOD); generating candidate masks using SAM; and selecting the optimal mask through heuristic scoring combined with iterative pairwise MLLM comparison. The method demonstrates particularly strong performance in multi-instance camouflage scenarios.
DPAD: Discriminative Perception via Anchored Description for Reasoning Segmentation: To address the limitation that geometric rewards in RL+GRPO training for reasoning segmentation (RS) cannot constrain whether the reasoning chain focuses on the target's unique attributes, this paper proposes DPAD: an MLLM generates a reasoning chain, geometric localization, and an anchored description; a CLIP-based Discriminative Perception Reward is introduced to compare the similarity between the description and the ROI/AOI, forcing the caption to be more discriminative and thereby indirectly constraining the reasoning chain to focus on the target. On ReasonSeg, cIoU improves by 3.09% while reasoning chain length decreases by 42%.
DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime: This paper proposes DSFlash, a low-latency panoptic scene graph generation model that achieves real-time inference at 56 FPS on an RTX 3090 while maintaining state-of-the-art performance (mR@50=30.9), through a unified backbone, bidirectional relation prediction, and mask-guided dynamic pruning.
DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime: DSFlash combines a unified segmentation/relation backbone, a gated bidirectional relation head, and mask-based dynamic patch pruning to deliver SOTA panoptic scene graph generation on PSG at mR@50=30.9 with only 18 ms latency (56 FPS).
DSS: Discover, Segment, and Select - A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation: DSS is a three-stage zero-shot camouflaged object segmentation framework: (1) Discover candidate regions via DINOv2 feature clustering and part combination (FOD); (2) Segment using SAM; (3) Select the optimal mask via pairwise MLLM comparison (SMS). Requiring no training, DSS achieves comprehensive improvements over prior zero-shot methods on four COD benchmarks, with particularly pronounced advantages in multi-instance scenarios.
Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance: This paper proposes an efficient RGB-D multi-task scene understanding network. An improved fusion encoder exploits channel redundancy to accelerate feature extraction. A Normalization-Focused Channel Layer (NFCL) and a Context Feature Interaction Layer (CFIL) provide cross-dimensional feature guidance. A batch-level multi-task adaptive loss function dynamically adjusts per-task learning weights. The unified framework simultaneously handles five tasks—semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification—on NYUv2, SUN RGB-D, and Cityscapes, achieving advantages in both accuracy and speed.
Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance: This paper proposes an efficient RGB-D multi-task scene understanding network. A partial-channel convolution fusion encoder reduces FLOPs to 1/16 of standard convolution. A Normalized Focus Channel Layer (NFCL) and a Context Feature Interaction Layer (CFIL) enable cross-dimensional feature guidance. A batch-level multi-task adaptive loss dynamically balances five tasks. The method achieves 49.82 mIoU on NYUv2 at 20.33 FPS, which is 24% faster than EMSAFormer.
ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark: ELVIS proposes the first low-light video instance segmentation (VIS) framework, comprising a physics-driven synthetic low-light video pipeline (with motion blur modeling), a calibration-free degradation parameter estimation network (VDP-Net), and an enhancement decoder integrated into the VIS architecture for degradation-content decoupling. It achieves gains of +3.7 AP and +2.8 AP on synthetic and real low-light videos, respectively.
EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection: This paper proposes EReCu, a unified unsupervised camouflaged object detection framework consisting of three synergistic modules — Multi-cue Native Perception (MNP), Pseudo-label Evolution Fusion (PEF), and Local Pseudo-label Refinement (LPR) — achieving boundary-accurate and detail-rich camouflaged object segmentation without any manual annotations.
EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection: EReCu is a unified framework built upon a DINO teacher-student architecture that employs Multi-cue Native Perception (MNP) to extract texture and semantic priors from raw images, guiding Pseudo-label Evolution Fusion (PEF) for global pseudo-label evolution, and Local Pseudo-label Refinement (LPR) for boundary detail recovery. It is the first framework to unify the two dominant UCOD paradigms—pseudo-label guidance and feature learning—achieving state-of-the-art performance across four COD benchmarks.
FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning: This paper proposes FCL-COD, a framework that injects camouflaged scene knowledge into SAM via Frequency-aware Low-Rank Adaptation (FoRA), enhances foreground-background feature separation through Gradient-aware Contrastive Learning (GCL), and refines boundary-sensitive features with Multi-Scale Frequency Attention (MSFA). Under a weakly supervised setting using only bounding box annotations, FCL-COD surpasses fully supervised state-of-the-art methods.
Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning: This paper proposes STaRC, a framework that leverages supervised frame-level saliency learning to jointly drive retrieval (saliency-guided segmentation and retrieval) and caption generation (saliency prompt injection into the decoder), achieving substantial improvements in temporal alignment and caption quality for dense video captioning (DVC).
FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting: This paper presents FoV-Net, the first rotation-invariant framework for CAD B-rep learning that simultaneously captures local surface geometry and global structural context. By introducing a Local Reference Frame UV grid (LRF UV) and a Field-of-View (FoV) ray casting descriptor, FoV-Net achieves robust classification and segmentation under arbitrary \(\mathbf{SO}(3)\) rotations.
From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction: This work decouples two-hand reconstruction into 2D structural alignment (fusing keypoint, segmentation, and depth priors) and 3D spatial interaction alignment (a penetration-free diffusion model), achieving an MPJPE of 5.36 mm on InterHand2.6M and substantially outperforming the state of the art.
Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation: This paper proposes Generalizable Knowledge Distillation (GKD), which transfers the cross-domain generalization capability of vision foundation models (VFMs) to lightweight student models through a multi-stage distillation scheme that decouples representation learning from task learning, along with a query-based soft distillation mechanism. GKD achieves an average improvement of +10.6% mIoU under the F2L setting.
GenMask: Adapting DiT for Segmentation via Direct Mask Generation: This paper proposes GenMask, which directly trains a DiT to generate binary segmentation masks (sharing the same model as color image generation). By discovering that the VAE latent representations of binary masks are linearly separable, the authors design an extreme heavy-tailed timestep sampling strategy tailored for segmentation, enabling single-step inference to produce segmentation results, achieving state-of-the-art performance on referring and reasoning segmentation benchmarks.
GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation: This paper proposes GeoGuide, a hierarchical geometric guidance framework for open-vocabulary 3D semantic segmentation. It leverages geometric priors from pretrained 3D models to correct geometric bias in 2D-to-3D knowledge distillation via three complementary modules: uncertainty-based superpoint distillation, instance-level mask reconstruction, and inter-instance relation consistency. GeoGuide achieves state-of-the-art performance of 64.8 mIoU on ScanNet v2.
GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth: GeomPrompt learns lightweight geometric prompt modules for frozen RGB-D segmentation models, synthesizing task-driven depth proxy signals from RGB (without depth supervision). It achieves gains of +6.1 mIoU under missing depth and up to +3.6 mIoU under degraded depth.
GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings: GeoSURGE introduces hierarchical geographic embeddings and a semantic fusion module, framing global image geo-localization as a matching problem between visual representations and learned geographic representations. The method achieves state-of-the-art performance on 22 out of 25 metrics across 5 benchmarks.
GKD: Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation: This paper proposes the GKD framework, which distills compact student models with cross-domain generalization capability from VFMs via a multi-stage decoupled distillation strategy (generic feature learning → frozen encoder → task head training) combined with a Query-based Soft Distillation (QSD) mechanism. GKD achieves an average mIoU gain of +10.6% under the F2L setting and +1.9% under the F2F setting.
Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions: This paper reformulates class-level curriculum learning in unsupervised domain adaptation as a sequential decision-making problem under the reinforcement learning framework. The proposed HeuSCM framework achieves autonomous curriculum scheduling via high-dimensional semantic state perception and category-fair policy gradients, attaining state-of-the-art performance (72.9 mIoU) on ACDC, Dark Zurich, and Nighttime Driving.
HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding: HippoMM maps three core hippocampal cognitive mechanisms—pattern separation (episodic segmentation), memory consolidation (semantic compression), and pattern completion (hierarchical retrieval)—into a computational architecture for episodic memory formation and cross-modal associative recall in long audiovisual streams. On the authors' proposed benchmark HippoVlog, the system achieves 78.2% accuracy while being 5× faster than retrieval-augmented baselines.
INSID3: Training-Free In-Context Segmentation with DINOv3: INSID3 is a training-free in-context segmentation method that relies exclusively on frozen DINOv3 features. Through a three-stage pipeline consisting of positional debiasing, fine-grained clustering, and seed cluster aggregation, it surpasses methods that depend on SAM or fine-tuning across semantic, part-level, and personalized segmentation tasks using a single self-supervised backbone, achieving an average mIoU improvement of +7.5%.
Kαlos finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks: This paper proposes the KαLOS meta-algorithm, which transforms the complex problem of spatial-categorical annotation agreement into a standard nominal reliability matrix via a "localize-then-classify" principle and data-driven parameter calibration, enabling unified evaluation of inter-annotator agreement (IAA) across diverse vision tasks including object detection, instance segmentation, and pose estimation.
Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction: This paper proposes CCMP, a cross-view object correspondence framework based on conditional binary segmentation. It leverages cycle-consistency constraints as a self-supervised signal and supports test-time training (TTT), achieving state-of-the-art performance of 44.57% mIoU on Ego-Exo4D.
LEMMA: Laplacian Pyramids for Efficient Marine Semantic Segmentation: This paper proposes LEMMA, a lightweight marine semantic segmentation model based on Laplacian pyramids, which replaces deep feature computation with pyramid-decomposed edge information. LEMMA achieves SOTA-level segmentation accuracy (98.97% mIoU on MaSTr1325) with a 71× reduction in parameter count.
Live Interactive Training for Video Segmentation: LIT (Live Interactive Training) proposes a framework enabling interactive visual systems (e.g., SAM2) to learn online from user corrections during inference. Its lightweight implementation, LIT-LoRA, generalizes user feedback to subsequent frames by updating LoRA modules in real time, reducing user corrections by 18–34% on challenging VOS benchmarks with a training overhead of only ~0.5 seconds per correction.
LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment: This paper proposes LoD-Loc v3, which addresses two critical limitations of LoD-based UAV localization — poor cross-scene generalization and pose ambiguity in dense urban areas — by constructing a large-scale synthetic instance segmentation dataset (InsLoD-Loc, 100K images) and upgrading the localization paradigm from semantic to instance silhouette alignment. On the Tokyo-LoDv3 dense scene benchmark, the method achieves a ~2000% improvement in (2m, 2°) accuracy over the previous state of the art.
Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation: This paper proposes GLA-CLIP to address cross-window semantic inconsistency introduced by sliding-window inference in training-free open-vocabulary semantic segmentation. Three mechanisms—global key-value extension, proxy anchor attention, and dynamic normalization—are introduced to integrate global context across windows, achieving state-of-the-art average mIoU of 44.0% across 8 benchmarks.
Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning: This paper identifies a critical yet overlooked problem in visual in-context learning (VICL): existing prompt retrieval methods ignore label information, leading to label inconsistency. The proposed LaPR framework addresses this through joint image-label representation and a mixture-of-experts (MoE) mechanism, achieving label-aware prompt retrieval that consistently outperforms state-of-the-art methods on foreground segmentation, object detection, and image colorization tasks.
Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift: This paper systematically demonstrates that prompt engineering completely fails to bridge the domain gap of vision-language models in satellite remote sensing cloud segmentation, and that fine-tuning with as little as 0.1% of labeled data (~8 images) suffices to surpass all zero-shot prompting strategies.
Making Training-Free Diffusion Segmentors Scale with the Generative Power: This paper identifies the fundamental reasons why existing training-free diffusion segmentation methods fail to scale with the generative power of stronger models — namely, two gaps between cross-attention maps and semantic relevance (an aggregation gap and a score imbalance gap). It proposes two techniques, auto aggregation and per-pixel rescaling, forming the GoCA framework, which for the first time enables stronger diffusion models (SDXL, PixArt-Sigma, Flux) to significantly outperform weaker ones in training-free semantic segmentation.
Masked Representation Modeling for Domain-Adaptive Segmentation: This paper proposes Masked Representation Modeling (MRM), which performs masking and reconstruction in latent space rather than pixel space as a plug-and-play auxiliary task for UDA segmentation, yielding an average gain of +2.3 mIoU across 4 baselines on GTA→Cityscapes.
MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator: This paper proposes a learned Matting Quality Evaluator (MQE) that assesses alpha quality at the pixel level without ground-truth supervision. MQE serves dual roles as an online training guide and an offline data filter, enabling the construction of VMReal — a real-world video matting dataset comprising 28K clips / 2.4M frames. Combined with a reference-frame training strategy, the proposed method significantly outperforms all existing approaches.
A Mixed Diet Makes DINO An Omnivorous Vision Encoder: This paper identifies severe cross-modal feature misalignment in pretrained vision encoders such as DINOv2 (across RGB, depth, and segmentation modalities), and proposes the Omnivorous framework, which trains lightweight adapters on the final few layers of a frozen backbone using an alignment loss, an anchoring loss, and modality mixup augmentation. The resulting encoder constructs a unified, modality-agnostic feature space that substantially outperforms baselines on cross-modal retrieval while maintaining or improving downstream task performance.
MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention: MixerCSeg is proposed to decouple channels into global/local branches by analyzing the implicit attention mechanism of Mamba, enhanced respectively by Self-Attention and CNN, combined with Direction-guided Edge Gated Convolution, achieving state-of-the-art crack segmentation performance at only 2.05 GFLOPs and 2.54M parameters.
MPM: Mutual Pair Merging for Efficient Vision Transformers: This paper proposes Mutual Pair Merging (MPM), a parameter-free, training-free token merging module for ViTs that reduces sequence length via mutual nearest-neighbor pairing and mean fusion. On ADE20K, MPM achieves a 60% latency reduction on Raspberry Pi 5 for ViT-Tiny and a 20% throughput improvement on H100 with FlashAttention-2, while keeping mIoU degradation within 3%.
Masked Representation Modeling for Domain-Adaptive Segmentation: The paper proposes Masked Representation Modeling (MRM), which randomly masks and reconstructs features in the encoder's latent space and supervises the reconstruction with a pixel classification loss. As a plug-in auxiliary task it lifts four UDA baselines by an average of +2.3 / +2.8 mIoU on GTA→Cityscapes / Synthia→Cityscapes, with zero inference-time overhead.
Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models: This paper proposes OccSAM-Bench, a benchmark that systematically evaluates the occlusion robustness of SAM-family models in endoscopic scenes via synthetically generated surgical instrument occlusions. A three-region evaluation protocol is introduced to reveal two distinct behavioral patterns under occlusion: occlusion-aware and occlusion-agnostic.
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation: PCA-Seg proposes a Parallel Cost Aggregation (PCA) paradigm to replace the conventional serial spatial-categorical aggregation architecture. It efficiently integrates semantic and spatial context streams via an Expert-driven Perception Learning (EPL) module, and eliminates redundancy between the two knowledge streams through a Feature Orthogonal Decoupling (FOD) strategy. Each parallel block adds only 0.35M parameters while achieving state-of-the-art performance across 8 open-vocabulary semantic and part segmentation benchmarks.
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation: This paper revisits cost aggregation strategies and proposes PCA-Seg, a parallel architecture that replaces the conventional serial design. It integrates class-semantic and spatial-contextual information via an Expert-driven Perception Learning (EPL) module, and employs a Feature Orthogonalization Decoupling (FOD) strategy to reduce redundancy. PCA-Seg achieves state-of-the-art performance on 8 benchmarks with only 0.35M additional parameters per block.
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation: PCA-Seg revisits the cost aggregation mechanism in open-vocabulary semantic and part segmentation, proposing a parallel cost aggregation paradigm to replace existing serial architectures. It efficiently integrates semantic and contextual streams via an Expert-driven Perception Learning (EPL) module and reduces redundancy between the two knowledge streams through a Feature Orthogonal Decoupling (FOD) strategy. With only 0.35M additional parameters per parallel block, PCA-Seg achieves state-of-the-art performance across 8 benchmarks.
PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation: PEARL proposes a two-step inference framework based on Procrustes alignment and text-aware Laplacian propagation. Without introducing any additional training or auxiliary backbone networks, it corrects the geometric mismatch between keys and queries in the final self-attention layer of CLIP and leverages textual semantics to guide label propagation, achieving new state-of-the-art performance on training-free open-vocabulary semantic segmentation.
Phrase-Instance Alignment for Generalized Referring Segmentation: This paper proposes InstAlign, which reformulates Generalized Referring Expression Segmentation (GRES) as an instance-level reasoning problem. By introducing a Phrase-Object Alignment (POA) loss to establish fine-grained correspondences between linguistic phrases and visual instances, and employing a relevance-weighted aggregation mechanism to handle both multi-target and no-target scenarios in a unified manner, InstAlign achieves +3.22% cIoU and +12.25% N-acc improvements on gRefCOCO.
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation: This paper formally defines the UAV Reasoning Segmentation task, constructs the DRSeg benchmark comprising 10K high-resolution UAV images with chain-of-thought reasoning annotations, and proposes the dual-path pixel-level multimodal large language model PixDLM as a strong baseline.
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection: This paper proposes a pointer-based command sequence representation that explicitly incorporates B-Rep geometric entities (edges/faces) into autoregressive CAD generation, enabling chamfer/fillet operations in command sequence methods for the first time while substantially reducing topology errors caused by quantization errors.
Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains: This paper proposes SAM FTI-FDet, which transfers SAM's general segmentation capability to freight train fault detection via an automatic prompt generation module and an adaptive feature dispatcher. Using a TinyViT lightweight backbone, the method achieves 74.6 AP^box / 74.2 AP^mask, surpassing existing methods in both accuracy and efficiency.
Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains: This paper proposes SAM FTI-FDet, which introduces a Transformer decoder-based Prompt Generator that enables lightweight TinyViT-SAM to automatically generate task-relevant query prompts, achieving instance-level fault detection of freight train components without manual interaction. The method attains 74.6 AP_box / 74.2 AP_mask on a self-constructed dataset.
PRUE: A Practical Recipe for Field Boundary Segmentation at Scale: This paper systematically evaluates 18 segmentation and geospatial foundation models (GFMs), and proposes PRUE—a field boundary segmentation recipe combining a U-Net backbone, composite loss function, and targeted data augmentation. PRUE achieves 76% IoU and 47% object-F1 on the FTW benchmark, surpassing the baseline by 6% and 9% respectively, while introducing a novel set of metrics for evaluating deployment robustness.
RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images: To address the challenge of large-scale variation in remote sensing images, this paper proposes RDNet, a region proportion-aware dynamic adaptive salient object detection network. RDNet uses a Proportion Guidance mechanism to dynamically select convolution kernel combinations of varying sizes, combined with wavelet frequency-domain interaction and a cross-attention localization module. The method achieves state-of-the-art performance across three ORSI-SOD benchmarks.
RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images: This paper proposes RDNet, which employs a region proportion-aware Proportion Guidance block to estimate the area ratio of salient objects and dynamically selects combinations of 3/4/5 convolutional kernels of varying sizes for detail extraction. Combined with wavelet-domain frequency-matched context enhancement (reducing computation to 1/4) and a cross-attention localization module, RDNet comprehensively outperforms 21 state-of-the-art methods on three optical remote sensing SOD benchmarks: EORSSD, ORSSD, and ORSI-4199.
RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation: This paper proposes the RealVLG framework, comprising the RealVLG-11B large-scale real-world multi-granularity annotated dataset and the RealVLG-R1 unified model fine-tuned via reinforcement learning. It is the first work to unify visual-language grounding (VLG) and robotic grasping under a single paradigm, enabling end-to-end prediction of bounding boxes, segmentation masks, grasp poses, and contact points from natural language instructions, while demonstrating zero-shot generalization capability.
Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics: This paper proposes the QVLM architecture and SQuID dataset, achieving pixel-level quantitative spatial reasoning on satellite imagery through a decoupled design of code generation and segmentation models. The approach overcomes the fundamental limitation of conventional VLMs, which lose spatial indexing due to patch embedding compression.
RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation: This paper proposes RecycleLoRA, which employs Rank-Revealing QR (RRQR) decomposition to systematically "recycle" subspace structures from pretrained Vision Foundation Model weights. By initializing a primary adapter from minor directions and a secondary adapter from major directions, the method substantially improves LoRA representational diversity and parameter utilization efficiency, achieving state-of-the-art performance on both synthetic-to-real and real-to-real domain generalized semantic segmentation benchmarks (average mIoU of 68.95 / 72.10).
REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion: This paper proposes REL, a three-channel depth representation based on cylindrical coordinates (Rectified Depth + EGVIA + LOA), and a Spherical Multi-Modal Fusion module (SMMF) for panoramic semantic segmentation. The approach achieves 63.06% average mIoU on Stanford2D3D (a 2.35% gain over the HHA baseline) and reduces performance variance under 3D perturbations by approximately 70%.
RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video: This paper presents RobotSeg, the first foundation model supporting both image and video robot segmentation. Built upon SAM 2, it introduces a Structure-Enhanced Memory Associator (SEMA), a Robot Prompt Generator (RPG), and a label-efficient training strategy requiring only first-frame annotations. In automatic mode, it achieves 85.1 J&F on Whole Robot segmentation, surpassing the fine-tuned SAM 2.1 by 4.9 points, with only 41.3M parameters — far fewer than existing 638M+ solutions.
RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation: RS-SSM is proposed to extract channel-wise specific information distribution features (CwAP) via frequency domain analysis and adaptively invert the forget gate matrix (FGIR) to complementarily refine spatiotemporal details lost during SSM state space compression, achieving state-of-the-art performance on four video semantic segmentation benchmarks while maintaining high efficiency.
RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection: This paper proposes RSONet, a two-stage RGB-T salient object detection network. In the region guidance stage, similarity scores between RGB/thermal guidance maps and a joint guidance map are computed to select the more reliable modality. In the saliency generation stage, a selective optimization (SO) module fuses dual-modality features based on the selection result, while Dense Detail Enhancement (DDE) and Mutual Interaction Semantic (MIS) modules extract detail and positional information, respectively, to produce high-quality saliency maps. RSONet achieves state-of-the-art performance on three RGB-T benchmarks.
RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection: RSONet is a two-stage RGB-T salient object detection framework that first generates region guidance maps via three parallel encoder-decoder branches and selects the dominant modality based on similarity, then fuses dual-modality features through a selective optimization module. It achieves MAE of 0.020/0.014/0.021 on VT5000/VT1000/VT821, outperforming 27 state-of-the-art methods.
SAP: Segment Any 4K Panorama: This paper proposes SAP (Segment Any 4K Panorama), which converts panoramic images into perspective pseudo-video sequences sampled along fixed spherical trajectories, addressing the structural mismatch of SAM2's streaming memory mechanism on 360° images. By synthesizing a 183K instance-annotated 4K panoramic dataset for fine-tuning, SAP achieves a zero-shot mIoU improvement of +17.2 on real-world panoramic benchmarks.
SARMAE: Masked Autoencoder for SAR Representation Learning: This paper proposes SARMAE, a framework for noise-robust SAR self-supervised pre-training built upon the million-scale SAR-1M dataset, speckle-aware representation enhancement (SARE), and semantic anchor representation constraint (SARC). SARMAE achieves state-of-the-art performance across multiple downstream tasks including classification, detection, and segmentation.
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation: SCOPE proposes a plug-and-play background-guided prototype enrichment framework that mines pseudo-instances from background regions of base-training scenes to build a prototype bank. At incremental stages, it enriches few-shot prototypes via retrieval + attention fusion — without retraining the backbone or adding parameters, it raises novel-class IoU on ScanNet / S3DIS by up to +6.98% while keeping forgetting low.
SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection: SDDF introduces a new task of Open-Vocabulary Camouflaged Object Detection (OVCOD) and constructs the OVCOD-D benchmark. It removes redundant textual noise via a sub-description principal component contrastive fusion strategy, and enhances foreground-background discrimination through a specificity-guided regional weak alignment mechanism and a dynamic focusing module, achieving 56.4 AP under the open-set setting.
Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation: This paper proposes EDA-PSeg, a framework that introduces two core modules — a Graph Matching Adapter (GMA) and an Euler-Margin Attention (EMA) — to achieve, for the first time, open-set unsupervised domain adaptive semantic segmentation from pinhole to 360° panoramic images, simultaneously addressing geometric FoV distortion and unknown category discovery.
SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation: This paper proposes SemiTooth, a framework that addresses distribution discrepancies across multi-source CBCT data in semi-supervised tooth segmentation via a multi-teacher–multi-student architecture and a Stricter Weighted Confidence (SWC) constraint, achieving state-of-the-art performance on the newly constructed MS3Toothset dataset.
SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons: This paper proposes SemLayer, a generative-model-based pipeline that recovers semantically structured, layered representations from flattened vector icons. The approach reframes segmentation as a colorization task via a diffusion model, follows with semantic amodal completion of occluded regions, and applies integer linear programming (ILP) to determine layer ordering, achieving segmentation gains of +5.0 mIoU and +16.7 PQ.
SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data: This paper proposes the SGMA framework, which constructs global semantic prototypes via a Semantic-Guided Fusion (SGF) module for adaptive cross-modal fusion, and dynamically increases the training frequency of fragile modalities through a Modality-Aware Sampling (MAS) module. The framework addresses three core challenges in incomplete multimodal semantic segmentation for remote sensing: modality imbalance, large intra-class variance, and cross-modal heterogeneity.
SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data: This paper proposes SGMA—a Semantic-Guided Modality-Aware segmentation framework—that employs Semantic-Guided Fusion (SGF) to reduce intra-class variance and reconcile cross-modal conflicts, and Modality-Aware Sampling (MAS) to balance training frequency for vulnerable modalities. On ISPRS, SGMA achieves Average mIoU +9.20% and Last-1 mIoU +18.26% for weak modalities compared to the SOTA method IMLT.
SouPLe: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts: This paper proposes SouPLe (Sound-aware Prompt Learning), which replaces fixed text prompts in CLIP with learnable context tokens generated conditioned on image features, enhancing semantic correspondence between audio embedding tokens and visual features. SouPLe achieves +3.75 cIoU on VGG-SS and +6.32 cIoU in the open-set setting, surpassing all prior methods.
SPAR: Single-Pass Any-Resolution ViT for Open-Vocabulary Segmentation: This paper proposes SPAR, which distills the spatial reasoning capability of a fine-stride sliding window teacher into a single-pass student of identical architecture, transforming a ViT into a resolution-agnostic dense feature extractor. SPAR achieves +10.5 mIoU over the single-pass baseline in open-vocabulary segmentation while running 52× faster than the teacher.
Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation: This paper proposes the SERA framework, which introduces a two-stage lightweight MoE expert refinement mechanism — SERA-Adapter at the backbone level and SERA-Fusion at the fusion level — into a frozen vision-language backbone. Through expression-guided adaptive routing, SERA improves spatial consistency and boundary precision in referring image segmentation while updating fewer than 1% of backbone parameters.
Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation: This paper proposes the TODSynth framework, which achieves joint text-image-mask controlled remote sensing image synthesis via unified tri-modal attention in MM-DiT, and introduces Control-Rectify Flow Matching (CRFM), a novel sampling-stage method that dynamically adjusts the generation trajectory using semantic loss from a downstream segmentation model. The synthesized data improves mIoU by 4.14% on FUSU-4k and 2.08% on LoveDA.
The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation: This paper proposes GOLD, a framework for Continual Test-Time Adaptation (CTTA). The central finding is that the minimal feature update subspace—termed the "golden subspace"—coincides with the row space of the classifier weight matrix and is inherently low-rank. GOLD estimates this subspace online via the Average Gradient Outer Product (AGOP) and performs feature adaptation using a lightweight scaling vector, achieving state-of-the-art performance on classification and segmentation benchmarks with minimal computational overhead.
Towards Context-Aware Image Anonymization with Multi-Agent Reasoning: This paper proposes CAIAMAR, a multi-agent framework that combines high-confidence direct PII processing (pedestrians, license plates) using dedicated models with context-aware reasoning via large vision-language models (LVLMs). Through a PDCA iterative refinement loop, it detects indirect privacy identifiers and applies appearance-decorrelated inpainting via diffusion models. On CUHK03-NP, it reduces re-identification risk by 73% while maintaining high image quality (FID 9.1) on CityScapes.
Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels: This paper proposes Same Class Neighbor Penalization (SCNP), which replaces each pixel's logit with the worst prediction among its same-class neighbors during training, thereby forcing the model to prioritize correcting weakly classified pixels within local neighborhoods. This approach achieves significant improvements in topological accuracy at negligible cost (only 3 lines of code and a few milliseconds per iteration).
Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera: USF proposes a modular, lens-agnostic spherical vision frontend that projects arbitrarily calibrated camera images onto the unit sphere and performs spatial-domain spherical resampling, convolution, and pooling operations. Using only distance-weighted kernels, the framework inherently guarantees rotation equivariance, and demonstrates zero-shot generalization robustness to random rotations and cross-lens transfer on classification, detection, and segmentation tasks.
Universal 3D Shape Matching via Coarse-to-Fine Language Guidance: This paper proposes UniMatch, a semantics-aware coarse-to-fine 3D shape matching framework. The coarse stage establishes part-level correspondences via category-agnostic 3D segmentation, MLLM-based part naming, and FG-CLIP language embeddings. The fine stage learns dense correspondences within an extended functional map framework using a Group-wise Ranking Contrastive (RnC) Loss, enabling universal matching across categories and non-isometric shapes.
UnrealPose: Leveraging Game Engine Kinematics for Large-Scale Synthetic Human Pose Data: This paper proposes UnrealPose-Gen, a synthetic human pose data generation pipeline built on Unreal Engine 5, which leverages native game engine skeletal kinematics—rather than SMPL—to produce UnrealPose-1M, a million-scale annotated dataset providing 3D joint positions, 2D keypoints, occlusion flags, instance segmentation masks, and camera parameters.
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model: This paper proposes VidEoMT, an encoder-only video segmentation model that unifies segmentation and temporal association within a single ViT encoder via query propagation and query fusion, eliminating all dedicated tracking modules. It achieves 160 FPS on YouTube-VIS 2019 (10×+ faster than CAVIS) with only a 0.3 AP drop.
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model: This paper proposes VidEoMT, an encoder-only video segmentation architecture that unifies segmentation and temporal association within a single ViT encoder via query propagation and query fusion, achieving 5×–10× speedup (160 FPS with ViT-L) while maintaining accuracy comparable to state-of-the-art methods.
VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation: VIRST proposes an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single vision-language model. Through Spatiotemporal Fusion (STF) and a Temporal Dynamic Anchor Updater (TDAU), the method achieves spatiotemporally consistent video segmentation, attaining J&F of 70.8 (+7.5 over SOTA) on ReVOS and 62.9 (+9.2) on MeViS, while achieving an inference speed of 5.1 FPS (1.3× faster than VRS-HQ).
Weakly-Supervised Referring Video Object Segmentation through Text Supervision: This paper proposes WSRVOS, the first weakly supervised referring video object segmentation framework that uses only text expressions as supervision signals. It achieves significant reduction in reliance on pixel-level annotations through MLLM-driven contrastive expression augmentation, bidirectional visual-language feature selection, instance-aware expression classification, and temporal segment ranking constraints.