✂️ Segmentation¶

📹 ICCV2025 · 78 paper notes

2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos: This paper proposes an automated pipeline to extract precise bimanual affordance annotations from human activity videos, yielding the 2HANDS dataset, and trains a VLM-based 2HandedAfforder model that predicts precise object region segmentation masks for left and right hand grasps conditioned on text prompts. The approach significantly outperforms existing methods on the newly introduced ActAffordance benchmark.
A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions: This paper proposes a plug-and-play physical motion restoration approach that repairs artifact frames in video-based motion capture via a Mask-conditioned Motion Correction Module (MCM), and achieves physics-based simulation of high-difficulty in-the-wild motions via a Physics-based Motion Transfer Module (PTM) built on pretraining and test-time adaptation, substantially improving the physical plausibility of recovered motions.
A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions: A plug-and-play physical motion restoration framework is proposed that repairs defective frames in video-based motion capture via a Mask-conditioned Motion Correction Module (MCM), and subsequently transfers the corrected motion into a physically plausible simulation through a Physics-based Motion Transfer Module (PTM) with RL-based test-time adaptation. This work is the first to achieve physics-based simulation restoration for in-the-wild high-difficulty motions such as gymnastics and martial arts back-flips.
Advancing Visual Large Language Model for Multi-granular Versatile Perception: This paper proposes MVP-LM, a multi-granular versatile perception framework built upon a visual large language model. Through a novel multi-granular decoder and a CoT-inspired data unification strategy, MVP-LM is the first single model to simultaneously support all four perception combinations—box and mask predictions under both word-level and sentence-level instructions—achieving competitive performance on panoptic segmentation, object detection, visual grounding, and referring expression segmentation.
AnimalClue: Recognizing Animals by their Traces: This paper introduces AnimalClue, the first large-scale dataset for animal trace recognition, containing 159,605 bounding boxes spanning 968 species across five categories of indirect clues (footprints, feces, eggs, bones, and feathers), and establishes four benchmarks covering classification, detection, instance segmentation, and attribute prediction.
Auto-Vocabulary Semantic Segmentation: This paper introduces Auto-Vocabulary Semantic Segmentation (AVS), a new task in which the AutoSeg framework autonomously discovers target categories from images and performs segmentation without any human-specified vocabulary. AutoSeg achieves 87.1 mIoU on PASCAL VOC, far surpassing the only comparable method ZeroSeg (20.1), and even outperforming several open-vocabulary methods that require explicit category specification.
Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection: This paper proposes RISE — a retrieval self-augmented unsupervised camouflaged object detection paradigm that constructs foreground/background prototype libraries from the training set itself and leverages KNN retrieval to generate pseudo-labels, substantially outperforming existing unsupervised and prompt-based methods without any annotations.
Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?: This paper proposes SatDiFuser, a framework that repurposes a generative geospatial diffusion model (DiffusionSat) as a discriminative remote sensing foundation model. Through systematic analysis of multi-stage, multi-timestep diffusion features and three designed fusion strategies (Global Weighted, Localized Weighted, and MoE Joint Fusion), SatDiFuser outperforms existing state-of-the-art geospatial foundation models (GFMs) on semantic segmentation and classification tasks, achieving gains of up to +5.7% mIoU and +7.9% F1.
CAVIS: Context-Aware Video Instance Segmentation: This paper proposes CAVIS, which introduces a Context-Aware Instance Tracker (CAIT) to incorporate contextual information around object boundaries for enhanced instance association, and designs a Prototypical Cross-frame Contrastive loss (PCC) to enforce cross-frame feature consistency, achieving state-of-the-art performance on both VIS and VPS benchmarks.
CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation: This paper proposes Closed Loop Optimal Transport (CLOT), a framework that jointly solves three OT problems through a three-level cyclic feature learning pipeline (frame embeddings → segment embeddings → cross-attention refined frame embeddings), establishing an explicit feedback loop between frame-level and segment-level representations to substantially improve boundary detection and clustering quality in unsupervised action segmentation.
ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction: This paper proposes ConformalSAM, a framework that leverages Conformal Prediction to calibrate the output uncertainty of the foundation segmentation model SEEM on target domains. Unreliable pixel labels are filtered out before serving as supervision signals for unlabeled data. Combined with a late-stage self-reliance training strategy, the framework achieves 81.21 mIoU on PASCAL VOC under the 1/16 labeled setting.
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation: This paper identifies inter-class patch correlations in CLIP as the fundamental bottleneck for segmentation performance, and proposes CorrCLIP, which addresses this via SAM-constrained patch interaction scope (scope reconstruction), DINO-based similarity value reconstruction (value reconstruction), spatial/semantic feature refinement, and SAM mask post-processing. The method achieves an average mIoU improvement from 48.6% to 53.6% across 8 benchmarks under the training-free setting.
Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild: CAV-SAM represents the correspondence between reference–target image pairs as a pseudo-video sequence, bridging semantic gaps via a Diffusion-Based Semantic Transition (DBST) module and aligning geometric variations via a Test-Time Geometric Alignment (TTGA) module. This enables SAM2's video segmentation capability to be adapted to reference segmentation in a training-free manner, surpassing the state of the art by approximately 5% mIoU on cross-domain few-shot segmentation benchmarks.
Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild: This paper represents the correspondences between reference-target image pairs as a pseudo-video sequence generated by a diffusion model, leverages SAM2's interactive video object segmentation (iVOS) capability for segmentation, and combines lightweight test-time fine-tuning to handle geometric variation. The proposed method outperforms state-of-the-art approaches by approximately 5% mIoU on cross-domain few-shot segmentation without requiring any meta-training.
DDB: Diffusion Driven Balancing to Address Spurious Correlations: This paper proposes Diffusion Driven Balancing (DDB), which leverages the textual inversion and inpainting capabilities of Stable Diffusion to automatically generate minority-group samples for balancing spurious correlations in datasets. Combined with a bicephalous pruning strategy based on ERM model prediction probabilities and integrated gradients, DDB achieves state-of-the-art worst-group accuracy on Waterbirds and MetaShift.
DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy: This paper proposes DeRIS, a framework that decouples referring image segmentation into two branches — perception and cognition — and introduces a Loopback Synergy mechanism to iteratively enhance cross-branch interaction. A non-referent sample conversion augmentation strategy is also introduced. DeRIS achieves state-of-the-art performance on RefCOCO/+/g and gRefCOCO benchmarks.
Dynamic Dictionary Learning for Remote Sensing Image Segmentation: This paper proposes D2LS, a dynamic dictionary learning framework that iteratively updates category-aware semantic embeddings (the dictionary) via multi-stage alternating cross-attention, and incorporates contrastive constraints to enhance inter-class separability. D2LS surpasses the state of the art on both coarse-grained and fine-grained remote sensing image segmentation benchmarks.
E-SAM: Training-Free Segment Every Entity Model: E-SAM is a training-free framework that systematically addresses over-segmentation and under-segmentation in SAM's Automatic Mask Generation (AMG) via three cascaded modules—Multi-level Mask Generation (MMG), Entity-level Mask Refinement (EMR), and Under-Segmentation Refinement (USR)—surpassing existing entity segmentation methods by +30.1 points on benchmark metrics.
Enhancing Transformers Through Conditioned Embedded Tokens: This paper identifies an inherent ill-conditioning problem in the self-attention matrices of Transformers. Through theoretical analysis, it establishes a direct relationship between the condition number of the self-attention matrix and that of the embedded token matrix, and proposes Conditioned Embedded Tokens — an SVD-based correction term applied to the embedding matrix — achieving consistent performance improvements across image classification, object detection, instance segmentation, and NLP tasks.
Ensemble Foreground Management for Unsupervised Object Discovery: This paper proposes UnionCut — a foreground union detection method based on minimum cut and ensemble learning — which provides mathematically guaranteed foreground priors for unsupervised object discovery (UOD). It enables UOD algorithms to reliably determine whether discovered regions are foreground and when to stop exploration. A distilled variant, UnionSeg, is also proposed to substantially improve both efficiency and accuracy.
Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation: This paper proposes DPMFormer, a framework that transforms domain-specific properties of input images into textual context prompts via domain-aware prompt learning, combined with domain-robust consistency learning, to address the semantic misalignment between visual and textual contexts in language-driven domain generalization for semantic segmentation.
Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation: This paper proposes PDAF (Probabilistic Diffusion Alignment Framework), which explicitly estimates a Latent Domain Prior (LDP) via probabilistic diffusion modeling to provide domain-shift compensation for existing segmentation networks, achieving state-of-the-art cross-domain generalization without requiring paired target-domain samples.
FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation: This paper challenges the default practice of averaging 80 templates in open-vocabulary semantic segmentation (OVSS), revealing that each class has specific "class-expert" templates that significantly outperform the averaged classifier. It proposes FLOSS, a method that uses prediction entropy to unsupervisedly select expert templates and fuse their predictions, consistently improving existing OVSS methods without any labels or training.
Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling: This paper proposes a remote sensing model pre-training pipeline comprising OpticalRS-13M, a dataset of 13 million optical remote sensing images, and SelectiveMAE, an efficient MIM method that selectively encodes and reconstructs patches based on semantic richness. Using only 40% of image patches, SelectiveMAE achieves performance comparable to full-patch training while delivering more than 2× speedup.
Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation: This paper introduces Continual Video Instance Segmentation (CVIS) as a new problem formulation, and proposes the Hierarchical Visual Prompt Learning (HVPL) model, which mitigates catastrophic forgetting of old categories via forgetting compensation mechanisms at both frame-level and video-level.
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model: This paper proposes HiMTok (Hierarchical Mask Tokenizer), which represents segmentation masks as up to 32 coarse-to-fine discrete tokens, enabling LMMs to directly generate segmentation results in the same manner as text generation — without any additional image-conditioned mask decoder — achieving state-of-the-art performance across multiple segmentation benchmarks.
How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?: This paper proposes the SSP (Stepping Stone Plus) framework, which employs optical flow as auxiliary mask prompts in conjunction with two types of textual prompts and a Visual-Textual Alignment (VTA) module, achieving state-of-the-art performance on the audio-visual semantic segmentation task.
Hybrid-TTA: Continual Test-time Adaptation via Dynamic Domain Shift Detection: Hybrid-TTA proposes a continual test-time adaptation (CTTA) framework that employs a Dynamic Domain Shift Detection (DDSD) module to determine whether the current input originates from a new domain, adaptively switching between Full Tuning (FT) and Adapter Tuning (AT). It additionally introduces Masked Image Modeling Adaptation (MIMA) as an auxiliary task to enhance model stability, achieving 62.2% mIoU on the Cityscapes-to-ACDC benchmark while running approximately 20× faster than comparable methods.
Implicit Counterfactual Learning for Audio-Visual Segmentation: This paper proposes the Implicit Counterfactual Framework (ICF), which employs multi-granularity implicit text as a modality bridge to reduce the audio-visual representation gap, and leverages semantic counterfactuals to generate orthogonal counterfactual samples that mitigate modality preference. Combined with Collaborative Distribution-Aware Contrastive Learning (CDCL), ICF achieves unbiased cross-modal understanding and state-of-the-art performance on three AVS benchmarks.
Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive Segmentation: This paper proposes Inter2Former, which employs Dynamic Hybrid Attention (DHA) to route boundary tokens to full attention and non-boundary tokens to linear-complexity BSQ attention. Combined with Dynamic Prompt Embedding (DPE), Hybrid Mixture of Experts (HMoE), and Dynamic Local Upsampling (DLU), the method achieves state-of-the-art performance and efficient inference for high-precision interactive segmentation on CPU devices.
Joint Self-Supervised Video Alignment and Action Segmentation: This paper proposes the VAOT/VASOT framework, which integrates Gromov-Wasserstein optimal transport with structural priors to unify self-supervised video alignment and action segmentation within a single model for the first time. The framework surpasses existing methods on video alignment and achieves state-of-the-art performance on action segmentation.
Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP: By analyzing the scarcity and misalignment of negation expressions in CLIP's pre-training data, this work designs two LLM/MLLM-based negation data generation pipelines to fine-tune the CLIP text encoder, producing NegationCLIP — a model that enhances negation understanding while preserving general performance. A new benchmark, NegRefCOCOg, is proposed for comprehensive negation evaluation.
Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation: This paper proposes an end-to-end weakly supervised semantic segmentation method that introduces multiple [CLS] tokens (one per class) into a ViT, applies random masking to [CLS] token output embeddings, and prunes redundant attention heads. Class-specific pseudo segmentation masks are generated directly from self-attention maps without any additional CAM module.
Latent Expression Generation for Referring Image Segmentation and Grounding: This paper proposes the Latent-VG framework, which generates multiple latent expressions from a single textual description — each sharing the same subject but highlighting distinct visual attributes — to bridge the semantic gap between sparse text and rich visual information via complementary visual details. The method achieves state-of-the-art performance on both referring image segmentation and referring expression comprehension tasks.
LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation: This paper proposes LawDIS, a language-window dual-control dichotomous image segmentation framework built upon Stable Diffusion. In macro mode, language prompts guide target segmentation; in micro mode, variable-size windows refine local details. LawDIS comprehensively outperforms 11 state-of-the-art methods on DIS5K.
LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation: This paper proposes LawDIS, a controllable dichotomous image segmentation framework built upon a latent diffusion model. It achieves high-quality foreground mask generation through the synergy of macro-level language control (LS) and micro-level window refinement (WR), comprehensively outperforming 11 state-of-the-art methods on the DIS5K benchmark.
LayerAnimate: Layer-level Control for Animation: This paper proposes LayerAnimate, a framework that integrates the layer-separation paradigm of traditional animation production with video diffusion models to enable fine-grained layer-level control (motion scores, trajectories, sketches). An automated data curation pipeline is designed to address the scarcity of layered animation data. The framework comprehensively outperforms existing methods across six video generation tasks.
Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation: This paper proposes Learn2Synth, a training framework that leverages hypergradients to learn optimal synthetic data augmentation parameters, enabling segmentation networks trained exclusively on synthetic data to achieve peak performance on real data. The framework simultaneously attains high in-domain accuracy and strong out-of-domain generalization, outperforming both SynthSeg and supervised learning baselines on brain MRI segmentation tasks.
Learning Precise Affordances from Egocentric Videos for Robotic Manipulation: This paper presents a complete affordance learning system comprising: (1) an automatic pipeline for extracting precise graspable and functional affordance segmentation annotations from egocentric videos; (2) a Geometry-guided Affordance Transformer (GAT) based on DINOv2 with depth-geometric guidance for cross-domain affordance segmentation (mIoU improved by 13.8%); and (3) the Aff-Grasp framework, which achieves a 77.1% grasping success rate across 179 real robot trials.
LEGION: Learning to Ground and Explain for Synthetic Image Detection: This paper proposes the LEGION framework and the SynthScars dataset, leveraging a multimodal large language model (MLLM) to unify artifact detection, pixel-level segmentation, and textual explanation for synthetic image detection. It further innovatively extends the role of the detector from a "Defender" to a "Controller," guiding generative models to produce higher-quality images.
LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity: This paper proposes LeGrad, a layer-wise explainability method designed specifically for ViTs. It computes the gradient of the activation with respect to the attention map at each layer as the explanation signal, aggregates these signals across layers to produce high-quality spatial saliency maps, and demonstrates superior spatial fidelity in segmentation, perturbation, and open-vocabulary settings.
MOVE: Motion-Guided Few-Shot Video Object Segmentation: This paper introduces a novel task of motion-guided few-shot video object segmentation along with a large-scale dataset MOVE (224 motion categories, 4,300 videos, 314K masks), and proposes a Decoupled Motion-Appearance (DMA) network. Through a dual-branch architecture combining frame-differencing-based motion prototypes and appearance prototypes, the proposed method significantly outperforms existing FSVOS methods on the new benchmark.
O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views: This work reframes cross-view (ego-exo) object segmentation as a mask matching problem. It leverages FastSAM to generate candidate masks, DINOv2 to extract semantic features, and contrastive learning to match objects across views, achieving state-of-the-art performance on the Ego-Exo4D benchmark with only 1% of the trainable parameters used by prior methods.
Object-level Correlation for Few-Shot Segmentation: OCNet is proposed to construct object-level (rather than image-level) support-query correlations by emulating biological visual processes. It first mines generic objects in the query image and then identifies the target object among them, effectively suppressing irrelevant object noise in the background.
OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation: This paper proposes OmniSAM, the first framework to apply SAM2 to unsupervised domain adaptation (UDA) for panoramic semantic segmentation. It partitions panoramic images into patch sequences via a sliding window and leverages SAM2's memory mechanism to capture cross-patch correspondences. Combined with a FoV-based prototypical adaptation module and a dynamic pseudo-label update strategy, OmniSAM significantly surpasses the state of the art on both indoor and outdoor benchmarks (+10.22% / +6.58%).
On the Generalization of Representation Uncertainty in Earth Observation: This paper systematically investigates the generalization of pretrained representation uncertainty in Earth Observation (EO), demonstrating that EO-pretrained uncertainty generalizes robustly across geographic locations, EO tasks, and target granularities, while remaining highly sensitive to ground sampling distance (GSD).
Online Generic Event Boundary Detection: This paper proposes Online Generic Event Boundary Detection (On-GEBD) as a new task—detecting event boundaries in real time from streaming video—and introduces the ESTimator framework inspired by the cognitive science Event Segmentation Theory (EST). Through the collaboration of a Consistent Event Anticipator (CEA) and an Online Boundary Discriminator (OBD), ESTimator achieves an Avg F1 of 0.748 on Kinetics-GEBD, surpassing all online baselines and approaching the performance of offline methods.
Online Reasoning Video Segmentation with Just-in-Time Digital Twins: This paper proposes a multi-agent framework based on the concept of "Just-in-Time Digital Twins" that decouples perception from reasoning. Without any LLM fine-tuning, the framework enables online video reasoning segmentation and comprehensively outperforms existing methods across semantic, spatial, and temporal reasoning tasks.
Open-World Skill Discovery from Unsegmented Demonstration Videos: Inspired by the human cognitive Event Segmentation Theory (EST), this paper proposes the Skill Boundary Detection (SBD) algorithm, which leverages prediction error spikes from a pretrained unconditional action prediction model to automatically identify skill boundaries in unsegmented demonstration videos, significantly improving the performance of conditional policies and hierarchical agents in Minecraft.
PartField: Learning 3D Feature Fields for Part Segmentation and Beyond: PartField learns a continuous 3D feature field via a feed-forward model, distilling knowledge from mixed 2D/3D part proposals through contrastive learning. It outperforms existing methods by 20%+ on category-agnostic 3D part segmentation while achieving inference speeds orders of magnitude faster.
Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss: This paper proposes P3HOT, a framework that achieves state-of-the-art performance on Human-Object Contact (HOT) detection by incorporating text prompt guidance to focus on human contact regions, a depth-aware module to filter irrelevant backgrounds, and a Regional Joint Loss to enforce intra-region category consistency.
RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping: This paper introduces RAGNet, the first large-scale reasoning-based affordance segmentation benchmark (273k images, 180 categories, 26k reasoning instructions), and proposes the AffordanceNet framework, which integrates VLM-pretrained affordance prediction with grasp pose generation, demonstrating strong open-world generalization and reasoning capabilities.
Refer to Any Segmentation Mask Group With Vision-Language Prompts: This paper proposes the Omni-modal Referring Expression Segmentation (ORES) task and the RAS framework, which leverages a mask-level LMM with a non-autoregressive decoding mechanism to select target mask groups from a candidate pool based on vision-language hybrid prompts. The approach achieves state-of-the-art performance on the newly introduced ORES dataset as well as classical RES/GRES benchmarks.
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations: This paper proposes ReferDINO, which end-to-end adapts the GroundingDINO visual grounding foundation model to the Referring Video Object Segmentation (RVOS) task. By introducing a grounding-guided deformable mask decoder, an object-consistent temporal enhancer, and a confidence-based query pruning strategy, ReferDINO significantly surpasses state-of-the-art methods across five benchmarks (e.g., +3.9% \(\mathcal{J}\&\mathcal{F}\) on Ref-YouTube-VOS) while achieving real-time inference at 51 FPS.
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos: By leveraging the general visual-language mappings learned by video diffusion models, and by preserving the complete generative model architecture while shifting the prediction target from noise to mask latents, this work enables open-world referring segmentation of any concept expressible in natural language in videos — including non-object dynamic processes.
Region-based Cluster Discrimination for Visual Representation Learning: This paper proposes RICE (Region-Aware Cluster Discrimination), which constructs a billion-scale region dataset, designs a Region Transformer layer, and introduces a unified region cluster discrimination loss to jointly optimize object-aware and OCR capabilities, significantly improving visual encoder performance across segmentation, detection, and MLLM multi-task benchmarks.
Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes: This paper constructs USC12K, the first unconstrained dataset for salient and camouflaged object detection covering four scene types, proposes USCNet built upon SAM, introduces an Attribute Relationship Modeling (ARM) module to explicitly model the relationship between salient and camouflaged objects, and designs a new metric CSCS to quantify confusion between the two categories, achieving state-of-the-art performance across all scene types.
ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones: This paper introduces ROADWork, the first large-scale work zone dataset comprising 4,375 video clips, 9,650 richly annotated images, and 129K images with drivable path annotations. It reveals that foundation models fail severely in work zone scenarios (AP of only 2.9–4.2), while fine-tuning yields substantial improvements (+32.2 AP), and proposes a four-level cognitive framework of Recognize, Observe, Analyze, and Drive.
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree: To address the error accumulation caused by SAM 2's greedy selection strategy in long videos, this paper proposes a training-free constrained tree search memory strategy that maintains multiple segmentation paths and selects the optimal result at the video level, achieving an average improvement of 3.7 J&F across 9 VOS and 3 VOT benchmarks, with up to 5.3 gains on long-video scenarios.
SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation: This paper proposes the SCORE framework, which leverages multi-granularity scene context (regional context + global context) to enhance open-vocabulary remote sensing instance segmentation. Two dedicated modules — Region-Aware Integration (RAI) and Global Context Adaptation (GCA) — are introduced to strengthen visual and textual representations, respectively.
SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation: This paper proposes SCORE, a framework that injects multi-granularity scene knowledge from a remote sensing-specific CLIP into an open-vocabulary instance segmentation pipeline via two modules — Region-Aware Integration (RAI) and Global Context Adaptation (GCA) — achieving an average mAP improvement of 5.53% over the previous state of the art in cross-dataset evaluation across multiple remote sensing benchmarks.
Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation: This paper proposes Skeleton Motion Quantization (SMQ), which achieves unsupervised temporal action segmentation on skeleton sequences via a joint-decoupled temporal autoencoder and a skeleton motion word quantization module, substantially outperforming existing unsupervised methods on HuGaDB, LARa, and BABEL.
SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation: This paper proposes SPADE — a spatial-aware denoising network for open-vocabulary panoptic scene graph generation (PSG). It adapts a pretrained diffusion model into a PSG-specific spatial prior extractor via DDIM inversion-guided calibration, and designs a relational graph Transformer to capture both long-range and local context. SPADE substantially outperforms prior state-of-the-art methods in both closed-set and open-set settings, with particularly strong performance on spatial relation prediction.
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation: This paper exposes an evaluation bias in existing open-vocabulary segmentation (OVS) benchmarks, where test sets exhibit high semantic similarity to training spaces. It proposes a new benchmark, OpenBench, and a method, OVSNet, that integrates heterogeneous features via Gradient-Free Aggregation (GFA) and expands the training semantic space at zero cost through Proxy Calibration (PC), achieving state-of-the-art performance on both existing benchmarks and OpenBench.
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models: TAViS is a text-bridged audio-visual segmentation framework that couples the cross-modal alignment capability of ImageBind with the precise segmentation capability of SAM2. By introducing a text-bridged hybrid prompting mechanism and alignment supervision strategy, TAViS achieves state-of-the-art performance across single-source, multi-source, semantic, and zero-shot segmentation scenarios.
Temporal Rate Reduction Clustering for Human Motion Segmentation: This paper proposes Temporal Rate Reduction Clustering (TR²C), which integrates the Maximal Coding Rate Reduction (MCR²) principle with temporal continuity regularization to jointly learn temporally consistent representations and affinity matrices conforming to the Union of Subspaces (UoS) distribution, achieving substantial state-of-the-art improvements on five HMS benchmarks.
TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba: This paper proposes TinyViM, a lightweight convolution-Mamba hybrid visual backbone based on frequency decoupling. A Laplace Mixer routes low-frequency components to Mamba for global context modeling and enhances high-frequency components via depthwise convolution. A frequency ramp Inception structure progressively adjusts frequency allocation across stages. TinyViM achieves 2–3× higher throughput than existing Mamba models on classification, detection, and segmentation tasks.
TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation: The first test-time adaptation (TTA) framework specifically designed for tubular structure segmentation (TSS). It adapts to cross-domain topological structural differences via Topological Meta-Differential Convolutions (TopoMDCs), and restores topological continuity through a Topological Hard Sample Generation (TopoHG) strategy, achieving an average clDice improvement of 31.81% across 10 datasets.
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation: This paper proposes the OmniAVS dataset and OISA model, extending referring audio-visual segmentation (RAVS) beyond simple acoustic attribute perception to omnimodal expressions (arbitrary combinations of text/speech/sound/image) and deep reasoning (understanding sound content + world knowledge), achieving SOTA on the new benchmark and multiple related tasks.
Training-Free Class Purification for Open-Vocabulary Semantic Segmentation: This paper proposes FreeCP, a training-free class purification framework that addresses class redundancy and visual-language ambiguity arising from over-complete vocabularies in open-vocabulary semantic segmentation (OVSS), via a two-stage strategy of redundancy purification and ambiguity purification. As a plug-and-play module, FreeCP consistently improves existing methods across eight benchmarks.
Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation: This paper introduces the Personalized Open-Vocabulary Semantic Segmentation (Personalized OVSS) task for the first time, and proposes a plug-and-play method based on text prompt tuning. By incorporating negative mask proposals to suppress false positives and injecting visual embeddings to enrich personalized concept representations, the method enables recognition of user-specific object instances from only a few image-mask pairs, while preserving the original OVSS performance.
UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis: This paper proposes UniGlyph, a visual text generation framework that adopts segmentation masks as a unified conditioning signal. By replacing conventional rendered glyph conditions with Adaptive Glyph Conditioning (AGC) and Glyph Region Loss (GRL), UniGlyph achieves state-of-the-art bilingual (Chinese and English) text image generation under a single ControlNet architecture, with particularly large margins in small-font and complex-layout scenarios.
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation: VEGGIE proposes an end-to-end unified framework that bridges an MLLM with a video diffusion model, enabling a single model to simultaneously accomplish 8 tasks—including instructional video editing, concept grounding, and reasoning segmentation—using only the diffusion loss.
VSC: Visual Search Compositional Text-to-Image Diffusion Model: This paper proposes VSC, a visual search-based compositional text-to-image diffusion generation method that significantly improves the accuracy and scalability of multi-attribute-object binding by generating reference images for each attribute-object pair independently, fusing visual prototype embeddings, and training with segmentation-guided cross-attention localization.
VSSD: Vision Mamba with Non-Causal State Space Duality: This paper proposes Non-Causal State Space Duality (NC-SSD), which transforms the SSD formulation of Mamba2 into a non-causal form by retaining the relative weights of token contributions in lieu of the cumulative decay of hidden states. Built upon NC-SSD, the VSSD visual backbone surpasses existing SSM-based models across classification, detection, and segmentation benchmarks while achieving 20%–50% faster training speed.
What If: Understanding Motion Through Sparse Interactions: This paper proposes the Flow Poke Transformer (FPT), which directly predicts multimodal probability distributions over object motion in a scene (rather than a single deterministic outcome), conditioned on sparse "poke" interactions, enabling interpretable motion understanding and moving part segmentation.
WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images: This paper proposes WildSeg3D, the first feed-forward 3D segmentation model that requires no scene-specific training. It addresses multi-view pointmap alignment errors via Dynamic Global Alignment (DGA) and achieves real-time interactive 3D segmentation through Multi-view Group Mapping (MGM), outperforming the current state of the art in accuracy while being 40× faster.
ZIM: Zero-Shot Image Matting for Anything: This paper proposes ZIM, a zero-shot image matting model that constructs the SA1B-Matte dataset by converting SA1B segmentation labels into fine-grained matting labels via a label converter. A hierarchical pixel decoder and a prompt-aware masked attention mechanism are further introduced to achieve micro-level fine-grained matting while preserving zero-shot generalization capability.