Skip to content

✂️ Segmentation

🤖 AAAI2026 · 31 paper notes

A²LC: Active and Automated Label Correction for Semantic Segmentation

This paper proposes the A²LC framework, which augments conventional active label correction (ALC) — where annotators manually fix errors one by one — with an automated correction stage via a Label Correction Module (LCM). The LCM leverages annotator feedback to automatically rectify similar erroneous masks, while an Adaptively Balanced acquisition function (ABC) is designed to mitigate class imbalance. On Cityscapes, A²LC surpasses the previous SOTA using only 20% of the budget, achieving a 27.23% mIoU improvement under equal budget conditions.

Adaptive Morph-Patch Transformer for Aortic Vessel Segmentation

This paper proposes the Morph-Patch Transformer (MPT), which generates morphology-aware patches via a velocity-field-based adaptive patch partitioning strategy to preserve vascular topological integrity, and introduces Semantic Clustering Attention (SCA) to dynamically aggregate features from semantically similar patches. The method achieves state-of-the-art performance on three aortic segmentation benchmarks: AVT, AortaSeg24, and TBAD.

Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization

This paper proposes Generative Clean-Image Backdoors (GCB), which employs a Conditional InfoGAN (C-InfoGAN) to automatically discover naturally occurring, task-irrelevant features within images as backdoor triggers. GCB achieves high attack success rates (ASR ≥ 90%) at extremely low poison rates (≤ 0.5%) with negligible degradation of clean accuracy (CA drop ≤ 1%), thereby becoming the first method to break the inherent stealth-potency trade-off in clean-image backdoor attacks.

Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-Domain Few-Shot Segmentation

This paper proposes the HSL framework, which addresses the segmentation granularity gap between source and target domains in cross-domain few-shot segmentation (CD-FSS) via three modules — Dual Style Randomization (DSR), Hierarchical Semantic Mining (HSM), and Prototype Confidence-modulated Thresholding (PCMT) — achieving state-of-the-art performance across four target-domain datasets.

Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

This paper proposes Causal-Tune, a causality-driven VFM fine-tuning strategy that decomposes VFM features into causal (domain-invariant) and non-causal (domain-specific) components via DCT frequency-domain transformation and Gaussian band-pass filtering. Learnable tokens are applied exclusively to the causal components for refinement, effectively suppressing VFM artifacts and improving generalization in domain generalized semantic segmentation.

CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion

This paper proposes CtrlFuse, which achieves interactive controllable infrared-visible image fusion by fine-tuning SAM with mask prompt guidance, simultaneously improving fusion quality and downstream segmentation/detection performance.

Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation

This paper proposes FLEX-Seg, a framework that reframes the inherent boundary misalignment between images and semantic masks in diffusion-model-synthesized data as an opportunity to learn robust representations. Through three modules—Granular Adaptive Prototypes (GAP), Uncertainty Boundary Emphasis (UBE), and Hardness-Aware Sampling (HAS)—FLEX-Seg achieves state-of-the-art performance on domain generalized semantic segmentation.

EAGLE: Episodic Appearance- and Geometry-Aware Memory for Unified 2D-3D Visual Query Localization

This paper proposes the EAGLE framework, inspired by avian memory consolidation mechanisms. A segmentation branch guided by an Appearance-aware Meta-learning Memory (AMM) and a tracking branch driven by a Geometry-aware Localization Memory (GLM) operate collaboratively. Combined with VGGT, the framework achieves efficient unified 2D-3D visual query localization, attaining state-of-the-art performance on the Ego4D-VQ benchmark.

Empowering DINO Representations for Underwater Instance Segmentation via Aligner and Prompter

This paper is the first to introduce DINOv2 into underwater instance segmentation. Through two adaptation modules—AquaStyle Aligner (Fourier frequency-domain style injection) and ObjectPrior Prompter (binary mask prior prompting)—the proposed DiveSeg achieves efficient domain adaptation and substantially outperforms SAM-based methods on the UIIS and USIS10K benchmarks with fewer parameters.

From Attribution to Action: Jointly ALIGNing Predictions and Explanations

This paper proposes the ALIGN framework, which jointly trains a learnable masker and a classifier through alternating optimization to iteratively align model attribution maps with task-relevant region masks, simultaneously improving prediction accuracy and interpretability. ALIGN outperforms six strong baselines on the VLCS and Terra Incognita domain generalization benchmarks.

Generalizable Slum Detection from Satellite Imagery with Mixture-of-Experts

This paper proposes GRAM (Generalized Region-Aware Mixture-of-Experts), a two-stage test-time adaptation framework. In the first stage, a MoE architecture is used to train region-specialized experts on million-scale satellite imagery from 12 cities. In the second stage, reliable pseudo-labels are selected via cross-region prediction consistency for self-training, enabling generalization of slum segmentation to unseen African cities.

Guideline-Consistent Segmentation via Multi-Agent Refinement

A training-free multi-agent framework is proposed that achieves guideline-consistent semantic segmentation through an iterative Worker (segmentation execution) and Supervisor (guideline verification) loop coupled with an RL-based adaptive stopping strategy, surpassing prior SOTA by 8.61 and 5.5 gIoU on Waymo and ReasonSeg, respectively.

InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

This paper proposes InfoCLIP, which adopts an information-theoretic perspective to design two objectives—information bottleneck compression and mutual information distillation—to remove noise in pretrained pixel-text alignment and preserve semantic alignment knowledge during CLIP fine-tuning. InfoCLIP achieves state-of-the-art results across six open-vocabulary semantic segmentation benchmarks (A-847: 16.6, A-150: 38.5, PC-59: 63.5 mIoU) while introducing only 0.53M additional parameters and negligible computational overhead.

JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

This paper proposes JoDiffusion, a framework that jointly diffuses images and pixel-level annotation masks in latent space, enabling—for the first time—simultaneous generation of semantically consistent image–annotation pairs conditioned solely on text prompts. JoDiffusion substantially outperforms existing Image2Mask and Mask2Image methods on Pascal VOC, COCO, and ADE20K.

LWGANet: Addressing Spatial and Channel Redundancy in Remote Sensing Visual Tasks with Light-Weight Grouped Attention

To address spatial redundancy (large homogeneous backgrounds) and channel redundancy (inefficient single feature space under extreme scale variation) in remote sensing imagery, LWGANet is proposed as a lightweight backbone that achieves efficient multi-scale feature representation via Top-K sparse global feature interaction (TGFI) and heterogeneous grouped attention (LWGA) modules, achieving SOTA across 4 remote sensing task categories on 12 datasets.

Multigranular Evaluation for Brain Visual Decoding

This paper proposes BASIC, a multigranular evaluation framework that unifies the assessment of brain visual decoding quality along two axes — structural (multi-level segmentation mask matching across four granularities) and semantic (precision/recall/F1 over object/attribute/relation graphs extracted by MLLMs) — covering six modality combinations of fMRI/EEG × Image/Video/3D, thereby addressing the limitations of metric saturation, lack of neuroscientific grounding, and insufficient fine-grained diagnostic capacity in existing evaluation protocols.

Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV

To address background distraction in few-shot action recognition (FSAR) for wide-angle videos — where subjects occupy a small portion of the frame and temporal relationships degrade — this paper proposes Otter, an enhanced RWKV-based framework. It introduces a Compound Segmentation Module (CSM) for subject highlighting and a Temporal Reconstruction Module (TRM) for recovering temporal relationships, achieving state-of-the-art results on SSv2, Kinetics, UCF101, and HMDB51 benchmarks.

RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation

This paper proposes RS2-SAM2, a framework that injects textual information into the SAM2 image encoding process via a Bidirectional Hierarchical Fusion Module (BHFM) and designs a Mask Prompt Generator (MPG) to supply SAM2 with dense prompts, achieving state-of-the-art performance on referring remote sensing image segmentation.

RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

This paper proposes RSVG-ZeroOV, a training-free framework that integrates cross-attention maps from a VLM and self-attention maps from a diffusion model via an "Overview-Focus-Evolve" three-stage strategy, enabling zero-shot open-vocabulary visual grounding in remote sensing images.

S5: Scalable Semi-Supervised Semantic Segmentation in Remote Sensing

This paper proposes the S5 framework, which for the first time extends semi-supervised semantic segmentation into a pre-training paradigm for remote sensing foundation models (RSFMs). By constructing the million-scale RS4P-1M dataset and introducing a MoE-based multi-dataset fine-tuning strategy, S5 achieves state-of-the-art performance across multiple remote sensing segmentation and detection benchmarks.

SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection

This paper proposes SAM-DAQ, which adapts SAM2 to RGB-D video salient object detection (VSOD) via a Depth-guided Parallel Adapter (DPA) and a Query-driven Temporal Memory (QTM) module, addressing three key challenges: manual prompt dependency, excessive GPU memory consumption, and computational overhead.

SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model

This paper proposes SAQ-SAM, which improves post-training quantization (PTQ) of SAM from a semantic alignment perspective. It introduces Perceptual Consistency Clipping (PCC) to handle extreme outliers in the mask decoder, and Prompt-Aware Reconstruction (PAR) to preserve semantic alignment between image and prompt interactions.

Segment and Matte Anything in a Unified Model (SAMA)

This paper proposes SAMA — a lightweight extension of SAM that introduces a Multi-View Local Encoder (MVLE) to capture fine-grained local features, a Local-Adapter to inject local details into the decoding process, and dual task-specific prediction heads. With only a 1.8% parameter increase, SAMA achieves high-quality interactive segmentation and alpha matting simultaneously within a unified model, reaching state-of-the-art performance on DIS-5K and multiple matting benchmarks.

Segment Anything Across Shots: A Method and Benchmark

This paper proposes SAAS, a method for Multi-shot Video Object Segmentation (MVOS), along with the Cut-VOS benchmark. SAAS achieves robust cross-shot segmentation via transition-simulating data augmentation (TMA), shot transition detection and understanding modules (TDM+TCH), and a local memory bank.

SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation

This paper proposes SSR, a dual-level semantic and spatial rectification framework that addresses non-target foreground over-activation caused by CLIP's cross-modal semantic misalignment via Cross-Modal Prototype Alignment (CMPA), and background over-activation during affinity propagation via Superpixel-Guided Correction (SGC). SSR achieves state-of-the-art performance on PASCAL VOC and MS COCO, surpassing both single-stage and multi-stage methods.

Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models

This paper proposes Symmetrical Flow Matching (SymmFlow), which unifies semantic segmentation, classification, and image generation into a single model. By jointly modeling forward and reverse flow transformations through a symmetric learning objective, SymmFlow achieves state-of-the-art performance in semantic image synthesis with only 25 inference steps (CelebAMask-HQ FID 11.9, COCO-Stuff FID 7.0), while obtaining competitive results on segmentation and classification.

Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

This paper systematically investigates CLIP's internal mechanisms from an explainability perspective, revealing a "distraction" phenomenon in which CLIP allocates substantial attention resources to target-irrelevant tokens in deeper layers. The proposed training-free method RF-CLIP performs attention redistribution to refocus dispersed resources onto target regions, achieving state-of-the-art performance across 8 benchmarks while maintaining inference efficiency.

Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

CT-CIG is proposed as the first text-guided controllable camouflage image generation method. It leverages a VLM-based Camouflage-Revealing Dialogue Mechanism (CRDM) to generate high-quality text prompts, and combines a lightweight control network with a Frequency Interaction Refinement Module (FIRM) built upon the Stable Diffusion framework to produce logically coherent and texturally realistic camouflage images, establishing a new Text-guided CIG paradigm.

Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors

This paper proposes AffordDex, a two-stage framework: the first stage pre-trains human hand motion priors (natural motion trajectories) via imitation learning; the second stage refines the policy through reinforcement learning using a residual module and VLM-guided Negative Affordance Annotation (NAA), achieving dexterous robotic grasping that is both human-like in naturalness and functionally correct (e.g., avoiding the blade and grasping the handle of a knife). The method significantly outperforms state-of-the-art approaches across multiple generalization levels.

Tracking and Segmenting Anything in Any Modality

SATA proposes a unified tracking and segmentation framework that models cross-modal shared and modality-specific knowledge via a Decoupled Mixture-of-Experts (DeMoE) mechanism, and introduces a Task-aware Multi-Object Tracking (TaMOT) pipeline to unify the output format across all tasks, demonstrating superior performance across 18 tracking and segmentation benchmarks.

Vista: Scene-Aware Optimization for Streaming Video Question Answering Under Post-Hoc Queries

Vista proposes a scene-aware streaming video question answering framework that dynamically segments streaming video into semantically coherent scene units, applies spatiotemporal compression to each scene and offloads it to CPU memory, and selectively recalls the most relevant scenes upon user queries, achieving high-accuracy video QA under strict GPU memory and latency constraints.