Skip to content

✂️ Segmentation

🔬 ICLR2026 · 32 paper notes

📌 Same area in other venues: 📷 CVPR2026 (117) · 🧪 ICML2026 (14) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (45) · 📹 ICCV2025 (73)

🔥 Top topics: Segmentation ×15 · Reasoning ×5 · Alignment/RLHF ×2 · Multimodal/VLM ×2 · Diffusion Models ×2

Advancing Complex Video Object Segmentation via Progressive Concept Construction

This paper introduces Segment Concept (SeC), which injects object-level "concept representations" extracted by Large Vision-Language Models (LVLMs) into a SAM 2.1-style Video Object Segmentation (VOS) pipeline on demand. This approach significantly reduces appearance-based interference and object reappearance failures in complex multi-shot scenarios while establishing the SeCVOS benchmark specifically for evaluating semantic-level VOS capabilities.

AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

The authors propose an Alignment-aware Masked Learning (AML) strategy that quantifies vision-language patch-level alignment and filters low-alignment pixels. This allows the RIS model to focus on reliable regions during training, achieving SOTA results across all 8 RefCOCO splits without any architectural modifications.

Benchmarking Open-ended Segmentation

Focusing on the evaluation loophole in "open-ended segmentation" where model-generated free-form text is forcibly mapped back to a fixed vocabulary via embedding similarity, this paper introduces a mapping function based on lexical relationships (exact/synonym/hyponym/meronym) and a Lexical Alignment Curve (LAC) protocol. This shifts evaluation accuracy from a 37.7% deviation from human judgement to over 90% alignment. Furthermore, the first open-ended segmentation MLLM with contrastive loss (OPAL) is trained, achieving a new SOTA on open-ended panoptic segmentation.

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

The authors propose ByteFlow Net, a hierarchical byte-level language model that operates without a tokenizer. It utilizes the information-theoretic metric of coding rate to adaptively compress raw byte streams into semantic units, outperforming BPE-based baselines and existing byte-level architectures in both pre-training loss and downstream tasks.

Decomposed Attention Fusion in MLLMs for Training-free Video Reasoning Segmentation

This work reconfigures video reasoning segmentation into a video QA task, extracting localization cues directly from the MLLM attention rollout. It purifies noisy attention maps into clean object masks through "Contrastive Background Removal" and "Video-Frame Complementarity" fusion. Finally, attention-guided SAM2 generates fine-grained masks. The entire process is training-free and achieves performance comparable to supervised methods.

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

This work formalizes Referring Video Object Segmentation (RVOS) as an ODE flow problem that continuously deforms video latent representations into masks under language guidance. By fine-tuning the pre-trained text-to-video (T2V) model Wan2.1 and employing three strategies focused on the trajectory starting point, the method achieves SOTA performance on MeViS, Ref-YouTube-VOS, and Ref-DAVIS17.

Detective SAM: Adaptive AI-Image Forgery Localization

A set of lightweight adapters is attached to SAM2 to automatically convert "post-perturbation feature distribution shift" forensics cues into heatmap prompts for segmenting tampered areas in diffusion edits. Combined with an AutoEditForge pipeline for automatic data generation, the locator can continually adapt to evolving image editing models.

Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

The study identifies that SAM2 exhibits sparse perception patterns similar to biological vision (the decoder focuses on the foreground while the encoder computes globally; only a few tokens in memory frames are effective and remain temporally consistent in saliency). Based on this, Efficient-SAM2 is proposed, eliminating redundant computation through Object-Aware Sparse Window Routing (SWR) and Sparse Memory Retrieval (SMR). This achieves a 1.68× end-to-end acceleration on SAM2.1-L with only a 1% accuracy loss.

Enabling True Global Perception in State Space Models for Visual Tasks

The authors axiomatically define "image global modeling" for the first time using gradient lower bound axioms and design the GSSM module based on 2D-DFT frequency domain modulation. They theoretically prove and experimentally verify that SSMs can achieve true global perception while maintaining linear-logarithmic complexity.

Enhancing Image-Conditional Coverage in Segmentation: Adaptive Thresholding via Differentiable Miscoverage Loss

The COAT framework is proposed to learn image-adaptive threshold predictors end-to-end using a differentiable sigmoid soft TPR approximation as a loss function, significantly reducing the per-image Coverage Gap in Conformal Risk Control for image segmentation.

Falcon: Fast Proximal Linearization of Normalized Cuts for Unsupervised Image Segmentation

Falcon reformulates the classic Normalized Cut (NCut) in zero-shot unsupervised segmentation—moving away from the traditional "spectral relaxation + recursive bisection + rounding" routine—into a solver that directly performs proximal linearization on discrete K-way one-hot labels. This approach ensures linear convergence under the KL framework, improves inference speed by nearly an order of magnitude, and achieves new SOTA results across six segmentation benchmarks.

gen2seg: Generative Models Enable Generalizable Instance Segmentation

By fine-tuning Stable Diffusion or MAE as "instance colorers" using synthetic mask supervision from only two narrow domains—indoor furniture and vehicles—this method achieves zero-shot generalization to unseen object categories and styles (e.g., humans, animals, artistic paintings, X-rays). Its performance approaches, and on fine structures even exceeds, SAM models supervised by 1.1 billion masks.

Hierarchical Prototype Learning for Semantic Segmentation

HiPoSeg attaches a "high-level + low-level" category prototype memory bank to the output of a segmentation model. It employs hierarchical contrastive learning and cross-layer margin alignment to organize the representation space following the human visual approach of "identifying the whole before distinguishing parts." As a pure training-time plugin with zero inference overhead, it achieves an average +3.07%p mIoU gain across four benchmarks.

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

To be added after in-depth paper reading.

LiFR-Seg: Anytime High-Frame-Rate Segmentation via Event-Guided Propagation

LiFR-Seg propagates semantic features from low-frame-rate RGB images to arbitrary intermediate time points using high-frequency motion fields estimated from event streams. By employing uncertainty weighting and temporal memory to mitigate event sparsity and long-interval degradation, it allows low-frame-rate hardware to approach or even exceed the performance of high-frame-rate RGB segmentation at night.

Locality-Attending Vision Transformer

This paper proposes the LocAt modular plugin (GAug + PRR), which focuses attention on local neighborhoods and refines patch representations through learnable Gaussian kernel biases. Without modifying the training objective, it improves ViT performance on ADE20K segmentation by over 6% while simultaneously maintaining or increasing classification accuracy.

Matting Anything 2: Towards Video Matting for Anything

MAM2 is a universal video matting model built on SAM2, driven by point/box/mask prompts. It addresses the cross-frame collapse of transparent objects through a "dual-modal decoder predicting mask and trimap simultaneously" and a "memory-separated siamese mechanism," extending matting capabilities from portraits to arbitrary natural objects like flames, bubbles, and water.

Object-Centric Refinement for Enhanced Zero-Shot Segmentation

Aiming at the limitation where CLIP patch features "lack object structure and are difficult to cluster into coherent semantic regions," OC-ZSS injects "object prompts" guided by DINO clustering into the frozen CLIP encoder. It then iteratively refines patch features into object-centric representations using two-stage Object Refinement Attention (ORA) coupled with multi-scale granularity attention, achieving SOTA across inductive, transductive, and cross-domain zero-shot segmentation settings.

Panoptic Pairwise Distortion Graph

This paper generalizes scene graphs from "intra-image" to "inter-image" by proposing the Distortion Graph (DG)—a structured representation using regions as atomic nodes. It introduces PANDASET (a region-level distortion dataset of 500k image pairs), PANDABENCH (a benchmark with three difficulty levels), and PANDA (a DETR-style lightweight architecture). Experiments demonstrate that frontier MLLMs perform near random chance in region-level distortion comparison, while PANDA leads across all difficulties. Furthermore, feeding predicted DGs to MLLMs as a Chain-of-Thought (CoT) triggers an emergent performance gain of approximately 15%.

QPrompt-R1: Real-Time Reasoning for Domain-Generalized Semantic Segmentation via Group-Relative Query Alignment

Addressing the challenge of achieving both real-time performance and cross-domain robustness in semantic segmentation, this paper identifies that the bottleneck of slow DGSS lies in the heavy segmentation head rather than the VFM backbone. By injecting learnable queries only into the final layer of the VFM (QPrompt), the authors achieve a lightweight architecture approximating query-decoding. Combined with Group-Relative Query Alignment (GRQA) active only during training, the method unlocks generalization capabilities and approaches heavy DGSS performance at 54 FPS.

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

RegionReasoner is proposed as a multi-round visual reasoning framework based on reinforcement learning. By utilizing reference grounding rewards and global-local consistency rewards, the model is compelled to explicitly reference coordinates of specified regions and maintain semantic coherence throughout reasoning trajectories. This results in significant improvements in multi-round localization and segmentation accuracy on the newly constructed RegionDial-Bench.

Revisiting [CLS] and Patch Token Interaction in Vision Transformers

This paper analyzes the interaction friction between the [CLS] global token and local patch tokens in Vision Transformers. It observes that normalization layers implicitly differentiate these two token categories. By introducing specialized processing paths in normalization layers and early QKV projections, the authors achieve a segmentation performance gain of over 2 mIoU with only an 8% increase in parameters, while maintaining classification accuracy.

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

To address the issues of expensive labeling, data scarcity, and fragmented sub-tasks (DIS / HR-SOD) in Salient Object Detection (SOD), this paper proposes a multimodal diffusion pipeline to simultaneously generate images and pixel-level masks. By incorporating iterative generation with hard-example feedback, the authors create S3OD, a high-resolution synthetic dataset of 139,000 images. Coupled with an ambiguity-aware multi-mask decoder, models trained solely on synthetic data reduce cross-dataset errors by 20–50% and achieve SOTA results on DIS and HR-SOD after fine-tuning.

Salient Object Ranking via Cyclical Perception-Viewing Interaction Modeling

Addressing the long-standing reliance of Salient Object Ranking (SOR) on bottom-up image features, this paper proposes to explicitly model the top-down cognitive process through "Cyclical Perception-Viewing Interaction." By allowing an image captioning module (SP) and a salient ranking module (GR) to iteratively exchange results for \(K\) rounds, the model achieves SA-SOR scores of 0.787 / 0.624 on the ASSR and IRSR benchmarks, outperforming the previous SOTA, QAGNet.

SAM-Veteran: An MLLM-based Human-like SAM Agent for Reasoning Segmentation

SAM-Veteran trains an MLLM to become a "seasoned SAM user" by imitating a human-like interactive segmentation workflow: "generating initial boxes \(\rightarrow\) observing SAM masks for iterative refinement via points \(\rightarrow\) adaptive termination." This behavior is learned through a multi-task reinforcement learning framework based on GRPO, achieving new SOTA on both in-distribution and out-of-distribution reasoning segmentation benchmarks.

SAM 3: Segment Anything with Concepts

SAM 3 unifies "finding and segmenting all instances of a concept in images/videos" (Promptable Concept Segmentation, PCS) into a single model. By using noun phrases or visual exemplars as prompts, it outputs masks and cross-frame identities for all matching instances via a shared backbone + detector + memory tracker. Supported by a human-AI collaborative data engine producing a training set with 4M concept labels, SAM 3 doubles the accuracy of existing systems in both image and video PCS.

TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

It is discovered that the self-attention of text-to-image diffusion models exhibits an "Instance Emergence Point" (IEP) during the denoising process, where the self-attention shows intense divergence at object boundaries. TRACE generates high-quality instance edges through IEP localization + ABDiv edge extraction + single-step distillation, achieving 81× inference acceleration. Without any instance annotations, it improves unsupervised instance segmentation by +5.1 AP, and its tag-supervised panoptic segmentation outperforms point-supervised methods by +1.7 PQ.

Universal Multi-Domain Translation via Diffusion Routers

This paper proposes the Diffusion Router (DR), which utilizes a single noise prediction network to implement all cross-domain mappings by conditioning on source/target domain labels. It supports indirect translation via a central domain as well as direct non-central domain translation based on a variational upper bound objective combined with Tweedie refinement, achieving SOTA performance on three large-scale UMDT benchmarks.

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

This paper defines a new task, "Urban Socio-Semantic Segmentation" (segmenting entities like schools and parks defined by social attributes rather than visual appearance from satellite imagery), constructs the SocioSeg dataset (unifying heterogeneous geospatial data into a single rendered digital map layer), and proposes the SocioReasoner framework. SocioReasoner mimics the human annotator's two-stage reasoning process of "localization, rendering feedback, and refinement" using a VLM, and optimizes this non-differentiable prompt generation pipeline end-to-end via GRPO reinforcement learning, outperforming SOTA models across three-level hierarchical tasks while demonstrating strong zero-shot generalization.

VINCIE: Unlocking In-context Image Editing from Video

The VINCIE framework is proposed, demonstrating for the first time that in-context image editing models can be learned entirely from native video data. By annotating videos as interleaved multimodal sequences and designing three proxy tasks (NIP/CSP/NSP), it reaches SOTA on multi-turn editing benchmarks, increasing the 5-turn editing success rate from <2% to 25% compared to baselines.

VIRTUE: Visual-Interactive Text-Image Universal Embedder

Ours proposes VIRTUE, which combines the segmentation model SAM2 with a VLM to construct a visual-interactive universal embedder. It allows users to specify regions of interest via points/boxes/masks to generate joint entity-level and global-level embeddings. A million-scale SCaR benchmark is constructed to evaluate visual-interactive retrieval capabilities. Ours achieves SOTA on 36 MMEB tasks (+3.1%-8.5%) and 5 SCaR tasks (+15.2%-20.3%).

WOW-Seg: A Word-Free Open World Segmentation Model

WOW-Seg reformulates the task of "assigning category names to segmented regions" from a classification problem with fixed heads into an autoregressive "image captioning" generation problem for VLLMs. By using Mask2Token to encode arbitrary masks into visual prompts within the VLM feature space and Cascade Attention Mask to prevent interference between multiple masks during parallel training/inference, it achieves new SOTA results on LVIS / PACO with only 1B parameters.