✂️ Segmentation¶

🔬 ICLR2026 · 11 paper notes

AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation: This paper proposes an Alignment-aware Masked Learning (AML) strategy that quantifies vision-language patch-level alignment and filters low-alignment pixels, enabling RIS models to focus on reliable regions during training. Without any architectural modifications, AML achieves state-of-the-art performance across all 8 splits of RefCOCO benchmarks.
ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer: This paper proposes ByteFlow Net, a tokenizer-free hierarchical byte-level language model that leverages information-theoretic coding rate to adaptively compress raw byte streams into semantic units, outperforming BPE baselines and existing byte-level architectures on both pretraining loss and downstream tasks.
Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval: This paper identifies sparse perception patterns in SAM2 analogous to biological vision — the decoder focuses on foreground while the encoder computes broadly, and only a small subset of tokens in memory frames are active with temporally consistent saliency. Based on these observations, Efficient-SAM2 is proposed, which eliminates redundant computation via object-aware Sparse Window Routing (SWR) and Sparse Memory Retrieval (SMR), achieving 1.68× end-to-end speedup on SAM2.1-L with only 1% accuracy loss.
Locality-Attending Vision Transformer: This paper proposes LocAt, a modular plug-in comprising GAug and PRR, which biases attention toward local neighborhoods via learnable Gaussian kernels and refines patch representations. Without modifying the training objective, it improves ViT segmentation performance on ADE20K by over 6% while simultaneously boosting classification accuracy.
RegionReasoner: Region-Grounded Multi-Round Visual Reasoning: This paper proposes RegionReasoner, a reinforcement learning-based multi-round visual reasoning framework that employs reference annotation rewards and global-local consistency rewards to enforce explicit citation of reference region coordinates in reasoning traces while maintaining semantic coherence. The approach achieves significant improvements in multi-round localization and segmentation accuracy on the newly constructed RegionDial-Bench.
Revisiting [CLS] and Patch Token Interaction in Vision Transformers: This paper systematically analyzes the interaction friction between the global [CLS] token and local patch tokens in Vision Transformers. It reveals that normalization layers implicitly differentiate between the two token types, and proposes specialized processing paths in normalization layers and early QKV projections. With only an 8% parameter increase, the method achieves over 2 mIoU improvement in segmentation while preserving classification accuracy.
Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers: This paper proposes Jumbo: a method that expands the ViT CLS token to \(J\) times its original width, splits it into \(J\) patch-width tokens before attention, and reassembles them after attention for processing by a dedicated wide FFN. With negligible computational overhead, Jumbo substantially increases global modeling capacity, enabling plain ViT to surpass dedicated efficient architectures (EfficientViT, SHViT, MobileNetV4) in high-throughput inference regimes while preserving all architectural advantages of the plain ViT.
TRACE: Your Diffusion Model is Secretly an Instance Edge Detector: This work identifies an "Instance Emergence Point" (IEP) in the denoising trajectory of text-to-image diffusion models, at which self-attention exhibits sharp divergence changes at object boundaries. TRACE leverages IEP localization, ABDiv edge extraction, and single-step distillation to generate high-quality instance edges with an 81× inference speedup—requiring no instance-level annotations—improving unsupervised instance segmentation by +5.1 AP and surpassing point-supervised panoptic segmentation with tag-level supervision by +1.7 PQ.
Universal Multi-Domain Translation via Diffusion Routers: This paper proposes the Diffusion Router (DR), which employs a single noise prediction network conditioned on source/target domain labels to handle all cross-domain mappings. It supports indirect translation via a center domain and direct non-center-domain translation based on a variational upper-bound objective combined with Tweedie refinement, achieving state-of-the-art performance on three large-scale UMDT benchmarks.
VINCIE: Unlocking In-context Image Editing from Video: VINCIE is a framework that first demonstrates that in-context image editing models can be learned entirely from native video data. By annotating videos as interleaved multimodal sequences and designing three proxy tasks (NIP/CSP/NSP), it achieves state-of-the-art performance on multi-turn editing benchmarks, improving the 5-turn editing success rate from less than 2% (baseline) to 25%.
VIRTUE: Visual-Interactive Text-Image Universal Embedder: This paper proposes VIRTUE, a visual-interactive universal embedder that integrates the segmentation model SAM2 with a VLM to support user-specified regions of interest via points, boxes, or masks, producing joint entity-level and global-level embeddings. A million-scale SCaR benchmark is introduced to evaluate visual-interactive retrieval, achieving SOTA on 36 MMEB tasks (+3.1%–8.5%) and 5 SCaR tasks (+15.2%–20.3%).