✂️ Segmentation¶

🧪 ICML2025 · 18 paper notes

📌 Same area in other venues: 📷 CVPR2026 (122) · 🔬 ICLR2026 (32) · 🧪 ICML2026 (14) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (45) · 📹 ICCV2025 (73)

🔥 Top topics: Segmentation ×6 · Recommendation ×2 · Few-/Zero-Shot Learning ×2 · Remote Sensing ×2

ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation: Introduces ActionPiece, the first context-aware action sequence tokenizer that models user behavior sequences as "sequences of feature sets." By adopting a BPE-like merge strategy to discover high-frequency feature patterns both within sets and across adjacent sets, it allows the same action to be tokenized into different tokens depending on the context, significantly improving generative recommendation performance.
ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation: This paper proposes ActionPiece, the first context-aware action sequence tokenization method. It represents each action as an unordered set of features, learning merge rules within and across adjacent sets using weighted co-occurrence statistics to build a vocabulary. This allows the same action to be tokenized into different tokens depending on the context, significantly improving the accuracy of generative recommendation in recommendation tasks.
Adapter Naturally Serves as Decoupler for Cross-Domain Few-Shot Semantic Segmentation: This paper discovers that adapters naturally possess the capability of domain information decoupling (based on architecture rather than loss). Consequently, the authors propose the Domain Feature Navigator (DFN) as a structural domain decoupler, coupled with SAM-SVN to prevent overfitting on the source domain. This approach significantly outperforms state-of-the-art methods in cross-domain few-shot semantic segmentation (CD-FSS), achieving a 1-shot average of 63.99% and a 5-shot average of 69.77% MIoU.
Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery: This paper introduces the first large-scale oil and gas well detection benchmark, the Alberta Wells Dataset (containing over 213k well locations and 188k+ satellite imagery patches). The localization of abandoned, suspended, and active oil and gas wells is formulated as binary segmentation and object detection tasks, and various CNN and Transformer baseline models are evaluated.
Balanced Learning for Domain Adaptive Semantic Segmentation: This paper proposes BLDA, which directly quantifies class bias by analyzing the logit distributions predicted by the network. It aligns the logit distributions of each class using a shared anchor distribution for post-processing calibration, while utilizing GMMs for online estimation and logit correction in self-training to generate unbiased pseudo-labels. This brings consistent improvements to various baseline methods on both GTA→CS and SYN→CS benchmarks.
ConText: Driving In-context Learning for Text Removal and Segmentation: This work applies the visual in-context learning (V-ICL) paradigm to OCR tasks for the first time. It proposes three key designs: task-chaining prompting, context-aware aggregation (CAA), and self-prompting (SP) strategies. ConText significantly outperforms existing general V-ICL models and task-specific models in text removal and segmentation tasks, achieving improvements of +4.50 PSNR and +3.34% fgIoU, respectively.
Dual form Complementary Masking for Domain-Adaptive Image Segmentation: Proposes the MaskTwins framework, which theorizes masked reconstruction as a sparse signal reconstruction problem, proves that dual form complementary masks have theoretical advantages in extracting domain-invariant features, and achieves domain-adaptive segmentation through complementary mask consistency constraints in end-to-end training.
Efficient and Robust Semantic Image Communication via Stable Cascade: A semantic image communication framework built upon the Stable Cascade architecture. It uses EfficientNet-V2 to extract highly compact image embeddings (occupying just 0.29% of the original size) as LDM conditioning. Through noise-robust fine-tuning, the system reconstructs images faithfully even under low SNR channels, while achieving 3-16x inference acceleration.
FeatSharp: Your Vision Model Features, Sharper: This paper proposes FeatSharp, which coherently upsamples feature maps of low-resolution vision encoders to high resolution at an extremely low cost by taking FeatUp's Joint Bilateral Upsampling (JBU) and attentively fusing it with image tiling features, while capturing fine-grained details lost at the original resolution.
InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective: This paper proposes InfoSAM, which designs a relationship compression and distillation framework based on Rényi mutual information for the Parameter-Efficient Fine-Tuning (PEFT) of SAM from an information-theoretic perspective, enhancing fine-tuning performance by compressing pseudo-invariant information and preserving domain-invariant relationships.
IT³: Idempotent Test-Time Training: Proposes IT³, a general test-time training method based on idempotence. It adapts to out-of-distribution samples by minimizing the deviation between recursive network calls without requiring domain-specific auxiliary tasks, making it applicable to any task and architecture.
MorphTok: Morphologically Grounded Tokenization for Indian Languages: This paper proposes the MorphTok framework, which addresses the issue of dependent vowels in Indian languages through a morphology-aware pre-tokenization step (lookup table/language model) and a Constrained BPE (CBPE) algorithm. It improves downstream performance in machine translation and language modeling tasks, and introduces a human evaluation metric, EvalTok.
QMamba: On First Exploration of Vision Mamba for Image Quality Assessment: This work introduces Vision Mamba (State Space Model) into image quality assessment (IQA) for the first time, proposing the QMamba framework and the StylePrompt lightweight fine-tuning strategy, which outperform CNN and Transformer baselines on various synthetic/realistic/AIGC IQA tasks with lower computational costs.
Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation: This paper identifies a feature entanglement issue in distance-comparison-based methods for Cross-Domain Few-Shot Segmentation (CD-FSS), which stems from the equal-weighted cross-matching of ViT layer outputs during distance computation. Consequently, the authors propose to address this issue through Self-Disentanglement and Re-Composition by learning the comparison weights among ViT components.
Separating Knowledge and Perception with Procedural Data: Training visual representation models solely on procedurally generated data (non-real images) and injecting real-world knowledge through a visual memory (KNN retrieval database) approaches the performance of models trained on real data in classification and segmentation tasks, while achieving full controllability of all real-world data (privacy protection and efficient forgetting).
SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and \(\mathcal{O}(T)\) Complexity: This paper proposes SpikeVideoFormer, the first spike-driven Transformer designed for video tasks. It utilizes Hamming attention instead of dot-product attention to accurately measure spike feature similarity, and combines joint space-time attention to maintain \(\mathcal{O}(T)\) linear time complexity. It achieves state-of-the-art (SOTA) performance for SNNs across three video tasks while being 5-16 times more energy-efficient than ANNs.
unMORE: Unsupervised Multi-Object Segmentation via Center-Boundary Reasoning: This paper proposes unMORE, which achieves unsupervised multi-object segmentation by learning a three-layer object-centric representation (existence, center field, and boundary distance field) and designing a network-free multi-object reasoning module, substantially outperforming all unsupervised methods on six datasets, including COCO.
Using Multiple Input Modalities Can Improve Data-Efficiency and O.O.D. Generalization for ML with Satellite Imagery: This work systematically studies the effects of fusing optical imagery with additional geographic data layers (DEM, land cover maps, temperature, wind speed, etc.) in satellite remote sensing ML tasks. It is found that multimodal inputs significantly improve model performance, with the gains being most pronounced in scenarios with limited labeled data and geographic out-of-distribution shifts. Surprisingly, hard-coded fusion strategies outperform learned fusion strategies.