Skip to content

📹 ICCV2025 Accepted Papers

1299 ICCV2025 paper notes covering 3D Vision (267), Image Generation (213), Multimodal VLM (106), Autonomous Driving (91), Segmentation (73), Video Understanding (56), Model Compression (52), Video Generation (49) and other 44 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.


🔒 LLM Safety (10)

Adversarial Robust Memory-Based Continual Learner

This paper identifies two compounding challenges when combining continual learning with adversarial training—accelerated forgetting and gradient confusion—and proposes two plug-and-play modules, Anti-Forgettable Logit Calibration (AFLC) and Robustness-Aware Experience Replay (RAER), achieving up to 8.13% improvement in adversarial robustness on Split-CIFAR10/100 and Split-Tiny-ImageNet.

Asynchronous Event Error-Minimizing Noise for Safeguarding Event Dataset

This paper proposes UEvs, the first unlearnable example generation method for asynchronous event data. It introduces Event Error-Minimizing Noise (E²MN) and an adaptive projection mechanism that prevent unauthorized models from learning from event datasets while preserving utility for legitimate use.

Cooperative Pseudo Labeling for Unsupervised Federated Classification

FedCoPL is the first work to extend unsupervised federated learning (UFL) to classification tasks. It addresses CLIP's inherent bias and label shift challenges via a cooperative pseudo labeling strategy (global assignment ensuring class balance) and a partial prompt aggregation protocol (aggregating only visual prompts while keeping text prompts local).

Enhancing Adversarial Transferability by Balancing Exploration and Exploitation with Gradient-Guided Sampling

This paper proposes Gradient-Guided Sampling (GGS), an inner-iteration sampling strategy that uses the gradient direction from the previous inner iteration to guide sampling. By striking a balance between Exploitation (attack strength / loss maxima) and Exploration (cross-model generalization / flat loss landscape), GGS significantly outperforms existing transfer attack methods across diverse architectures including CNNs, ViTs, and MLLMs.

Forgetting Through Transforming: Enabling Federated Unlearning via Class-Aware Representation Transformation

This paper proposes FUCRT, a federated unlearning method based on class-aware representation transformation. Rather than directly erasing the representations of forget classes, FUCRT transforms them toward the semantically nearest retain classes, and employs dual contrastive learning to align transformation consistency across clients. The method guarantees 100% unlearning on four datasets while maintaining or even improving performance on retain classes.

Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning

This paper proposes Geminio, the first gradient inversion attack (GIA) leveraging vision-language models (VLMs) to enable natural language-guided targeted reconstruction. A malicious server can specify the type of data to steal via natural language queries, precisely locating and reconstructing semantically matching private samples from large-batch gradients, without disrupting normal FL model training.

LATTE: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning

This paper proposes Latte, a framework that enables collaborative test-time adaptation of vision-language models (e.g., CLIP) in decentralized federated learning settings. Through a dual-memory mechanism combining local and external memory, Latte achieves cross-client knowledge sharing while preserving client-level personalization.

MUNBa: Machine Unlearning via Nash Bargaining

This work formulates Machine Unlearning (MU) as a two-player cooperative bargaining game and derives a closed-form solution via Nash bargaining theory to simultaneously address gradient conflict and gradient dominance between the forgetting and retention objectives, achieving an optimal balance between unlearning and preservation across both classification and generation tasks.

SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

SAUCE leverages sparse autoencoders (SAEs) to identify and selectively suppress features associated with target concepts in VLM intermediate representations, enabling fine-grained concept unlearning without weight updates. Evaluated across 60 concepts, it surpasses the previous SOTA in forgetting quality by 18%.

Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation

This paper presents the first study on preventing video data from being exploited by deep trackers without authorization. It proposes a DiT-based generative framework for producing Temporal Unlearnable Examples (TUE), employing a temporal contrastive loss to induce trackers to rely on perturbation noise for temporal matching rather than learning genuine data structure. The method achieves strong transferability across models, datasets, and tasks.


📊 LLM Evaluation (27)

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

This paper introduces 3DSRBench, the first comprehensive 3D spatial reasoning benchmark comprising 2,772 manually annotated VQA pairs across 12 question types. Through balanced data distribution and a novel FlipEval strategy, the benchmark enables robust evaluation. Results reveal that state-of-the-art LMMs—including GPT-4o and Gemini—fall far short of human performance on 3D spatial reasoning (≈52% vs. 95.7%), with substantial performance degradation under uncommon camera viewpoints.

A Conditional Probability Framework for Compositional Zero-shot Learning

This paper proposes a Conditional Probability Framework (CPF) that decomposes the compositional recognition probability into an object likelihood \(p(o|x)\) and a conditional attribute likelihood \(p(a|o,x)\). Two dedicated modules — Text-Enhanced Object learning (TEO) and Object-Guided Attribute learning (OGA) — explicitly model attribute-object dependencies, achieving state-of-the-art performance across three CZSL benchmarks.

A Real-world Display Inverse Rendering Dataset

This paper presents the first real-world inverse rendering dataset built upon an LCD display-camera system, comprising stereo polarization images of 16 objects with diverse materials captured under OLAT illumination patterns alongside high-precision geometric ground truth. A simple yet effective display inverse rendering baseline is proposed, outperforming existing inverse rendering methods.

BATCLIP: Bimodal Online Test-Time Adaptation for CLIP

This paper proposes BATCLIP, a bimodal online test-time adaptation (TTA) method for CLIP that simultaneously adapts the LayerNorm parameters of both the visual and text encoders. By introducing a projection matching loss and an inter-class separability loss to enhance vision-text feature alignment and class discriminability, BATCLIP achieves state-of-the-art performance on CIFAR-10C, CIFAR-100C, and ImageNet-C.

Combinative Matching for Geometric Shape Assembly

This paper proposes Combinative Matching (CMNet), which jointly models two fundamental properties of interlocking parts — surface shape consistency and volumetric occupancy complementarity — via an equivariant network trained with three objectives: orientation alignment, shape matching, and occupancy matching, substantially reducing local ambiguity in geometric assembly.

Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

This paper proposes DMDiff, a framework that leverages the natural image priors of pretrained diffusion models. Through a positive/neutral/negative tripath multi-prompt diffusion strategy and a Spatially-Varying Degradation-Aware (SVDA) attention module, DMDiff achieves high-fidelity tunable image reconstruction for millimeter-scale metalens cameras, surpassing existing methods across multiple metrics.

Discontinuity-aware Normal Integration for Generic Central Camera Models

This paper proposes a novel normal integration method that supports explicit discontinuity modeling and generic central camera models. By establishing constraints between surface normals and ray directions under a local planarity assumption, the method achieves state-of-the-art performance on standard normal integration benchmarks and, for the first time, directly handles generic central cameras such as fisheye and panoramic cameras.

DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection

This paper proposes DisCoPatch, a framework that exploits the inherent bias of BatchNorm toward batch statistics in adversarial VAEs to distinguish ID from OOD samples. At inference time, multiple patches from the same image are composed into a batch to ensure distributional consistency. The method achieves state-of-the-art performance on covariate-shift OOD detection (ImageNet-1K(-C) 95.5% AUROC) and near-OOD detection (95.0% AUROC), with a model size of only 25 MB and latency an order of magnitude lower than competing methods.

DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing

DISTA-Net proposes a dynamic deep unfolding network that replaces the static nonlinear transform and threshold parameters in ISTA-based sparse reconstruction with input-adaptive counterparts, constituting the first deep learning method for closely-spaced infrared small target (CSIST) unmixing. The work also establishes the first open-source ecosystem encompassing a dataset, evaluation metrics, and a toolkit.

Few-Shot Pattern Detection via Template Matching and Regression

This paper proposes TMR, a method that combines classical template matching with support-conditioned bounding box regression to achieve few-shot detection of arbitrary patterns—including non-object-level patterns. The authors also introduce the RPINE dataset to cover a broader range of repetitive patterns. TMR surpasses existing FSCD methods on multiple benchmarks and demonstrates strong cross-dataset generalization.

Browse all 27 LLM Evaluation papers →


📚 Pretraining (9)

ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training

ACE-G decomposes a scene coordinate regressor into a scene-agnostic Transformer and a scene-specific map code, and achieves significant generalization gains under illumination and viewpoint variation by conducting alternating mapping/query pre-training across tens of thousands of scenes, while maintaining lightweight computational overhead.

ConstStyle: Robust Domain Generalization with Unified Style Transformation

This paper proposes ConstStyle, a framework that constructs a theoretically grounded Unified Domain to which all training samples are style-aligned during training, while test samples from unseen domains are partially projected toward this unified domain at inference time, effectively reducing the domain gap and improving generalization performance.

Dataset Ownership Verification for Pre-trained Masked Models

DOV4MM proposes the first dataset ownership verification method tailored for masked pre-trained models. By comparing the embedding reconstruction difficulty of seen versus unseen samples, and applying a paired t-test, the method determines whether a black-box model was pre-trained on a specific dataset. It achieves p-values well below 0.05 across 10 masked image models and 4 masked language models.

ETA: Energy-based Test-time Adaptation for Depth Completion

This paper proposes ETA, a method that employs an energy-based model to quantify the likelihood of depth predictions belonging to the source domain distribution, and guides a pre-trained depth completion model to adapt to new environments at test time by minimizing the energy of target-domain predictions. ETA achieves average improvements of 6.94% and 10.23% over the previous state of the art on outdoor and indoor scenes, respectively.

FlowMo: Flow to the Mode — Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

This paper proposes FlowMo, a Transformer-based diffusion autoencoder trained in two stages (mode-matching pretraining + mode-seeking post-training), achieving state-of-the-art performance on ImageNet-1K discrete image tokenization for the first time among diffusion autoencoders — without convolutions, adversarial losses, 2D spatially-aligned latents, or distillation from other tokenizers.

Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

This paper introduces Image Intrinsic Scale (IIS)—the maximum scaling factor at which an image exhibits its highest perceptual quality—and proposes the IISA task, constructs a dataset of 785 images with expert annotations, and presents a weak-label training strategy (WIISA) that consistently improves IIS prediction across multiple NR-IQA methods.

Make Your Training Flexible: Towards Deployment-Efficient Video Models

This paper proposes Flux — a data augmentation tool that enables flexible video model training through flexible sampling grids and group-dynamic token selection, allowing a single model to operate efficiently across varying computational budgets. The paper further introduces a Token Optimization test-time paradigm that matches previous SOTA performance using only 1/4 of the tokens, saving approximately 90% of computation.

Synchronization of Multiple Videos

This paper proposes Temporal Prototype Learning (TPL), a prototype-based video synchronization framework that constructs shared compact 1D representations from high-dimensional embeddings extracted by pretrained models. By learning a unified prototype sequence to anchor key action phases, TPL aligns multiple videos jointly and, for the first time, addresses the synchronization of generative AI videos.

SynCity: Training-Free Generation of 3D Worlds

SynCity proposes a training- and optimization-free method for 3D world generation. Through carefully designed prompt engineering strategies, it combines a pretrained language model, a 2D image generator (Flux), and a 3D generator (TRELLIS) to autoregressively synthesize large-scale, high-quality, freely navigable 3D scenes in a tile-by-tile fashion.


💬 LLM (Other) (6)

Any-SSR: How Recursive Least Squares Works in Continual Learning of Large Language Models

This paper proposes the Analytic Subspace Routing (Any-SSR) framework, which eliminates inter-task interference by assigning each task an independent LoRA subspace, and trains a zero-forgetting analytic router via a recursive least squares (RLS) closed-form solution, enabling replay-free continual learning for LLMs.

Any-SSR: How Recursive Least Squares Works in Continual Learning of Large Language Models

This paper proposes Analytic Subspace Routing (Any-SSR), which assigns an independent LoRA subspace to each new task to eliminate knowledge interference, while employing an analytic router based on a recursive least squares (RLS) closed-form solution to dynamically select subspaces. The approach provides theoretical guarantees against forgetting prior task knowledge, enabling replay-free continual learning for LLMs.

FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

This paper formalizes model merging as a constrained optimization problem and introduces FW-Merging, a Frank-Wolfe optimization-inspired method that iteratively selects the most relevant models and performs local merging. The approach achieves scalable and robust merging over large black-box model pools, surpassing the data-aware method AdaMerging by 8.39% when merging 20 ViT models.

ShadowHack: Hacking Shadows via Luminance-Color Divide and Conquer

This paper proposes the ShadowHack framework, which decomposes shadow removal into two subtasks—luminance restoration and color reconstruction. LRNet with Rectified Outreach Attention (ROA) recovers luminance and texture, followed by CRNet with cross-attention to reconstruct accurate color. The method achieves state-of-the-art performance on the ISTD+ and SRD datasets.

VA-GPT: Aligning Effective Tokens with Video Anomaly in Large Language Models

This paper proposes VA-GPT, a multimodal large language model for video anomaly event understanding. Through two modules—Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG)—VA-GPT enables MLLMs to precisely align anomaly-relevant information in both spatial and temporal dimensions, achieving state-of-the-art performance on both in-domain and cross-domain anomaly detection benchmarks.

VIM: Versatile Interactive Motion-Language Model

This paper proposes VIM, the first multimodal large language model capable of simultaneously understanding and generating dyadic interactive motion and text within a unified framework. Accompanied by the Inter-MT² dataset containing 82.7K multi-turn interactive motion instruction samples, VIM supports a diverse set of tasks including text-to-motion, motion-to-text, reaction generation, motion editing, and motion reasoning.


🎨 Image Generation (213)

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

This paper proposes A₀, an affordance-aware hierarchical diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding (predicting contact points and trajectories) and low-level action execution. Pretrained on 1M contact point data and fine-tuned with minimal task-specific data, A₀ achieves cross-platform deployment across Franka/Kinova/Realman/Dobot, reaching a 45% success rate on complex trajectory tasks such as whiteboard wiping.

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

This paper proposes A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial understanding and low-level action execution by predicting object-centric contact points and post-contact trajectories via an Embodiment-Agnostic Affordance Representation. Pre-trained on 1 million contact-point annotations, A0 generalizes across four robot platforms: Franka, Kinova, Realman, and Dobot.

A Unified Framework for Motion Reasoning and Generation in Human Interaction

This paper proposes MoLaM, a unified interactive motion-language model that, through a three-stage training strategy and a newly constructed Inter-MT² dataset (82.7K multi-turn instructions), is the first to simultaneously achieve understanding, generation, editing, and reasoning of dyadic interaction motion within a single framework.

Accelerating Diffusion Sampling via Exploiting Local Transition Coherence

This paper proposes LTC-Accel, a training-free diffusion sampling acceleration method based on the phenomenon of Local Transition Coherence (LTC). By exploiting the strong correlation between transition operators of adjacent denoising steps, the method approximates the current step's computation using the previous step's transition operator. It achieves 1.67× speedup on Stable Diffusion v2, and combined with distilled models, reaches 10× acceleration in video generation.

Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Models and Small Edge Models

This paper proposes RouteT2I, the first edge-cloud model routing framework for text-to-image generation. It maximizes image generation quality under cost constraints through multi-dimensional quality metrics, Pareto Relative Superiority, and a dual-gated token selection MoE architecture.

Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model

This paper proposes RouteT2I, a framework that dynamically routes text-to-image generation requests to either a lightweight edge model or a large cloud model via multi-dimensional quality assessment metrics and a dual-gate token-selection MoE routing model, achieving 83.97% of the quality gain attainable by exclusively using the cloud model at a 50% routing rate.

Addressing Text Embedding Leakage in Diffusion-Based Image Editing

This work identifies the root cause of attribute leakage in text-driven diffusion-based image editing — semantic entanglement in EOS embeddings of autoregressive text encoders — and proposes the ALE framework (ORE + RGB-CAM + BB) to comprehensively eliminate attribute leakage through embedding disentanglement, attention masking, and background blending.

ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation

This paper proposes ADIEE, an automated pipeline for constructing training datasets for instruction-guided image editing evaluation. A LLaVA-NeXT-8B model is fine-tuned on over 100K samples as a scorer, surpassing all open-source VLMs and Gemini-Pro 1.5 on multiple benchmarks. The trained scorer can further serve as a reward model to improve image editing models.

ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation

This paper proposes ADIEE, an automated pipeline for constructing a training dataset of over 100,000 samples for image editing evaluation. It fine-tunes LLaVA-NeXT-8B as an editing quality scorer and surpasses open-source VLMs and Gemini-Pro 1.5 on multiple benchmarks. The resulting scorer can also serve as a reward model to improve editing model performance.

Aether: Geometric-Aware Unified World Modeling

Aether proposes a geometric-aware unified world modeling framework that jointly trains reconstruction, prediction, and planning capabilities on synthetic 4D data, built upon post-training of CogVideoX to achieve zero-shot generalization to real-world scenes.

Browse all 213 Image Generation papers →


🎬 Video Generation (49)

Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis

This paper proposes the Adversarial Distribution Matching (ADM) framework, which aligns the latent predictions of real and fake score estimators adversarially via a diffusion-based discriminator, replacing the predefined KL divergence in DMD. Combined with Adversarial Distillation Pretraining (ADP), the proposed DMDX pipeline achieves one-step generation on SDXL surpassing DMD2, and sets new multi-step distillation benchmarks on SD3 and CogVideoX.

Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis

This paper proposes an Adversarial Distribution Matching (ADM) framework that replaces the predefined KL divergence in DMD with an implicit, data-driven measure of distributional discrepancy. A diffusion-model-based discriminator aligns the latent predictions of real and fake score estimators along the PF-ODE. Combined with Adversarial Distillation Pre-training (ADP), the resulting DMDX pipeline surpasses DMD2 on one-step SDXL generation and extends naturally to SD3 and CogVideoX video synthesis.

Aligning Moments in Time using Video Queries

This paper proposes MATR (Moment Alignment TRansformer), which conditions target video representations on query video features via dual-stage sequence alignment (soft-DTW), enabling video-to-video moment retrieval (Vid2VidMR). A self-supervised pretraining strategy is designed accordingly, achieving +13.1% R@1 and +8.1% mIoU on ActivityNet-VRL.

BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

BadVideo is the first backdoor attack framework targeting text-to-video (T2V) generation models. It exploits inherent static and dynamic redundancy in video (e.g., unspecified background elements, motion trajectories) through two strategies—spatio-temporal composition and dynamic element transition—to covertly embed malicious content. The framework achieves up to 93.5% human-evaluated attack success rate on LaVie and Open-Sora while effectively evading existing content moderation systems.

Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

This paper proposes Causal-VidSyn, a diffusion model that achieves causal entity localization via an Accident-Reason Answering (ArA) module and a gaze-conditioned visual token selection mechanism. The authors also construct the Drive-Gaze dataset comprising 1.54 million frames of gaze data. The method outperforms state-of-the-art approaches across three tasks: accident video editing, normal-to-accident video diffusion, and text-to-video generation.

D3: Training-Free AI-Generated Video Detection Using Second-Order Features

Drawing from second-order control systems in Newtonian mechanics, this paper identifies a fundamental distinction between real and AI-generated videos in their second-order temporal features ("acceleration"): real videos exhibit high fluctuation while generated videos remain flat. Based on this insight, the authors propose D3, a fully training-free AI-generated video detection method that classifies videos solely by computing the standard deviation of second-order differences of inter-frame features, achieving state-of-the-art performance across 40 test subsets.

DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images

This paper proposes DACoN, which fuses semantic features from the DINOv2 foundation model with high-resolution spatial features from a U-Net to enable automatic anime line art colorization with an arbitrary number of reference images, surpassing existing methods on both key-frame and sequential-frame colorization tasks.

Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer

To address the difficulty of decoupling motion from appearance in DiT models with 3D full-attention, this paper proposes Shared Temporal Kernels and a Dense Point Tracking Loss, along with a comprehensive motion transfer benchmark MTBench and a hybrid motion fidelity metric.

DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation

This paper introduces DH-FaceVid-1K, a large-scale high-quality face video dataset comprising 1,200+ hours, 270,043 video clips, and 20,000+ unique identities. It specifically addresses the severe underrepresentation of Asian faces in existing datasets and empirically validates scaling laws with respect to data volume and model parameter count through systematic experiments.

Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

This paper proposes DisWM, a framework that pre-trains disentangled representations from "distracting videos" offline, then transfers semantic knowledge to downstream world models via offline-to-online latent space distillation, improving sample efficiency and robustness of visual reinforcement learning under environmental variations.

Browse all 49 Video Generation papers →


🧩 Multimodal VLM (106)

A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition

This paper proposes a Quality-guided Mixture of score-fusion Experts (QME) framework that employs a quality-guided MoE strategy to perform learnable fusion of similarity scores from heterogeneous biometric modalities (face, gait, body). Combined with a pseudo-quality loss and a score triplet loss, QME achieves state-of-the-art performance on multiple whole-body biometric recognition benchmarks.

Acknowledging Focus Ambiguity in Visual Questions

This work is the first to formally define and systematically investigate focus ambiguity in visual question answering — the phenomenon arising when a linguistic expression in a question may plausibly refer to multiple regions in an image, a type of ambiguity entirely overlooked by existing VQA systems. The authors construct the VQ-FocusAmbiguity dataset (5,500 samples with 12,880 instance segmentation annotations) and demonstrate that modern models perform poorly at both recognizing and localizing focus ambiguity.

Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-Distribution Detection

This paper proposes the APLGOS framework, which initializes learnable in-distribution (ID) prompts using ChatGPT-standardized Q&A pairs, synthesizes virtual OOD prompts and images by sampling from the low-likelihood regions of class-conditional Gaussian distributions, and aligns text-image embeddings via contrastive learning to achieve more compact ID/OOD decision boundaries.

Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-distribution Detection

This paper proposes APLGOS, a framework that leverages prompt learning in vision-language models to synthesize virtual OOD prompts and images by sampling from low-probability regions of class-conditional Gaussian distributions, thereby enforcing more compact decision boundaries between in-distribution (ID) and out-of-distribution (OOD) categories. The method achieves state-of-the-art performance on four mainstream benchmarks.

Advancing Textual Prompt Learning with Anchored Attributes

This paper proposes ATPrompt, which embeds general-purpose attribute tokens (e.g., color, shape) into textual prompts, extending the learning space of soft prompts from a one-dimensional class level to a multi-dimensional attribute level. ATPrompt serves as a plug-and-play module that integrates seamlessly into existing textual prompt learning methods, consistently improving baseline performance across 11 datasets.

AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?

This paper proposes AdvDreamer, a framework that generates physically reproducible adversarial 3D transformation (Adv-3DT) samples from single images via zero-shot monocular pose manipulation, a naturalness reward model, and an inverse semantic probability loss. The framework reveals that current VLMs—including GPT-4o—suffer performance drops of 50–80% under 3D variations, and establishes MM3DTBench, the first VQA benchmark for evaluating VLM robustness to 3D variations.

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

This paper proposes AIGI-Holmes, which adapts MLLMs into a "Holmes"-style detector capable of both accurately identifying AI-generated images and providing human-verifiable explanations. This is achieved by constructing the Holmes-Set dataset with explanatory annotations and a carefully designed three-stage training pipeline (visual expert pre-training → SFT → DPO). At inference time, a collaborative decoding strategy further enhances generalization.

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

This paper proposes AIGI-Holmes, which achieves explainable and generalizable AI-generated image detection through the construction of Holmes-Set — an annotated dataset with interpretive labels — a three-stage training pipeline (visual expert pre-training → SFT → DPO), and a collaborative decoding strategy. The method attains state-of-the-art detection accuracy on three benchmarks while providing human-verifiable explanations.

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching

This paper proposes D2S-VSE, a two-stage training framework that addresses the information density asymmetry in image-text matching. In the first stage, the model is pre-trained on LLaVA-generated dense captions to enhance information capacity; in the second stage, dense text embeddings are distilled into sparse text embeddings. The method achieves state-of-the-art performance on MS-COCO and Flickr30K.

Analyzing Finetuning Representation Shift for Multimodal LLMs Steering

A training-free framework that reveals representation shifts in multimodal large language models (MLLMs) during finetuning through concept-level analysis, and leverages shift vectors for lightweight model behavior steering (debiasing, safety control).

Browse all 106 Multimodal VLM papers →


🧠 VLM Reasoning (13)

Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

This paper identifies two critical issues in applying GRPO to MLLM reasoning — low data utilization (invalid gradients when all sampled outputs for a hard question are incorrect) and text bias (the model ignores visual input and relies solely on textual reasoning) — and proposes two corresponding solutions: Hint-GRPO (adaptively providing reasoning hints) and text-debiasing calibration (enhancing image conditioning at test time). The approach achieves significant reasoning improvements across 11 datasets on 3 base MLLMs.

ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

This paper proposes PointCoT, which integrates reflective visual grounding (bounding boxes) into the chain-of-thought for chart reasoning, enabling MLLMs to interactively verify each reasoning step against the chart's visual content. It also constructs the ChartPoint-SFT-62k dataset containing 19.2K high-quality samples, achieving a +5.04% improvement on ChartBench.

DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

This paper proposes the DWIM framework, which employs a discrepancy-aware workflow generation strategy to curate high-quality training data and an instruct-masking fine-tuning strategy to clone only effective actions, endowing LLMs with tool-aware capability for compositional visual reasoning and achieving state-of-the-art results on multiple VR benchmarks.

FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

This paper proposes FinMMR, a bilingual (Chinese–English) multimodal financial numerical reasoning benchmark containing 4,300 questions and 8,700 images spanning 14 financial sub-domains, requiring models to perform multi-step precise numerical computation. Evaluation of 15 state-of-the-art MLLMs shows that the best model achieves only 53% accuracy on the Hard subset, exposing fundamental bottlenecks in current MLLMs for professional-domain multimodal reasoning.

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

This paper proposes the MIR benchmark, comprising 22,257 multi-image interleaved reasoning QA pairs with five-stage reasoning steps, and introduces a progressive curriculum learning strategy that trains MLLMs from easy to hard samples to improve multi-image interleaved reasoning capability.

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

LLaVA-CoT proposes a method enabling vision-language models to perform autonomous multi-stage structured reasoning. By constructing the LLaVA-CoT-100k structured reasoning annotation dataset, the model is trained to sequentially execute four stages—Summary, Caption, Reasoning, and Conclusion—and a Stage-Wise Retracing Search (SWIRES) is proposed for test-time scaling, allowing an 11B model to surpass Gemini-1.5-pro and GPT-4o-mini.

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

This paper introduces MMAT-1M, the first million-scale multimodal agent tuning dataset, constructed via a four-stage data engine (Foundation → Rationale → Reflection → Integration). It endows MLLMs with CoT reasoning, tool invocation, and self-reflection capabilities, achieving an average improvement of 2.7% on InternVL2.5-8B and 8.8% on RAG tasks.

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

This paper proposes the Abstract Perspective Change (APC) framework, which leverages visual foundation models to construct an abstract scene representation and perform perspective transformations, enabling VLMs to reason spatially from arbitrary viewpoints. APC substantially outperforms existing VLMs and fine-tuned models on both synthetic and real-image benchmarks.

Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models

This paper proposes Physics Context Builders (PCBs), a modular framework that fine-tunes small specialized VLMs on simulation data to generate detailed physical scene descriptions, which serve as physical context to augment the physical reasoning capabilities of large foundation VLMs (e.g., GPT-4o), without modifying the large model itself.

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

This paper proposes StepGRPO, an online reinforcement learning framework that introduces two rule-based step-wise reasoning rewards — StepRAR (Step-wise Reasoning Accuracy Reward) and StepRVR (Step-wise Reasoning Validity Reward) — without requiring a process reward model. The framework addresses the sparse reward problem in RL-based MLLM training, enabling models to autonomously explore and improve their reasoning capabilities.

Browse all 13 VLM Reasoning papers →


⚡ VLM Efficiency (11)

AirCache: Activating Inter-Modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference

This paper proposes AirCache, a KV Cache compression method for LVLMs that evaluates visual token importance via an Elite Observation Window, combined with adaptive layer-wise budget allocation based on the intensity and skewness of importance score distributions. At only 10% visual KV Cache retention, performance degradation remains within 1%, while decoding latency is reduced by 29%–66%.

AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference

This paper proposes AirCache, which achieves model performance retention with only 10% of the visual KV cache—reducing decoding latency by 29%–66%—through an elite observation window (leveraging text self-attention to select critical text tokens for evaluating visual token importance) and adaptive inter-layer budget allocation (based on the intensity and skewness of importance score distributions).

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

This paper proposes Dynamic-VLM, which employs a dynamic visual token compressor to flexibly adjust the number of tokens per frame according to video length. Combined with a 2-million-scale high-quality synthetic video QA dataset, the method achieves a 2.7% improvement over LLaVA-OneVision on VideoMME and a 10.7% improvement on MuirBench.

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

This paper identifies a systematic positional bias in early visual token pruning for VLMs—caused by RoPE, which tends to retain tokens from the bottom of the image—and proposes FEATHER, which addresses this issue via RoPE-free attention, uniform sampling, and multi-stage pruning, achieving over 5× performance improvement on visual grounding tasks.

FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance

This paper proposes FOLDER — a plug-and-play visual token compression module that systematically analyzes three key factors of information loss (reduction impact, propagation effect, and aggregation method), performs aggressive token merging in the last few layers of the visual encoder, and achieves up to 70% token reduction while maintaining or even improving model performance.

Growing a Twig to Accelerate Large Vision-Language Models

This paper proposes TwigVLM, which attaches a lightweight twig module to the early layers of a VLM to simultaneously enable twig-guided visual token pruning (TTP, for prefilling acceleration) and self-speculative decoding (SSD, for decoding acceleration). On LLaVA-1.5-7B, TwigVLM retains 96% accuracy after pruning 88.9% of visual tokens and achieves a 154% speedup in long-answer generation, substantially outperforming existing methods in both accuracy and speed.

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

By exploiting the sparsity of attention scores between the CLS token and spatial tokens in the visual encoder, this work adaptively prunes and merges visual tokens, maintaining comparable LMM performance while retaining only 5.5% of visual tokens.

MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

This paper proposes MaTVLM, which replaces a portion of Transformer layers in a pretrained VLM with Mamba-2 layers and trains the resulting model via single-stage knowledge distillation, achieving 3.6× inference speedup and 27.5% memory reduction while maintaining competitive performance.

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

METEOR proposes the first three-stage progressive token pruning framework for multi-encoder MLLMs: at the encoding stage, feature rank is used to allocate sparsity ratios across encoders; at the fusion stage, collaborative pruning eliminates cross-encoder redundancy; at the decoding stage, pruning ratios are adaptively adjusted based on text prompts. The framework reduces visual tokens by 76% with only a 0.3% performance drop.

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

This work identifies significant layer-level redundancy in MLLMs—most layers contribute minimally to the transformation of visual tokens—and proposes ShortV: freezing visual tokens (skipping their attention and FFN computations) in approximately 60% of layers. On LLaVA-NeXT-13B, this achieves a 50% reduction in FLOPs with negligible performance degradation. The method is training-free and orthogonal to token pruning approaches, allowing them to be combined.

Browse all 11 VLM Efficiency papers →


🎵 Audio & Speech (11)

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

This work collects 2.5 years (22,000 hours) of instructional videos from YouTube and constructs a high-quality interleaved image-text "multimodal textbook" corpus (6.5M keyframes + 0.75B text tokens) via an LLM-driven multi-level extraction and filtering pipeline. The resulting dataset significantly improves VLM pretraining on knowledge-intensive and reasoning tasks, yielding substantial gains on ScienceQA and MathVista in particular.

Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation

This paper proposes Danceba, a framework comprising three core modules — Phase-based Rhythm Extraction (PRE), Temporal Gated Causal Attention (TGCA), and Parallel Mamba Motion Modeling (PMMM) — to achieve music-driven dance generation with high rhythm alignment and diversity, attaining a 48.68% improvement in FIDk and a 12% improvement in BAS on the AIST++ dataset.

Everything is a Video: Unifying Modalities through Next-Frame Prediction

This paper reformulates multimodal learning tasks involving text, images, audio, and video as a unified next-frame prediction problem—rendering all inputs and outputs as sequences of 64×64 video frames—and demonstrates that a single Transformer model without any modality-specific encoders can handle cross-modal tasks, validating the radical yet feasible "everything is a video" unified representation paradigm.

How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Objects

This paper proposes a material-controlled acoustic profile generation task (M-CAPA): given audio-visual observations of an indoor scene and a user-defined target material configuration, the model generates a target room impulse response (RIR) that reflects the material changes. A companion dataset, Acoustic Wonderland, is also introduced.

Latent Swap Joint Diffusion for 2D Long-Form Latent Generation

This paper proposes SaFa (Swap Forward), a modality-agnostic and efficient method that replaces the averaging operation in conventional joint diffusion with two latent swap operators—Self-Loop Latent Swap and Reference-Guided Latent Swap—to address spectrum aliasing and preserve cross-view consistency, achieving significant improvements over existing methods in both long audio and panoramic image generation.

Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry

This paper proposes a non-contact system based on laser speckle vibrometry that simultaneously senses micro-vibrations on the surfaces of multiple opaque containers via a 2D grid, then employs a Vibration Transformer to infer container type and hidden liquid fill level from vibration spectra — establishing "seeing inside opaque containers" as a novel computer vision task.

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

This paper proposes Lyra, a speech-centric omni-modal MLLM framework consisting of three core components — a DTW-based cross-modality regularizer, multi-modality LoRA, and a latent multi-modality extractor — along with the first 12K long-speech SFT dataset. Using only 2.7M training samples and modest compute, Lyra achieves state-of-the-art performance simultaneously on vision-language, vision-speech, and speech-language benchmarks, while supporting speech inputs of up to 2 hours in length.

MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing

This paper proposes the MUG framework, which simultaneously improves segment-level and event-level prediction in weakly supervised audio-visual video parsing (AVVP) through a pseudo label-augmented cross-modal random combination data augmentation strategy and an audio-visual Mamba network.

Understanding Co-speech Gestures in-the-wild

This paper proposes JEGAL — a joint gesture-audio-language tri-modal embedding space that learns co-speech gesture representations under weak supervision via a global phrase-level contrastive loss and a local gesture-word coupling loss. Three new gesture understanding tasks and benchmarks are introduced, and the method outperforms a range of baselines including large vision-language models.

VGGSounder: Audio-Visual Evaluations for Foundation Models

To address the limitations of the VGGSound dataset — including missing multi-labels, category overlap, and modality misalignment — this work constructs VGGSounder, a multi-label audio-visual classification benchmark with modality-level annotations, and proposes a "modality confusion" metric to expose deficiencies in foundation models' multimodal fusion capabilities.

Browse all 11 Audio & Speech papers →


🧊 3D Vision (267)

TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update

This paper proposes TRAN-D, a 2D Gaussian Splatting-based method for sparse-view transparent object depth reconstruction. It employs segmentation-guided object-aware losses to optimize Gaussian distributions in occluded regions, and leverages physics simulation (MPM) to enable dynamic scene updates after object removal, requiring only a single image for scene refresh.

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

This paper proposes a 3D Gaussian Map based on 3D Gaussian Splatting for scene representation, combined with an open-set semantic grouping mechanism, to construct a 3D environmental representation that captures both geometric structure and rich semantic information for Vision-Language Navigation (VLN). A Multi-Level Action Prediction strategy is further designed to integrate multi-granularity spatial-semantic cues for navigation decision-making.

3D Mesh Editing using Masked LRMs

This paper proposes MaskedLRM, which reformulates 3D shape editing as a conditional reconstruction problem. During training, randomly generated 3D occluders mask multi-view inputs, and a single clean conditioning view guides completion of the occluded regions. At inference, the user defines an edit region and provides a single edited image; the model produces an edited 3D mesh in a single forward pass in under 3 seconds — 2–10× faster than optimization-based methods — while supporting topological changes (e.g., adding holes or handles) and achieving reconstruction quality on par with state-of-the-art methods.

3D Test-time Adaptation via Graph Spectral Driven Point Shift

This paper proposes GSDTTA, which is the first work to shift 3D point cloud test-time adaptation (TTA) from the spatial domain to the graph spectral domain. By optimizing only the lowest 10% frequency components (reducing parameters by ~90%), GSDTTA achieves global structural adjustment. Combined with a feature map guided self-training strategy for pseudo-label generation, it significantly outperforms existing 3D TTA methods on ModelNet40-C and ScanObjectNN-C.

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

This paper proposes 3DGraphLLM, which encodes semantic inter-object relationships in 3D scenes as learnable graph representations and feeds them into an LLM. The method significantly outperforms baselines that ignore semantic relations across multiple 3D vision-language tasks — including object grounding, scene captioning, and visual question answering — while achieving 5× faster inference than LVLM-based approaches.

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

This paper proposes 3DGS-LM, which replaces the ADAM optimizer in 3D Gaussian Splatting with a customized second-order Levenberg-Marquardt (LM) optimizer. Combined with an efficient GPU parallelization scheme and a gradient caching structure, the method achieves a 20% training speedup while preserving reconstruction quality.

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

This paper replaces the ADAM optimizer in 3D Gaussian Splatting with a custom Levenberg-Marquardt (LM) second-order optimizer. By leveraging an efficient CUDA-parallelized PCG algorithm and a gradient cache structure to accelerate Jacobian-vector products, the method reduces optimization time by approximately 20% while maintaining equivalent reconstruction quality.

4D Gaussian Splatting SLAM

This paper presents the first complete 4D Gaussian Splatting SLAM system capable of simultaneously performing camera pose tracking and 4D Gaussian radiance field reconstruction in dynamic scenes. Gaussian primitives are partitioned into static and dynamic sets; dynamic object motion is modeled via sparse control points and an MLP; and a novel 2D optical flow map rendering algorithm is introduced to supervise dynamic Gaussian motion learning.

4D Visual Pre-training for Robot Learning

FVP formulates 3D visual pre-training as a next-point-cloud-prediction problem, training a conditional diffusion model to predict the current-frame point cloud from historical-frame point clouds. This approach achieves a 28% average success rate improvement over DP3 across 12 real-world manipulation tasks, establishing a new state of the art.

7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting

This work extends 3DGS to seven dimensions (spatial 3D + temporal 1D + directional 3D). A conditional slicing mechanism projects 7D Gaussians into 3D Gaussians compatible with the standard 3DGS pipeline, achieving up to 7.36 dB PSNR improvement on dynamic scenes with view-dependent effects while maintaining 401 FPS real-time rendering.

Browse all 267 3D Vision papers →


🎯 Object Detection (28)

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

This paper proposes 3D-MOOD, the first end-to-end monocular open-set 3D object detector, which lifts open-set 2D detections into 3D space via geometry-aware 3D query generation and a canonical image space design, achieving state-of-the-art performance on both the Omni3D closed-set benchmark and the Argoverse 2 / ScanNet open-set benchmarks.

Adversarial Attention Perturbations for Large Object Detection Transformers

This paper proposes AFOG (Attention-Focused Offensive Gradient), an architecture-agnostic adversarial attack method that leverages a learnable attention mechanism to concentrate perturbations on vulnerable image regions. With only 10 iterations and visually imperceptible perturbations, AFOG reduces the mAP of 12 detection Transformers by up to 37.8×, while also outperforming existing methods on CNN-based detectors.

Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning

This paper proposes the AMR framework, which leverages a Splice-and-Boost data augmentation strategy and a cold-start–distillation two-stage training pipeline to substantially improve boundary awareness and semantic discriminability in video moment retrieval—without relying on any external data or pretrained models—surpassing the previous SOTA by +5% on QVHighlights.

Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability

This paper proposes PCR (Prediction Consistency and Reliability), an automated evaluation method that estimates object detection model performance without human annotations. PCR analyzes the spatial consistency and confidence reliability of bounding boxes before and after NMS to estimate mAP, and constructs a corruption-based meta-dataset for more realistic and scalable evaluation.

Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion

This paper leverages the image guidance strength of diffusion models to generate a continuous synthetic-to-real spectrum of data, and proposes a Diffusion Curriculum Learning (DisCL) strategy that adaptively selects synthetic data at optimal guidance levels across different training stages, effectively addressing long-tail classification and low-quality data learning challenges.

DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion

DISTIL proposes a data-free trojan trigger inversion method that searches for trigger patterns in the latent space of a pretrained guided diffusion model—rather than in pixel space—and injects uniform noise regularization at each step to effectively distinguish genuine backdoor triggers from adversarial perturbations, achieving up to 7.1% accuracy improvement on BackdoorBench.

Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection

This work is the first to introduce Mixture of Experts into real-time open-vocabulary object detectors. Through MoE-Tuning, it extends Grounding DINO 1.5 Edge from a dense model into a dynamic inference framework, proposing fine-grained expert decomposition and a pretrained weight allocation strategy. Using only 1.56M open-source data, the resulting model surpasses the original version trained on 20M private data.

EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision

This paper proposes I2EvDet, a framework that adapts mainstream image detectors to event-based video detection by inserting lightweight RNN temporal modules into the frozen latent space of RT-DETR, achieving state-of-the-art results of +2.3 and +1.4 mAP on the Gen1 and 1Mpx benchmarks, respectively, with minimal architectural modifications.

From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision

This paper proposes a Progressive Active Learning (PAL) framework that trains infrared small target detection networks through a three-stage strategy—model pre-start, model enhancement, and model refinement—driving the network to actively identify and learn from hard samples in an easy-to-hard manner. Under single point supervision, PAL substantially narrows the performance gap with fully supervised methods (IoU improvement of 8.53%–29.1%).

Intervening in Black Box: Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding

This paper proposes the CBM-HNMU framework, which approximates the reasoning process of a black-box model via a Concept Bottleneck Model (CBM), automatically identifies and corrects harmful concepts, and distills the corrected knowledge back into the black-box model, enabling systematic model intervention and accuracy improvement beyond the sample level.

Browse all 28 Object Detection papers →


✂️ Segmentation (73)

2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos

This paper proposes an automated pipeline to extract precise bimanual affordance annotations from human activity videos, yielding the 2HANDS dataset, and trains a VLM-based 2HandedAfforder model that predicts precise object region segmentation masks for left and right hand grasps conditioned on text prompts. The approach significantly outperforms existing methods on the newly introduced ActAffordance benchmark.

A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions

A plug-and-play physical motion restoration framework is proposed that repairs defective frames in video-based motion capture via a Mask-conditioned Motion Correction Module (MCM), and subsequently transfers the corrected motion into a physically plausible simulation through a Physics-based Motion Transfer Module (PTM) with RL-based test-time adaptation. This work is the first to achieve physics-based simulation restoration for in-the-wild high-difficulty motions such as gymnastics and martial arts back-flips.

Advancing Visual Large Language Model for Multi-granular Versatile Perception

This paper proposes MVP-LM, a multi-granular versatile perception framework built upon a visual large language model. Through a novel multi-granular decoder and a CoT-inspired data unification strategy, MVP-LM is the first single model to simultaneously support all four perception combinations—box and mask predictions under both word-level and sentence-level instructions—achieving competitive performance on panoptic segmentation, object detection, visual grounding, and referring expression segmentation.

AnimalClue: Recognizing Animals by their Traces

This paper introduces AnimalClue, the first large-scale dataset for animal trace recognition, containing 159,605 bounding boxes spanning 968 species across five categories of indirect clues (footprints, feces, eggs, bones, and feathers), and establishes four benchmarks covering classification, detection, instance segmentation, and attribute prediction.

Auto-Vocabulary Semantic Segmentation

This paper introduces Auto-Vocabulary Semantic Segmentation (AVS), a new task in which the AutoSeg framework autonomously discovers target categories from images and performs segmentation without any human-specified vocabulary. AutoSeg achieves 87.1 mIoU on PASCAL VOC, far surpassing the only comparable method ZeroSeg (20.1), and even outperforming several open-vocabulary methods that require explicit category specification.

Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection

This paper proposes RISE — a retrieval self-augmented unsupervised camouflaged object detection paradigm that constructs foreground/background prototype libraries from the training set itself and leverages KNN retrieval to generate pseudo-labels, substantially outperforming existing unsupervised and prompt-based methods without any annotations.

Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?

This paper proposes SatDiFuser, a framework that repurposes a generative geospatial diffusion model (DiffusionSat) as a discriminative remote sensing foundation model. Through systematic analysis of multi-stage, multi-timestep diffusion features and three designed fusion strategies (Global Weighted, Localized Weighted, and MoE Joint Fusion), SatDiFuser outperforms existing state-of-the-art geospatial foundation models (GFMs) on semantic segmentation and classification tasks, achieving gains of up to +5.7% mIoU and +7.9% F1.

CAVIS: Context-Aware Video Instance Segmentation

This paper proposes CAVIS, which introduces a Context-Aware Instance Tracker (CAIT) to incorporate contextual information around object boundaries for enhanced instance association, and designs a Prototypical Cross-frame Contrastive loss (PCC) to enforce cross-frame feature consistency, achieving state-of-the-art performance on both VIS and VPS benchmarks.

CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation

This paper proposes Closed Loop Optimal Transport (CLOT), a framework that jointly solves three OT problems through a three-level cyclic feature learning pipeline (frame embeddings → segment embeddings → cross-attention refined frame embeddings), establishing an explicit feedback loop between frame-level and segment-level representations to substantially improve boundary detection and clustering quality in unsupervised action segmentation.

ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction

This paper proposes ConformalSAM, a framework that leverages Conformal Prediction to calibrate the output uncertainty of the foundation segmentation model SEEM on target domains. Unreliable pixel labels are filtered out before serving as supervision signals for unlabeled data. Combined with a late-stage self-reliance training strategy, the framework achieves 81.21 mIoU on PASCAL VOC under the 1/16 labeled setting.

Browse all 73 Segmentation papers →


🖼️ Image Restoration (31)

AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm

This paper formulates multi-exposure HDR reconstruction from a MAP estimation perspective, decomposes the problem into two alternating subproblems—alignment and fusion—via a spatial correspondence prior, and unfolds them into an end-to-end trainable AFUNet comprising SAM (spatial alignment), CFM (channel fusion), and DCM (data consistency) modules. The method achieves state-of-the-art performance on three HDR benchmarks, reaching PSNR-μ of 44.91 dB on the Kalantari dataset.

Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis

This paper addresses the lack of datasets and noise models for polarization image burst super-resolution (SR) by constructing two dedicated datasets—PolarNS (noise statistics) and PolarBurstSR (SR benchmark)—proposing a polarization noise propagation analysis model, and systematically benchmarking existing burst SR methods on polarization scenes, thereby establishing a standardized evaluation framework for polarization image reconstruction.

Blind2Sound: Self-Supervised Image Denoising without Residual Noise

This paper proposes the Blind2Sound framework, which perceives noise levels and achieves personalized denoising via an adaptive re-visible loss, complemented by a Cramer Gaussian loss that improves noise parameter estimation accuracy. The framework eliminates residual noise in self-supervised blind denoising and outperforms all contemporary self-supervised methods and even some supervised baselines.

Blind Noisy Image Deblurring Using Residual Guidance Strategy

This paper proposes a Residual Guidance Strategy (RGS) for coarse-to-fine blind image deblurring within an image pyramid framework. At each scale transition, the convolution residual from the adjacent coarser scale is denoised via a guided filter and used to correct the blurred input at the current scale. This approach significantly improves kernel estimation accuracy and restoration quality under high noise levels (σ=0.1), surpassing multiple deep learning methods without requiring any training.

Closed-Loop Transfer for Weakly-supervised Affordance Grounding

This paper proposes LoopTrans, a closed-loop knowledge transfer framework that unifies exocentric and egocentric image activation via a shared CAM module, refines coarse activations into precise localizations using pixel-level pseudo-masks, and feeds egocentric localization results back to enhance exocentric knowledge extraction through denoising distillation, achieving state-of-the-art performance across all metrics on AGD20K.

Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention

GIGA-ToF proposes a ToF depth denoising network that fuses motion-invariant graph structures across frames. Through cross-frame graph attention and algorithm unrolling of a MAP problem, the method simultaneously improves temporal stability and spatial sharpness, demonstrating strong generalization on both synthetic and real data.

CWNet: Causal Wavelet Network for Low-Light Image Enhancement

This paper proposes CWNet, a Causal Wavelet Network that models low-light image enhancement through a structural causal model (SCM), treating semantic information as causal factors and brightness/color degradation as non-causal factors, and employs a wavelet-based backbone for fine-grained frequency-domain feature restoration.

Decouple to Reconstruct: High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion

This paper proposes D²R-UHDNet, a framework that employs a Controlled Differential Disentangled VAE (CD²-VAE) to actively decompose degraded images into a degradation-dominant latent space and background-dominant features, and processes the background features via a complex-domain invertible multi-scale fusion network. The method achieves state-of-the-art performance across six UHD restoration tasks with only 1M parameters.

Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

Targeting the redundancy caused by uniform subspace allocation across heads in standard Multi-Head Attention (MHA), this paper proposes HINT, which introduces Hierarchical Multi-Head Attention (HMHA) and Query-Key Cache Updating (QKCU) to enhance inter-head diversity and interaction, achieving state-of-the-art results on 12 benchmarks across 5 image restoration tasks.

EAMamba: Efficient All-Around Vision State Space Model for Image Restoration

This paper proposes EAMamba, a framework that introduces a Multi-Head Selective Scan Module (MHSSM) and an all-around scanning strategy to achieve multi-directional scanning without increasing computational complexity or parameter count. EAMamba addresses the computational overhead and local pixel forgetting issues of Vision Mamba in image restoration, achieving 31–89% FLOPs reduction while maintaining competitive performance across super-resolution, denoising, deblurring, and dehazing tasks.

Browse all 31 Image Restoration papers →


🛰️ Remote Sensing (11)

AstroLoc: Robust Space to Ground Image Localizer

This paper proposes AstroLoc, the first space-to-ground localization model trained on 300K manually annotated astronaut photographs. Through a query-satellite pairwise loss and unsupervised mining technique, the model learns robust representations of Earth's surface, achieving an average improvement of 35% in Recall@1, consistently exceeding 99% in Recall@100, and has already localized over 500K photographs in real-world deployment.

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

This paper introduces CityNav, the first large-scale aerial vision-and-language navigation dataset for real-world urban environments, comprising 32,637 human demonstration trajectories covering 4.65 km². A Geo-Semantic Map (GSM) auxiliary representation is proposed and shown to significantly improve baseline navigation performance.

GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization

This paper proposes GeoDistill, a framework that enhances locally discriminative feature learning via a Field-of-View (FoV) occlusion-based teacher-student self-distillation paradigm. Under weakly supervised conditions (requiring only coarse GPS annotations), it achieves robust cross-view localization with performance improvements exceeding 10%, and can be applied as a plug-and-play component to different localization frameworks.

GeoExplorer: Active Geo-Localization with Curiosity-Driven Exploration

This paper proposes GeoExplorer, an active geo-localization (AGL) agent that integrates goal-directed extrinsic rewards with curiosity-driven intrinsic rewards. By jointly modeling action-state dynamics and curiosity-based exploration within a reinforcement learning framework, GeoExplorer achieves more robust UAV search strategies and demonstrates superior generalization to unseen targets and environments.

Information-Bottleneck Driven Binary Neural Network for Change Detection

This paper proposes BiCD, the first binary neural network specifically designed for change detection. By introducing an auxiliary objective module guided by the Information Bottleneck (IB) principle, BiCD enhances the feature representation capability and separability of BNNs, achieving state-of-the-art performance among BNN-based methods on both street-view and remote sensing change detection benchmarks, while achieving 30× memory compression and 2.5× inference acceleration.

Pan-Crafter: Learning Modality-Consistent Alignment for Pan-Sharpening

PAN-Crafter proposes a modality-consistent alignment framework that explicitly addresses cross-modal misregistration between PAN and MS images via Modality-Adaptive Reconstruction (MARs) and Cross-Modal Misalignment-aware Multi-scale Attention (CM3A), achieving state-of-the-art performance on multiple remote sensing benchmarks while running 1110× faster than diffusion-based methods.

RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model

This work is the first to introduce the physical heat conduction process into a remote sensing foundation model. RS-vHeat replaces the attention mechanism with a Heat Conduction Operator (HCO) to model local region correlations in remote sensing images, achieving strong performance across 4 tasks and 10 datasets while reducing GPU memory by 84%, FLOPs by 24%, and improving throughput by 2.7× compared to the attention-based baseline.

SkySense V2: A Unified Foundation Model for Multi-Modal Remote Sensing

This paper proposes SkySense V2, which employs a single unified Transformer backbone to process three remote sensing modalities — high-resolution optical, multispectral, and SAR imagery — and introduces Adaptive Patch Merging (APM), modality-specific prompt tokens, and Query-based Semantic Aggregation Contrastive Learning (QSACL) for pre-training. With only 665M parameters (vs. 1.26B in the predecessor SkySense), SkySense V2 achieves an average improvement of 1.8 points across 7 tasks on 16 datasets.

SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images

This paper proposes SMARTIES, a unified sensor-agnostic foundation model for remote sensing that maps heterogeneous sensor data into a shared space via spectrum-aware projection. Combined with cross-sensor token mixing and masked reconstruction for self-supervised pre-training, SMARTIES surpasses sensor-specific models on both unimodal and multimodal tasks and generalizes to sensors unseen during pre-training.

Towards a Unified Copernicus Foundation Model for Earth Vision

This work presents a unified Earth observation foundation model system covering all major Copernicus Sentinel tasks, comprising the Copernicus-Pretrain dataset with 18.7 million aligned images, the Copernicus-FM model supporting arbitrary spectral and non-spectral sensors, and the Copernicus-Bench evaluation benchmark spanning 15 hierarchical downstream tasks.

Browse all 11 Remote Sensing papers →


🧑 Human Understanding (41)

AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

This paper proposes AR-VRM, the first method to enhance visual robot manipulation (VRM) through explicit imitation of human hand keypoints. It employs a keypoint vision-language model pretrained on large-scale human activity videos to acquire motion knowledge, and establishes correspondences between human hand keypoints and robot components via analogical reasoning.

Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars

This paper presents Avat3r — the first animatable large reconstruction model (LRM) that regresses high-quality drivable 3D Gaussian head avatars from only 4 input images in a feed-forward manner. By integrating DUSt3R positional maps and Sapiens semantic features as priors, and modeling expression-driven animation via simple cross-attention, Avat3r substantially outperforms existing methods on the Ava256 and NeRSemble datasets.

Bi-Level Optimization for Self-Supervised AI-Generated Face Detection

This paper proposes BLADES, a method that employs bi-level optimization to explicitly align self-supervised pretraining with the AI-generated face detection objective. The inner loop optimizes a visual encoder on pretext tasks including EXIF classification/ranking and face manipulation detection, while the outer loop optimizes task weights to improve performance on a proxy detection task, enabling cross-generator generalization without relying on any synthetic face data.

CarGait: Cross-Attention based Re-ranking for Gait Recognition

This paper proposes CarGait, a cross-attention based re-ranking method for gait recognition. By performing strip-wise cross-attention between probe and candidate sequences, CarGait learns fine-grained gait correspondences and maps global features from pretrained single-stage models into a new discriminative embedding space. The method consistently improves Rank-1/5 accuracy across seven gait models on three major benchmarks: Gait3D, GREW, and OU-MVLP.

CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation

This work is the first to introduce causal reasoning into category-level object pose estimation (COPE). It eliminates spurious correlations induced by data bias via a front-door adjustment-based causal reasoning module, and provides unbiased categorical semantic supervision through residual knowledge distillation from the 3D foundation model ULIP-2. The method achieves 61.7% on the strict 5°2cm metric on REAL275, surpassing the state of the art by 4.7%.

Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing

This paper proposes BioTUCH, a framework that detects self-contact events via wrist-to-wrist bioimpedance sensing and performs contact-aware 3D arm pose refinement in conjunction with a visual pose estimator, achieving an average improvement of 11.7% in reconstruction accuracy.

Controllable and Expressive One-Shot Video Head Swapping

This paper proposes a diffusion-based multi-condition controllable video head swapping framework (SwapAnyHead) that achieves high-fidelity identity preservation, seamless background blending, and accurate cross-identity expression transfer and editing via a shape-agnostic mask strategy, a hair enhancement strategy, and an expression-aware 3DMM-driven landmark retargeting module.

DADM: Dual Alignment of Domain and Modality for Face Anti-Spoofing

This paper proposes the DADM framework, which simultaneously addresses intra-domain modality misalignment and inter-domain modality misalignment in multimodal face anti-spoofing via a Mutual Information Mask (MIM) module and a dual domain-modality alignment optimization strategy, achieving state-of-the-art performance across four evaluation protocols.

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

DreamActor-M1 proposes a human image animation framework based on the DiT architecture, achieving fine-grained facial and body control through hybrid control signals comprising implicit facial representations, 3D head spheres, and 3D body skeletons. Combined with complementary appearance guidance and a progressive training strategy, the framework supports multi-scale generation ranging from portrait to full-body.

Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation

This paper proposes ViTaM-D, a vision-tactile fusion framework that achieves dynamic reconstruction of hand-object interaction for both rigid and deformable objects. The framework introduces a novel Distributed Force-aware Contact Representation (DF-Field) and a two-stage pipeline consisting of visual dynamic tracking followed by force-aware optimization. The HOT dataset is also introduced to address the evaluation gap in deformable object hand-object interaction.

Browse all 41 Human Understanding papers →


📹 Video Understanding (56)

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

This paper introduces 4D-Bench, the first benchmark for evaluating multi-modal large language models (MLLMs) on 4D object (dynamic 3D object) understanding, comprising two tasks: 4D object question answering and 4D object captioning. The benchmark reveals that even GPT-4o achieves only 63% accuracy on simple 4D objects (vs. 91% human baseline), with particularly weak performance on object counting and temporal understanding.

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

This paper introduces 4D-Bench, the first benchmark for evaluating multimodal large language models (MLLMs) on 4D object understanding (i.e., 3D objects with temporal evolution). It comprises two core tasks: 4D Object QA (751 QA pairs) and 4D Object Captioning (580 objects × 5 annotations). Evaluation reveals that even the state-of-the-art GPT-4o achieves only 63% accuracy compared to 91% for humans, exposing a substantial gap in multi-view spatiotemporal understanding among MLLMs.

Adaptive Hyper-Graph Convolution Network for Skeleton-Based Human Action Recognition

This paper proposes Hyper-GCN, which replaces conventional binary graphs with an adaptive non-uniform hypergraph to model skeletal topology, and introduces virtual hyper-joints to create virtual connections that enable direct modeling of multi-joint cooperative relationships. The approach achieves state-of-the-art performance on NTU-60/120 and NW-UCLA with the most lightweight GCN design (base variant: only 1.1M parameters, 1.63 GFLOPs).

Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections

This paper proposes Hyper-GCN, which transcends the limitation of conventional GCNs that model only binary pairwise joint relationships, by introducing adaptive non-uniform hypergraph convolution and virtual hyper joints. The design enables efficient aggregation of multi-joint collaborative semantics, achieving state-of-the-art performance on NTU-60/120 and NW-UCLA benchmarks with the most lightweight GCN architecture to date.

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

This paper proposes AIM, a training-free adaptive inference method for multimodal LLMs that achieves a 6.8× FLOPs reduction while maintaining performance, through similarity-based iterative visual token merging before the LLM and progressive PageRank-based token pruning within LLM layers. Under equal compute budgets, AIM even surpasses SOTA on long video understanding (+4.6 MLVU).

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

This paper proposes a training-free adaptive inference framework that achieves flexible accuracy–efficiency trade-offs across a 40× FLOPs range for multimodal LLMs. The method combines iterative token merging based on embedding cosine similarity before the LLM, and progressive token pruning based on PageRank-derived multimodal importance scores within LLM layers. Strong performance is demonstrated on both video and image understanding benchmarks.

Aligning Effective Tokens with Video Anomaly in Large Language Models

This paper proposes VA-GPT, which efficiently aligns anomaly-relevant tokens within MLLMs via two modules — Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG) — enabling precise detection, description, and temporal localization of anomalous events.

AllTracker: Efficient Dense Point Tracking at High Resolution

AllTracker reformulates point tracking as a multi-frame long-range optical flow problem, iteratively refining correspondence estimates on low-resolution grids via 2D convolutions and pixel-aligned temporal attention, followed by upsampling. With only 16M parameters, it achieves state-of-the-art accuracy and enables high-resolution (768×1024) dense tracking of all pixels at speeds approaching optical flow methods.

An Empirical Study of Autoregressive Pre-training from Videos

This paper systematically investigates autoregressive pre-training from videos (termed Toto), training a causal Transformer on over one trillion visual tokens. Despite minimal inductive biases, the approach achieves competitive performance across image recognition, video classification, object tracking, and robot manipulation, while exhibiting scaling laws analogous to those of language models, albeit at a slower rate.

Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking

This paper proposes TRACT, a method that leverages trajectory-level information to enhance open-vocabulary multi-object tracking (OV-MOT). It improves association via Trajectory Consistency Reinforcement (TCR) and improves classification via Trajectory Feature Aggregation (TFA) and Trajectory Semantic Enrichment (TSE). TRACT achieves significant performance gains on the OV-TAO benchmark, particularly in classification accuracy.

Browse all 56 Video Understanding papers →


🚗 Autonomous Driving (91)

3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation

This paper proposes PGA, the first physical adversarial attack framework based on 3D Gaussian Splatting (3DGS). By addressing mutual occlusion and self-occlusion among Gaussians to ensure cross-viewpoint consistency, and by designing a min-max optimization strategy to filter non-robust adversarial features, PGA substantially outperforms state-of-the-art methods in both the digital and physical domains.

3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

This paper presents 3DRealCar, the first large-scale real-world 3D vehicle dataset comprising 2,500 vehicles from 100+ brands, each with approximately 200 high-resolution 360-degree RGB-D views captured under three lighting conditions (standard, reflective, and low-light), along with 13-category vehicle parsing annotations, supporting tasks including 3D reconstruction, detection, and generation.

4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads

This paper proposes 4DSegStreamer, a streaming 4D panoptic segmentation framework built upon a dual-thread system (predictive thread + inference thread). It achieves real-time, high-quality 4D panoptic segmentation through geometric and motion memory maintenance, ego-pose prediction, and inverse forward flow iteration.

6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting

Leveraging the efficient differentiable rendering capability of 2D Gaussian Splatting, this paper proposes a CAD-model-free online 6D object pose estimation and tracking method. By jointly optimizing a Gaussian object field and keyframe poses, it achieves approximately 5× speedup over BundleSDF while maintaining comparable accuracy.

A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds

This paper proposes an SfM-free constrained optimization framework that jointly optimizes camera parameters and 3DGS scene reconstruction from coarse poses and noisy point clouds produced by multi-camera SLAM systems, via camera pose decomposition, sensitivity-based pre-conditioning, log-barrier constraints, and geometric constraints.

AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving

This paper proposes AD-GS, a self-supervised autonomous driving scene rendering framework based on 3D Gaussian Splatting. The core innovation is combining learnable B-spline curves with trigonometric functions for local-global motion modeling, coupled with a simplified binary pseudo-segmentation for robust scene decomposition. Without relying on manual 3D annotations, AD-GS substantially outperforms existing self-supervised methods.

AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving

This paper proposes AD-GS, a self-supervised autonomous driving scene rendering framework that models dynamic object motion by combining locally-aware learnable B-spline curves with globally-aware trigonometric functions. It employs simplified pseudo 2D segmentation for scene decomposition, significantly outperforming existing self-supervised methods and approaching the performance of annotation-dependent approaches without relying on manual 3D annotations.

AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving

AdaDrive presents the first LLM-augmented autonomous driving framework with an adaptive slow-fast architecture. Two adaptive connectors dynamically determine when to activate the LLM (Connector-W) and how much the LLM contributes (Connector-H), achieving SOTA performance on language-grounded driving benchmarks (driving score 80.9%) while reducing inference latency to 189ms and GPU memory to 6.79GB.

Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts

This paper proposes DUO (Dual Uncertainty Optimization), the first test-time adaptation framework that jointly minimizes semantic uncertainty and geometric uncertainty, achieving robust monocular 3D object detection via conjugate focal loss and normal field constraints.

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

This paper proposes the AGO framework, which handles known categories via noise-augmented grounding training and unknown categories via a modality adapter for adaptive alignment. An information entropy-based open-world recognizer dynamically selects the optimal features at inference time. AGO surpasses VEON by 4.09 mIoU on the Occ3D-nuScenes self-supervised benchmark while exhibiting open-world zero-shot/few-shot transfer capability.

Browse all 91 Autonomous Driving papers →


🤖 Robotics & Embodied AI (26)

Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding

This paper proposes AdaRPG, a framework that leverages foundation vision-language models for part-level segmentation and affordance reasoning on articulated objects, and employs GPT-4o to generate high-level control code for adaptively scheduling atomic manipulation skills, achieving cross-category zero-shot generalization in both simulation and real-world environments.

AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation

This paper proposes AnyBimanual, a plug-and-play framework that transfers pretrained unimanual manipulation policies to general bimanual manipulation scenarios via a Skill Manager and a Visual Aligner, achieving significant multi-task generalization with only a small number of bimanual demonstrations.

Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective

From a generalization perspective, this paper introduces Sharpness-Aware Minimization (SAM) into multi-task learning (MTL). By decomposing each task's SAM gradient into a "low-loss direction" and a "flat direction" and aggregating them separately, the method reduces gradient conflicts and guides the model toward a jointly flat low-loss region shared across tasks.

Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations

This paper proposes URMMDG, a framework that constructs a cross-modal unified representation space via supervised contrastive learning and decouples class-generic information from modality/domain-specific information through mutual information minimization. This enables effective transfer of classical single-modal domain generalization methods (Mixup, JiGen, IBN-Net) to multimodal domain generalization (MMDG) settings, achieving state-of-the-art performance on the EPIC-Kitchens and HAC benchmarks.

Certifiably Optimal Anisotropic Rotation Averaging

This paper proposes a novel SDP relaxation that enforces solutions to lie within the convex hull of SO(3), conv(SO(3)), achieving for the first time certifiably globally optimal rotation averaging under anisotropic cost functions. It resolves the fundamental failure of the conventional O(3) relaxation in anisotropic settings.

CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

This paper proposes CombatVLA, an efficient 3B-parameter VLA model designed for combat tasks in 3D action role-playing games. Through the Action-of-Thought data format and a truncated inference strategy, CombatVLA achieves inference speeds up to 50× faster than existing VLM-based game frameworks while surpassing human players in combat success rate.

COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

This paper proposes COSMO, a low-cost VLN architecture combining selective memorization, which replaces the computationally expensive attention mechanisms in Transformers with two customized selective state space modules—Round Selective Scan (RSS, capturing global context in a single scan pass) and Cross-modal Selective State Space Module (CS3, dual-stream cross-modal interaction)—achieving navigation performance surpassing the baseline DUET with only 15.5% of its parameters and 9.3% of its FLOPs.

DexVLG: Dexterous Vision-Language-Grasp Model at Scale

This paper presents DexVLG — the first large-scale vision-language-dexterous-grasp model. It introduces DexGraspNet 3.0, a dataset comprising 174K objects and 170M grasp poses with part-level semantic annotations. By combining a VLM encoder with a Flow Matching pose prediction head, DexVLG achieves over 76% zero-shot execution success in simulation and demonstrates semantically aligned dexterous grasping in the real world.

Embodied Representation Alignment with Mirror Neurons

Inspired by mirror neurons, this paper aligns the intermediate representations of action understanding (observing others' behavior) and embodied execution (autonomously performing actions) into a shared latent space via contrastive learning. The work reveals a spontaneous alignment phenomenon between the two model families that correlates with task success rate, and demonstrates that explicit alignment yields improvements on action recognition (+3.3%) and robot manipulation (+3.5%).

EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment

This paper proposes EvolvingGrasp, which achieves efficient evolutionary generation and human preference alignment for dexterous grasp pose synthesis via Handpose-wise Preference Optimization (HPO) and a Physics-Aware Consistency Model (PCM), attaining state-of-the-art performance on four benchmark datasets with a 30× inference speedup.

Browse all 26 Robotics & Embodied AI papers →


🎮 Reinforcement Learning (7)

Embodied Navigation with Auxiliary Task of Action Description Prediction

DescRL introduces action description generation as an auxiliary task for reinforcement learning-based navigation. By distilling knowledge from pretrained vision-language models to train an ADPredictor, the navigation agent simultaneously produces interpretable action descriptions and achieves improved navigation performance, attaining state-of-the-art results on Semantic Audio-Visual Navigation (SAVNav) and several other tasks.

mDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs

This paper proposes mDP3, a training-free and model-agnostic video frame selection method that estimates frame similarity in RKHS via a conditional Gaussian kernel, leverages Determinantal Point Processes (DPP) to capture query relevance and list-wise diversity, and models temporal structure via a Markov Decision Process (MDP). Using only 8 input frames, mDP3 significantly outperforms uniform sampling and existing frame selection methods on multiple long-video benchmarks.

NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation

This paper proposes NavQ, a foresighted VLN agent that employs a Q-model to predict, in a single forward pass, long-horizon future semantic aggregation features (Q-features) for each candidate action. Combined with an A*-style search strategy, NavQ achieves significant improvements on object-goal navigation benchmarks.

Progressor: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement

This paper proposes Progressor, a framework that learns task-agnostic reward functions from unannotated videos via self-supervision. It provides dense reward signals by predicting task progress distributions and addresses distribution shift during online RL training through an adversarial push-back strategy.

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

This paper proposes R1-Onevision, a framework that converts images into formalized textual representations via a cross-modal reasoning pipeline, combined with a two-stage post-training strategy of SFT followed by rule-based reinforcement learning (GRPO), to significantly enhance multimodal reasoning in vision-language models, surpassing GPT-4o on multiple mathematical reasoning benchmarks.

RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment

This paper proposes RL-Selector, which introduces the ε-sample cover concept to quantify sample redundancy and formulates data selection as a reinforcement learning problem. A lightweight A2C policy network adaptively optimizes the selection strategy, achieving generalization performance comparable to or surpassing full-data training with significantly fewer samples across multiple benchmark datasets.

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

This paper introduces the concept of compositional constraints to formalize safety and efficiency requirements in multi-agent embodied collaboration, constructs the first multi-agent manipulation benchmark RoboFactory based on this formalization, and systematically investigates architectures and training strategies for multi-agent imitation learning.


🔄 Self-Supervised Learning (13)

A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention

This paper identifies a previously overlooked issue in GCD—ViT attention on unlabeled data (especially novel categories) tends to disperse onto background regions (distracted attention)—and proposes an Attention Focusing (AF) module that corrects attention via multi-scale token importance measurement combined with adaptive pruning. As a plug-and-play module on top of SimGCD, AF achieves up to 15.4% performance improvement.

A Token-level Text Image Foundation Model for Document Understanding (TokenFD/TokenVL)

This paper proposes TokenFD, the first token-level text image foundation model, pre-trained on 20 million images and 1.8 billion BPE token-mask pairs via token-level vision-language alignment to achieve image-as-text semantic understanding. Built upon TokenFD, TokenVL is introduced as a document understanding MLLM, achieving a score of 860 on OCRBench (highest among 8B-class models) and an average improvement of 8.8% across ten VQA benchmarks including DocVQA.

Always Skip Attention

This paper theoretically demonstrates that the self-attention mechanism in Vision Transformers is inherently ill-conditioned, leading to training collapse in the absence of skip connections. It further proposes Token Graying (TG), a method that improves the condition number of input tokens to enhance ViT training stability and performance.

CObL: Toward Zero-Shot Ordinal Layering without User Prompting

This paper presents CObL, an architecture based on multiple frozen Stable Diffusion UNets operating in parallel, capable of inferring an occlusion-ordered object layer representation (one amodally-completed object per layer) from a single image without any user prompts or prior knowledge of object count. Trained on only a few thousand synthetic tabletop scenes, CObL generalizes zero-shot to real-world photographs.

From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations

This paper theoretically analyzes how MAE learns spatial correlations in images. It derives a closed-form solution for linear MAE, reveals how masking ratio and patch size select short- or long-range spatial features, and extends the analysis to nonlinear MAE, providing theoretical guidance for hyperparameter selection in practice.

Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery

This paper proposes DiffGRE, a diffusion-model-based framework for on-the-fly category discovery. It synthesizes novel samples containing virtual category information via Attribute Composition Generation (ACG), filters low-quality samples through Diversity-Driven Refinement (DDR), and injects additional category knowledge via Semi-supervised Leader Encoding (SLE). DiffGRE achieves substantial performance gains over existing OCD methods across 6 fine-grained datasets (average ACC-ALL improvement of 6.5%).

Improving Large Vision and Language Models by Learning from a Panel of Peers

This paper proposes the Panel-of-Peers (PoP) learning framework, in which multiple LVLMs of comparable capability mutually generate candidate responses and score each other to construct preference data. Combined with iterative self-improvement via SimPO, PoP raises the average score across 15 benchmarks from 48% to 57% without any human-annotated data.

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

LoftUp is proposed to map low-resolution VFM features to arbitrary high resolutions via a coordinate-cross-attention architecture, with class-agnostic mask refinement and self-distillation to construct full-resolution pseudo-GT for training, achieving average improvements of 10–20% across 6 downstream tasks and nearly 50% on video object segmentation.

Manual-PA: Learning 3D Part Assembly from Instruction Diagrams

This paper proposes Manual-PA, a Transformer-based instruction-guided 3D part assembly framework that infers assembly order by aligning 3D parts with instruction step diagrams via contrastive learning, then uses the learned order as soft guidance through positional encoding for 6DoF pose prediction, significantly outperforming existing methods on PartNet.

MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning

MoSiC extracts long-range motion trajectories via an offline point tracker and propagates cluster assignments along the temporal dimension through an Optimal Transport (Sinkhorn-Knopp)-based clustering mechanism. This enables learning spatially and temporally consistent dense representations from video data, improving DINOv2 by 1%–6% across multiple image and video benchmarks using only video for training.

Browse all 13 Self-Supervised Learning papers →


📐 Optimization & Theory (7)

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

This paper proposes SimVQ, a method that reparameterizes codebook vectors via a single learnable linear transformation layer (\(\bm{C}\bm{W}\)), converting the disjoint optimization of the codebook into a joint spatial optimization, thereby fundamentally resolving representation collapse in VQ models and achieving near-100% codebook utilization.

Class-Wise Federated Averaging for Efficient Personalization

cwFedAvg extends FedAvg from client-level aggregation to class-level aggregation, constructing a dedicated global model per class and combining them into a personalized local model weighted by each client's class distribution. Coupled with Weight Distribution Regularization (WDR) to strengthen the alignment between class distribution and weight norms, the method achieves substantial personalization gains under non-IID settings while maintaining the same communication overhead as FedAvg.

Federated Continual Instruction Tuning

This paper introduces the first Federated Continual Instruction Tuning (FCIT) benchmark, covering 2 scenarios, 4 settings, and 12 datasets, and proposes the DISCO framework, which addresses data heterogeneity and catastrophic forgetting via Dynamic Knowledge Organization (DKO) and Subspace Selective Activation (SSA).

Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data

This paper proposes FED-PRIME, a federated prompt-tuning framework for multimodal settings with missing modalities. It maintains two sets of learnable prompts — inter-client and intra-client — to capture cross-client alignable missing patterns and client-specific missing patterns, respectively, and employs a clustering-alignment mechanism for server-side aggregation. FED-PRIME substantially outperforms existing baselines across diverse missing-data configurations.

Learning Interpretable Queries for Explainable Image Classification with Information Pursuit

This paper parameterizes the query dictionary of Information Pursuit (IP) as learnable vectors in the CLIP semantic embedding space, and learns a task-sufficient interpretable query dictionary via an alternating optimization algorithm, substantially closing the performance gap between interpretable classifiers and black-box classifiers.

Memory-Efficient 4-bit Preconditioned Stochastic Optimization

This paper proposes a 4-bit quantization scheme based on Cholesky decomposition and error feedback, compressing the preconditioner matrices of the Shampoo optimizer to 4-bit precision. The approach substantially reduces GPU memory consumption while preserving training performance close to 32-bit Shampoo, with convergence guarantees provided for both smooth and non-smooth settings.

Zeroth-Order Fine-Tuning of LLMs in Random Subspaces

This paper proposes SubZero (random Subspace Zeroth-order), which estimates gradients in random subspaces via per-layer low-rank perturbations, significantly reducing gradient variance and angular error in zeroth-order optimization, enabling memory-efficient LLM fine-tuning at a cost close to inference.


🔬 Interpretability (10)

AIM: Amending Inherent Interpretability via Self-Supervised Masking

This paper proposes AIM, a top-down learnable binary masking mechanism for self-supervised spatial feature selection, built upon a feature pyramid architecture. Without requiring additional annotations, AIM guides CNNs to focus on genuinely discriminative features and suppress spurious correlations, simultaneously achieving inherent interpretability and improved OOD generalization.

ArgoTweak: Towards Self-Updating HD Maps through Structured Priors

This paper proposes ArgoTweak, the first HD map dataset providing complete triplets of "prior map + current sensor data + up-to-date ground-truth map." It decomposes large-scale map modifications into element-level atomic changes via a bijective change mapping framework, and introduces interpretable evaluation metrics (mAPC/mACC). Models trained on ArgoTweak reduce the sim2real gap by more than 10× compared to synthetic-prior baselines.

CAD-Recode: Reverse Engineering CAD Code from Point Clouds

This paper proposes CAD-Recode, which translates point clouds into executable Python CadQuery code to reconstruct CAD models. By leveraging a pretrained LLM (Qwen2-1.5B) as the decoder paired with a lightweight point cloud encoder, the method achieves more than 10× reduction in Chamfer Distance on three benchmarks: DeepCAD, Fusion360, and CC3D.

CE-FAM: Concept-Based Explanation via Fusion of Activation Maps

CE-FAM is a concept explanation method that trains a branch network sharing activation maps with an image classifier to simulate VLM embeddings, establishing a one-to-one correspondence among concept prediction → concept region (weighted sum of activation maps) → concept contribution (effect on classification score). The paper also introduces a novel NRA evaluation metric and surpasses existing methods on zero-shot concept reasoning.

Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations

This paper proposes Granular Concept Circuit (GCC), a method that automatically discovers fine-grained visual circuits encoding specific concepts in deep visual models by iteratively evaluating inter-neuron functional dependency (Neuron Sensitivity Score) and semantic consistency (Semantic Flow Score). GCC is the first method capable of discovering multiple concept-level circuits within a single query.

Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond

LFRD² proposes a hybrid framework that combines learnable time-fractional reaction-diffusion equations with neural networks for under-display ToF (UD-ToF) depth map restoration. The approach captures long-range memory dependencies across iterations via fractional calculus and introduces an efficient continuous convolution operator to replace discrete convolution, achieving state-of-the-art performance on UD-ToF depth restoration, ToF denoising, and depth super-resolution tasks.

Minerva: Evaluating Complex Video Reasoning

This paper introduces Minerva — a manually annotated benchmark of 1,515 complex video reasoning QA pairs, each with 5 answer choices and a detailed reasoning trace, designed to evaluate the video reasoning capabilities of multimodal large language models. It further establishes a video reasoning error taxonomy (Temporal / Perceptual / Logical / Completeness) and the MiRA automated evaluation framework.

"Principal Components" Enable A New Language of Images

This paper proposes Semanticist, a visual tokenization framework that embeds a provable PCA structure into the latent token space—where each subsequent token contributes decreasing, non-overlapping information—and employs a diffusion decoder to decouple the semantic-spectral entanglement effect, achieving state-of-the-art performance on both image reconstruction and autoregressive generation.

SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

This paper proposes the SVIP framework, which addresses semantic misalignment in zero-shot learning at its source by identifying and replacing semantically irrelevant image patches at the input stage with learnable embeddings initialized from attribute-level word embeddings.

VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow

This paper proposes VITAL, a feature visualization method that reframes the problem as aligning intermediate feature distributions with those of real images (rather than conventional activation maximization), and incorporates relevance scores to filter irrelevant features, producing neuron visualizations that are more interpretable to humans.


📦 Model Compression (52)

A Good Teacher Adapts Their Knowledge for Distillation

This paper identifies the root cause of the teacher–student capacity gap in knowledge distillation as intra-class distribution mismatch in the output distributions, and proposes AID (Adapted Intra-class Distribution), a method that fine-tunes the teacher model prior to distillation to align its intra-class distribution with the student's learning capacity, achieving state-of-the-art performance across diverse architecture combinations.

ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation

This paper proposes ACAM-KD, an adaptive student-teacher cooperative attention masking framework for knowledge distillation. By employing Student-Teacher Cross-Attention Feature Fusion (STCA-FF) and Adaptive Spatial-Channel Masking (ASCM) to dynamically adjust distillation focus, ACAM-KD surpasses the state of the art by up to 1.4 mAP on COCO detection and improves mIoU by 3.09 on Cityscapes segmentation.

Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning

This paper proposes APT (Additive Prompt Tuning), which replaces the conventional prompt concatenation paradigm with an additive operation. By introducing only two learnable vectors added to the key/value projections of the CLS token, APT achieves state-of-the-art class-incremental learning performance while substantially reducing computational overhead (41.5% reduction in GFLOPs) and trainable parameters (78.2% reduction).

ARGMatch: Adaptive Refinement Gathering for Efficient Dense Matching

This paper proposes an Adaptive Refinement Gathering pipeline comprising three modules—a content-aware offset estimator, a local consistency matching corrector, and a local consistency upsampler—augmented with an adaptive gating mechanism. The approach substantially reduces reliance on heavyweight feature extractors and global matchers, achieving performance comparable to state-of-the-art methods with a lightweight model.

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

This paper proposes B-VLLM, a framework that dynamically balances spatio-temporal cues within the context window constraints of VLLMs via three modules: text-conditioned adaptive frame selection, temporal frame token merging, and spatial token sampling. The approach achieves a 10% performance improvement on MVBench.

Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes

This paper proposes SR-LoRA (Stable Rank-Guided LoRA), which leverages the stable rank of pretrained weight matrices as a natural prior to assign optimal per-layer ranks for LoRA modules. Without any search procedure, SR-LoRA achieves flexible layer-wise rank allocation and significantly outperforms fixed low-rank LoRA and other adaptive-rank methods in large-domain-gap and few-shot transfer scenarios such as medical imaging.

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

This paper proposes TokenBridge, which converts continuous tokens into discrete tokens by applying post-training dimension-wise quantization to pre-trained continuous VAE features. The approach preserves the high-fidelity representation capability of continuous tokens while enabling straightforward autoregressive modeling with standard cross-entropy loss, achieving generation quality on ImageNet 256×256 comparable to continuous methods.

CIARD: Cyclic Iterative Adversarial Robustness Distillation

This paper proposes CIARD, which addresses the optimization objective conflict between the clean teacher and robust teacher in dual-teacher ARD frameworks via a Contrastive Push Loss, and introduces an Iterative Teacher Training (ITT) strategy to continuously update the robust teacher and prevent performance degradation. CIARD simultaneously improves adversarial robustness by +3.53% and clean accuracy by +5.87% on CIFAR-10/100 and Tiny-ImageNet.

Color Matching Using Hypernetwork-Based Kolmogorov-Arnold Networks (cmKAN)

This paper proposes cmKAN, a hypernetwork-driven Kolmogorov-Arnold Network for color matching. A generator predicts spatially varying KAN spline parameters, supporting three scenarios (supervised / unsupervised / pairwise optimization) and three tasks (raw-to-raw / raw-to-sRGB / sRGB-to-sRGB). cmKAN outperforms existing methods by an average of 37.3% across all tasks while remaining extremely lightweight (76.4K parameters).

Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement

This paper proposes CSCI, a method that introduces a Color token to learn color representations (Color See) and employs a novel S2A self-attention mechanism to disentangle color information from ReID features (Color Ignore), effectively eliminating appearance bias in clothes-changing person re-identification without requiring any external annotations.

Browse all 52 Model Compression papers →


🏥 Medical Imaging (31)

AcZeroTS: Active Learning for Zero-shot Tissue Segmentation in Pathology Images

This work proposes AcZeroTS, a framework that integrates active learning with a VLM-based prototype-guided zero-shot segmentation model (ProZS). By simultaneously accounting for uncertainty, diversity, and the ability of selected samples to improve prototype coverage over unseen classes, the framework selects the most informative samples for annotation, achieving high-quality segmentation of both seen and unseen tissue types under minimal annotation budgets.

Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation

This paper proposes ProLearn, a framework that introduces a Prototype-driven Semantic Approximation (PSA) module to fundamentally alleviate textual reliance in medical language-guided segmentation. The prototype space is initialized from a small number of image-text pairs; thereafter, both training and inference require no text input. ProLearn maintains strong performance under 1% text availability (QaTa-COV19 Dice = 0.857), with parameters 1000× fewer than LLM-based solutions and inference speed 100× faster.

An OpenMind for 3D Medical Vision Self-supervised Learning

This work releases OpenMind, the largest publicly available 3D medical imaging pretraining dataset (114k brain MRI volumes), and conducts a systematic benchmark of existing 3D SSL methods on this dataset using state-of-the-art CNN (ResEnc-L) and Transformer (Primus-M) architectures, establishing the current SOTA for 3D medical image SSL.

Beyond Brain Decoding: Visual-Semantic Reconstructions to Mental Creation Extension Based on fMRI

This paper proposes NeuroCreat — a multimodal brain architecture that integrates the visual and textual capabilities of LLMs — extending fMRI decoding from single-task visual stimulus reconstruction to three levels: image reconstruction + text captioning + mental creation. A Prompt Variant Alignment (PVA) module is introduced to effectively bridge the gap between low-resolution fMRI signals and high-level semantic representations.

Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training

This paper proposes ViSD-Boost, which addresses the alignment bias caused by low visual semantic density in medical vision-language pre-training (VLP). The method employs disease-level visual contrastive learning to enhance visual semantics and VQ-VAE-based anatomical normality modeling to amplify abnormality signals, achieving 84.9% AUC in zero-shot diagnosis across 54 diseases spanning 15 organs.

COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation

This paper proposes COIN, a three-stage framework that addresses the critical "error-free instance absence" problem in annotation-free cell instance segmentation. The framework combines unsupervised semantic segmentation with optimal transport for pixel-level cell propagation, model–SAM consistency for instance-level confidence scoring, and confidence-guided recursive self-distillation, achieving performance on MoNuSeg and TNBC that surpasses semi-supervised and weakly supervised methods.

Controllable Latent Space Augmentation for Digital Pathology

This paper proposes HistAug — a lightweight Transformer-based latent space augmentation model that simulates realistic image transformations (hue shifts, erosion, etc.) in feature space via conditional cross-attention, providing controllable and computationally efficient data augmentation for pathology MIL training at minimal overhead.

Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography

This paper proposes an efficient self-supervised joint reconstruction method that parameterizes the speed of sound (SOS) as either a pixel grid or a neural field, recovering SOS and high-quality photoacoustic images by backpropagating gradients through a differentiable imaging forward model. The method surpasses the current state of the art in accuracy while achieving a 35× speedup (40 seconds vs. 23 minutes).

CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations

This paper proposes CuMPerLay, a differentiable Cubical Multiparameter Persistence (CMP) vectorization layer that decomposes CMP into multiple learnable single-parameter persistence lines. By jointly learning bifiltration functions for end-to-end training and embedding the layer into Swin Transformer, the method achieves significant improvements on medical image classification and semantic segmentation tasks, particularly in data-scarce settings.

GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule

This paper proposes GDKVM, an echocardiography video segmentation architecture based on linear key-value association and the gated delta rule, achieving state-of-the-art performance on CAMUS and EchoNet-Dynamic through efficient memory management and multi-scale feature fusion while maintaining real-time inference speed.

Browse all 31 Medical Imaging papers →


🛡️ AI Safety (24)

A Framework for Double-Blind Federated Adaptation of Foundation Models

BlindFed proposes a double-blind federated foundation model adaptation framework combining FHE-friendly architectural redesign (polynomial approximation of nonlinear operations), a two-stage split learning protocol (offline knowledge distillation + online encrypted inference), and privacy enhancements (sample permutation + random block sampling), achieving adaptation accuracy close to LoRA under the constraint that the data owner cannot observe the model and the model owner cannot observe the data.

Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning

This paper proposes Active MINT (aMINT), a multi-task learning framework that jointly trains a MINT model alongside the audited model during training, enabling detection of whether specific data was used for training with over 80% accuracy — significantly outperforming existing passive MINT and membership inference attack methods.

Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

This paper proposes QUAD—a continual VQA method that stores only past task questions (without images). Through question replay and attention consistency distillation, QUAD achieves privacy preservation while outperforming methods that store full image–question–answer triplets.

Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering

This paper proposes QUAD, which replays only questions from previous tasks (without storing images), combined with attention consistency distillation to preserve intra- and inter-modal attention patterns across tasks, achieving state-of-the-art performance in continual VQA under a privacy-preserving setting.

Backdoor Attacks on Neural Networks via One-Bit Flip

This paper proposes SOLEFLIP, the first inference-time backdoor attack on quantized models that requires flipping only a single bit. Through an efficient algorithm for identifying exploitable weights and bit positions, along with a corresponding trigger generation procedure, SOLEFLIP achieves an average attack success rate of 98.9% with zero degradation in clean accuracy across CIFAR-10, SVHN, and ImageNet.

Backdoor Mitigation by Distance-Driven Detoxification

This paper proposes Distance-Driven Detoxification (D3), which reformulates backdoor defense as a constrained optimization problem — maximizing the distance between the fine-tuned model weights and the poisoned initial weights, subject to a constraint that the clean sample loss does not exceed a threshold. This allows the model to effectively escape the "backdoor region," achieving best or second-best defense performance across 7 state-of-the-art attacks.

Backdooring Self-Supervised Contrastive Learning by Noisy Alignment

This paper proposes Noisy Alignment (NA), a method that enhances backdoor attacks against self-supervised contrastive learning by explicitly suppressing noise components in poisoned images. The attack is formulated as a 2D image layout optimization problem, and theoretically optimal layout parameters are derived. NA achieves up to 45.9% improvement in ASR on ImageNet-100.

Client2Vec: Improving Federated Learning by Distribution Shifts Aware Client Indexing

This paper proposes the Client2Vec mechanism, which leverages a CLIP encoder and a Distribution Shifts Aware Index Generation Network (DSA-IGN) to generate, prior to federated training, an index vector for each client that encodes both label and feature distribution information. The resulting indices are then used to improve three key stages of FL: client sampling, model aggregation, and local training.

Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation

This paper proposes the Controllable Feature Whitening (CFW) framework, which eliminates linear correlations between target features and bias features via whitening transformations to mitigate model bias. The approach requires neither adversarial training nor additional regularization hyperparameters, and supports smooth interpolation between demographic parity and equalized odds through a single weighting coefficient.

Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training

This paper reveals a counterintuitive phenomenon in adversarial training — the model's perceptual change on failure cases is actually smaller than on success cases (i.e., failure cases are "over-learned") — and proposes Robust Perception Adversarial Training (RPAT), which encourages perceptions to change smoothly with perturbations to alleviate the accuracy-robustness trade-off.

Browse all 24 AI Safety papers →


📂 Others (33)

A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition

This paper proposes HOPS (Hyperdimensional One Place Signatures), a framework leveraging hyperdimensional computing (HDC) to fuse multiple reference descriptors of the same place captured under varying environmental conditions into a unified representation, substantially improving the robustness and recall of Visual Place Recognition (VPR) without increasing computational or memory overhead.

A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks

This paper proposes a unified linear N-point solver that recovers camera linear velocity and 3D point structure from 2D point correspondences with arbitrary timestamps, supporting global shutter, rolling shutter, and event camera sensor modalities.

AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes

This paper proposes AdaptiveAE, which formulates HDR bracketed exposure capture as a Markov Decision Process (MDP) using deep reinforcement learning, jointly optimizing ISO and shutter speed combinations to adaptively select optimal exposure parameters for dynamic scenes within a user-defined time budget. The method achieves PSNR 39.70 on the HDRV dataset, outperforming the previous best method Hasinoff et al. (37.59) by 2.1 dB.

Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponents

This paper proposes LEAwareSGD, an optimizer that dynamically adjusts the learning rate using Lyapunov exponents (LE) to guide model training toward the edge of chaos, enabling broader exploration of the parameter space within an adversarial data augmentation framework and achieving significant improvements in single domain generalization (SDG).

Auto-Regressively Generating Multi-View Consistent Images (MV-AR)

This paper is the first to introduce autoregressive (AR) models into multi-view image generation. By generating views sequentially, the model leverages all preceding views to enhance consistency across distant viewpoints. It further proposes a unified multimodal condition injection architecture and a Shuffle Views data augmentation strategy, enabling a single model to handle text, image, and geometry conditions simultaneously.

C4D: 4D Made from 3D through Dual Correspondences

This paper proposes C4D, a framework that upgrades existing 3D reconstruction paradigms to full 4D reconstruction by jointly capturing dual temporal correspondences — short-term optical flow and dynamic-aware long-term point tracking (DynPT) — on top of DUSt3R's 3D pointmap predictions. Motion masks are generated to separate static and dynamic regions. Three optimization objectives are introduced: camera motion alignment, camera trajectory smoothing, and point trajectory smoothing. The resulting system produces per-frame point clouds, camera parameters, and 2D/3D trajectories, achieving competitive performance across depth estimation, pose estimation, and point tracking tasks.

Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection

This paper proposes the first sketch-based cross-modal few-shot keypoint detection framework. By leveraging a prototype network, grid-based locator, prototype domain adaptation, and a de-stylization network, the framework detects novel keypoints on unseen categories in real photographs using only a handful of annotated sketches.

EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration

This paper proposes EDFFDNet, which replaces conventional B-spline FFD and TPS with an Exponentially Decaying Free-Form Deformation (EDFFD) model for image registration. Combined with an Adaptive Sparse Motion Aggregator (ASMA) and a progressive correlation strategy, the method achieves a +0.5 dB PSNR improvement on the UDIS-D dataset while reducing parameter count by 70.5% and GPU memory usage by 32.6%.

FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases

FixTalk is proposed as a framework that addresses identity leakage in GAN-based talking head generation through two lightweight plug-and-play modules — the Enhanced Motion Indicator (EMI) and the Enhanced Detail Indicator (EDI). EMI eliminates identity information from motion features to suppress identity leakage, while EDI repurposes the leaked identity information to compensate for missing details under extreme poses, thereby removing rendering artifacts.

HyTIP: Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding

This paper proposes HyTIP, a framework that unifies output-recurrence (explicit buffering of decoded frames) and hidden-to-hidden propagation (implicit buffering of latent features) within a single learned video coding framework, achieving comparable coding performance to state-of-the-art methods using only 14% of their buffer size.

Browse all 33 Others papers →


🗂 More Areas (43)


💡 LLM Reasoning (3)

CoRVid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

This paper proposes Corvid, which comprehensively enhances the chain-of-thought reasoning capability of MLLMs through a hybrid visual encoder, a GateMixer connector, a high-quality CoT dataset, and a test-time self-verification strategy, surpassing open-source models of comparable parameter scale on mathematical reasoning and scientific problem solving.

Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

This paper proposes UV-CoT, a framework that enables image-level chain-of-thought (Visual CoT) reasoning without any manual bounding box annotations, by automatically constructing preference data and introducing an improved Score-DPO loss. UV-CoT surpasses the supervised Visual-CoT method on 6 benchmarks.

Video-T1: Test-Time Scaling for Video Generation

This paper transfers the test-time scaling (TTS) paradigm from LLMs to video generation by reformulating TTS as a search problem over trajectories from Gaussian noise space to the target video distribution. It proposes the Tree-of-Frames (ToF) search algorithm for efficient inference-time compute scaling, achieving consistent quality improvements across diverse video generation models on VBench.


🦾 LLM Agent (4)

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

A three-stage self-supervised framework is proposed that significantly improves cross-view description consistency and accuracy for the same object in indoor environments, achieved through agent-driven multi-view observation collection, LLM consensus-based pseudo-label generation, and contrastive fine-tuning of the captioner.

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

This paper identifies the "thought collapse" phenomenon in RL-based VLM Agent training — where CoT reasoning rapidly degenerates into state-agnostic, templated thoughts that lead to ineffective actions — and proposes the GTR framework, which combines a VLM corrector for automatic thought correction (SFT) with PPO-based action optimization in a dual-objective training scheme, achieving 3–5× success rate improvements on the 24-Point Game and ALFWorld.

Less is More: Empowering GUI Agent with Context-Aware Simplification

This paper proposes SimpAgent — a context-aware simplification framework that achieves SOTA on multiple GUI navigation benchmarks while reducing FLOPs by 27%, via masking-based element pruning (randomly masking irrelevant element regions during training) and consistency-guided history compression (directly dropping historical visual tokens at intermediate LLM layers with a KL divergence consistency constraint).

UIPro: Unleashing Superior Interaction Capability for GUI Agents

UIPro is proposed to achieve state-of-the-art GUI interaction performance across mobile, web, and desktop platforms by constructing 20.6M GUI understanding samples for pre-training and introducing a unified action space to integrate heterogeneous GUI agent task data.


⚖️ Alignment & RLHF (2)

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

This paper proposes HIMRD, a black-box multimodal jailbreak attack method that bypasses unimodal safety mechanisms by distributing malicious semantics across multiple modalities. A heuristic search strategy is employed to identify optimal understanding-enhancing prompts and inducing prompts, achieving average attack success rates of approximately 90% and 68% on open-source and closed-source multimodal large language models, respectively.

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

This paper proposes MagicID, a framework that constructs hybrid video pair data capturing identity and dynamic preferences, and designs a two-stage Hybrid Preference Optimization (HPO) training strategy. MagicID is the first work to apply DPO to identity-customized video generation, simultaneously addressing identity degradation and motion weakening caused by conventional self-reconstruction training.


👻 Hallucination Detection (4)

ChartCap: Mitigating Hallucination of Dense Chart Captioning

This work constructs ChartCap, a large-scale dataset of 565K real chart–caption pairs. By adopting type-specific caption schemas that exclude irrelevant information while emphasizing structure and key insights, and by introducing a reference-free Visual Consistency Score (VCS) evaluation metric, the paper effectively mitigates hallucination in VLM-based chart captioning.

DASH: Detection and Assessment of Systematic Hallucinations of VLMs

This paper proposes DASH, a fully automated pipeline that systematically discovers false-positive object hallucination clusters in VLMs via two complementary strategies: LLM-based text query generation (DASH-LLM) and diffusion model optimization-based image query generation (DASH-OPT). Applied to ReLAION-5B, DASH uncovers 19k+ clusters and 950k+ images, and constructs the more challenging DASH-B benchmark.

Mitigating Object Hallucinations via Sentence-Level Early Intervention

This paper proposes SENTINEL, a framework grounded in the key observation that hallucinations emerge early in generation and propagate forward. By combining in-domain candidate bootstrapping with dual-detector cross-validation to construct sentence-level preference data, and employing Context-aware DPO (C-DPO) for early intervention, SENTINEL reduces hallucinations on Object HalBench by 92% while preserving general capabilities.

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

This paper proposes ONLY, a training-free single-layer intervention decoding method. It selects text-biased attention heads via the Text-to-Visual Entropy Ratio (TVER) to generate textually-enhanced logits, which are then used in adaptive contrastive or collaborative decoding against the original logits. With only 1.07× inference overhead, ONLY outperforms VCD/M3ID by 3.14% on POPE and reduces CHAIR_S by 6.2 points on CHAIR.


⚡ LLM Efficiency (1)

MixANT: Observation-dependent Memory Propagation for Stochastic Dense Action Anticipation

This paper proposes MixANT, which introduces input-dependence into the forgetting gate (A matrix) of Mamba via a Mixture-of-Experts approach. A lightweight router dynamically selects context-aware A matrices to control temporal memory propagation, achieving state-of-the-art performance across all three dense action anticipation benchmarks: 50Salads, Breakfast, and Assembly101.


📖 NLP Understanding (1)

Balancing Task-Invariant Interaction and Task-Specific Adaptation for Unified Image Fusion

TITA proposes a unified image fusion framework that requires no task identifier at inference. It employs an Interaction-enhanced Pixel Attention (IPA) module to explore task-invariant complementary information extraction, an Operation-based Adaptive Fusion (OAF) module to dynamically adapt to task-specific requirements, and the FAMO strategy to mitigate multi-task gradient conflicts.


✍️ Text Generation (1)

Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation

This paper proposes DiffBrush, the first diffusion-based method for handwritten text-line generation. Through content-decoupled style learning (column/row masking) and a multi-scale content discriminator (line/word level), DiffBrush substantially outperforms existing methods in both style imitation and content accuracy.


🌐 Multilingual & Translation (1)

SignRep: Enhancing Self-Supervised Sign Representations

This paper proposes SignRep, a scalable self-supervised sign language representation learning framework that incorporates sign-specific skeleton priors, feature regularization, and an adversarial style-invariant loss into Masked Autoencoder pretraining. Using only a single RGB modality, SignRep surpasses complex multi-modal and multi-branch methods, achieving state-of-the-art performance on three tasks: sign language recognition, dictionary retrieval, and sign language translation.


🔍 Information Retrieval & RAG (5)

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation

This paper proposes D2S-VSE, a two-stage training framework (dense-text pretraining + dense-to-sparse feature distillation fine-tuning) that enhances information capacity in visual-semantic embeddings, addressing the core asymmetry in information density between image and text modalities for image-text matching.

External Knowledge Injection for CLIP-Based Class-Incremental Learning

This paper proposes Engine (ExterNal knowledGe INjEction), a framework that employs dual-branch injection tuning (visual branch via data augmentation; text branch via GPT-4-generated discriminative descriptions) and post-tuning knowledge injection at inference (pairwise discriminative feature re-ranking), achieving 3–10% improvements over all CLIP-based class-incremental learning methods across 9 benchmark datasets without storing any historical samples.

LangBridge: Interpreting Image as a Combination of Language Embeddings

LangBridge achieves interpretable vision-language alignment by explicitly decomposing visual features into linear combinations of LLM vocabulary embeddings, and supports pretraining-free adapter transfer across different LLMs.

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

This paper proposes MonSTeR—the first tri-modal retrieval model for motion, scene, and text—which constructs a unified latent space via higher-order relationship modeling inspired by topological deep learning. By capturing intrinsic dependencies among all three modalities, MonSTeR substantially outperforms baselines that rely solely on unimodal representations across multiple retrieval tasks, and can further serve as an evaluation tool for human-scene interaction models.

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

This paper presents OHRBench—the first benchmark for evaluating the cascading impact of OCR on RAG systems. It comprises 8,561 document images across 7 domains and 8,498 QA pairs, and systematically reveals the distinct impact patterns of OCR-induced Semantic Noise and Formatting Noise on both the retrieval and generation stages.


💻 Code Intelligence (1)

TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

This paper proposes TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediate bridge, enabling zero-shot text-guided TikZ graphics program synthesis without text-aligned training data. TikZero substantially outperforms baseline methods, and its end-to-end fine-tuned variant TikZero+ matches or surpasses large commercial models such as GPT-4o.


🔗 Causal Inference (2)

A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

This paper proposes a block-based diffusion method leveraging LLMs and diffusion models to automatically generate high-quality counterfactual image-text pair datasets, accompanied by a set-aware loss function. Without manual annotation, the approach significantly improves CLIP's compositional reasoning ability, surpassing state-of-the-art methods on ARO/VL-Checklist and other benchmarks with substantially less data.

Social Debiasing for Fair Multi-modal LLMs

This paper constructs CMSC, a large-scale counterfactual dataset spanning 18 social concepts, and proposes the Anti-Stereotype Debiasing (ASD) strategy—comprising bias-aware data resampling and a Social Fairness Loss—that effectively reduces social bias across four MLLM architectures with negligible degradation of general multimodal capability.


🕸️ Graph Learning (1)

PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior

This paper proposes the PASTA framework, which integrates VLM-derived text priors to compensate for semantic deficiencies in sketches, and introduces ISG-Net (a dual graph convolutional network comprising IndivGCN and PartGCN) to model inter-part structural relationships, achieving state-of-the-art sketch-to-3D shape generation and part-level editing.


📈 Time Series (4)

I²-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

This paper proposes I²-World, which decouples 3D scene tokenization into two complementary processes — intra-scene multi-scale residual quantization and inter-scene temporal quantization — thereby retaining the high compression ratio of 3D tokenizers while incorporating the temporal modeling capability of 4D tokenizers, enabling efficient and high-quality 4D occupancy forecasting.

V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

This paper proposes V2XPnP, a V2X spatio-temporal fusion framework built upon a unified Transformer architecture, which achieves multi-agent end-to-end perception and prediction under a one-step communication strategy. The work also introduces the first large-scale real-world sequential dataset supporting all V2X collaboration modes, achieving state-of-the-art performance on both perception and prediction tasks.

VA-MoE: Variables-Adaptive Mixture of Experts for Incremental Weather Forecasting

This paper proposes a novel incremental weather forecasting paradigm and the VA-MoE framework. Through a variables-adaptive MoE architecture and index embedding mechanism, VA-MoE achieves forecasting accuracy comparable to full training with only 25% trainable parameters and 50% of the initial training data.

VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

This paper proposes VLRMBench, a comprehensive and challenging benchmark for vision-language reward models (VLRMs) comprising 12,634 questions across 12 tasks, covering three dimensions: process understanding, outcome judgment, and criticism generation. Extensive experiments on 26 models reveal significant deficiencies in current VLRMs.


🧬 Computational Biology (4)

CryoFastAR: Fast Cryo-EM Ab initio Reconstruction Made Easy

The first work to introduce the DUSt3R-style geometric foundation model paradigm into cryo-EM, achieving feedforward pose prediction from large sets of noisy particle images via a ViT encoder with cross-view attention decoder—without iterative optimization—enabling ab initio protein 3D reconstruction 10–33× faster than traditional methods.

G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion

This paper proposes G2PDiffusion, the first diffusion model-based cross-species genotype-to-phenotype prediction framework, which generates morphological images conditioned on evolutionary signals (multiple sequence alignments, MSA, and environmental context) to predict species appearance from DNA sequences.

Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines

This paper proposes integrating external biological knowledge — protein–protein interaction graphs and transcriptomic features from single-cell foundation models — into microscopy image pretraining, explicitly decoupling perturbation-specific and cell-line-specific representations to improve generalization of perturbation screening on unseen (de novo) cell lines.

MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

This paper proposes MolParser, an end-to-end Optical Chemical Structure Recognition (OCSR) method that handles Markush structures via an extended SMILES representation (E-SMILES), constructs a large-scale training set MolParser-7M with 7 million samples, and incorporates real-world literature data through active learning. MolParser achieves 76.9% accuracy on the WildMol benchmark, significantly outperforming existing methods.


⚛️ Physics & Scientific Computing (2)

JPEG Processing Neural Operator for Backward-Compatible Coding

This paper proposes JPNeO, a next-generation codec that is fully backward-compatible with the JPEG format. By introducing neural operators at both the encoding stage (JENO) and decoding stage (JDNO), along with a trainable quantization matrix, JPNeO significantly improves JPEG reconstruction quality—particularly for chroma components—while maintaining low memory footprint and parameter count.

ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers

This paper proposes ResQ — the first framework to natively implement residual neural networks (ResNets) on analog Rydberg atom quantum computers by exploiting continuous-time Hamiltonian evolution, encoding input features and trainable parameters via piecewise parameterized laser pulses, achieving an average 50% improvement over classical models of equivalent scale on MNIST, FashionMNIST, and medical dataset classification tasks.


📡 Signal & Communications (3)

Boosting Multimodal Learning via Disentangled Gradient Learning

This paper reveals an optimization conflict between modality encoders and fusion modules in multimodal learning — the fusion module suppresses gradients propagated back to individual modality encoders, causing even the dominant modality to underperform its unimodal counterpart. The paper proposes the Disentangled Gradient Learning (DGL) framework, which addresses this issue by cutting the gradient path from the fusion module to the encoders and replacing it with independent unimodal losses.

Generalizable Non-Line-of-Sight Imaging with Learnable Physical Priors

This paper proposes two modules — Learnable Path Compensation (LPC) and Adaptive Phasor Field (APF) — to address material-dependent radiance intensity falloff and frequency-domain denoising under varying SNR conditions in NLOS imaging, respectively. Trained solely on synthetic data, the method achieves state-of-the-art generalization across multiple real-world datasets.

Rectifying Magnitude Neglect in Linear Attention

This paper identifies that Linear Attention completely discards Query magnitude information, causing a significant deviation of attention score distributions from Softmax Attention. It proposes Magnitude-Aware Linear Attention (MALA), which restores magnitude awareness by introducing a scaling factor \(\beta\) and an offset term \(\gamma\), achieving comprehensive improvements over existing methods across classification, detection, segmentation, NLP, speech recognition, and image generation tasks.


👥 Social Computing (4)

Gradient Extrapolation for Debiased Representation Learning

This paper proposes GERNE, a method that constructs two batches with different degrees of spurious correlation and performs linear extrapolation on their gradients to guide the model toward learning debiased representations, outperforming state-of-the-art methods under both known and unknown attribute settings.

Learning Visual Proxy for Compositional Zero-Shot Learning

This paper proposes the concept of Visual Proxy — text-guided visual class centers introduced into CZSL for the first time — and jointly optimizes textual prototypes and visual proxies via Cross-Modal Joint Learning (CMJL), achieving closed-world SOTA on four CZSL benchmarks.

No More Sibling Rivalry: Debiasing Human-Object Interaction Detection

This paper identifies and systematically analyzes the "Toxic Siblings Bias" in HOI detection—highly similar HOI triplets that mutually interfere and compete at both the input and output levels. Two debiasing learning objectives are proposed: Contrastive-then-Calibration (C2C) and Merge-then-Split (M2S), achieving +9.18% mAP over the baseline and +3.59% over the previous state-of-the-art on HICO-DET.

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

This paper proposes PropVG, the first end-to-end proposal-based visual grounding framework that eliminates the need for pretrained detectors. It decomposes visual grounding into two stages — foreground proposal generation and contrastive learning-based referring scoring — and introduces a Multi-granularity Target Discrimination (MTD) module that integrates object-level and semantic-level information to determine target existence. PropVG achieves state-of-the-art performance on 10 datasets while running 4× faster than traditional proposal-based methods.