Skip to content

📷 CVPR2026 Accepted Papers

3703 CVPR2026 paper notes covering 3D Vision (646), Image Generation (434), Multimodal VLM (388), Video Understanding (178), Medical Imaging (163), Video Generation (152), VLM Reasoning (144), AI Safety (143) and other 49 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.


💡 LLM Reasoning (16)

Agile Deliberation: Concept Deliberation for Subjective Visual Classification

For subjective concepts with fuzzy boundaries like "healthy food" or "clickbait," this work proposes Agile Deliberation, a human-in-the-loop framework. The system decomposes concepts into hierarchies of positive/negative sub-concepts, iteratively retrieves "semantic boundary samples" for user annotation and reflection, and automatically compiles feedback into VLM prompts. This allows the image classifier to align with users' evolving intentions. In 18 real-user experiments, it outperformed automatic decomposition baselines by 7.5% in F1 and manual deliberation by over 3%.

APPO: Attention-guided Perception Policy Optimization for Video Reasoning

APPO identifies that "the bottleneck of video reasoning lies in perception rather than reasoning." It leverages the model's own attention on video frames to convert sparse outcome rewards into token-level dense rewards. By applying differential weighted learning to "intra-group perception tokens" that focus on the same key frames across different responses based on reward disparities, it consistently outperforms GRPO and DAPO on Qwen2.5-VL-3/7B by 0.5%–4%.

Dynamic Important Example Mining for Reinforcement Finetuning

In each training step of RFT (GRPO/PPO, etc.), DIEM uses the "inner product between single-sample gradients and the total batch gradient" to estimate the marginal contribution of each sample to current policy improvement in real-time. It then solves a constrained optimization problem to reweight samples while maintaining the gradient magnitude. With nearly zero extra overhead (+1.3% time), it improves multimodal reasoning benchmarks by 1–6 points on average.

E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought

Constructed the first multi-dimensional quality evaluation framework for Chinese e-commerce posters, E-comIQ-ZH, consisting of an 18K expert-annotated dataset (including CoT reasoning chains), a dedicated evaluation model E-comIQ-M (trained via SFT+GRPO), and a standardized benchmark E-comIQ-Bench.

EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

The proposed EagleVision is a dual-stage framework. In the macro-perception stage, it utilizes Semantic-Perspective Fusion DPP (SPF-DPP) to jointly optimize semantic relevance and perspective diversity in \(SE(3)\) space for keyframe selection. In the micro-verification stage, the model actively queries new perspective frames on a BEV plane to conduct iterative spatial CoT reasoning (hypothesis \(\rightarrow\) view \(\rightarrow\) verification loop). The query strategy is trained purely via RL without human annotation, achieving open-source SOTA on VSI-Bench and SQA3D.

FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

A VLM (Oracle) fine-tuned with GRPO and Chain-of-Thought (CoT) reasoning first infers a scalar wildfire risk score from satellite imagery and climate data. Then, FiLM is used to feed this score into a lightweight vision Encoder-Decoder to generate a high-resolution continuous risk raster. In a "US training, Europe testing" cross-continent setting, explicit linguistic reasoning significantly improves out-of-distribution (OOD) generalization, and the reasoning traces are interpretable and recoverable by wildfire experts.

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo is the first unified formal language framework (including a predicate library and a theorem library) for solid geometry. It utilizes a "Parse2Reason" approach: first, a Multimodal Large Language Model (MLLM) translates text and 3D diagrams into a formal Condition Description Language (CDL); then, a specialized symbolic reasoning engine performs rigorous theorem searching. This method improves MLLM accuracy in solid geometry from approximately 50% to 77.3%, approaching human performance.

Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop

This work decomposes the iterative human cognitive process of "understanding-solving-reunderstanding" into a cyclic interaction between an Understanding Module (UM) and a Solving Module (SM). Supplemented by representation isomorphism constraints and an adaptive halting mechanism, a small model with only 7M parameters achieves 47.2% accuracy on ARC-AGI-1, surpassing TRM and several general-purpose large language models.

Latent Chain-of-Thought World Modeling for End-to-End Autonomous Driving

LCDrive proposes the Latent Chain-of-Thought (Latent CoT) framework, which replaces natural language CoT for reasoning with action proposal tokens and world model prediction tokens. Through cold-start and RL post-training, it achieves lower latency and superior trajectory quality for end-to-end autonomous driving.

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

This paper discovers that existing LVLMs actually ignore the content of intermediate rationales during CoT reasoning. It proposes RED (Rationale-Enhanced Decoding), which multiplies next-token distributions conditioned on images and rationales at the logit level. Theoretically equivalent to the optimal solution for KL-constrained reward maximization, RED significantly improves multimodal reasoning accuracy without requiring training.

Browse all 16 LLM Reasoning papers →


🦾 LLM Agent (39)

AdapAction: Adaptive Target Action Backdoor Attack against GUI Agents

For MLLM-driven GUI agents, this work replaces traditional "trigger \(\rightarrow\) fixed action" backdoors with "trigger \(\rightarrow\) context-adaptive malicious action." An adversarial teacher LLM generates structured malicious reasoning trajectories, which are distilled into the target agent via SFT. This enables the agent, when triggered, to autonomously select a malicious operation that appears perfectly reasonable given the current interface and instruction, pushing the attack success rate to 100% while bypassing multi-principle LLM defenses and maintaining normal task utility.

AeroAgent: A Vision-Physics-Decision Framework for Aerodynamic Vehicle Design

AeroAgent integrates "text/image-to-3D car generation → second-level drag and flow field prediction via the AeroFormer surrogate model → planner-driven propose-evaluate-refine closed-loop editing" into a unified framework. It utilizes high-fidelity CFD only for final top-K candidate verification, achieving an average drag reduction of 2–12% within 5 iterations while reducing high-fidelity CFD calls by 50–80%.

Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection

A multi-agent system driven by LLMs is used to "act" as both forgers and social network observers, simulating the complete life cycle of face forgery from creation to propagation. It synthesizes training data with text-image consistency annotations, leading to significant performance gains for deepfake detectors in cross-domain and cross-algorithm real-world scenarios (e.g., Celeb-DF AUC improved from the 70% range to 87.1%).

BAMI: Training-Free Bias Mitigation in GUI Grounding

This paper diagnoses GUI grounding errors using the MPD attribution method, identifying two main types of inductive biases: precision bias and ambiguity bias. It proposes BAMI, a training-free inference framework that eliminates precision bias through "coarse-to-fine focusing" and mitigates ambiguity bias via "candidate selection." BAMI improves the accuracy of TianXi-Action-7B on ScreenSpot-Pro from 51.9% to 57.8%.

CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

Aiming at the "learning new while forgetting old" problem of GUI agents under frequent app updates, this paper discovers that SFT learns quickly but overwrites old knowledge, while RL (GRPO) resists forgetting but learns slowly. Therefore, the CGL framework is proposed—using "error-aware routing + entropy-regulated weighting + conditional gradient surgery" to integrate SFT and GRPO, achieving the highest accuracy and near-zero forgetting on the self-built AndroidControl-CL benchmark.

DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux

DRAMA uniformly abstracts agents and tasks in embodied multi-agent systems as "resource entities," utilizing an affinity matrix and a modified Hungarian algorithm for event-triggered dynamic scheduling. Complemented by a "Trust Chain" for decentralized fault takeover, the framework ensures uninterrupted task completion during agent dropout, addition, or recovery. In VirtualHome-Social, it achieves fewer average steps, lower conflict rates, and higher throughput compared to SOTA.

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

Ego2Web is proposed as the first benchmark that combines egocentric video perception with web agent execution. Accompanied by a semi-automatic data construction pipeline and the Ego2WebJudge automatic evaluation framework, experiments reveal a significant gap for current top agents in transferring from real-world visual perception to online actions, with a maximum success rate of only 48.2%.

EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

EpiAgent is the first Agent system for ancient inscription restoration. By utilizing an LLM central planner to coordinate multimodal analysis, specialized restoration tools, and iterative self-optimization, it outperforms existing methods in both textual authenticity and visual fidelity.

Experience Transfer for Multimodal LLM Agents in Minecraft Game

This paper proposes Echo—a "transfer-oriented" memory framework that explicitly decomposes reusable knowledge into five transfer dimensions: structure, attribute, process, function, and interaction. These are encapsulated into a unified Contextual State Descriptor (CSD). Using In-Context Analogical Learning (ICAL), the agent actively infers and verifies new tasks from the memory bank. In Minecraft "from-scratch" scenarios, this increases item unlocking speed by 1.3×–1.7× and leads to a "chain burst unlocking" phenomenon.

GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

This work proposes GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI Agents. It covers 201 mainstream Chinese Apps across 4 device types, utilizing a "Foundation + Application" two-layer structure to conduct fine-grained diagnosis across five dimensions: perception, planning, reflection, execution, and evaluation. Experiments on 20 representative models reveal that current models still exhibit significant weaknesses in reflection and self-evaluation.

Browse all 39 LLM Agent papers →


⚖️ Alignment & RLHF (12)

Anchoring the Mind of Multimodal Reasoners: Cognitive Bias as a Vector for Jailbreak Attacks

This paper discovers an "anchoring effect" in the safety judgments of Multimodal Large Reasoning Models (MLRMs)—where the model is significantly biased by the first information it encounters. Based on this, RA-Attack is proposed: it first anchors the model's reasoning chain to a "safe tone" using a "seemingly safe" structured mind map and educational context text, then smoothly packages harmful intent as a natural extension of this reasoning chain. It achieves SOTA Attack Success Rates (ASR) of 92% (Gemini-2.5-Pro) and 82% (GPT-4o) across 7 mainstream MLRMs.

Bridging Human Evaluation to Infrared and Visible Image Fusion

To address the long-standing issue of Infrared and Visible Image Fusion (IVIF) optimizing only handcrafted metrics and disconnecting from human aesthetics, this paper constructs the first large-scale IVIF human feedback dataset. It trains a "fusion-oriented reward model" to quantify perceptual quality and utilizes SAM-assisted GRPO to align the fusion network with human preferences, achieving SOTA performance on mainstream benchmarks with more visually pleasing fusion results.

DRM: Diffusion-based Reward Model With Step-wise Guidance

This paper utilizes the pre-trained diffusion model itself as the reward model backbone (DRM). By leveraging its unique ability to score noise latents at any denoising step, the authors design Step-GRPO for training with dense step-wise rewards and Step-wise Sampling for "explore-and-select" during inference. This approach significantly improves the generation quality of SD3.5-Medium without adding parameters and achieves 2.5–3.5 times faster convergence.

EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment

EcoAlign reframes the inference-time alignment of Large Vision-Language Models (LVLMs) as an "optimal path search problem under a limited compute budget." It utilizes a Net Present Value (NPV)-like look-ahead function to score candidate actions on a dynamically constructed Graph-of-Thought, balancing safety, utility, and cost while defining path safety via the "weakest link" principle to achieve superior safety and utility at lower compute costs.

From Pixel to Precision: Enhancing Handwritten Mathematical Expression Recognition with Image-Level Reward

Addressing the fundamental misalignment in handwritten mathematical expression recognition where "LaTeX text similarity \(\neq\) rendered image similarity," this paper proposes Image Matching Score (IMS)—a lightweight image-level reward based on column projection encoding and Levenshtein distance. This reward drives IMPO, a GRPO reinforcement learning framework without a value network. Across CROHME, HME100K, and M2E benchmarks, it increases ExpRate by an average of approximately 1.1% (up to 1.37%), achieving a new SOTA.

MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

MorphSeek redefines deformable medical image registration as "policy optimization in the encoder's latent space"—attaching a Gaussian policy head to the top layer of a U-Net encoder to treat latent features as samplable actions. It first uses unsupervised warm-up to stabilize the latent space, then employs GRPO for multi-trajectory multi-step weakly supervised fine-tuning. Combined with LDVN to stabilize policy gradients in the tens-of-thousands-dimensional latent space, it improves Dice by 2–4% and reduces the folding rate (NJD) by 30–60% on three 3D registration benchmarks using minimal labels.

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

Ours proposes NullSteer, an activation steering defense framework based on null-space projection. By restricting steering operations within the null space of benign activations, it effectively defends against visual jailbreak attacks without compromising the model's general capabilities.

SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization

SafeGRPO integrates "verifiable rule-governed rewards" into GRPO, allowing Multimodal Large Language Models (MLLMs) to learn self-rewarded safety through a "step-guided reasoning process" (analyzing visual, text, and combined risks) without manual preference annotations. This approach enhances jailbreak defense, safety awareness, and stability across multiple safety benchmarks while minimizing degradation of general capabilities and avoiding excessive refusal.

Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model

REACT is a frame-level reward model targeting "structural distortion" in generated videos. It establishes a taxonomy of eight distortion categories and labels 15,000 pairs of frame preference data. Using grounding reconstruction combined with Gemini-2.5-Pro, it synthesizes 6K CoT samples at low cost. Qwen2.5-VL-7B is trained in two stages via "Masked SFT + GRPO pairwise reward." During inference, a dynamic sampling mechanism focuses on frames most likely to be distorted, significantly outperforming existing video/image evaluators in both preference alignment and distortion identification.

Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

UE-DPO shifts the optimization focus for hallucination suppression in Multimodal Large Language Models (MLLMs) from "visually sensitive tokens that the model already understands" to "critical cognitive blind-spot tokens that the model fails to comprehend." By quantifying these blind spots with token-level epistemic uncertainty, UE-DPO asymmetrically adjusts DPO gradient intensities for preferred and dispreferred branches. It outperforms similar methods like TPO and V-DPO on multiple hallucination benchmarks using significantly less data.

Browse all 12 Alignment & RLHF papers →


🔒 LLM Safety (11)

AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models

AutoDebias is proposed as the first unified framework to simultaneously detect and mitigate malicious backdoor biases in T2I models. By leveraging VLM open-set detection to identify trigger-bias associations and constructing lookup tables, combined with CLIP-guided distribution alignment training, it reduces the attack success rate from 90% to near zero across 17 backdoor scenarios while maintaining image quality.

The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

This paper systematically investigates the catastrophic forgetting issue when fine-tuning VLMs for autonomous driving. It constructs FidelityDrivingBench, a large-scale benchmark with \(180\text{K}\) scenarios, and proposes the Drive Expert Adapter (DEA), which enhances driving task performance via prompt-space routing without corrupting base parameters.

Designing to Forget: Deep Semi-parametric Models for Unlearning

This paper proposes the "Designing to Forget" philosophy, introducing a family of Deep Semi-parametric Models (SPM). By simply removing training samples at inference time without modifying model weights, SPM reduces the prediction gap compared to retraining baselines by 11% on ImageNet and accelerates unlearning by more than 10x.

Elastic Weight Consolidation Done Right for Continual Learning

This paper systematically analyzes the fundamental flaws of EWC and its variants in weight importance estimation from a gradient perspective (gradient vanishing in EWC and redundant protection in MAS). It proposes an extremely simple Logits Reversal operation to correct the Fisher Information Matrix (FIM) calculation, significantly outperforming the original EWC and its variants in exemplar-free class-incremental learning and multimodal continual instruction tuning tasks.

Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting

The authors propose KNOW prediction: inducing a structured forgetting process through sequential fine-tuning on progressively smaller nested data subsets to collect weight transition trajectories, and then using a meta-learned hyper-model (KNOWN) to reverse the direction of forgetting. This predicts virtual knowledge-enhanced weights as if they were trained on larger datasets. Across multiple datasets (CIFAR/ImageNet/PACS, etc.) and architectures (ResNet/PVTv2/DeepLabV3+, etc.), the method consistently outperforms naive fine-tuning and various weight prediction baselines, showing significant improvements in downstream tasks such as image classification, semantic segmentation, image captioning, and domain generalization.

Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization

To address the issues of "uniform treatment of all samples/categories" and "gradient conflicts between forgetting and retaining objectives" in machine unlearning, this paper proposes Adaptive Gradient Reweighting (weighting based on sample memory depth/category vulnerability) combined with Three-stage Objective Optimization (direction rectification → temporal smoothing → adaptive combination). On CIFAR-10/100 and Tiny-ImageNet, the Avg Gap for random forgetting is reduced from the SOTA 0.85 to 0.19.

Omni-Attack: Adversarial Attacks on Open-Ended VQA in Black-Box Multimodal LLMs

Addressing the gaps where "open-ended VQA/OCR tasks lack explicit attack targets and existing adversarial robustness evaluations use fragmented protocols," this paper first establishes a unified targeted attack benchmark AdvRobustBench (1,000 items, VQA+OCR). It then proposes Omni-Attack, a transferable black-box attack using LLMs to generate "question-conditioned" textual/visual targets, OCR location-aware perturbations, and four transfer regularizations. It achieves a 71.8% targeted attack success rate on GPT-4.1 with \(\epsilon=8/255\).

⊘ Source Models Leak What They Shouldn't ↛: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

This paper identifies that Source-Free Domain Adaptation (SFDA) methods inadvertently leak knowledge of source-exclusive classes to the target domain (zero-shot transfer). It proposes the SCADA-UL framework, which concurrently performs class unlearning during domain adaptation by adversarially generating forgotten samples and employing a rescaled labeling strategy, achieving unlearning performance comparable to training from scratch.

Revisiting Learning with Noisy Labels: Active Forgetting and Noise Suppression

To address the overfitting bottleneck in noisy label learning (LNL) caused by long-term "clean sample selection," this paper proposes FINE, a plug-and-play framework. It uses Active Forgetting via Machine Unlearning (AFMU) to "actively forget" noise absorbed during early stages and Noise Suppression via Negative Learning (NSNL) to "suppress" overfitting in later stages. Integrated into existing SOTA methods like SED or ACT, it consistently improves robustness and generalization.

Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation

The SIEVE (Select–Hypothesize–Verify) framework is proposed to interpret neuron functions through a closed-loop process involving high-activation sample screening, concept hypothesis generation, and text-to-image verification. The probability of generated concepts matching neuron activation is approximately 1.5 times that of existing SOTA methods.

Browse all 11 LLM Safety papers →


👻 Hallucination Detection (32)

AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

Addressing the issue where "amplifying image attention suppresses hallucinations but leads to repetitive and wordy output," this paper discovers that real object tokens possess higher attention to the previously generated text \(T_p\) than hallucinated tokens. Consequently, the authors propose increasing attention specifically to \(T_p\) (IAT). By further employing layer-wise thresholds to control "when to intervene" and a head-wise amplification matrix to control "how much to amplify" (AdaIAT), they significantly reduce hallucination rates (CS/CI) on LLaVA-1.5, Janus-Pro, and Qwen2.5-VL with almost no loss in text diversity.

Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

Ours proposes a patch-level LVLM hallucination detection framework, discovering that hallucinated tokens exhibit dispersed attention patterns and low semantic alignment. Based on these signatures, Attention Dispersion Score (ADS) and Cross-modality Grounding Consistency (CGC) are designed as lightweight metrics, achieving a detection accuracy of 90%.

CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models

CausalLens decomposes each attention head of the decoder into three pathways—"visual, text, and system prompt"—and identifies heads that truly attend to the image using a visual sensitivity score. By amplifying visual contributions and applying projection alignment corrections in a single forward pass within the middle layers (L10–L20), it significantly reduces hallucinations in Large Vision-Language Models without retraining or multiple decoding iterations.

COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs

The authors discovered that MLLMs, when post-trained with GRPO (using only outcome rewards based on final answer correctness), tend to over-focus on image backgrounds, forming spurious "background \(\to\) answer" correlations that lead to hallucinations. They propose COPO, which calculates a "causal completeness" reward (sufficiency + necessity) for each reasoning token and injects it into the GRPO advantage function. This forces the model to reward only those tokens that truly determine the answer's correctness, consistently reducing hallucination rates across multiple benchmarks such as CHAIR and POPE.

Cross-Modal Attention Calibration for LVLM Hallucination Mitigation

To mitigate hallucinations in LVLMs, this paper proposes CMAC, a training-free cross-modal attention calibration framework. It uses the IMD module to perform "surgical" masking of high cross-modal weight value vectors in the attention layer to construct a more accurate hallucination distribution for contrastive decoding. Additionally, the CMPC module scales the position indices of image tokens to alleviate the position bias introduced by RoPE. CMAC consistently outperforms existing contrastive decoding methods across POPE, CHAIR, and MME.

Envision, Attend, Then Respond: Counterfactual Hallucination Mitigation in Large Vision-Language Models

EnAR is a training-free framework that utilizes a diffusion model to generate a "visual impression" of what an input image "should look like." By comparing the visual attention differences between the original image and this impression, it identifies counterfactual elements (e.g., a five-legged alpaca). These tokens are then masked for contrastive decoding, forcing the LVLM to anchor its response on real pixels rather than linguistic priors. This approach achieves a 10.82% improvement on the counterfactual benchmark VLMBias and an average 6.9% gain on the general hallucination benchmark POPE.

Evaluating and Easing Hallucinations for GUI Grounding

This paper presents the first systematic study of hallucinations in GUI grounding, categorizing them into "confusion hallucinations" (misidentifying similar elements) and "fabrication hallucinations" (inventing non-existent coordinates). The authors construct GUI-HalluBench, a bilingual dataset with dual subsets, to diagnose the correlation between hallucinations and parsing capabilities. They propose a training-free Parsing-guided Prompt (PGP) and a Hallucination-aware Fine-Tuning (HFT) solution. Experiments demonstrate that stronger parsing leads to fewer hallucinations, with HFT yielding an absolute improvement of approximately 7%.

Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

Ours proposes CIPHER, a training-free test-time hallucination suppression method. In the offline stage, a diffusion model is used to generate counterfactual images to construct the OHC-25K dataset, from which a visual hallucination subspace is extracted via SVD. During inference, hidden states are projected onto the orthogonal complement of this subspace, significantly reducing visual hallucinations in LVLMs without modifying model parameters or increasing inference overhead.

Fine-Grained Multi-Image Object Hallucination Benchmark

MIOH is the first fine-grained object hallucination diagnostic benchmark designed for multi-image scenarios. It creates a matrix of "4 object tasks × 3 multi-image reasoning modes" resulting in 26 question types, further overlaid with three controllable adversarial pressures: "number of images / perceptual difficulty / contextual bias." Evaluations of 29 models reveal that even GPT-5 and Gemini-2.5-Pro achieve overall accuracies of only 63.1% and 64.4%, respectively, with a global average of only 36.1%. The study identifies that hallucinations primarily originate from the cross-image integration stage rather than simple perceptual failure.

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

The study identifies a sharp increase in MLLM hallucination rates under fine-grained negative queries (queries with a single subtle error among multiple objects/attributes/relations). It proposes the FINER benchmark and the FINER-Tuning method (based on DPO), achieving a maximum improvement of 24.2% on InternVL3.5-14B.

Browse all 32 Hallucination Detection papers →


⚡ LLM Efficiency (8)

E\(^2\)-SCI: Elastic Edge-Cloud Speculative Decoding via Credit Inertia

This paper identifies strong temporal consistency in token acceptance rates across adjacent windows in edge-cloud speculative decoding (termed "Credit Inertia"). Based on this, it dynamically adjusts verification thresholds using historical acceptance rates. Combined with an Asynchronous Pipeline (PLC) that parallelizes draft generation and cloud verification, it achieves 9.4+ tokens/s on DeepSeek-R1-Distill-Qwen (1.5B/32B), representing an 88.5% speedup over the FSD baseline without compromising accuracy.

Few-Shot Hybrid Incremental Learning: Continually Learning under Data Scarcity and Task Uncertainty

This paper proposes "Few-Shot Hybrid Incremental Learning (FSHIL)," a realistic new paradigm where data is scarce and task types (new classes, new domains, or both) appear stochastically. By introducing "Conditional Meta-Expanding Mixture of Experts (CME-MoE)" to reconcile stability and plasticity at the feature level and "Self-Expanding Prototype Classifier (SEPC)" to model multi-distribution boundaries at the classification layer, the method outperforms existing FSIL and HIL approaches across five datasets and three incremental settings.

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

This work reinterprets the state updates of Linear State Space Models (SSMs) as "performing test-time ridge regression on the entire history." By replacing the one-step gradient approximation in existing SSMs with the exact gain from Kalman filtering and overcoming the dual obstacles of low-precision numerical instability and parallel training via adaptive regularization and Chebyshev iterations, it outperforms linear SSMs like Mamba2 and Gated DeltaNet in short/long context tasks and ImageNet.

Generalizable Video Quality Assessment via Weak-to-Strong Learning

Without relying on any human annotation labels, off-the-shelf VQA models are utilized as "weak teachers" to supervise a high-capacity Multimodal Large Language Model (MLLM) "strong student." The student is then recycled as the teacher for subsequent iterative rounds. The final model matches in-distribution performance and significantly surpasses all teachers in OOD scenarios, improving the overall OOD SRCC of VQA from 0.59 to 0.745.

JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction

JUMP-Hand reformulates multi-view 3D hand reconstruction as a Mixture of Experts (MoE) problem where "each view is an expert," utilizing joint-wise, view-wise probabilistic uncertainty as an explicit gating signal. This signal drives both uncertainty-weighted triangulation in the coarse stage and uncertainty-gated cross-attention in the refinement stage, adaptively amplifying reliable views while suppressing noisy ones under severe occlusion, achieving SOTA results across three multi-view benchmarks.

ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding

Addressing two major bottlenecks in Video-LLM speculative decoding—"draft and target models waiting for each other" and "trade-off between speedup ratio and model alignment"—ParallelVLM implements both prefilling and decoding as draft/target parallel pipelines. It employs UV-Prune, an unbiased pruning method based on visual-text similarity variations (rather than attention scores), to expand the draft window. This achieves \(3.36\times\) and \(2.42\times\) lossless acceleration on LLaVA-OneVision-72B and Qwen2.5-VL-32B, respectively, while being training-free and plug-and-play.

QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models

QuietPrune proposes query-guided early pruning: visual tokens unrelated to the text query are pruned during the ViT forward process rather than after it. By utilizing a lightweight adapter initialized through an inverse transformation of the VLM projector, the text query is converted into a visual-domain [Q-CLS] token to provide guidance. Pruning is performed in a 2×2 semi-structured manner with redundant token aggregation. On Qwen3-VL and InternVL3, it reduces prefill latency by up to 19.0% while achieving 4.2% higher accuracy than existing late-pruning methods.

Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference

ReMix inserts an iteratively refreshed "continuous mixed state" between the discrete "mask state \(\rightarrow\) token state" transitions in Diffusion Language Models (DLLMs). This allows multiple positions in parallel decoding to coordinate in continuous space before finalizing tokens. By applying a rejection rule to reset unstable positions to masks, the method achieves a 2–8\(\times\) inference speedup without training or performance degradation, frequently even improving accuracy.


📚 Pretraining (5)

Exploring Visual Pretraining for Learning Language Intelligence

This paper proposes MAPLE: instead of extracting text from PDFs to feed into LLMs, it directly performs masked autoregressive pretraining on document page images. By allowing the LLM to learn language intelligence through "generating latent hypotheses for occluded regions," it achieves an average improvement of up to 40.2% over pure text pretraining across four mathematical reasoning benchmarks.

Linking Modality Isolation in Heterogeneous Collaborative Perception

The CodeAlign framework is proposed to address the "modality isolation" problem in heterogeneous collaborative perception, where different modalities never co-occur in training data. By constructing discrete code spaces via codebooks and performing cross-modal Feature-Code-Feature (FCF) translation, it achieves SOTA perception performance with only 8% of HEAL's training parameters and a \(1024\times\) reduction in communication volume.

Reconstructing CLIP for Open-Vocabulary Dense Perception

DenseRC addresses the neglected problem of "how to construct high-quality dense features for CLIP." It reveals that the generalized semantics of the cls token actually derive from multi-layer value embeddings, whereas spatial aggregation tends to amplify semantic misalignment. By using multi-layer values as a foundation and employing a lightweight Head Selection Gating (HSG) for re-weighting solely across the head dimension, the authors construct dense representations aligned with global semantics. DenseRC sets new SOTAs on multiple open-vocabulary detection and segmentation benchmarks.

Unlocking Pre-trained Weights: Parameter Inheritance for Zero-Shot Initialization

PITH utilizes a Graph HyperNetwork to dynamically generate "projection matrices" that map internal weights of large pre-trained models directly onto target ViTs of arbitrary sizes for initialization. This enables the initialized networks to be used immediately without training—achieving a zero-shot accuracy of 53.35% for ViT-Base on ImageNet-1K, which is 6.54% higher than the previous SOTA (TAL).

Watch and Learn: Learning to Use Computers from Online Videos

The Watch & Learn (W&L) framework is proposed, which automatically transforms human computer-operation videos from the internet into executable UI trajectory data using an Inverse Dynamics Model (IDM). It generates 53K+ high-quality trajectories, significantly improving the performance of various Computer-Using Agents (CUAs) when used as In-Context Learning (ICL) examples or Supervised Fine-Tuning (SFT) data.


🎨 Image Generation (434)

2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching

A fine-tuning framework named 2ndMatch is proposed. By aligning the second-order Jacobian matrix \(J^\top J\) (inspired by Finite-Time Lyapunov Exponents) of the pruned model with the original model, it matches the temporal sensitivity to input perturbations, significantly narrowing the generation quality gap.

3D Space as a Scratchpad for Editable Text-to-Image Generation

This paper proposes treating an editable 3D scene as a "spatial scratchpad" for text-to-image generation. A suite of LLM agents parses text prompts into subject meshes, plans placements/orientations/cameras in 3D, and renders this layout into an image via identity-preserving depth-conditioned generation. It achieves a 32% training-free improvement in text alignment on GenAI-Bench and supports consistent image updates through simple 3D modifications.

A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation

T2LDM utilizes a "Guidance Network" (SCRG) that provides geometric reconstruction supervision during training but is discarded at inference, along with Directional Positional Encoding (DPE) to correct street distortion from spherical projection. It generates finely structured and controllable LiDAR scenes despite the extreme scarcity of Text-LiDAR pairs, and introduces the controllability benchmark T2nuScenes and the TBR metric.

A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

CoTyle conjures novel and reproducible visual styles using a single numeric code. It accomplishes "one number = one style" for the first time in the open-source community by training a discrete style codebook to compress images into style indices, a T2I diffusion model to generate images conditioned on these indices, and an autoregressive generator to create new style index sequences from scratch.

A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation

Addressing the limitation where pose/appearance control signals are injected with fixed intensity across all denoising steps in controllable hand generation, this paper proposes TCCA. It utilizes a set of learnable queries to align heterogeneous features—noisy latents, 3D pose, and appearance—into a unified space to dynamically adjust injection intensity step-by-step. Complementing this is a pose-invariant appearance encoder using SVD orthogonal decomposition to remove pose artifacts. The method outperforms FoundHand across FID/LPIPS/PCK metrics on datasets like InterHand2.6M.

A Training-Free Style-Personalization via SVD-Based Feature Decomposition

Based on the scale-wise autoregressive model Infinity, this work discovers that the largest singular value component of the 3rd feature \(F_3\) in the generation process specifically encodes style information. Consequently, a training-free approach is proposed to inject the style of a reference image into this feature step using SVD (Principal Feature Blending), while stabilizing the structure via attention maps from a content branch (Structural Attention Correction). This achieves style fidelity comparable to fine-tuning methods in 3.58 seconds, which is up to 195 times faster.

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Addressing the pain points of "sub-linear speedup and quality degradation" in multi-GPU diffusion inference, this paper leverages the inherent "conditional/unconditional dual-path" of Classifier-Free Guidance as the data parallelism splitting dimension (Conditional Partitioning). It then uses a metric for noise discrepancy (rel-MAE) to adaptively determine when to enable pipeline parallelism. On two RTX 3090 GPUs, it achieves 2.31× and 2.07× speedups for SDXL and SD3, respectively, with almost no loss in image quality.

Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation

Aiming at zero-shot image-to-image generation methods like IP-Adapter and InstantID that "clone faces or styles with a single image," this paper proposes Adapter Shield. It utilizes a pair of trainable "encryptor/decryptor" modules to map image encoder embeddings into garbled code based on a password. Multi-objective adversarial perturbations are then used to "anchor" the original image to these garbled embeddings. This causes unauthorized users to generate distorted results, while authorized users with the correct password can decrypt the embeddings for normal use—marking the first unified framework in this field to combine "protection" and "authentication."

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Proposes Adaptive Auxiliary Prompt Blending (AAPB), which derives a closed-form adaptive blending coefficient via Tweedie’s formula to dynamically balance the contributions of auxiliary anchor prompts and target prompts at each denoising step. This training-free approach significantly improves semantic accuracy and structural fidelity in rare concept generation and zero-shot image editing.

Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

The authors propose Spectrum, a global spectral-domain feature forecasting method based on Chebyshev polynomials. By treating the intermediate features of the diffusion model denoiser as functions of time and fitting coefficients via ridge regression, it achieves long-range feature forecasting where errors do not accumulate with step size. Spectrum achieves a \(4.79 \times\) speedup on FLUX.1 and \(4.67 \times\) on Wan2.1-14B with nearly no loss in quality.

Browse all 434 Image Generation papers →


🎬 Video Generation (152)

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

3DiMo shifts human motion control from "relying on external SMPL reconstruction" to "jointly learning a set of view-invariant implicit motion tokens end-to-end with the video generator." By leveraging cross-attention semantic injection and multi-view rich data supervision, the model recovers genuine 3D motion from 2D driving frames. This allows for faithful action reproduction while supporting free camera视角 control via text, with results significantly exceeding 2D pose and SMPL baselines in motion fidelity and image quality.

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

The paper introduces DeltaTok, which compresses the VFM feature differences between consecutive frames into a single delta token. Combined with Best-of-Many training, DeltaWorld efficiently generates diverse future predictions in a single forward pass. With only 1/35 the parameters and 1/2000 the FLOPs of Cosmos, it outperforms existing models in dense prediction tasks.

Accelerating Autoregressive Video Diffusion via History-Guided Cache and Residual Correction

To address the critical issue in Autoregressive Video Diffusion Models (ARDMs) where "cache approximation errors accumulate and amplify over time" during segment-by-segment generation, this paper proposes the training-free ARCache. It uses History-Guided Cache to schedule caching based on changes in history tokens (suppressing intra-segment errors) and Enhanced Residual Correction to calibrate subsequent segments using the clean residual trajectory of the first segment (preventing inter-segment drift). It achieves up to \(3.13\times\) acceleration across three ARDMs with nearly lossless image quality.

Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep

For masked video editing (MV2V) tasks, this paper proposes HetCache, a training-free framework: it categorizes denoising steps into "full, partial, or reuse" based on cumulative change across timesteps, and partitions tokens into "context, margin, or generation" based on mask spatial priors within a single step. By performing attention only on the most semantically representative context tokens, it achieves a 2.67× speedup on Wan2.1-VACE with almost no drop in visual quality.

ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos

This paper introduces the activity-level video forgery localization task and the ActivityForensics large-scale benchmark (6K+ forged clips). It utilizes a grounding-assisted automated data construction pipeline to create highly realistic activity manipulations and proposes the Temporal Artifact Diffuser (TADiff) baseline, which amplifies forgery clues through diffusion-based feature regularization.

AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation

AdaCluster is a training-free sparse attention framework that designs specific strategies for the distinct roles of queries and keys in video DiTs. It uses "angular clustering" to compress queries and "layer-wise adaptive multi-stage K-means" to cluster keys. Combined with TensorQuest, which identifies key clusters via Tensor Cores, it achieves 1.67×–4.31× end-to-end acceleration on CogVideoX-2B / HunyuanVideo / Wan-2.1 with nearly lossless visual quality (up to 30.99 PSNR).

AdapTok: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

AdapTok encodes video into a temporally causal 1D discrete token sequence. During training, it learns "variable-length" representations by randomly dropping tail tokens in blocks. During inference, a scorer predicts "reconstruction quality for \(N\) tokens," and Integer Linear Programming (ILP) dynamically allocates tokens to different frames or samples under a fixed total budget. This achieves rFVD=28 reconstruction on UCF-101 with fewer tokens and significantly improves autoregressive video generation quality.

Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Anti-I2V proposes a defense method against malicious image-to-video generation by optimizing perturbations in the L*a*b* and frequency dual-spaces and designing Internal Representation Collapse (IRC) and Anchoring (IRA) losses to disrupt semantic feature propagation in denoising networks, achieving SOTA protection across CogVideoX, DynamiCrafter, and Open-Sora.

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Archon discretizes seven modalities involved in digital humans (description, text script, speech, 3DMM animation, semantic video, image, and video) into tokens. A single autoregressive large model is pre-trained on 72 tasks to achieve any-to-any modality generation, understanding, and editing. It addresses token explosion in high-frame-rate talking videos using a \(4\times\) semantic video token compression and semantic-driven diffusion decoding. Furthermore, it stabilizes quality for high-ambiguity tasks like speech-to-video through "Thinking in Modality," which decomposes the process into modality-by-modality intermediate steps.

Are Image-to-Video Models Good Zero-Shot Image Editors?

This paper proposes IF-Edit, a training-free framework that directly utilizes pre-trained Image-to-Video (I2V) diffusion models as zero-shot image editors. By rewriting static editing instructions into "evolution over time" descriptions using Chain-of-Thought (CoT) prompting, employing Temporal Latent Dropout (TLD) to prune redundant frames for denoising acceleration, and using Self-Consistent Post-Refinement (SCPR) to select and regenerate a "static video" for clarity, IF-Edit demonstrates strong performance in non-rigid deformation and reasoning-based editing tasks.

Browse all 152 Video Generation papers →


🧩 Multimodal VLM (388)

4DP-QA: Scalable QA for 4D Perception in Vision Language Models

This paper designs a scalable spatiotemporal QA automatic generation pipeline, producing 400,000 training samples (4DP-QA) and a 2.2K benchmark (4DP-QA-Bench) from various real/synthetic 4D data sources. It introduces "true-motion point tracking" as a new perception task to decouple object motion from camera motion. By fine-tuning standard VLMs with this data, 4D perception accuracy increases from ~42% to ~84%, with generalization to the external benchmark VLM4D.

4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

4DWorldBench proposes a unified, multimodal, physics-aware 4D world generation evaluation framework. By mapping text/image/video conditions into a unified textual space, it evaluates models across four dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. It employs an adaptive hybrid scoring strategy involving LLM-as-judge, MLLM-as-judge, and traditional metrics, validated by human subjective experiments to be more aligned with human judgment than existing benchmarks.

A3: Towards Advertising Aesthetic Assessment

The authors propose the A3 framework, which includes a theory-driven three-stage advertising aesthetic assessment paradigm A3-Law (Perceptive Attention → Formal Interest → Desire Impact), a dataset of 120,000 annotated samples (A3-Dataset), a model aligned via SFT and GRPO (A3-Align), and an evaluation benchmark (A3-Bench). It outperforms existing MLLMs in automated advertising aesthetic assessment.

A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks

A closed-form solution for VLM debiasing is proposed, achieving Pareto optimal fairness and bounded utility loss through orthogonal decomposition of the attribute subspace in the cross-modal embedding space and Chebyshev scalarization. It is training-free and label-free, uniformly covering zero-shot classification, text-to-image retrieval, and text-to-image generation tasks.

A More Word-like Image Tokenization for MLLMs

DiVT replaces the MLP projector in LLaVA with a clustering-based visual projector, grouping ViT patch features into "visual words" based on semantics. Each cluster generates a single token, with the token count adaptively varying based on image complexity. Trained solely on language modeling objectives, it matches or exceeds full-resolution baselines across 8 multimodal benchmarks using 1/4 or even 1/40 of the visual tokens.

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

To address the deficiencies of VLMs in 3D spatial reasoning, this paper proposes the training-free SandboxVLM: it utilizes a video diffusion prior to generate multi-view sequences from a single 2D image, lifts key objects into sparse "abstract 3D bounding boxes," and renders them back to the VLM. This enables zero-shot understanding of 3D structures, achieving a 17.4% improvement over the baseline on SAT-Real.

Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models

This paper introduces TANL (Test-time Activated Negative Labels), which dynamically evaluates the "activation level" of negative labels on OOD samples during test-time to mine the most effective labels. Combined with an activation-aware scoring function, it significantly reduces FPR95 from 17.5% to 9.8% on ImageNet benchmarks while remaining training-free and computationally efficient at inference.

Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data

To address the "random frame-level missingness" problem in Multimodal Sentiment Analysis (MSA), this paper incorporates the human brain's "active perceptual inference" mechanism into the network. It proposes a dual-layer nested recurrent network, DNRNet: a local loop simulates intra-cortical pattern completion for intra-modality self-correction, while a global loop simulates the corticothalamic circuit to perform cross-modal weighted completion based on modality confidence. Two corrective signals are iteratively fed back into the input, upgrading "one-pass feedforward passive completion" to "multi-round active inference completion," achieving an average improvement of 1.5%–2.0% across various missing rates on MOSI/MOSEI/SIMS.

Adapting In-context Generation for Enhanced Composed Image Retrieval

This paper proposes DAIG: using 32 target domain samples to perform in-context fine-tuning (CIR-LoRA) on a pre-trained T2I model (Flux). This allows the model to synthesize "unbiased, domain-aligned" Composed Image Retrieval (CIR) triplets in batches. A two-stage training framework (feature-perturbed pre-training DRSP + angular margin fine-tuning FRA) is then used to feed these synthetic data into any off-the-shelf CIR model, achieving significant performance gains on CIRR/FashionIQ in a plug-and-play manner with zero additional inference cost.

Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning

The authors observe that in source-free cross-domain few-shot learning (CDFSL) scenarios, standard few-shot fine-tuning on the target domain significantly exacerbates the attention sink of CLIP. The model concentrates attention on "simple tokens" that are inherently associated with all classes, leading to a loss of inter-class discriminability. To address this, TIR (Token Importance Recalibration) is proposed. It linearly reweights tokens between deep layers of the CLIP vision encoder based on their "cross-class activation" (Sum score). This suppresses sink tokens and amplifies discriminative tokens, achieving new SOTA results across four CDFSL benchmarks.

Browse all 388 Multimodal VLM papers →


🧠 VLM Reasoning (144)

A Causal Marriage between VLM and IRM from Understanding to Reasoning

Starting from token-level causal representations, this paper proves that a "vocabulary-constrained InfoNCE" is formally equivalent to the invariance principle of IRM. Based on this, it proposes CLIP-IRM, a mid-training paradigm that enhances OOD understanding without architectural changes, and transfers the OOD guarantees of IRM to multimodal reasoning by using its invariant alignment score as a process-level reward for GRPO.

A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

Ours proposes A4VL, a training-free multi-agent perception-action alliance framework. Through event-driven video chunking, clue-guided keyframe selection, and a multi-round agent negotiation-pruning mechanism, it consistently outperforms 28 baseline methods across five VideoQA benchmarks with significantly lower inference latency.

Act2See: Emergent Active Visual Perception for Video Reasoning

Act2See enables video VLMs through supervised fine-tuning to autonomously decide when to insert a video frame during the textual CoT reasoning process—either by retrieving a real evidence frame from the original video or conditionally "imagining" a counterfactual frame—thereby refreshing or surpassing closed-source models of similar or even larger sizes on 5 video reasoning benchmarks including VideoEspresso and ViTIB.

Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization

The authors identify a "stylistic inconsistency" vulnerability in VLMs—they can understand content in almost any artistic style, yet their safety alignment is easily bypassed by specific visual style triggers. Based on this, they propose ASO, which fine-tunes an image editing model using GRPO to overlay optimal styles onto existing adversarial images, consistently improving the Attack Success Rate (ASR) across four SOTA VLMs.

Agentic Video Summarization via Self-Reflecting Multimodal Understanding

Reinterprets video summarization from a "one-time importance score regression" into a "predict-verify-reflect" closed-loop workflow composed of three MLLM agents: Summarizer, Verifier, and Reflector. This allows the model to self-correct and retrieve missed keyframes, outperforming previous SOTA on SumMe and TVSum in Kendall's \(\tau\) and Spearman's \(\rho\).

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

The authors observe that VLMs trained with GRPO, while achieving deeper reasoning in single trials, suffer from "diversity collapse" early in training—degenerating into a single dominant strategy. They propose MUPO (Multi-group Policy Optimization), which clusters sampled responses into multiple groups based on reasoning patterns, estimates local advantages within groups, and applies inter-group diversity rewards. This allows the model to maintain multiple problem-solving strategies while preserving depth, achieving an average improvement of 2-7% in acc@1/acc@4 across nine reasoning benchmarks.

ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

ANTS allows Multimodal Large Language Models (MLLM) to "understand" cached suspected OOD images at test-time. It generates "descriptive negative sentences" to characterize far-OOD and "visually similar negative labels" to characterize near-OOD. These two negative textual spaces are dynamically fused via an adaptive weight. On the ImageNet benchmark, ANTS achieves a zero-shot, training-free 3.1% reduction in FPR95, setting a new SOTA.

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

ARM-Thinker transforms the multimodal reward model from a "one-pass scoring" system into an agent that actively invokes tools (crop-and-zoom, document retrieval, instruction verification) to seek evidence. Using a two-stage GRPO training strategy—encouraging tool usage followed by refining accuracy—the 7B model achieves average gains of +16.2%, +9.6%, and +4.2% across reward modeling, think-with-images, and general reasoning benchmarks, respectively, matching or even surpassing GPT-4o on reward and tool-use benchmarks.

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Addressing the "counting deficiency" in multimodal large language models (MLLMs), this work introduces CG-AV-Counting—the first interpretable counting benchmark for long videos across audio-visual modalities with fine-grained "counting clue" annotations. Simultaneously, it proposes AV-Reasoner, which leverages GRPO and curriculum learning to transfer counting capabilities from related tasks such as localization and QA. While achieving SOTA on several audio-visual reasoning benchmarks, the paper honestly identifies that explicit reasoning in the language space offers little help out-of-distribution.

AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision-Language Models

To address the challenge of "detecting and explaining user operation errors in long task videos," this paper utilizes a frozen VLM combined with an automatically constructed "Action Execution Graph (AXG)" and temporal action segmentation. By decomposing each action segment into fine-grained sub-actions and querying the VLM only on keyframes of these sub-actions, the model focuses on sparse spatial-temporal error clues. It achieves SOTA performance in error explanation and detection on EgoPER and CaptainCook4D, significantly surpassing VLM baselines.

Browse all 144 VLM Reasoning papers →


⚡ VLM Efficiency (62)

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

To address the slow real-time deployment of streaming Video Large Language Models (Streaming VideoLLM), this paper proposes STC, a plug-and-play two-level token compression framework. STC-Cacher caches and reuses static features from adjacent frames during the ViT encoding stage, recomputing only dynamic tokens. STC-Pruner utilizes "spatio-temporal dual anchors" to prune redundant tokens before entering the LLM. STC maintains approximately 99% accuracy on ReKV while reducing ViT encoding latency by 24.5% and LLM pre-filling latency by 45.3%.

Adapting Lightweight Image-based Counting Models for Video Crowd Counting

This paper avoids adding any temporal modules to Video Crowd Counting (VCC). Instead, it analytically formulates the spatiotemporal prior—that "crowd count changes between adjacent frames should be bounded"—as a frequency-domain statistical regulator based on the Characteristic Function (ChF). This regulator constrains a lightweight Image Crowd Counting (ICC) model only during training, while inference remains single-frame. It achieves SOTA accuracy across six datasets while reaching an inference frame rate of 99.5 fps.

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

AdaptVision is proposed to enable VLMs to autonomously determine the minimum number of visual tokens required for each sample through a coarse-to-fine active vision mechanism and reinforcement learning. Combined with Decoupled Turn Policy Optimization (DTPO), it achieves an optimal balance between efficiency and accuracy.

ApET: Approximation-Error Guided Token Compression for Efficient VLMs

From an information-theoretic perspective, this paper proposes a visual token importance evaluation method based on linear approximation reconstruction error. It does not rely on attention weights, making it naturally compatible with FlashAttention. On LLaVA-1.5, it maintains 95.2% performance while compressing 88.9% of visual tokens.

Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

AttentionPack leverages the inherent low-rank observation of LVLM KV caches (especially vision tokens). It compresses the cache along the hidden dimension using SVD via "multi-head concatenation + modality separation" and employs an "attention-aware partial decompression" strategy based on cumulative attention scores to select ranks on-demand. Without significant performance loss, it reduces memory consumption to 1/5–1/8 of the original, supporting larger batches/longer contexts and achieving up to a 74% increase in decoding throughput.

Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction

STAMP reformulates MLLM-based segmentation as a parallel "cloze" classification task for all image patches. By simultaneously predicting the entire mask using a single non-autoregressive forward pass, it achieves high segmentation precision and fast inference speed without compromising conversational capabilities, effectively resolving the long-standing "dialogue/performance/speed" trilemma in MLLM segmentation.

Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

The Blink framework is proposed to adaptively enhance visual perception in a single forward pass by dynamically expanding and discarding visual tokens across different Transformer layers of MLLMs (mimicking human "rapid-blink" scanning), improving LLaVA-1.5 performance across multiple multimodal benchmarks.

Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers

Co-Me equips visual geometric Transformers like VGGT and π3 with a lightweight "confidence predictor." It merges patch tokens that the network deems unimportant (low confidence) into a single token before passing them into the latter half of the network. This accelerates both attention and MLP without retraining or altering the backbone structure, achieving up to 21.5× speedup on VGGT with negligible accuracy loss.

CoIn: Coverage and Informativeness-Guided Token Reduction for Efficient Large Multimodal Models

This paper reformulates visual token reduction in large multimodal models (LMMs) as an "optimal subset selection" problem. It uses informativeness (visual saliency + cross-modal alignment) to score each token and coverage (log-det volume) to ensure the selected subset spans the feature space. A compact subset is then selected end-to-end via greedy submodular optimization—requiring no training, being independent of attention mechanisms, and compatible with FlashAttention/KV cache. On LLaVA-NeXT-7B, pruning 94.4% of visual tokens retains 86.7% performance with a 6.5× prefill speedup.

CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs

CORE shifts the visual token compression of LVLMs from "merging individual tokens by feature similarity" to "merging by objects." By utilizing a built-in segmentation head to generate masks for each object, it performs weighted averaging of tokens within the same object into a single compact token, combined with centroid sorting to preserve spatial order. It achieves SOTA performance in fixed-rate compression across six benchmarks; under extreme compression, it maintains 97.4% of the baseline performance while retaining only 2.2% of tokens.

Browse all 62 VLM Efficiency papers →


🎵 Audio & Speech (22)

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

This paper introduces AMUSE—an audio-visual benchmark for "multi-speaker, dialogue-dense" scenarios (6 agentic tasks × Zero-shot/Guided/Agentic evaluation modes), revealing systematic weaknesses in mainstream MLLMs like GPT-4o and Qwen3-Omni regarding "who is speaking, when, and cross-scene causality." It also proposes the RAFT alignment framework (Reflective Reward + Selective Reasoning Adaptation), which improves the accuracy of open-source models on this benchmark by up to 39.52% (relative) using minimal annotations.

AudioStory: Generating Long-Form Narrative Audio with Large Language Models

AudioStory integrates LLM narrative reasoning with a DiT diffusion audio generator into an end-to-end framework. The LLM first decomposes complex instructions into timestamped sub-events, then generates short audio segments sequentially to form long-form narrative audio. Decoupled bridging via "semantic tokens + residual tokens" ensures intra-segment alignment and cross-segment coherence, enabling stable generation of multi-scene audio stories up to 150 seconds.

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

The BabyVLM-V2 framework is proposed, which constructs three formats of pretraining data (768K image pairs + 181K video pairs + 63K interleaved sequences) from the SAYCam longitudinal corpus from an infant's first-person perspective. It designs the DevCV Toolbox (10 developmental cognitive tasks) based on the NIH Baby Toolbox®. A compact model trained from scratch surpasses GPT-4o on certain mathematical tasks, marking the first systematic exploration of Artificial Developmental Intelligence (ADI).

Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning

The authors propose Refine, an ensemble active learning method that consistently outperforms individual AL strategies and existing ensemble methods. It employs a two-stage strategy: progressive filtering (iterative refinement of the unlabeled pool using multiple strategies) followed by coverage selection (selecting high-value diverse samples from the refined pool) without requiring prior knowledge of the optimal strategy.

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Ours proposes MMHNet, a multi-modal hierarchical network based on a hierarchical architecture and non-causal Mamba-2. It achieves length generalization capabilities—training on short segments (8s) while generating high-quality aligned audio for long videos (5+ minutes)—significantly outperforming existing methods on UnAV100 and LongVale benchmarks.

EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Addressing the issues of "visual dominance, inability to understand text instructions, and lack of fine-grained editing" in existing video-to-audio models, this paper proposes the EchoFoley task (using symbolic "sound event" representations + three levels of control granularity) along with a densely annotated benchmark of 6k samples. It designs EchoVidia, a training-free agentic framework (using slow-fast thinking + an action pool), which improves controllability by approximately 40.7% and perceptual quality by 12.5% over the strongest baseline.

FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts

FoleyDirector attaches a pluggable adapter to a pre-trained DiT-based V2A generator (MMAudio), utilizing "director's script"-style per-second Structured Temporal Scripts (STS) to supplement visual cues and realize precise temporal control over sound occurrence. By employing dual-stream parallel rendering for on-screen/off-screen sounds, it raises the control F1 score on DirectorBench from 0.2451 to 0.4819 with almost no degradation in original audio quality.

GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization

GEM-TFL is proposed to bridge the gap between weak and full supervision through a two-stage classification-regression framework. It incorporates three modules: EM-based decomposition of binary labels into multi-dimensional latent attributes, training-free Temporal Consistency Refinement, and Graph-diffusion Proposal Refinement, achieving a 4-8% average mAP improvement in weakly supervised temporal forgery localization.

Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization

VisioSonic utilizes a dual-stream condition of "CLIP low-frame-rate semantics + Synchformer high-frame-rate temporal" fed into a video-text-audio co-attention Diffusion Transformer for rectified flow matching to generate dubbing for silent videos. It further maximizes semantic and temporal alignment using STAR-DPO, a fully automated preference optimization requiring no human annotation. With only 151M trainable parameters (the fewest among similar works), it achieves the strongest distribution matching and audio-visual synchronization on VGGSound.

Hierarchical Codec Diffusion for Video-to-Speech Generation

HiCoDiT reframes "silent video to speech" generation as a masked diffusion task that proceeds layer-by-layer along the RVQ discrete token hierarchy. Lower-level tokens handle content and timbre under lip-motion and identity guidance, while higher-level tokens manage prosody via dual-scale AdaLN modulation of expressions. This approach achieves leading performance in naturalness, intelligibility, and lip-sync on LRS2/LRS3 through zero-shot cross-dataset evaluation.

Browse all 22 Audio & Speech papers →


🔎 AIGC Detection (7)

Enabling Supervised Learning of Generative Signatures for Generalized AI-Generated Images Detection

To address the deadlock where "generative traces in AI-generated images lack clean pairs and cannot be extracted via supervised learning," this paper uses a randomly-structured image reconstructor to artificially "create traces" on real images. The reconstruction residuals are treated as pseudo-labels to train a generative signature (GenSign) extractor, followed by a GenSign + RGB dual-stream classifier for detection, achieving SOTA cross-model generalization across four benchmarks.

Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks

This work defines the new task of "Fine-grained Image Aesthetic Assessment" and constructs the FGAesthetics benchmark containing 32,217 images across 10,028 series. It proposes the FGAesQ model, which learns discriminative aesthetic scores from relative ranks through Difference-Preserving Tokenization (DiffToken), Contrastive Text-aligned Alignment (CTAlign), and Rank-Aware Regression (RankReg). The model achieves an accuracy of 0.779 in fine-grained scenarios while maintaining a coarse-grained SRCC of 0.770.

Inconsistency-aware Multimodal Schrodinger Bridge for Deepfake Localization

IaMSB reformulates "temporal interval localization" of audio-visual deepfakes as a Schrödinger Bridge (SB) generation problem—directly reading cross-modal consistency scores from the bridge's transmission cost and asymmetrically allocating computation steps to the more suspicious modality, resulting in a 3-10% gain over existing methods on strict IoU ([email protected]).

Learning Forgery-Aware Lip Representations Without Forgery Priors

To address the vulnerability of speaker authentication systems to personalized Talking Face Generation (TFG) forgeries, this paper proposes a detector trained solely on real videos without relying on any forgery samples. By combining mixed-fake lip generation, asymmetric contrastive learning, and Gaussian regularization, the real lip motion features are compressed into a compact hypersphere. Anything outside the sphere (forgeries and impostors) is treated as an outlier, reducing the error rate by over 10% against 8 modern forgeries compared to 10 SOTA methods.

Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency

To address four common issues in No-Reference Image Quality Assessment (NR-IQA)—forced resizing to accommodate pre-trained resolutions, poor generalization across resolutions, difficulty in joint training due to inconsistent MOS scales, and computational explosion for UHD images—this paper proposes ReLIQS. It samples fixed-size patches from the original resolution and its scaled variants, encoding them with CLIP. A lightweight "Perceptual Importance Estimator (PIE)" learns IQA-specific saliency to select a few key patches, while a "Latent Quality Axis Module (LQAM)" aggregates multi-scale embeddings into a single score. ReLIQS outperforms CNN, CLIP, and MLLM-based baselines across various real/synthetic/AIGC distortions and resolutions with lower computational cost.

Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images

LTE enables Vision-Language Models to first perform a "global scan to locate suspicious regions" and then "zoom in and crop to re-examine for the final verdict." It upgrades one-time classification into a two-stage region-grounded reasoning process. Accompanied by the TRACE dataset containing box-level annotations and forensic explanations, it achieves simultaneous improvements in accuracy, robustness, and interpretability.

PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection

PPM-CLIP replaces the "discriminative static boundary" paradigm with "generative probabilistic inference." It utilizes normalizing flows to generate a family of adaptive prompts (multiple hypotheses) for each image and determines the results by averaging cosine similarities to marginalize noise. Combined with frequency-guided patch-wise contrastive learning, it forces the CLIP encoder to capture high-frequency forgery traces, significantly outperforming SOTA in cross-generator generalization on Ojha, GenImage, and DRCT.


🧊 3D Vision (646)

240FPS Stereo Vision from Monocular Mixed Spikes

A single monocular spike camera is used to optically mix left and right views onto the same sensor, with one view subjected to periodic 60 Hz modulation. Through a two-stage process—"Least Squares Baseline Decoupling + SMS-Net Depth Refinement"—a 240 FPS binocular video is reconstructed from the mixed spike stream. This approach maintains the compact hardware and data efficiency of a monocular setup while achieving depth estimation accuracy close to the "theoretical upper bound."

2D-LFM: Lifting Foundation Model without 3D Supervision

By injecting "correspondence positional encodings" into every layer of a Transformer, this work trains the first cross-category 2D→3D lifting foundation model using only 2D keypoints (without any 3D ground truth). It outperforms large models like VGGT that rely on RGB depth in object-level geometry (Pascal3D+ 8.1mm vs. VGGT 89.4mm).

3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding

Append a lightweight, task-agnostic "geometric bypass"—the Cross-View Module (CvM, consisting of a spatial-aware encoder + multi-view Transformer + cost volume)—to standard Multi-Task Learning (MTL) networks. By injecting geometric correspondences between adjacent views into shared features as geometric consistency, the single network develops a better "understanding of 3D" when simultaneously predicting depth, segmentation, surface normals, and boundaries. This yields plug-and-play performance gains on NYUv2 and PASCAL-Context (max \(\Delta\)MTL +3.09).

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

A new "in-place completion" paradigm is proposed, extending pre-trained object-level generative priors to the scene level. It directly completes fragmented geometry at its original location without explicit pose alignment. Simultaneously, a large-scale scene dataset ARSG-110K is constructed, significantly outperforming baselines like MIDI and Gen3DSR.

3D-IDE: 3D Implicit Depth Emergent

The "Implicit Geometric Emergence Principle" (IGEP) is proposed. By utilizing a lightweight geometric verifier and a global 3D teacher for privileged supervision during training, the visual encoder develops 3D perception capabilities using only RGB video input. This achieves zero latency overhead during inference and outperforms comparable methods on several 3D scene understanding benchmarks.

3D-Object Perception Transformer (3PT)

3PT replaces the existing zero-shot 3D object perception pipelines—often characterized by "assembled frozen foundation models + depth dependency"—with a unified, end-to-end trained Transformer framework (detection + object grouping + iterative refinement) directly conditioned on CAD models. Relying solely on multi-view RGB, it significantly outperforms SOTA in detection and 6DoF pose on BOP benchmarks (with a relative improvement of 56.5% in AP-mm for industrial datasets), securing 7 first-place rankings across 11 tracks in the BOP Challenge 2025.

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

3D-VCD is the first inference-time hallucination mitigation framework for 3D embodied agents. It applies semantic/geometric perturbations to object-centric 3D scene graphs to generate a "corrupted" negative sample context. By running the MLLM on both the original and perturbed graphs and using a contrastive decoding formula, it suppresses tokens that maintain "high probability even when the scene changes." This method requires no retraining, incurs nearly zero additional overhead, and significantly reduces over-affirmation and object hallucinations in 3D-POPE and HEAL.

3D Gaussian Splatting at Arbitrary Resolutions with Compact Proxy Anchors

Building upon the anchor-based framework of Scaffold-GS, this paper employs FiLM to inject "target resolution" into anchor features and introduces a "Pixel Coverage Gate" to dynamically activate Gaussians based on sampling rates, achieving aliasing-free rendering at continuous arbitrary resolutions. Simultaneously, the method stores only approximately 30% of proxy anchors and utilizes a residual predictor to reconstruct the remaining leaf anchors online, reducing storage to nearly half of Scaffold-GS without compromising quality.

Nope-SGS: 3D Gaussian Reconstruction from Unposed Spike Streams

This paper introduces Nope-SGS, the first framework for reconstructing high-speed 3D scenes directly from raw spike camera streams without camera pose priors. By remodeling spike imaging as a binomial distribution, it recovers a stable Normalized Binomial Distribution Spike (NBDS) supervision signal from unstable single-frame spikes. Combined with key-frame selection and progressive optimization, it simultaneously solves for camera trajectories and 3D Gaussians. Compared to SOTA, it achieves up to a 7.4dB improvement in PSNR, a 40% reduction in ATE, and is the fastest among spike-based methods.

3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction

This paper proposes the Self-Constrained Prior (SCP), which constructs a TSDF distance field by fusing depth maps rendered from the current 3D Gaussians. This field serves as a prior to impose geometry-aware constraints (outlier removal, opacity constraints, and movement toward the surface) on Gaussians, achieving SOTA high-fidelity surface reconstruction on NeRF-Synthetic and DTU datasets.

Browse all 646 3D Vision papers →


🎯 Object Detection (97)

A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps

This paper proposes the Hybrid Ensemble Decoder (HED) and a progressive fine-tuning strategy for cross-domain few-shot object detection (CD-FSOD). By parallelizing part of the decoding layers and introducing prediction diversity through randomly initialized denoising queries, the method achieves SOTA performance on CD-FSOD, ODinW-13, and RF100-VL benchmarks without introducing any additional parameters.

A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection

The SeDiR framework is proposed to achieve semantically disentangled unified 3D anomaly detection through three modules: Coarse-to-Fine Global Tokenization (CFGT), Category-Conditional Contrastive Learning (C3L), and Geometric-Guided Decoder (GGD). It addresses the Inter-Category Entanglement (ICE) problem and outperforms SOTA by 2.8% and 9.1% AUROC on Real3D-AD and Anomaly-ShapeNet, respectively.

AKCMamba-YOLO: Selective State Space Models For Real-Time Object Detection

This paper integrates Selective State Space Models (Mamba/SSM) and adaptive kernel convolutions into YOLOv8. By replacing the C2f blocks in the backbone and neck with 3CAKCMamba and 4CAKCMamba modules, it compensates for the "short-range" limitation of standard convolutions while maintaining linear complexity and real-time speed. On COCO2017, the model achieves 46.3% mAP with 14.9G FLOPs (a 1.4% mAP improvement with 47.9% fewer FLOPs compared to YOLOv8-S).

Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization

ANoCo redefines anomaly detection from "how similar is this patch to normal ones" to "how much cost is required to pull this patch back to the normal manifold." By minimizing an anchored bipartite graph Laplacian energy to pull query patches toward the normal manifold, the displacement magnitude itself serves as the anomaly score. This approach requires no training, no message passing, and provides a closed-form solution, achieving new SOTA results on MVTec-AD / VisA in 1/2/4-shot settings.

AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

AnomalyVFM proposes a general framework that transforms any Vision Foundation Model (VFM) into a robust zero-shot anomaly detector through a three-stage synthetic data generation scheme and a parameter-efficient LoRA adaptation mechanism. Using RADIO as the backbone, it achieves 94.1% image-level AUROC on 9 industrial datasets, outperforming the SOTA by 3.3 percentage points.

AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

Ours leverages the time-invariance of background structures in fixed-view videos to construct an offline Anchor Bank and an online Anchor Map as persistent language-scene memory. Combined with anchor-guided re-entry priors and a ReID-Gating identity verification mechanism, it achieves robust target re-capture after occlusion or departure, improving RCR by 10.3% and reducing RCL by 24.2%.

Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

AVI-Edit performs "audio-visual synchronized instance-level video editing" on a pre-trained video diffusion backbone. It utilizes a Granularity-Aware Mask Refiner to progressively refine rough user-provided masks (even bounding boxes) into precise instance contours, paired with a Self-Feedback Audio Agent (a separate-generate-remix-rework pipeline) to produce accompanying audio temporally aligned with the edited visuals. It significantly outperforms existing methods in visual quality, condition following, and audio-visual synchronization.

Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection

BTP applies pre-trained Point-Language Models (PLM, e.g., ULIP) to zero-shot 3D anomaly detection for the first time. It proposes a Multi-Granularity Feature Embedding Module (MGFEM) to fuse patch-level semantics, geometric descriptors, and global CLS tokens. Combined with a joint representation learning strategy, it achieves 84.5% point-level AUROC on Real3D-AD, significantly surpassing the VLM-based rendering approach of PointAD (73.5%).

Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

Ours embeds the hierarchical label tree from remote sensing fine-grained detection into the representation space of DETR. A "Balanced Hierarchical Contrastive Loss" (BHCL) is proposed to achieve gradient balancing via learnable class prototypes, combined with a strategy that decouples classification and localization queries. This allows contrastive learning to act solely on the classification branch without interfering with localization, reaching new SOTA on three hierarchically labeled remote sensing datasets.

BDNet: Bio-Inspired Dual-Backbone Small Object Detection Network

BDNet mimics the LGN/V1–V2–V4 color pathway and the V1–V4 edge pathway of the human visual system to construct a dual-backbone detection network featuring "color enhancement + edge strengthening + hierarchical fusion." Designed to remedy the insufficient feature extraction caused by low color contrast and blurred edges of small objects in remote sensing, it achieves SOTA results on VisDrone2019, NWPU VHR-10, and AI-TODv2 datasets with only 2.59M parameters.

Browse all 97 Object Detection papers →


✂️ Segmentation (117)

3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion

Ours proposes 3M-TI, a calibration-free multi-camera cross-modal diffusion framework. It automatically aligns and fuses uncalibrated RGB-thermal image pairs in the VAE latent space via Cross-modal Self-Attention (CSM). Combined with a misalignment augmentation strategy, it achieves SOTA on mobile thermal super-resolution tasks and significantly improves downstream object detection and semantic segmentation performance.

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

An Omnivorous Vision Encoder is proposed, which performs cross-modal alignment distillation training (RGB/Depth/Segmentation) on top of a frozen DINOv2 via a lightweight adapter. This enables a single encoder to produce consistent embeddings for diverse visual modalities while preserving original discriminative semantics.

Annotation-Efficient Coreset Selection for Context-dependent Segmentation

Focusing on the extremely high annotation cost in "context-dependent" segmentation tasks like camouflaged objects and medical lesions, this paper assigns an "importance score" to each image via point-annotation-based Optimal Transport. A Max-Distance Entropy strategy is then used to select a coreset (CostSet) that balances coverage and diversity. At a 40% pruning rate, it only loses approximately 1% IoU compared to full training.

Bayesian Decomposition and Semantic Completion for Few-shot Semantic Segmentation

The authors decompose Few-shot Semantic Segmentation (FSS) into three lightweight probabilistic terms—Prior, Likelihood, and Class Consistency—using the Bayesian formula. The method utilizes SAM to generate structured candidate regions, a small binary classification network (CALM) to estimate likelihood and consistency simultaneously, and a Semantic Completion Module (SCM) to merge regional fragments into a complete mask. It achieves SOTA performance on PASCAL-5\(^i\) and COCO-20\(^i\) with high efficiency.

Beyond Appearance: Camouflaged Object Detection via Geometric Structure

DepthSAM adapts the monocular depth estimation (MDE) foundation model, Depth Anything v2, for camouflaged object detection. By freezing the backbone and injecting Sparse Mixture-of-Experts Adapters (SMEA), it pivots the task from "reconstructing the entire scene geometry" to "highlighting camouflaged object geometry." A Geometric-Semantic Fusion Module (GSFM) is then used to align geometric cues with semantic information, achieving new SOTA results on COD10K, CAMO, and NC4K benchmarks (surpassing the runner-up by 3.0% \(S_\alpha\) and 4.3% \(F^\omega_\beta\) on COD10K).

Beyond Text: Visual Description Assembly by Probabilistic Model for CLIP-based Weakly Supervised Semantic Segmentation

To address the issues of "modality gap between text prototypes and visual features" and "static text failing to adapt to diverse instances" in CLIP-based weakly supervised segmentation, this paper uses an Invertible Neural Network to model CLIP visual features as a Hierarchical Gaussian Mixture Model (H-GMM). It explicitly decouples intra-class attributes in the visual space, dynamically assembles them into visual description prototypes based on instance responses to replace text queries, and adaptively reverts to text anchors using density weights. It achieves new SOTAs of 79.9%/51.4% mIoU on VOC/COCO for single-stage WSSS.

BiPA: Bilevel Prompt Adaptation for Underwater Instance Segmentation

BiPA reformulates SAM's dense prompt learning as a bilevel optimization problem with "prompts at the upper level and model parameters at the lower level." It employs Bayesian optimization and a two-stage training strategy to make the problem solvable, combined with a Foreground Attention Injection (FAI) module to restore local details. This efficiently transfers the general SAM to severely degraded underwater scenes, achieving mAP scores that comprehensively surpass previous SOTAs on UIIS and USIS10K datasets.

AFRO: Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning

This paper proposes AFRO, a self-supervised 3D visual pre-training framework. By employing an Inverse Dynamics Model (IDM) to infer latent actions, a Forward Dynamics Model (FDM) based on Diffusion Transformers to predict future features, and an inverse consistency constraint to ensure temporal symmetry, the method achieves an average success rate of 76.0% on MetaWorld 14 tasks after pre-training on the large-scale RH20T dataset (vs. 64.9% for DynaMo-3D and 63.9% for PointMAE). It also achieves state-of-the-art results on four real-world tasks.

Bootstrap Your Own AV-Proxies: Adaptive Contrastive and Prototype Learning for Audio-Visual Segmentation

Addressing the "intra-modal noise + audio-visual semantic gap" in audio-visual segmentation (AVS), this paper proposes BYOAVP. It utilizes BYOL-style negative-free contrastive learning (SSAE) to allow high-level visual semantics to supervise audio, suppressing off-screen/background noise. Additionally, it employs momentum-updated dynamic prototypes (DPC) for pixel-level classification and cross-modal reinforcement of sounding regions. Without any priors like SAM or offline prototypes, it achieves SOTA performance across six sub-tasks on AVSBench and VPO datasets.

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

Proposes DEO (Distillation for Earth Observation), a dual-teacher contrastive distillation framework. It utilizes a multispectral self-distillation teacher to learn spectral representations and an optical VFM teacher (DINOv3) to inject high-level semantic priors. This enables a single student network to excel in both optical and multispectral remote sensing tasks, achieving SOTA across semantic segmentation, change detection, and classification.

Browse all 117 Segmentation papers →


🖼️ Image Restoration (107)

2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition

This paper proposes a "two-shots are enough" sensor noise synthesis method—requiring only one noise image and one dark frame per ISO. It synthesizes signal-independent noise as a texture using random phase sampling in the Fourier domain, complemented by iterative histogram matching to correct marginal distributions. This allows for the generation of infinitely diverse training pairs without large-scale paired datasets, enabling denoising networks to achieve SOTA performance among physics-based methods on several low-light benchmarks.

AceTone: Bridging Words and Colors for Conditional Image Grading

AceTone is proposed as the first unified framework supporting multimodal conditional color grading for both text and reference images. It compresses 3D-LUT into 64 discrete tokens via VQ-VAE, trains a VLM to predict LUT token sequences, and utilizes GRPO reinforcement learning to align color similarity and aesthetic preferences, achieving a 50% LPIPS improvement in style transfer and instruction-based grading.

Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration

The IQPIR framework is proposed, which introduces Image Quality Priors (IQP) from pre-trained NR-IQA models as conditioning signals. Through three mechanisms—a quality-conditioned Transformer, a dual-codebook structure, and quality optimization in discrete representation space—the model guides the restoration process toward the highest perceptual quality, comprehensively outperforming SOTA on tasks such as blind face restoration.

Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion

This paper challenges the convention that Infrared and Visible Image Fusion (IVIF) must be trained on "strictly aligned paired data." It proposes the Arbitrarily Paired Training Paradigm (APTP)—freely recombining \(N\) pairs of base data into \(N^2\) cross-modal pairs, equipped with a set of adaptively weighted pixel-level self-supervised losses. Trained on only 150 pairs of content-inconsistent data, it approaches the fusion performance of models trained on 100 times the amount of strictly paired data.

Beyond the Ground Truth: Enhanced Supervision for Image Restoration

This paper proposes enhancing the perceptual quality of sub-optimal GT images in existing datasets through super-resolution combined with frequency-adaptive mixing. It introduces a lightweight ORNet refinement module that can be trained to improve the perceptual quality of outputs from pre-trained restoration models without architectural modifications.

BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting

Starting from a single blurry EHT black hole image, BHCast performs super-resolution and long-term autoregressive prediction (stable for 100 steps) via a U-Net dynamics surrogate model. Physical features (rotation speed, pitch angle, etc.) are extracted from the predicted plasma dynamics, and black hole spin and inclination are inferred using XGBoost, demonstrating effectiveness on real M87* observational images.

Bi-Bridge: Bidirectional Diffusion Bridges for Low-Light Image Enhancement

This work integrates "low-light to normal-light" enhancement and "normal-light to low-light" degradation into a single symmetric diffusion bridge. By training a shared U-Net with a bidirectional consistency constraint as implicit regularization, the model significantly outperforms existing SOTA in fidelity (PSNR/LPIPS).

BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement

To address the issues of event streams being contaminated by BA noise and the separation of denoising from enhancement in event-aided low-light enhancement, BiEvLight reformulates event denoising from a static preprocessing step into a task-aware bi-level optimization problem. This allows the enhancement gain from the lower level to calibrate the upper-level denoising, supplemented by a spatially adaptive denoising prior guided by image gradients. It achieves an average gain of 1.30dB PSNR / 0.047 SSIM on the real-world SDE dataset.

BiProLoRA: Bilevel Prompt LoRA for Real Scene Recovery

To address the severe degradation issue when large diffusion models "trained on synthetic data generalize to real scenes," BiProLoRA first calibrates the VAE auto-encoder path to the real degradation distribution via self-supervised distribution fidelity learning. It then formulates "LoRA for structure recovery and Prompt for degradation-aware modulation" as a bilevel (hyperparameter optimization) problem for joint training. Using real data equivalent to only 10% of the synthetic data volume, it surpasses SOTA across five non-reference metrics in low-light, dehazing, and underwater tasks.

BluRef: Unsupervised Image Deblurring with Dense-Matching References

Ours proposes BluRef, the first unsupervised framework that utilizes unpaired reference sharp images through dense matching to generate pseudo ground truth for training deblurring networks, achieving performance close to or even surpassing supervised methods.

Browse all 107 Image Restoration papers →


🛰️ Remote Sensing (57)

ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery

Ours proposes ACPV-Net, the first framework to generate topologically consistent all-class polygonal vector maps from aerial imagery in a single pass. It utilizes a Semantic Supervised Conditioning (SSC) diffusion model to generate vertex heatmaps and ensures zero-gap/zero-overlap through proposition-driven PSLG reconstruction.

APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation

APEX decomposes the "UAV target search" task into three decoupled modules—using MLLMs to dynamically construct 3D spatio-temporal semantic maps as memory, PPO-based reinforcement learning to translate maps into actions, and an open-vocabulary detector for final target confirmation. These modules run at different frequencies via an asynchronous parallel framework to bypass the inference latency of large models, achieving a \(+4.2\%\) SR and \(+2.8\%\) SPL improvement over the Prev. SOTA on the UAV-ON benchmark.

Asking like Socrates: Socrates helps VLMs understand remote sensing images

This work reveals the "pseudo-reasoning" phenomenon in remote sensing VLMs (where explicit reasoning chains lead to performance degradation), attributed to the "glance effect" (insufficient single coarse-grained perception). It proposes the RS-EoT (Evidence-of-Thought) iterative evidence search paradigm. The method uses SocraticAgent self-play to synthesize reasoning trajectories for SFT cold startup, followed by two-stage progressive RL (grounding → VQA) for enhancement and generalization. RS-EoT-7B achieves SOTA on multiple remote sensing VQA and grounding benchmarks.

AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

AVION proposes a knowledge distillation framework that utilizes semantic-rich remote sensing text prototypes generated by an LLM as a Teacher for supervision. Simultaneously, it injects learnable prompts into both the vision and text encoders of the Student model to achieve tri-aspect alignment distillation. It significantly outperforms existing PEFT methods in few-shot classification and cross-modal retrieval.

Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

Addressing the frequent fragmentation and incorrect connections of urban road models in wilderness/off-road scenarios, this paper proposes "path-centric" connectivity reasoning. Instead of relying solely on local features of two endpoints, the method samples multi-scale road evidence along the entire geodesic of candidate edges to determine connectivity. The authors also release WildRoad, the first intercontinental vectorized off-road road dataset, achieving SOTA on off-road benchmarks while generalizing well to urban datasets.

Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation

Bearing-UAV abandons the "matching a UAV view to a specific satellite tile" paradigm. Instead, it utilizes 4 adjacent satellite tiles and 1 UAV view to directly regress the absolute coordinates and heading angle of the UAV. In scenarios with misalignment, sparse features, and cross-view discrepancies, it reduces errors by an order of magnitude compared to retrieval/matching methods (UAV view MLE reduced from ~30 m to 8.6 m) and integrates heading prediction into end-to-end navigation.

Beyond Tie Points: Satellite Image Block Adjustment based on Dense Feature Consistency

Addressing the long-standing limitation of Planar Block Adjustment (PBA) relying on sparse tie points and accumulating errors in high-disparity regions such as tall buildings, this paper proposes the "Beyond Tie Points" paradigm. It utilizes a pre-trained feature extractor to generate dense features and confidence maps, reformulating block adjustment as a self-supervised optimization problem to "minimize the dense feature distance of homologous object points." Combined with a grid-based coarse-to-fine solver, it reduces average errors by up to 75.43% on data from Beijing, Guangzhou, and San Jose.

ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing

Ours proposes ChangeBridge, the first conditional spatiotemporal image generation model for remote sensing. Based on asymmetrically drifting diffusion bridges, it generates post-event images from pre-event images and multimodal conditions (coordinate-text/semantic masks/instance layouts), simultaneously modeling foreground event-driven changes and background temporal evolution, while serving as a data engine for downstream change detection tasks.

Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark

The Cross-modal Fuzzy Alignment Network (CFAN) is proposed, utilizing fuzzy logic to quantify token-level reliability for fine-grained alignment. It introduces the ground view as a bridging proxy to mitigate the semantic gap between aerial images and text, alongside the construction of a large-scale text-aerial person retrieval benchmark, AERI-PEDES.

Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark

This paper proposes PanScale, the first cross-scale pansharpening dataset and evaluation benchmark (PanScale-Bench), along with the ScaleFormer framework. The method reinterprets resolution changes as sequence length variations, achieving cross-scale generalization through Scale-Aware Patchify bucketed sampling, decoupled spatial-sequence modeling, and RoPE.

Browse all 57 Remote Sensing papers →


🔍 Anomaly Detection (7)

Anomaly-Related Residual Fields for Cross-domain Anomaly Detection

Addressing the challenge that diffusion model residuals are noisy and magnitudes alone cannot distinguish anomalies, this paper proposes Residual Evolution Fields (REF). It separates "persistent non-stationary anomaly signals" from the spatio-temporal trajectories of residuals in the diffusion reverse process. Cross-domain Field Alignment (CFA) is then employed to transfer detectors trained on labeled source domains to unlabeled target domains, achieving an average AUROC of 95.22% across 9 cross-domain tasks, outperforming the strongest baseline by 13 percentage points.

Defect Cue-Preserved Structural Feature Refinement for Few-Shot Anomaly Detection

This paper identifies that the core difficulty in few-shot anomaly detection (FSAD) lies in the "dilution" of subtle defect cues layer-by-layer within deep feature extraction pipelines. It proposes DCP-SFR: first using learnable prompts to "amplify" early weak signals into high-contrast anomaly cue maps, then using these maps to guide reconstruction-based localization, and finally performing structural-aware boundary refinement. It achieves an image-level AUROC of 97.3% and a pixel-level AUROC of 98.2% on MVTec AD and VisA.

Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification

PG-SFD models "unsupervised anomaly detection (pixel-level localization) + weakly supervised anomaly classification (region-level classification)" as a dual-prototype collaborative optimization problem. By explicitly decoupling normal/anomaly semantics using normal and category prototypes, injecting normal priors into the classification branch via differential gating, and alleviating multi-task gradient conflicts with geometric regularization, it achieves an I-AUROC of 99.4% on MVTec-AD while supporting fine-grained defect classification.

Hunting Normality from Query Sample via Residual Learning for Generalist Anomaly Detection

Addressing the issue in Generalist Anomaly Detection (GAD) where "directly modeling residual distributions" leads to misjudgments due to inconsistency between residuals and instance features, Ours no longer classifies residuals directly. Instead, it treats residuals as a guide: learnable proxies extract patterns from residuals (RFL), then these residual proxies aggregate query-related "normality proxies" (NLS) from the support set. Finally, these normality proxies are used to search for normal regions (HNQ) within the query features to locate anomalies. Ours achieves competitive few-shot performance on cross-domain benchmarks including Industrial→Industrial and Industrial→Medical.

LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection

LayoutAD proposes a new task "Scene Layout Anomaly Detection," which uses an unsupervised approach to generate object-level anomaly scores for each object in an image. By decomposing the scene into semantic and geometric graphs and reasoning the "misalignment" between them via cross-graph attention, it identifies layout-level hallucinations—such as "a five-legged dog" or "a car parked on a lake"—that are invisible to pixel-level detectors.

Multi-Prototype Compactness and Boundary-Aware Synthesis for Unsupervised Anomaly Detection

Addressing the issue where the single-prototype hypothesis results in overly loose decision boundaries under high intra-class variance, this paper proposes the PGBL framework. It structures normal features into multiple compact sub-clusters using Multi-Prototype Compactness Constraints (MPCC), synthesizes pseudo-anomalies at the topological boundaries of these sub-clusters (BAAS), and refines the decision surface with a discriminator (DBR). PGBL outperforms previous methods in detection and localization on MVTec-AD, VisA, and Real-IAD.

RAID: Retrieval-Augmented Anomaly Detection

RAID reinterprets Unsupervised Anomaly Detection (UAD) as a Retrieval-Augmented Generation (RAG) pipeline: it first performs coarse-to-fine retrieval using a three-level vector library (class prototype → semantic prototype → instance token), then employs a "Guided MoE Filter" to denoise the retrieved matching cost volume. This suppresses matching noise and produces anomaly maps with sharp boundaries, achieving SOTA across full-shot, few-shot, and multi-dataset settings on MVTec/VisA/MPDD/BTAD.


🧑 Human Understanding (138)

ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars

ActAvatar utilizes "structured text prompts + phase-aware cross-attention" to allow talking avatar videos to perform specific actions within designated time windows. Combined with "depth-progressive audio influence" and "two-stage training," it maintains lip-sync, action accuracy, and image quality without relying on pose skeletons, achieving 14B-level effects with a 5B model.

All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark

This paper proposes LIDMark, the first framework to unify deepfake detection, tampering localization, and source tracing into a single proactive forensics system. By embedding a 152-dimensional Landmark-Identity watermark (136D facial landmarks + 16D source ID), it utilizes intrinsic/extrinsic consistency to achieve three-in-one forensics, outperforming existing methods in both PSNR/SSIM and detection accuracy.

AudioAvatar: Personalized Audio-driven Whole-body Talking Avatars

AudioAvatar reconstructs a canonical 3D Gaussian whole-body digital human from a single portrait and allows audio to directly modulate the motion trajectory of each Gaussian particle (skipping the lossy intermediate chain of "audio → parametric pose → rendering"). By leveraging large-scale audio-driven video diffusion models for feature distillation, it significantly outperforms pose-driven baselines in lip synchronization, facial micro-expressions, and gesture naturalness.

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

The authors upgrade "talking head generation" from unidirectional broadcasting to genuine bidirectional conversation. By utilizing causal diffusion forcing in the motion latent space, the model receives user audio/motion while auto-regressively generating avatar head movements. Combined with KV caching, the latency is reduced to ~500ms (6.8x faster than baselines). Furthermore, a label-free DPO (Direct Preference Optimization), which generates negative samples by "dropping user conditions," enables the avatar to learn expressive reactions like nodding and smiling, achieving over an 80% preference rate against the strongest baseline in human evaluations.

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

The AVATAR framework is proposed to improve GRPO through two core components: an off-policy training architecture (stratified replay buffer) and Time Advantage Shaping (TAS, using U-shaped weighting to emphasize the beginning and end of reasoning chains). This approach addresses three major issues of GRPO—data inefficiency, vanishing advantages, and uniform credit assignment—significantly outperforming the GRPO baseline on audio-visual reasoning benchmarks.

BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition

Addressing the pain point that real-world collection of gait data for "one person wearing hundreds of outfits" is nearly impossible, this paper maps 521 real subjects into a virtual engine. By randomly generating 100 outfits per person, the authors construct an identity-consistent synthetic gait dataset, BarbieGait. A companion clothing-invariant baseline, GaitCLIF, is proposed, achieving SOTA results on BarbieGait and real-world datasets including CCPG, SUSTech1K, Gait3D, and GREW.

Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes

The authors model driver gaze as an autoregressive dynamical system: each frame of the traffic scene is encoded into a "gaze-centric" heterogeneous spatio-temporal graph. An Affinity Relational Transformer (ART) models the interaction between the gaze and traffic objects, while an Object-level Density Network (ODN) predicts the next-step gaze distribution, which is autoregressively unrolled into continuous gaze trajectories. This unified model simultaneously generates SOTA-level gaze time series, scanpaths, and saliency maps.

Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding

Addressing the loophole in existing MLLM benchmarks that default to "single-view sufficiency" and only reward single-image recognition, this work constructs CVBench—3,000 human understanding questions where each item is verifiably "unsolvable via single-view, solvable via cross-view" (12 spatio-temporal tasks, 4-way synchronized cameras). Evaluation reveals that even the strongest models lag nearly 50 points behind humans, identifying a systematic failure mechanism across all models: "single-view bias."

BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification

Addressing the large modality gap and infrared sample scarcity in Visible-Infrared Person Re-Identification (VI-ReID), BIT discards the conventional approach of aligning features into a shared space. Instead, it adopts a matching-based paradigm: a bi-directional cross-interaction module allows visible-infrared image pairs to mutually absorb complementary information, followed by a query-aware scoring module that mines reliable reciprocal correspondences at the patch level to compute final similarity. BIT achieves SOTA results on SYSU-MM01, LLCM, and RegDB benchmarks.

BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer

BoostSLT introduces a plug-and-play module that wraps any sign language translation model. It segments long videos into semantic segments based on motion energy, translates segments independently, and reconstructs fragmented translations into coherent long sentences using a Diffusion Language Model. Without relying on gloss annotations, it significantly improves BLEU and ROUGE for long-sentence and document-level translation.

Browse all 138 Human Understanding papers →


📹 Video Understanding (178)

A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett-Luce Ranking

The authors propose PL-Stitch, a self-supervised framework that utilizes the Plackett-Luce probabilistic ranking model to treat the temporal ordering of video frames as a pre-training signal. By learning "procedure-aware" video representations, it significantly outperforms existing self-supervised methods in surgical phase recognition and cooking action segmentation.

Active Intelligence in Video Avatars via Closed-loop World Modeling

To address the issue of current video avatars "passively following speech/pose while lacking autonomous goal-driven behavior," this paper proposes the L-IVA task (modeling avatar control as a POMDP with I2V generation models as environment simulators) and the ORCA framework. ORCA utilizes an "Observe-Think-Act-Reflect" (OTAR) closed-loop to counteract generational randomness and a System 2/System 1 dual-system hierarchy for open-domain planning and precise grounding. On a benchmark of 100 tasks, it achieves an average task success rate of 71.0%, significantly exceeding open-loop, reactive, and reflection-free baselines.

Adaptive Capacity Autoregressive Visual Tracking

ARTrack-AC extends autoregressive tracking from "fixed-capacity per-frame prediction" to "system-level autoregression." It uses a lightweight diffusion trajectory estimator to pre-judge the stability of future video segments. A controller then switches to a low-capacity parallel mode for simple segments and a high-capacity sequential mode for difficult frames, achieving 66.7% AUC on LaSOT while being 2.9x faster than its predecessor.

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

Ours proposes AdaSpark, which reduces long-video processing FLOPs by up to 57% while maintaining performance through 3D spatiotemporal cube partitioning and two synergistic adaptive sparsity mechanisms: cube-level attention selection and token-level FFN selection.

Affordance-First Decomposition for Continual Learning in Video–Language Understanding

Addressing the blurred boundary of "what to stabilize and what to plasticize" in video-language continual learning, this paper proposes Affordance-First Decomposition (AFD). It maps videos to slowly-varying affordance tokens as a shared, stable "evidence foundation" across tasks, while concentrating plasticity into a LoRA scheduler that utilizes query-based routing and conflict-triggered rank expansion. Combined with question-only replay distillation (storing no videos) for anti-forgetting, AFD achieves higher accuracy and lower forgetting on ViLCo-Bench and domain/time-incremental VideoQA.

Alert-CLIP: Abnormality-aware Latent-Enhanced Representation Tuning of CLIP for Video Anomaly Detection

To address the issue where CLIP's text space highly entangles "normal" and "abnormal" descriptions—causing near-identical similarity scores for both types of prompts—this paper reshapes CLIP's embedding geometry via three-level (Global/Regional/Hard Negative) cross-modal contrastive training using a self-built dataset (VAGTA). This transforms CLIP into a more abnormality-aware backbone, consistently outperforming original CLIP in weakly supervised, zero-shot, and open-vocabulary VAD settings.

\(\alpha\)Matte4K & \(\mu\)Matting: Dataset and Model for Ultra-Micro Precision Alpha Video Matting

Targeting 4K portrait video matting, this paper introduces \(\alpha\)Matte4K, a large-scale dataset with pixel-level precision and physical consistency generated via Physics-Based Rendering (PBR). It also proposes \(\mu\)Matting, which utilizes a portrait prior (MAE) to predict a coarse alpha and identify "difficult regions," followed by sparse 3D convolution refinement only on these regions. This approach achieves full-resolution 4K video matting without downsampling for the first time, surpassing existing SOTA in both accuracy and temporal consistency.

An Efficient Token Compression Framework for Visual Object Tracking

To address the visual token explosion and redundancy in multi-frame template tracking, ETCTrack utilizes a learnable Adaptive Token Compressor (ATC) to compress historical template frames into a refined subset. This is followed by a Hierarchical Interaction Block (HIBlock) for deep interaction with the search region. It sets new state-of-the-art accuracy across 7 benchmarks while reducing computation (template tokens reduced by 60%, MACs reduced by 21.4%, with only a 0.4% drop in accuracy).

An Empirical Study on How Video-LLMs Answer Video Questions

This paper systematically dissects the internal mechanisms of how Video-LLMs answer video questions using "attention knockout." It identifies a clear "early-layer perception, late-layer reasoning" two-stage pattern and finds that spatiotemporal modeling relies primarily on language-to-video retrieval rather than intra/inter-frame video self-attention. Furthermore, only a few intermediate layers are critical. Based on these insights, the authors design a simple strategy involving early exit for visual tokens and temporal attention pruning, which significantly reduces computational cost with almost no performance degradation.

Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning

Addressing the "when to speak" challenge in streaming dense video captioning, which is difficult to control via thresholds, this paper proposes Takusen. It is an asynchronous dual-agent framework using a small model as an "Oracle" to detect event boundaries ahead of time and a large model as a "Listener" to generate descriptions only upon receiving signals. This mechanism eliminates thresholds and achieves streaming SOTA on ActivityNet Captions and YouCook2.

Browse all 178 Video Understanding papers →


🚗 Autonomous Driving (140)

ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving

ActiveAD designs a "planning-oriented" active learning strategy for end-to-end autonomous driving: it uses nearly free meta-information (weather/lighting/driving commands/speed) for diversity initialization to solve the cold-start problem, and selects the most critical scenarios using three label-free criteria: displacement error, soft collision, and agent uncertainty. Training on only 30% of the data matches the performance of SOTA models trained on 100% data in both nuScenes open-loop and CARLA closed-loop evaluations.

AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception

The authors propose AdaRadar, an online adaptive radar data compression framework based on DCT spectral pruning and zeroth-order proxy gradients. It achieves over 100× compression with only ~1%p loss in detection/segmentation performance, effectively alleviating the bandwidth bottleneck between radar sensors and compute units.

AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction

AMap identifies a safety hazard in existing temporal HD mapping methods: they "only enhance the rear area already passed and provide almost no improvement for the critical road ahead." It proposes a "distill-from-future" paradigm—using a teacher capable of seeing future frames to implicitly instill forward priors into a lightweight student observing only the current frame, significantly improving ahead-mapping accuracy (A-mAP) with zero inference overhead.

An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving

Ours proposes ADMesh (a high-quality 3D model library with 15K+ assets) and CarlaOcc (a panoptic occupancy dataset with 100k frames and 0.05m precision). It provides the first instance-level annotations and physically consistent ground truth for 3D panoptic occupancy prediction in autonomous driving, along with occupancy quality evaluation metrics and a systematic benchmark.

BEV-CAR: Enhancing Monocular Bird's Eye View Segmentation with Context-Aware Rasterization

BEV-CAR introduces a "training-only, inference-removed" context-aware rasterization mechanism that rearranges decoder outputs into rays along the lines of sight. Using discrete sampling via the Bresenham algorithm and ray-wise supervision, combined with a dual-branch (depth + global) BEV feature fusion, it achieves SOTA results on nuScenes (31.5% mIoU) and Argoverse (29.9% mIoU) with zero additional inference overhead at 43.1 FPS.

BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images

This paper proposes BEV-SLD, a LiDAR global localization method based on self-supervised Scene Landmark Detection (SLD). By decoupling detection from correspondence prediction, it achieves high-precision \((x, y, \text{azimuth})\) pose estimation across various scenarios with a compact storage footprint of only 20MB.

Beyond Rule-Based Agents: Active Markov Games for Realistic Multi-Agent Interaction in Autonomous Driving

This paper models the driving environment as an "Active Markov Game" (AMG) where both state transitions and rewards depend on the current policies of all agents. By employing multi-agent co-evolutionary training, the ego policy plays against and evolves with a pool of diverse opponent strategies. This approach learns robust interactive decision-making in CARLA unsignaled intersections and long-tail scenarios, reducing the collision rate to 0.02 and achieving a success rate of 98%.

BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds

BuildAnyPoint is proposed to achieve unified reconstruction from diverse point cloud distributions (airborne LiDAR, SfM, sparse noisy points) to structured 3D building meshes using a Loosely-coupled Cascaded Diffusion Transformer (Loca-DiT). The framework first restores the underlying point cloud distribution through hierarchical latent diffusion and subsequently generates compact polygonal meshes via an autoregressive Transformer.

C-LaV: Conditional Latent Velocity Field Denoising for Weather-Robust LiDAR Place Recognition

C-LaV compensates for LiDAR degradation caused by rain, snow, and fog within the BEV latent space of a frozen DINOv2. By learning a velocity field via conditional Flow Matching and solving a probability flow ODE, it deterministically transports "weather-noisy latent representations" back to "clear-day latent representations." Using a SALAD clustering head for global descriptor retrieval, it achieves Recall@1 improvements of 17.5% on NCLT Snowy and 21.5% on real-world Boreas datasets.

CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography

CARD is a multi-modal autonomous driving dataset targeting "non-flat road surfaces" (speed bumps, potholes, irregularities, and off-road sections). Through a novel multi-LiDAR fusion ground truth generation pipeline, it providing approximately 500,000 measured LiDAR depth points per frame (about 6.5 times that of KITTI Depth Completion). It is equipped with 2D bounding boxes for road topography, wheel-ground contact excitation trajectories, and standardized evaluation protocols, specifically designed to evaluate depth estimation/completion capabilities for fine-grained road geometry.

Browse all 140 Autonomous Driving papers →


🤖 Robotics & Embodied AI (130)

A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

Addressing the issue of unstable 6-DoF grasping caused by occluded geometric information in "corner views" of single-view point clouds, this paper proposes a post-fusion framework utilizing an auxiliary view captured easily by a robotic arm. By employing self-supervised contrastive learning, cross-view point features are mapped to be "spatially consistent + directionally discriminable." A "Cross-view Aligned Cylindrical Integration" module fuses geometry from two views within a grasp-related cylindrical neighborhood. On GraspNet-1Billion, the Seen split AP reaches 74.08 (RealSense, +3.55 Gain), with a 96% clearing success rate on a real robotic arm.

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

The "intermediate reasoning" of VLA is replaced from language subtasks or target images with coarse-grained reference action sequences in the action space (Action Chain-of-Thought). An explicit action reasoner generates reference trajectories, while an implicit action reasoner extracts action priors from the VLM's KV cache. These two pathways jointly condition the action head, achieving SOTA on LIBERO/LIBERO-Plus/VLABench simulation benchmarks and real-world hardware.

Action-Sketcher: From Reasoning to Action via Visual Sketches for Robotic Manipulation

This paper proposes Action-Sketcher, which enables VLA models to operate in a "See-Think-Sketch-Act" loop. It first draws spatial intent as a Visual Sketch (composed of points, boxes, and arrows) as a human-readable and editable intermediate representation before generating actions. It significantly outperforms strong baselines like π0.5 and OpenVLA-OFT on long-horizon, cluttered, and referentially ambiguous real-world manipulation tasks. Furthermore, sketches allow for direct human intervention to further improve success rates.

ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model

Addressing the challenge of grasping targets in cluttered scenes with limited viewpoints, ActiveGrasp employs a calibrated energy-based model to directly model grasp distributions on the SE(3) manifold. It defines the information gain of the "Next-Best-View" (NBV) as the reduction in grasp success entropy, guiding the robot to regions of highest uncertainty. This approach achieves superior success rates with fewer view budgets in both simulation (79% SR) and real-world experiments.

ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

ActiveVLA integrates "active perception" into 3D Vision-Language-Action (VLA) models: it first utilizes multi-view orthogonal projections and heatmaps to locate 3D key regions, then actively selects optimal virtual camera views around these regions and performs virtual Zoom-in to enhance resolution. This approach significantly improves success rates in scenarios involving occlusions and fine manipulations (achieving a 91.8% average on RLBench).

AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking

AdaDexTrack redefines the "Language Command → Dexterous Hand-Object Interaction" pipeline as modulated tracking. A distilled general tracker acts as the "skill carrier," while an RL-trained modulator is integrated into the feedback loop. This modulator performs real-time correction through three interfaces—reference trajectory, object latent variables, and position targets—enabling the stable execution of noisy text-generated references for long-horizon, drift-resistant manipulation and achieving zero-shot sim-to-real transfer.

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

The paper proposes Adaptive Action Chunking (AAC), which utilizes action entropy as a cue to dynamically determine the optimal chunk size during inference without additional training or architectural modifications. It consistently improves success rates for GR00T N1.5 and π0.5 on benchmarks such as RoboCasa and LIBERO.

Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation

Addressing the "Memory Trap" issue where VLA models rigidly follow training trajectories towards old object locations under scene perturbations, this paper proposes a training-free 3D Spatial Affordance Field (SAF) as a plug-and-play plugin. The system uses proprioception to detect traps, rolls back to safe historical poses, and employs SAF to sample waypoints and rerank VLA candidate trajectories based on cumulative affordance, achieving an average improvement of 23.5% in real-world OOD scenarios.

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Affordance Correspondence

AffordGen transforms "affordance semantic correspondence" from an online planning signal into an offline data generation prior. By establishing keypoint correspondences across large-scale 3D meshes using DINOv2, it batch-transfers grasping and skill segments from a single human demonstration to hundreds of new objects. This process synthesizes a trajectory dataset covering full 6D poses and multiple categories, which is then used to train a closed-loop visuomotor policy, achieving zero-shot generalization to genuinely unseen objects.

AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions

AGENTSAFE is the first benchmark to systematically evaluate the safety of "embodied VLM agents executing hazardous instructions." It utilizes an adversarial simulation sandbox (SAFE-THOR) that interfaces with arbitrary agents, a collection of 9,900 hazardous instructions categorized by the "Three Laws of Robotics" (SAFE-VERSE), and a fine-grained diagnostic protocol (SAFE-DIAGNOSE) spanning the "perception-planning-execution" stages. The study evaluates 9 VLMs and 2 agent workflows, revealing a systemic failure where current agents "recognize danger but fail to incorporate this cognition into planning and execution," and proposes a thought-level defense module called SAFE-AUDIT.

Browse all 130 Robotics & Embodied AI papers →


🎮 Reinforcement Learning (23)

AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

AnyDoc proposes a general document generation framework based on a unified HTML/CSS representation. Through an automated data synthesis pipeline, it constructs the DocHTML dataset containing 265K documents. By combining SFT and Height-Aware Reinforcement Learning (HARL) to fine-tune MLLMs, it outperforms baselines such as GPT-4o on intention-to-document, document derendering, and element-to-document tasks.

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

The authors propose the CCCaption dual-reward reinforcement learning framework. By jointly optimizing image captioning completeness (based on visual query sets generated by multiple MLLMs) and correctness (based on hallucination detection of decomposed sub-queries), the 2B model outperforms the 32B baseline.

CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation

Aiming at the industrial scenario of "directly generating executable and editable CAD code from 2D engineering triple-views," CME-CAD enables multiple heterogeneous pre-trained large models to act as "experts" with distinct styles. It first employs Multi-Expert Fine-Tuning (MEFT) using their respective reasoning styles, followed by a Multi-Expert Reinforcement Learning (MERL) stage. In MERL, strong experts transfer superior strategies to weak experts via KL distillation, and a Hard Sample Buffer mechanism is used to repeatedly tackle the most difficult samples. Ultimately, on the self-built industrial-grade benchmark CADExpert, the IoU is improved from 71.84% to 80.71%, and the code execution rate reaches 98.25%.

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

The authors propose Cross-modal Identity Mapping (CIM), which quantifies information loss in image descriptions by analyzing the representation consistency (GRC) of images retrieved via the caption and their relevance to the source image (QIR). This serves as an RL reward signal to train LVLMs to generate fine-grained and accurate descriptions without requiring additional human annotations.

DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration

DreamSAC replaces the black-box dynamics of pixel-based world models (DreamerV3) with an SE(3)-invariant Hamiltonian dynamics prior and employs a "symmetry-breaking work" intrinsic curiosity to collect physically informative data. This allows the model to learn conservation laws rather than just pixel statistical correlations, achieving 22%–163% higher extrapolative generalization on unseen physical parameters such as mass, gravity, and friction compared to SOTA.

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

EVA models long video understanding as a "planning-before-perception" Markov Decision Process (MDP), enabling the MLLM agent to decide "which segment to watch, how many frames to sample, and at what resolution" based solely on the text question. Through a three-stage training pipeline (SFT Cold Start \(\rightarrow\) KTO Offline Correction \(\rightarrow\) Data-Enhanced GRPO), the model evolves from a format imitator to an active video explorer. It achieves a 6–12% accuracy improvement over general MLLMs and a 1–3% gain over existing adaptive agents using approximately 1/10 of the visual tokens across six video benchmarks.

GeoWorld: Geometric World Models

GeoWorld maps the latent representations of predictive world models from Euclidean space onto hyperbolic manifolds. By maintaining geometric structures and hierarchical relationships through Hyperbolic JEPA and employing Geometric Reinforcement Learning to optimize multi-step planning, it achieves improvements of approximately 3% SR (3 steps) and 2% SR (4 steps) on CrossTask and COIN.

Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues

RLVC treats the feature generator in generative zero-shot learning as an RL policy. It utilizes outcome rewards based on "correct classification" from a frozen classifier to drive generator self-evolution, combined with class-level visual cues for prototype distillation to stabilize training. It achieves new SOTA on CUB, SUN, and AWA2 benchmarks (e.g., 90.1% CZSL accuracy and 81.2% GZSL harmonic mean on CUB).

JoPPO: Hierarchical Photography Assessment via Contrastive Joint Conditional Probabilistic Reinforcement Learning

JoPPO upgrades "using VLMs to score image aesthetics" from regressing a single global score to modeling the joint Gaussian distribution of attribute scores and total scores across a batch. By deriving attribute-conditional pairwise win rates and utilizing them as rewards in GRPO to train the judge, the model provides interpretable multi-attribute sub-scores while significantly exceeding GPT-4o in ranking consistency.

Local Motion Matters: A Deconstruct-Recompose Paradigm for Reinforcement Learning Pre-training from Videos

This work deconstructs complex "global motion" in videos into morphology-agnostic "atomic actions" (local optical flow patches). A dual-attention encoder learns transferable local motion representations, which are recomposed into a world model via a learnable aggregation token. This paradigm significantly enhances RL sample efficiency and final performance on downstream robotic control tasks such as DMControl Remastered and Meta-World.

Browse all 23 Reinforcement Learning papers →


🔄 Self-Supervised Learning (89)

A Faster Path to Continual Learning

To address the issue of the C-Flat optimizer being too slow due to calculating three additional gradients per step, this paper identifies "direction-invariant" components within the first-order flatness gradients. These components are cached and reused in subsequent steps to skip redundant perturbation gradient calculations. Combined with a linear scheduler that gradually increases the skip interval as tasks progress and an adaptive trigger based on gradient statistics, C-Flat Turbo achieves 1.0×~1.25× speedup over C-Flat (recovering throughput from ~27% to ~60%) while maintaining or even slightly improving accuracy.

AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning

AdaPrior reinterprets Long-Tailed Continual Learning (LTCIL) as a "model-induced prior drift" problem. It uses EMA to online estimate the model's self-learned prior \(P_m(y)\), followed by Bayesian alignment for debiasing in both training loss and inference post-processing. This single-stage, plug-and-play approach consistently outperforms recent LTCIL baselines on CIFAR100-LT, ImageNet-subset-LT, and iNaturalist18-subset.

An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

This paper proposes an online mixture model learning framework (MMOT) based on Optimal Transport theory. By maintaining multiple adaptive centroids for each category, it precisely characterizes the multimodal nature of online data streams. Combined with a dynamic preservation strategy to enhance category discriminability, it effectively mitigates catastrophic forgetting in Online Class-Incremental Learning (OCIL).

Assignment-Driven Hash Learning in a Hyper-Semantic Space for On-the-Fly Category Discovery

To address the critical issues of "feature-to-hash cascade degradation" and "known-class monopoly in the representation space" in On-the-fly Category Discovery (OCD), this paper constructs a hyper-semantic space comprising "derived subspaces" and "calibrated subspaces" to simultaneously characterize intra-class diversity and reserve space for new categories. Assignment-driven hash learning, featuring "soft prototype assignment + binary hash regularization," is then performed within this space. As a plug-and-play module for SMILE/PHE, it achieves an average All accuracy improvement of approximately 12.78% on six fine-grained datasets (based on SMILE).

Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors

To address the issues of isolated clusters and rigid boundaries caused by "binary contrast" in self-supervised skeleton action recognition, TranCLR synthesizes "transitional anchors" as manifold regularization terms between actions and reshapes the representation space from discrete point clouds into continuous smooth manifolds using three-level geometric manifold calibration. It achieves SOTA across linear evaluation, transfer learning, and retrieval on NTU/PKU-MMD, while reducing the Expected Calibration Error (ECE) from ~5.6% to 0.65%.

Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning

Addressing the issue of forgetting caused by "conflict between current task gradients and replay gradients" in online class-incremental learning, this paper theoretically reveals that hypergradient methods essentially align task gradients to a shared meta-objective but are "myopic" as they only consider the current step. Consequently, it proposes LOR: before updating, it explores multiple future model states along a set of "plasticity-stability" trade-off directions, then optimizes the worst-case direction using a Log-Sum-Exp softened min-max objective. This pushes the model toward flatter, more forgetting-resistant regions, outperforming SOTA on Seq-CIFAR10/100 and Seq-TinyImageNet.

Beyond the Static World: Continual Category Discovery under Visual Drift

Addressing the realistic scenario where "unlabeled data streams both introduce new categories and originate from unfamiliar domains," this paper proposes the OCCD task. It introduces a three-component framework—"Optimal Transport for automatic separation of known/unknown samples → Adversarial Alignment of known class prototypes → Frequency-domain augmentation for category topological consistency"—achieving new SOTA performance in both new category discovery and old category recognition on DomainNet and SSB-C.

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

Before a ViT officially processes images, a lightweight masked-token pre-training (warm-up) is conducted using purely symbolic sequences without any visual content (e.g., "balanced parentheses") generated by formal grammars. This forces the model to internalize universal computational mechanisms such as stack-based hierarchy and long-range dependencies. When followed by standard image training, this approach achieves a +1.72% top-1 gain on ImageNet-1K with only a 1% training budget expansion, effectively substituting for 28% of image data.

CHEEM: Continual Learning by Reuse, New, Adapt and Skip -- A Hierarchical Exploration-Exploitation Approach

Ours proposes the CHEEM framework, which automatically learns task-aware dynamic ViT backbones via Hierarchical Exploration-Exploitation (HEE) sampled NAS—selecting from four operations: Reuse, New, Adapt, and Skip at each layer. It significantly outperforms prompt-based methods on the MTIL and VDD benchmarks, approaching the upper bound of full fine-tuning.

Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

Ours proposes Chain-of-Models Pre-Training (CoM-PT), which arranges Vision Foundation Models (VFMs) by size into a "model chain." It achieves lossless pre-training acceleration through small-to-large inverse knowledge transfer (weight initialization + feature distillation), where training efficiency improves as the scale of the model family grows.

Browse all 89 Self-Supervised Learning papers →


📐 Optimization & Theory (22)

ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

This paper theoretically proves that fine-tuned parameter differences contain input covariance information. Accordingly, it proposes ACE-Merging, which achieves data-free closed-form model merging through a three-step process: adaptive covariance estimation, collective structure priors, and spectral refinement. It achieves an average improvement of 4% on GPT-2 and 5% on RoBERTa-Base compared to previous methods.

BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning

The authors propose the BD-Merging framework, which utilizes Dirichlet evidence modeling, Neighborhood Disparity Score (ADS), and disparity-aware contrastive learning to train a debiasing router for adaptive assignment of model merging weights. This significantly improves the robustness and generalization of merged models under test-time distribution shifts and unseen tasks.

Beyond Single Solution: Multi-Hypothesis Collaborative Deep Unfolding Network for Image Compressive Sensing

Addressing the "underdetermined and non-unique" nature of the Compressive Sensing (CS) problem, this paper proposes MHC-DUN: a paradigm shift from reconstructing a single solution in traditional Deep Unfolding Networks (DUNs) to "reconstructing \(T\) hypothesis solutions simultaneously with collaborative optimization." Specifically, AlphaNet predicts pixel-adaptive step sizes for each hypothesis in the gradient descent step, while MHCB captures inter-hypothesis correlations for fusion in the proximal mapping step. The method consistently outperforms current SOTA on Set11/Urban100/CS-MRI (e.g., achieving a 0.45 dB average PSNR gain over USB-Net on Set11).

Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling

Ours proposes CFC (Conditional Factuality Control), a post-hoc conformal framework that learns feature-conditioned acceptance thresholds via augmented quantile regression. It provides conditional coverage guarantees for LLM/VLM sampled outputs, significantly improving reliability for difficult subgroups while maintaining compact prediction sets.

DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors

DABO treats "optimization difficulty" as a first-class conditional variable throughout the entire freeze-thaw hyperparameter optimization (HPO) pipeline. By utilizing a three-level difficulty characterization and a conditional diffusion model to generate 1 million synthetic learning curves with difficulty labels, it trains a difficulty-aware PFN proxy and an adaptive acquisition function. DABO achieves an average regret reduction of 11–18% compared to the current SOTA (ifBO) across 75 tasks, with greater gains observed on harder tasks.

DC-Merge: Improving Model Merging with Directional Consistency

DC-Merge discovers that the key to model merging lies in maintaining directional consistency in singular space between the merged multi-task vector and the original single-task vectors. Through a two-step process of singular value smoothing and projection onto a shared orthogonal subspace, it achieves SOTA results on both Vision and Vision-Language tasks.

Defending Unauthorized Model Merging via Dual-Stage Weight Protection

Ours proposes MergeGuard, an active dual-stage weight protection framework: Stage 1 disperses task-critical weights through L2 regularization, and Stage 2 injects structured perturbations to disrupt merging compatibility. It maintains <1.5% original performance loss for the protected model while causing up to 90% accuracy degradation in merged models.

Dynamic Momentum Recalibration in Online Gradient Learning

This paper reveals the inherent flaws of fixed momentum coefficients in the bias-variance tradeoff from a signal processing perspective. It proposes the SGDF optimizer, which dynamically balances noise suppression and signal preservation in gradient estimation by calculating an optimal time-varying gain online (based on the Minimum Mean Square Error principle), outperforming SGD with momentum and Adam variants across various vision tasks.

End-to-End Hyper-Relational Information Extraction for Engineering Diagrams via Dynamically Tokenized Relation Transformer

This work reframes the parsing of engineering diagrams (P&ID, Electrical Diagrams) from a multi-model workflow of detecting symbols, lines, and text separately into a one-time scene graph generation task. By employing a vision backbone with dynamic token pruning and a one-stage Relation Transformer (DTRT), the system end-to-end outputs a Hyper-Relational Knowledge Graph (HKG) containing "entities + connectivity + text qualifiers." On P&ID datasets, it achieves 94.84% SGDET R@2000 with approximately 1/8 the computational cost of two-stage methods.

Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning

Addressing the issue where existing Federated Prototypical Learning methods destroy inter-class semantic relations, the proposed FedTSP method utilizes pre-trained language models to construct textual prototypes that preserve semantic structures, significantly improving performance and accelerating convergence in heterogeneous federated learning.

Browse all 22 Optimization & Theory papers →


🔬 Interpretability (33)

Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers

ALOE utilizes a one-time, label-free "teacher-student feature alignment" to convert frozen ViT foundation models (Supervised / DINOv3 / SigLIP2) into inherently interpretable B-cos versions. Once aligned, the backbone can be used as a drop-in replacement for tasks like classification, zero-shot, and dense prediction, improving accuracy by \(>4.9\) percentage points over original B-cosification on ViTs while providing faithful and localized explanations with \(100–1000\times\) higher data efficiency.

Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

This paper proposes BTTF, a pure optimization framework that uses Image-to-Video diffusion models to generate Counterfactual Explanations (CFE) for video classifiers. By optimizing the initial noise latent variable solely based on the gradients of the target classifier—first anchoring the search via "inversion" near the original video and then optimizing toward the target category—it generates a "parallel video" that is most similar to the original yet classified as another category, revealing the spatiotemporal features the model relies on for decision-making.

Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability

For the problem of "how to evaluate automated neuron explanations," this paper utilizes Model-Guided Importance Sampling (MG-IS) to select the most informative inputs for crowdsourced labeling and Bayesian Rating Aggregation (BRAgg) to remove noise. This reduces the cost of a reliable full-distribution correlation evaluation from approximately $90k to $2.16k (~40×). Using this method, the authors systematically compare mainstream interpretability methods across multiple vision models, finding that Linear Explanations perform best overall, surprisingly outperforming recent LLM-based methods.

CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers

CIGMA quantifies the contribution of each attention head to background shortcuts using two counterfactual edits (masking foreground/background). By ranking heads according to causal information gain and surgically zeroing out the top-K "spurious heads," ViT/VLM models are encouraged to shift attention from the background to foreground objects without requiring training. This leads to classification accuracy gains of 7.6–24.8 percentage points and an approximately 83% reduction in background dependency.

CREward: A Type-Specific Creativity Reward Model

This paper decomposes "visual creativity" along the image formation pipeline into three interpretable axes: Geometry / Material / Texture. It first establishes a human benchmark, CreBench, via expert pairwise comparisons to confirm that Large Vision-Language Models (LVLMs) align closely with human judgment regarding creativity. Subsequently, a lightweight type-specific reward model, CREward (comprising a frozen visual backbone and MLP heads), is distilled from LVLM-generated preference labels. This model is applied across three domains: creativity evaluation, creative sample filtering / LoRA slider-guided generation, and Grad-CAM based interpretability.

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

This paper proposes CoE, a training-free multimodal summarization framework. By constructing a Hierarchical Event Graph (HEG) to guide chain-of-event reasoning, it surpasses SOTA video CoT baselines on 8 datasets, achieving an average improvement of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore.

Draft and Refine with Visual Experts

Proposes DnR (Draft and Refine), an agent framework based on a query-conditional Visual Utilization metric. This framework quantifies an LVLM's actual reliance on visual evidence and iteratively improves visual grounding to reduce hallucinations through rendering feedback from external visual experts (detection/segmentation/OCR, etc.).

Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

This work redefines open-vocabulary 3D indoor scene editing as a goal-regressive planning problem. It introduces the PDDL-style symbolic language EditLang and an LLM-driven Planner-Validator loop to derive minimal editing sequences from target states. The method achieves the best balance across instruction faithfulness (69.1%), semantic consistency (86.6%), and physical plausibility (91.7%) across 63 editing tasks.

ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

ERMoE proposes reparameterizing MoE expert weights within an orthogonal eigenbasis and substituting traditional routing logits with eigenbasis alignment scores (cosine similarity), enabling stable routing and interpretable expert specialization without the need for auxiliary load balancing losses.

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

H-Sets utilizes the input Hessian to detect second-order (non-additive) interactions between pixels, recursively merging them into semantically coherent feature sets. It then scores each set using IDG-Vis (Integrated Directional Gradients + Harsanyi Dividends) at the set level, ultimately producing saliency maps that are sparser and more faithful than existing methods.

Browse all 33 Interpretability papers →


📦 Model Compression (98)

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

This paper proposes 4D-RGPT and the Perceptual 4D Distillation (P4D) framework, which enhances 4D perception by distilling knowledge such as depth and optical flow from a frozen 4D perceptual expert model into an MLLM. It also introduces R4D-Bench, the first region-level 4D video question-answering benchmark.

A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling

BoT treats neural network weights as "continuous signals," where models of different sizes are simply discretized versions of the same signal at different resolutions. By applying 3D Discrete Wavelet Transform (DWT) for downsampling to achieve Large-to-Small (L2S) transfer and Inverse DWT (IDWT) with zero-padded high frequencies for upsampling to achieve Small-to-Large (S2L) transfer, it introduces the first training-free, zero-parameter framework that unifies cross-architecture knowledge transfer in both directions. It saves up to 67.1% of pre-training FLOPs on DeiT, BERT, and GPT.

AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks

This paper proposes AdaBet, a gradient-free layer selection method based on algebraic topology (the first Betti number \(b_1\)). By calculating the topological complexity of the activation space of each layer through only a forward pass, it determines which layers require fine-tuning without the need for labels, gradients, or backpropagation. On ResNet50/VGG16/MobileNetV2/ViT-B16, AdaBet achieves higher accuracy than full training with only 10% of layers fine-tuned, while reducing peak memory by approximately 40%.

Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing

ADTrack treats network depth as a dynamically allocatable computational budget. By equipping a frozen dual-stream ViT-T backbone with multi-layer "anytime" prediction heads and a confidence-calibrated early-exit strategy, and employing a minimalist Holistic-Token-Guided Interaction (HTGI) module with only 37.3K parameters for low-cost cross-modal fusion, it achieves 70.2% PR / 56.3% SR on LasHeR. It runs at 148.3 FPS on GPU, 50.2 FPS on CPU, and 28.7 FPS on edge devices.

Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation

Addressing the common issues of "color oversaturation + motion collapse" in DMD (Distribution Matching Distillation) for video diffusion models, this paper proposes an adaptive regression loss (using an EMA cache to dynamically down-weight unreliable real samples with high variance) and a temporal regularization loss (directly penalizing low inter-frame variance). Combined with an inference acceleration strategy that reduces the frame rate at high-noise steps and interpolates them back at low-noise steps, the method achieves 4-step generation on Wan2.1-1.3B/14B. The VBench/VBench2 scores surpass all distillation baselines, and user preference even exceeds that of the 50-step teacher model.

AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models

AdaSVD utilizes "alternating least squares to compensate for truncated singular matrices" and "adaptive compression rate allocation based on layer importance." These mechanisms significantly reduce accuracy loss in SVD-based Large Multimodal Models (LMMs) under high compression rates (60%+), consistently outperforming SVD-LLM across LLaMA2, OPT, Mistral, and Vicuna.

Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation

Aiming at the Open-set Continual Test-Time Adaptation (OCTTA) scenario where "continual domain drift" and "unknown novel classes" occur simultaneously, this paper proposes DOCO. The method first splits the current batch into ID-like and OOD-like subsets. It then learns a visual prompt on the ID samples to "pull" feature statistics back to the source domain. Finally, this prompt is directly reused for OOD samples in the same batch to strip away their domain shift and expose their semantic novelty. This three-step closed-loop mutual assistance achieves an H-score 4.7% higher than the second-best method on ImageNet-C.

Balanced Dataset Distillation via Modeling Multiple Visual Pattern Distribution

This work identifies a prevalent "pattern imbalance" in existing dataset distillation methods (either favoring intra-class majority class-general patterns or rare marginal patterns). It proposes the BPS framework: first, each class is modeled as a distribution of multiple visual patterns using a hierarchical semantic structure; then, a pattern-balanced coreset is constructed by taking half of the IPC budget from both the "center" and "margin" of each pattern; finally, a student model is trained via knowledge distillation. BPS comprehensively outperforms previous SOTA across four benchmarks and naturally possesses advantages in cross-architecture generalization and efficiency through its "mode once, reuse for all IPC" approach.

Batch Loss Score for Dynamic Data Pruning

Batch Loss Score (BLS) is proposed to estimate sample importance using only the mean batch loss instead of hard-to-acquire per-sample losses. Providing theoretical guarantees from a signal processing perspective via EMA low-pass filtering, it can be integrated into existing dynamic pruning frameworks with only 3 lines of code.

Beyond Soft Label: Dataset Distillation via Orthogonal Gradient Matching

Addressing the issue where existing ImageNet-1K dataset distillation methods rely excessively on BN statistics matching and suffer performance collapse without soft labels, this paper argues from a gradient perspective that BN matching only aligns gradient "scales" while ignoring the "directions" that determine training. The authors propose Orthogonal Gradient Matching (OGM), which performs SVD on real/synthetic gradients, forces all singular values to 1 to align only the singular vectors, and utilizes the closed-form gradient of the Least Squares Error (LSE) loss to complete matching during the forward pass. At IPC=10, OGM achieves 47.0% with soft labels and 16.7% with hard labels, significantly surpassing baselines like RDED.

Browse all 98 Model Compression papers →


🕸️ Graph Learning (8)

Adaptive Learned Image Compression with Graph Neural Networks

GLIC transforms nonlinear transformations in learned image compression (LIC) from fixed convolutions or window attention into content-adaptive connections driven by Graph Neural Networks (GNNs). It employs dual-scale graphs to determine "where to connect" and a complexity-aware mechanism to decide "how much to connect" to better model local and long-range redundancy. It significantly outperforms traditional codecs and recent LIC baselines across three standard datasets.

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

This paper introduces Graph2Eval, a knowledge-graph-driven framework for the automatic generation of agent evaluation tasks. By constructing structured knowledge graphs from documents/webpages, performing subgraph sampling, LLM conditional generation, and multi-stage filtering, it automatically produces multimodal agent tasks with significantly improved semantic consistency (+20%) and solvability (+17%), resulting in the Graph2Eval-Bench containing 1,319 tasks.

M3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

M3KG-RAG is proposed, which constructs a Multi-hop Multimodal Knowledge Graph (M3KG) via a lightweight multi-agent pipeline and designs the GRASP mechanism for entity grounding and selective pruning. It retains only query-relevant and answer-assisting knowledge, significantly enhancing the audio-visual reasoning capabilities of MLLMs.

Mario: Multimodal Graph Reasoning with Large Language Models

Mario is proposed for LLM reasoning on Multi-Modal Graphs (MMGs). It achieves topology-aware cross-modal alignment via a Graph-conditioned Vision-Language Model (GVLM) and selects the optimal modality configuration for each node using a Modality-Adaptive Prompt Router (MAPR), reaching SOTA performance on node classification and link prediction.

Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation

Addressing the issues of "relying solely on off-the-shelf VLM features, lacking discriminative attributes, and semantic isolation between objects and relations" in Open Vocabulary Scene Graph Generation (OVSGG), this paper proposes MoE-FD. It adaptively decouples object/relation features into sub-attributes like shape, texture, and space using a Mixture-of-Experts (MoE) module, followed by iterative cross-attention for mutual refinement between nodes and edges. On the Visual Genome all-open vocabulary setting, it significantly improves R@100 for novel categories (e.g., +4.24% R@20 over ACC in the OvD+R novel relation setting).

R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII

Ours proposes R2G, the first standardized multi-view circuit graph benchmark suite, providing five stage-aware graph representations (with information equivalence) across 30 IP cores. Systematic research reveals that the choice of graph representation has a greater impact on performance than the choice of GNN model.

Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation

Addressing the issue where "domain shift in visual features leads to a performance collapse" in robust Scene Graph Generation (inference on corrupted images with noise/blur/weather), this paper proposes a plug-and-play framework, Robo-SGG. It utilizes Instance Normalization to eliminate domain-specific statistics caused by corruption and uses layout-aware attention to recover global structural features (NRM). Additionally, it employs gated fusion to adaptively balance visual and coordinate features (LEE). Integrating these into existing SGG models yields relative improvements in mR@50 of 6.3% / 11.1% / 8.0% for PredCls/SGCls/SGDet on VG-C.

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning

Ours embeds the Procedural Knowledge Graph (PKG) into a planning model end-to-end via a differentiable Viterbi layer, allowing the neural network to focus on learning emission probabilities rather than memorizing complete procedural structures. This achieves SOTA success rates on CrossTask/COIN/NIV with only 5-7M parameters (1-3 orders of magnitude fewer than Diffusion/LLM methods) and establishes a unified evaluation benchmark.


🤝 Federated Learning (18)

Domain Sensitive Federated Learning with Fisher-Informed Pruning

FEDFIP estimates channel importance using domain-specific Fisher information to assemble a globally shared pruning mask on the server, while clients "reactivate" a small number of locally critical channels. Combined with domain-prototype structural contrastive regularization and a "shared-channel-only" aggregation strategy, it significantly compresses models while achieving higher accuracy and stability than mainstream FL baselines in multi-domain federated scenarios.

FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization

This paper uses diffusion theory to explain the root cause of why "FedAdam converges fast but generalizes poorly"—the adaptive learning rate weakens the preference for flat minima. Based on this, FedAdamom is proposed: shifting the adaptive mechanism from the learning rate to the momentum coefficient. This preserves the ability to quickly escape saddle points while restoring the selection of flat minima, simultaneously achieving faster convergence and higher accuracy on CIFAR-10/100, Tiny-ImageNet, and LEAF.

FedAlign: Differentially Private Distribution Alignment for Non-IID Federated Learning

FedAlign requires each client to upload noisy versions of the first four statistical moments (mean, variance, skewness, kurtosis) of their local data. The server aggregates these into a global reference distribution and broadcasts it back. Clients then align the distribution of their locally sampled data accordingly—mitigating both Non-IID heterogeneity and privacy leakage under differential privacy constraints, achieving a ~4% accuracy gain over strong baselines on CIFAR-10.

FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning

To address the issue in Federated Multi-Label Learning (FedMLL) where clients observe only local label spaces and generate conflicting label correlations (Label Correlation Drift), FedHarmony utilizes "Consensus Correlation" from the majority of clients as a global teacher to correct local training biases. Furthermore, it weights clients during server aggregation based on both data volume and correlation quality. It consistently outperforms existing SOTA on three non-IID federated benchmarks: FLAIR, COCO-80, and VOC2007 (e.g., +11.4 mAP on FLAIR).

FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning

FedRAC introduces a "dynamic reputation calculation that evolves with training" alongside "submodel construction via historical frequency rotation followed by reputation-based allocation." This dual-module approach ensures high-contribution clients receive superior submodels (fairness) while maintaining uniform training for every neuron in the global model (accuracy). It outperforms existing collaborative fairness methods in both fairness and accuracy.

FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients

To address the dual challenges of "noisy client annotations + Non-IID data" in Federated Learning, FedRG abandons the unreliable small-loss heuristic. Instead, it identifies clean/noisy samples based on representation geometry. Specifically, it first learns label-agnostic representations on a hypersphere through self-supervision, then uses a vMF mixture model to compare "geometric evidence" with "annotated label evidence" in a shared space for noise detection. Finally, it employs a personalized noise absorption matrix for robust optimization, achieving SOTA across multiple datasets and four noise scenarios.

Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning

This paper empirically reveals that fine-tuning CLIP in long-tailed federated scenarios destroys its inherent class balance, even falling below zero-shot performance. It proposes FedPuReL: using zero-shot predictions to "purify" local gradients into directions that preserve balance for a global model, and reframing personalization as "residual correction" atop a frozen global model. FedPuReL outperforms existing SOTA in both global and personalized models across 8 long-tailed datasets.

From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity

Addressing the pain point in Federated Continual Learning (FCL) where "selecting samples is easy, but utilizing them is difficult," FEAT does not modify the replay strategy itself. Instead, it employs a set of fixed ETF prototypes shared across all clients. It uses geometric structure distillation during training to align feature angles across clients and applies energy-based geometric correction during inference to "pull back" tail-class features from head-class subspaces. As a plug-and-play module layered on Re-Fed+ or FedCBDR, it yields stable performance gains.

Fully Decentralized Certified Unlearning

Addressing the neglected scenario of "decentralized networks without a central coordinator," this paper proposes RR-DU—a random-walk-based certified unlearning algorithm. It performs noisy projected gradient ascent on the forgetting set only at the client initiating the deletion, while other clients continue with noise-free descent. By incorporating sub-sampled Gaussian noise and trust region projections, the authors prove \((\varepsilon,\delta)\) network unlearning certificates, convergence, and deletion capacity bounds. Notably, the noise does not scale with the size of the forgetting set \(m\), successfully reducing backdoor attack success rates to random-guess levels while maintaining clean accuracy on image classification tasks.

GDFA: Geometry-Driven Federated Unlearning with Directional Task Vector Alignment

GDFA reinterprets "Federated Unlearning" as a loss surface geometry problem: it first migrates the global model to a flat minima region via perturbations, then has relevant clients generate task vectors on unlearning data, retaining only components with directional consensus (sign consensus) for reverse aggregation. This achieves precise erasure of target client knowledge in Non-IID scenarios with almost no loss in retention task accuracy.

Browse all 18 Federated Learning papers →


📈 Time Series (7)

PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning

PFGNet is a pure convolutional spatiotemporal prediction framework that dynamically modulates multi-scale large-kernel peripheral responses via Pixel-level Frequency-guided Gating (PFG) and applies learnable center suppression. Mimicking the center-surround band-pass filtering mechanism of biological vision, it achieves SOTA or near-SOTA performance on Moving MNIST, TaxiBJ, KTH, and Human3.6M benchmarks with minimal parameters and computational cost.

Probabilistic Precipitation Nowcasting with Rectified Flow Transformers

This work proposes FREUD—a framework utilizing a Rectified Flow Transformer as a "compressed first stage." It employs a frame-level encoder to independently encode each frame and a joint video decoder to reconstruct all frames simultaneously, replacing deterministic decoding with probabilistic decoding to quantify uncertainty during the compression stage. Combined with a latent-space rectified flow nowcasting model, it achieves SOTA CRPS (0.0190) and SSIM on the SEVIR precipitation nowcasting benchmark.

Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization

This paper addresses long-horizon (48–120 hours) PM concentration forecasting in East Asia. It first releases CMAQ–OBS, a regional dataset aligned with observations, and then employs a two-stage training framework (FAKER-Air) consisting of "SFT with temporal accumulation loss + GRPO with categorical AQI rewards." This aligns the inherent "over-forecasting and high false alarm" issues of MSE training with actual operational costs, reducing the False Alarm Rate (FAR) by 47.3% relative to the SFT baseline while maintaining a competitive F1 score.

SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval

SATTC is proposed as a label-free test-time calibration head. By employing a Product-of-Experts (PoE) fusion of a geometric expert (subject-adaptive whitening + adaptive CSLS) and a structural expert (mutual nearest neighbors + bidirectional top-k ranking + category popularity), it operates directly on the similarity matrix of frozen EEG and image encoders. This approach significantly enhances Top-1 accuracy and alleviates the hubness effect in cross-subject EEG-to-image retrieval.

Stable Spike: Dual Consistency Optimization via Bitwise AND Operations for Spiking Neural Networks

Ours proposes the Stable Spike dual consistency optimization framework, which utilizes hardware-friendly bitwise AND operations to decouple stable spike skeletons from multi-timestep spike maps and injects amplitude-aware spike noise to enhance generalization. It improves neuromorphic object recognition accuracy by up to 8.33% under ultra-low latency (\(T=2\)).

STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting

The STCast framework is proposed, which replaces static boundaries with learnable global-regional distributions through Spatial-Aligned Attention (SAA) to adaptively fuse global atmospheric information into regional forecasts. It utilizes Temporal Mixture-of-Experts (TMoE) with monthly dynamic routing to enhance temporal modeling, outperforming existing methods across four tasks: global forecasting, high-resolution regional forecasting, typhoon track prediction, and ensemble forecasting.

Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport

This paper proposes Causal-OT, which embeds inter-channel Granger causality graphs into the Optimal Transport (OT) cost matrix for cross-domain alignment. It simultaneously employs entropy-based uncertainty filtering for pseudo-labels to ensure that time-series and video domain adaptation preserves temporal-causal structures without being biased by overconfident pseudo-labels. It achieves an average accuracy improvement of 4.5% across 6 time-series benchmarks and 2.5% across 4 video benchmarks.


🏥 Medical Imaging (163)

A Supervised Multi-task Framework for Joint cryo-ET Restoration Enabled by Generative Physical Simulation

cryoDeRec utilizes a "generative noise modeling + physical imaging simulation" pipeline to generate paired tomograms consisting of "noisy inputs \(\leftrightarrow\) clean GT." This transforms cryo-ET denoising and missing wedge restoration, which previously relied on self-supervised methods, into fully supervised multi-task training. A single U-Net performs both tasks simultaneously, outperforming Topaz-Denoise / SC-Net / IsoNet across four real and two simulated datasets.

Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning

The HistoSelect framework is proposed to simulate the coarse-to-fine reasoning process of pathologists. Through a three-tier filtering mechanism consisting of tissue segmentation → Group Sampler → Patch Selector, and based on Information Bottleneck (IB) theory, irrelevant visual tokens are compressed. This achieves SOTA performance across three datasets while reducing computational overhead by approximately 70%.

Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning

Ours proposes the UAAI framework, which introduces Active Inference to micro-gesture recognition for the first time. Through EFE-guided temporal frame selection, spatial attention, and UMIX uncertainty-aware augmentation, it achieves 63.47% on the RGB modality of the SMG dataset, significantly outperforming traditional RGB methods.

AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation

The study upgrades "point prototypes / isotropic balls" used as semantic anchors in UNet to differentiable granular balls with anisotropic vector scales. A bidirectional "Pixel Set ↔ Ball" aggregation-broadcasting mechanism serves as a semantic refiner for skip-connections, supplemented by two geometric regularizations to prevent anchor collapse. This approach yields consistent performance gains (average IoU +1.3~1.7%) across four medical segmentation benchmarks for both Rolling-UNet and U-KAN backbones.

Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance

GaussM2ASR reformulates multi-contrast MRI arbitrary-scale super-resolution (ASSR) from "INR direct regression of pixel intensity" to "learning parameters for a set of anisotropic 2D Gaussian kernels." By using narrow kernels to fit high-frequency anatomical boundaries and wide kernels for smooth low-frequency regions, combined with three anatomy-driven modules to align structures with high-resolution reference images, it outperforms existing SOTA methods in PSNR/SSIM across IXI, BraTS, and fastMRI datasets.

Adaptive Confidence Regularization for Multimodal Failure Detection

The ACR framework is proposed to systematically address misclassification detection in multimodal scenarios for the first time. By combining Adaptive Confidence Loss (penalizing the "confidence degradation" phenomenon where multimodal fusion confidence is lower than unimodal confidence) and Multimodal Feature Swapping (synthesizing failure samples in the feature space), ACR significantly outperforms existing methods across four datasets.

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

This work transfers 3D diffusion models pre-trained on natural videos (Wan 2.1) or public CT datasets (MAISI) to radiotherapy dose prediction. It introduces an "Any2Any" modality conditioning paradigm allowing any modality to serve as a generation target, followed by reinforcement learning post-training aligned with clinical Scorecards to match institutional preferences. It achieved a new SOTA on the GDP-HMM challenge, reducing voxel-level MAE from 2.07 to 1.93.

BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation

The paper proposes BackSplit: a paradigm that subdivides the homogeneous "background" in lesion segmentation into semantic auxiliary organ/tissue classes for joint multi-class softmax training. Using Fisher information theory, it proves that this approach retains more information and produces more stable estimates than binary training, consistently improving Dice scores for small lesions across five datasets with zero additional inference overhead.

Benchmarking Endoscopic Surgical Image Restoration and Beyond

The authors constructed SurgClean, the first multi-source real-world endoscopic surgical image restoration dataset (3,113 images across desmoking, dehazing, and desplashing). They systematically evaluated 22 representative methods (12 general and 10 task-specific), revealing a significant gap between existing methods and clinical requirements, while analyzing the intrinsic differences between surgical and natural scene degradations.

Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance

This work presents the first systematic study of aggregation strategies from pixel-level uncertainty to image-level scores in segmentation tasks. It proposes SMR aggregators that integrate spatial structural information (Moran's I, Edge Density, Shannon Entropy) and a GMM-based meta-aggregator. Evaluation across 10 datasets demonstrates that global average (AVG) is a suboptimal choice, while GMM-All meta-aggregation performs robustly in both OoD and failure detection.

Browse all 163 Medical Imaging papers →


🧬 Computational Biology (19)

HINGE: Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images

The HINGE framework is proposed to adapt a pre-trained expression-space single-cell foundation model (sc-FM, CellFM) into a histology image-conditioned spatial gene expression generator. This is achieved by lightweight injection of visual context via identity-initialized SoftAdaLN modulation, alignment with pre-training objectives through an expression-space masked diffusion process, and training stabilization via a warm-start curriculum. It achieves SOTA results across three ST datasets while maintaining superior gene co-expression consistency.

Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective

HFGPI explicitly models the "gene → protein → tissue morphology" systems biology cascade as a hierarchical fusion pipeline. It utilizes graph-aware cross-attention to characterize gene-to-protein regulation and hypergraphs to link proteins to pathology patches. On 5 TCGA cohorts, it achieves an average C-index of 0.753 for survival prediction, outperforming all Prev. SOTA.

BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery

BiGMINT utilizes a three-stage hierarchical fusion—"chemoproteomics-guided high-content imaging (HCI) feature aggregation + outer product cross-modal fusion + protein-protein interaction (PPI) prior-based task-level information sharing"—to unify molecular mechanism signals and cellular phenotypic signals for compound activity prediction. On two large private datasets (~99K / ~40K compound-image pairs), it improves average AUCROC over state-of-the-art single-modal/multimodal baselines by up to 10.0% / 4.2%, with the coverage of high-performance tasks nearly doubling.

Bulk RNA-seq Guided Multi-modal Detection of Anomalous Regions in Human Cancer via Spatial Transcriptomics

BRGMAR utilizes a dynamic multi-relational graph to characterize spatial proximity and gene similarity between spots in spatial transcriptomics (ST). It transfers diagnostic information from patient-level bulk RNA-seq to ST through "gene module alignment" based on optimal transport. Combined with cross-attention fusion of pathological images, it significantly advances AUC/F1 scores for tumor anomalous region detection across BRCA, HCC, and ccRCC datasets.

CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis

CARE is proposed as a pathology slide-level foundation model that partitions WSIs into morphologically relevant irregular regions via an Adaptive Region Generator (ARG)—analogous to word-level tokens in NLP. By combining cross-modal alignment with RNA/protein expression profiles in a two-stage pre-training paradigm, CARE achieves optimal average performance across 33 downstream tasks while using only approximately 1/10 of the data required by mainstream models.

Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

Ours proposes CPNN, which leverages public single-cell RNA-seq data to construct cell-type prototypes. It models slide/patch-level gene expression as a weighted combination of these prototypes, achieving Prev. SOTA performance in gene expression estimation while providing interpretability.

Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference

SpaHGC is proposed, a multi-modal heterogeneous graph-based framework that integrates intra-target slice, cross-slice, and intra-reference slice subgraphs. Combined with masked graph contrastive learning and a cross-node dual attention mechanism, it predicts spatial gene expression from H&E pathology images, achieving a PCC improvement of 7.3%-27.1% across seven datasets.

CryoHype: Reconstructing a Thousand Cryo-EM Structures with Transformer-Based Hypernetworks

Ours proposes CryoHype, a Cryo-EM reconstruction method based on a Transformer hypernetwork, which reduces parameter sharing by dynamically adjusting the weights of Implicit Neural Representations (INR), achieving simultaneous reconstruction of 1000 different protein structures from unlabeled Cryo-EM images for the first time.

CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction

CryoKRAQEN utilizes an encoder-free (decoder-only) tri-plane Fourier codebook for cryo-EM heterogeneous reconstruction. By measuring the similarity between particle images and codebook prototypes using an Epanechnikov kernel, gradually tightening soft assignments to near-hard clustering via temperature annealing, and stabilizing the codebook with triplet regularization, the method accurately assigns noisy 2D projections to different 3D conformations/components without relying on encoders or Gaussian priors. It performs on par with SOTA on CryoBench and demonstrates significantly better performance on data with strong compositional heterogeneity.

cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold

Ours proposes cryoSENSE, the first computational framework for compressive imaging in cryo-EM. It demonstrates that protein cryo-EM images can be reconstructed with high fidelity from undersampled measurements using both sparse priors (DCT/Wavelet/TV) and generative priors (Diffusion Models), achieving up to 2.5× throughput gain while maintaining 3D reconstruction resolution.

Browse all 19 Computational Biology papers →


🛡️ AI Safety (143)

A Combination of Noise and Bilateral Filters Achieve Supralinear and Scalable Adversarial Robustness in CNNs

This paper theoretically demonstrates from the perspective of decision boundary geometry that "Gaussian noise" and "image filtering" defend against adversarial attacks through two complementary mechanisms. Consequently, their combination yields supralinear robustness gains. Based on this, a minimalist preprocessor (pixel-level Gaussian noise + iterative bilateral filtering, applied during both training and inference) is proposed, which approaches or even exceeds SOTA defenses on RobustBench using only ~35% of the training FLOPs and half the parameters.

A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World

This paper performs a "sanity check": it reveals that existing deepfake detectors achieve seemingly high AUC on multi-domain mixed data but suffer from low frame-level real/fake Accuracy (ACC) because "inter-domain discrepancies" overshadow "real/fake differences" in the feature space. Subsequently, it proposes a model-agnostic two-stage framework, DevDet (FFDev to expose forgery traces + DAFT for dose-adaptive fine-tuning), which significantly boosts frame-level ACC while preserving original generalization capabilities.

A Unified Perspective on Adversarial Membership Manipulation in Vision Models

This paper first reveals the vulnerability of Membership Inference Attacks (MIA) in vision models to adversarial membership manipulation. It demonstrates that imperceptible perturbations can forge non-members as members to deceive auditing. It identifies a "gradient norm collapse" signature in forged members and proposes a gradient-geometry-based detection strategy along with an adversarial robust inference framework.

AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples

Unrestricted adversarial attacks are implemented within the continuous-time velocity field of flow matching. Instead of perturbing pixels directly or using diffusion-based "denoising-then-re-noising," PGD perturbations on reconstructed images are translated into velocity field perturbations propagated deterministically along probability flow ODEs. A "lookahead two-point objective" corrects temporal mismatch, achieving simultaneously stronger black-box transferability and higher success rates against purification and adversarial training on ImageNet.

All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference

Ours proposes the Pseudo-Random Bayesian Inference (PRBI) framework for collaborative perception scenarios where all vehicles are untrusted. By utilizing inter-frame temporal consistency as a self-reference signal through pseudo-random grouping and Bayesian inference, the system efficiently identifies and excludes malicious vehicles with an average of only 2.5 verifications per frame, restoring detection accuracy to 79.4%–86.9% of pre-attack levels.

AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal

The authors invert style transfer into "style removal" to eliminate the "random texture style" of adversarial patches from images. By locating and masking patch pixels based on these style changes, they develop a zero-shot defense that is agnostic to models, patches, and attacks. This method avoids training and preserves clean image performance while improving adversarial mAP by 8–15 points, achieving real-time detection at 10–12 FPS (40–90ms per image).

AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

AVFakeBench is the first comprehensive audiovisual forgery detection benchmark covering "Human + Generic Scenes, 7 types of AV forgery combinations, and 4 levels of annotation" (3K segments / 12K QA). Using a multi-stage hybrid forgery framework based on "proprietary model planning + expert model execution" to mass-produce fake data, the authors evaluated 11 Audiovisual Large Multimodal Models (AV-LMMs) and 2 expert detectors. The study reveals that while AV-LMMs outperform expert models in binary real/fake judgment, they nearly collapse in fine-grained forgery classification and explanatory reasoning.

Batman: Benign Knowledge Alignment Through Malicious Null Space in Federated Backdoor Attack

Addressing the dilemma in federated backdoor attacks where "aligning benign knowledge weakens the attack, while failing to align makes it easily detected by defenses," Batman utilizes SVD to compress malicious knowledge into the dominant directions of parameter matrices. It aligns benign knowledge within the orthogonal "malicious null space," significantly enhancing stealthiness while maintaining backdoor functionality. It achieves high ASR and ACC simultaneously across four datasets and six aggregation/defense mechanisms.

Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection

Addressing the "Pre-trained Information Bias" (PIB) where the [CLS] token in ViT foundation models excessively focuses on global semantics and ignores local forgery traces during deepfake detection, this paper proposes the QTFP framework. By replacing [CLS] with a set of randomly initialized learnable query tokens to aggregate local evidence, combined with "Forgery Likelihood Weighted Contrastive Loss" and "Real-Graph Attention Alignment" regularizations, the average cross-dataset AUC is improved from 0.923 (Effort) to 0.947.

Bias at the End of the Score

This paper conducts a large-scale bias audit of five widely used reward models (PickScore, ImageReward, HPS, VQAScore, CLIP) in text-to-image (T2I) systems. It demonstrates that these scoring functions, acting as proxies for "image quality," encode systematic demographic biases. When used as noise optimizers, they disproportionately hypersexualize female subjects and "whiten" non-White subjects. Furthermore, the scores themselves correlate highly with real-world demographic distributions (such as gender ratios in occupations) rather than truly measuring quality.

Browse all 143 AI Safety papers →


📂 Others (98)

A2GC: Asymmetric Aggregation with Geometric Constraints for Locally Aggregated Descriptors

Addressing the failure of the "symmetric Sinkhorn" assumption in feature aggregation for Visual Place Recognition (VPR), A2GC reformulates the Optimal Transport solver into an asymmetric version (averaging row/column normalization + independent source/target marginal calibration) and overlays a geometric constraint branch (using learnable coordinate embeddings to bias spatially adjacent features towards the same cluster), achieving 95.6% Recall@1 on Pitts30k.

A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images

To address the issue where "reconstruction-based training-free AI image detection" is biased by simple backgrounds or large latent norms, this paper proposes using augmentations like rotation + low-pass filtering—which "preserve bias factors but destroy forensic information"—to normalize reconstruction errors. By computing debiased scores at both the image and latent levels and fusing them into a unified RDD score, the method achieves training-free SOTA performance (average AUROC 0.981 / 0.940) across 18 sub-benchmarks including GenImage and LSUN-Bedroom.

A Difference-in-Difference Approach to Detecting AI-Generated Images

Addressing the limitation where first-order reconstruction errors fail as modern diffusion models generate images closer to reality, this paper performs reconstruction twice. It utilizes the "difference of reconstruction errors"—a second-order difference—to cancel out stochastic perturbations inherent in the reconstruction process and amplify weak signals between real and fake images. By combining separate classifiers for first-order and second-order errors, the method achieves a 20%–30% improvement over the strongest baselines in cross-dataset and cross-generator scenarios.

Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning

ENL-DEE redesigns "Non-Transferable Learning (NTL)" as a Bayesian early-exit network. By freezing the backbone and training only several early-exit classification heads, it uses entropy-based routing to guide source domain samples to deep exits (preserving performance) and eject target domain samples at shallow exits (non-semantic features, accuracy near random). This significantly strengthens model IP protection while drastically reducing training and inference costs.

Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition

ADAMAB trains a lightweight "calibrator" on top of frozen pre-trained embedding models and utilizes a modified Upper Confidence Bound (UCB) algorithm to adaptively determine which data to synthesize for augmentation on a per-class basis. This approach improves accuracy by up to approximately 40% on few-shot long-tail recognition tasks with only 2–5 initial samples per class, providing theoretical guarantees for convergence.

AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments

This paper proposes AdaSFormer, a serialized Transformer framework for indoor Monocular Semantic Scene Completion (MSSC). By introducing three core designs—Adaptive Serialized Attention (with learnable offsets), Center Relative Position Encoding, and Convolution-Modulated Layer Normalization—it achieves SOTA performance on NYUv2 and Occ-ScanNet.

ALLNet: Multi-task Dense Prediction for Degraded Images

ALLNet dismantles the two-stage cascaded "restoration-then-prediction" pipeline. Using a dual-decoder U-Net, it enables mutual feature feeding between the restoration and prediction streams at every scale. By employing a degradation-adaptive Mixture-of-Experts (MaE) module for de-degradation and a Task Collaborative Refinement (TCR) module for bidirectional semantic alignment, it outperforms existing SOTA methods across four tasks on degraded versions of NYUD-v2 and PASCAL-Context.

Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images

Addressing the collapse of fully decentralized training under non-i.i.d. data and imbalanced client scales, this paper replaces the "Euclidean gossip" operation (averaging model parameters between neighbors) with linear mixing in the expectation parameter space of exponential families. This approach happens to be equivalent to a curvature-aware KL-Barycentric consensus (natural gradient step), reducing the per-round complexity from \(O(d^3)\) to \(O(d)\) without constructing or inverting the Fisher matrix. The authors provide an implementation called KL-consensus Adam, which has nearly the same overhead as Adam and achieves approximately 20% higher accuracy than the Euclidean consensus baseline on CIFAR-100.

Bi-directional Autoregressive Diffusion for Large Complex Motion Interpolation

ARVFI reformulates video frame interpolation from "generating all intermediate frames at once" to "generating frames autoregressively from two endpoints towards the center." By replacing optical flow with DINOv3 features as the motion representation, it significantly enhances interpolation accuracy for large complex motions (leading in FID across benchmarks) while reducing sampling to 15 steps—approximately 3x faster than its backbone, Wan.

Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models

BISE proposes that a biased model trained normally (vanilla) on biased data actually already contains a relatively unbiased subnetwork. By freezing the original parameters and learning a set of structured pruning masks, combined with "reweighted cross-entropy + biased mutual information regularization" to prune neurons relying on shortcut features, this subnetwork can be extracted without retraining or additional unbiased datasets. Performance is on par with SOTA debiasing methods, can exceed them after fine-tuning, and the model becomes smaller and faster.

Browse all 98 Others papers →


🗂 More Areas (33)


👥 Multi-Agent (2)

AgentDet: A Shared-Blackboard Multi-Agent Framework for Zero-/Few-Shot Object Detection

AgentDet decomposes zero-/few-shot object detection into four LLM agents: Scout, Pinner, Curator, and Judge. These agents collaborate via a "Shared Blackboard" and a patch-level "Knowledge Base" (KB). The framework fragments visual evidence into the KB, assembles them into holistic textual clues for LLM-based box prediction, and trains only the Judge agent. It achieves competitive results on PASCAL VOC and COCO for both ZSOD and FSOD tasks.

Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling

MACT decomposes the "monolithic single-model" visual document QA into four agents with distinct roles: planning, execution, judging, and answering. It adaptively allocates test-time compute according to the cognitive load of each agent rather than uniformly increasing parameters. On 15 benchmarks, it consistently ranks in the top three with <30B parameters, achieving an average improvement of 9.9–11.5% over the base models.


✏️ Knowledge Editing (2)

Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors

This paper proposes an attribution-guided dynamic model rectification framework that repositions rank-one model editing from domain adaptation to behavior rectification. By quantifying layer editability via Integrated Gradients to automatically locate suspect layers, it repairs three types of unreliable behaviors—backdoor attacks, spurious correlations, and feature leakage—using only a single clean sample.

SAME: Sparse and Anchored Model Editing for Heterogeneous Incremental Learning under Limited Data

This work adapts the "locate-then-edit FFN key-value pairs" paradigm from Large Language Models (LLMs) to Vision-Language Models (VLMs) like CLIP. Under a newly proposed "Heterogeneous Incremental Learning (HIL)" setting—characterized by no task identities, cross-domain shifts, and few-shot data—the authors propose sparse fine-tuning, dual-anchor constraints, and closed-form solutions to directly "write" new task knowledge into the FFN output projection matrices. The method requires no additional parameters, achieves 6.8% higher average accuracy than existing continual learning methods, and retains 95.8% of oracle performance.


💬 LLM (Other) (2)

LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis

This paper integrates text-pretrained LLMs as "structural prior generators" into the pseudo-label refinement stage of semi-supervised layout detection. By using OCR+LLM to infer document hierarchical regions and performing inverse variance probabilistic fusion (including learnable instance-adaptive gating) with teacher detector outputs, the method achieves 88.2 AP (lightweight backbone) and 89.7 AP (LayoutLMv3) on PubLayNet using only 5% labels, with the most significant gains observed in rare layout elements such as titles and headers.

OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning

Addressing the limitation that existing document layout generation data are "academic-only with single styles," the authors first create OmniDocLayout-1M, the first million-scale diverse layout dataset covering six document categories. They then employ a 0.5B small LLM using a "coarse-to-fine" paradigm—learning general layout rules on multi-domain coarse labels followed by adapting to specific domains with few fine labels. This approach outperforms both specialized layout models and general large models such as GPT-4o/Gemini/Claude on M6Doc.


🗣️ Dialogue Systems (1)

Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

Ours proposes HIER, which consistently outperforms SOTA methods and leading MLLMs (1-3% gain) on three multimodal intent recognition benchmarks by combining hierarchical semantic representations (token→concept→relation) with a self-evolutionary reasoning mechanism based on MLLM feedback.


🔍 Information Retrieval & RAG (9)

CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering

Ours proposes CC-VQA, a training-free method for mitigating knowledge conflict. Through a two-stage strategy involving visual-centric context conflict reasoning and correlation-guided encoding/decoding, it achieves an absolute accuracy gain of 3.3%-6.4% across E-VQA, InfoSeek, and OK-VQA benchmarks.

Explaining CLIP Zero-shot Predictions Through Concepts

This paper proposes EZPC, which maps CLIP image-text embeddings into an interpretable concept space by learning a linear projection matrix. While maintaining almost no loss in zero-shot classification accuracy (H-mean gap of only ~1% on CIFAR-100/CUB/ImageNet-100), it provides faithful explanations based on human-understandable concepts for CLIP predictions with a negligible inference overhead increase of about 0.1ms.

Language-driven Fine-grained Retrieval

LaFG replaces the semantically sparse one-hot category name supervision in Fine-Grained Image Retrieval (FGIR) with "attribute-level language prototypes." It leverages an LLM to expand category names into attribute descriptions, uses a frozen VLM to encode and cluster these into a dataset-level attribute vocabulary, and aggregates the Top-K attributes per category into prototypes to supervise the retrieval model. This establishes comparability across inter-class details, achieving SOTA results on CUB / Cars / SOP while significantly improving generalization to unseen classes.

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

Ours proposes M4-RAG, the first large-scale multilingual, multicultural, and multimodal RAG evaluation framework. Covering 42 languages and 80K+ cultural VQA instances from 189 countries, it systematically reveals that RAG is effective for small models but fails to scale positively with model size, while showing severe performance degradation in cross-lingual retrieval.

Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast

To address the "partial alignment + semantic ambiguity" issues in unsupervised cross-modal hashing, UWMCH performs token masking before fusion to force the model to learn complementary semantics. It then uses semantic affinity to re-weight contrastive losses to suppress false negatives, supplemented by dual-scale semantic regularization to stabilize the hashing space. It achieves the best mAP in 21 out of 24 settings across three retrieval benchmarks.

MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model

MuCo proposes a multi-turn contrastive learning framework that leverages the conversational capabilities of MLLMs to process multiple associated query-target pairs in a single forward pass. This significantly improves training efficiency and achieves SOTA performance on MMEB and M-BEIR retrieval benchmarks.

POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval

POGA parses both images and text into structured scene graphs, utilizes LLMs to automatically generate "paraphrased positive samples + counterfactual negative samples" along with their difference information, and trains with a composite loss across four granularities—global, node, relation, and focus. This allows the model to both recognize object attributes and reject "semantically similar but factually incorrect" descriptions in fine-grained long-text retrieval.

ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology

ProM3E utilizes an "align-then-fuse" two-stage framework to train a Masked Variational Autoencoder (MVAE) within the embedding space. By inferring Gaussian distribution representations of missing modalities from a small subset of visible modalities, it supports any-to-any modality generation, modality inversion retrieval, and uncertainty analysis regarding "which modalities to fuse." It comprehensively outperforms TaxaBind on ecological multimodal tasks.

RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations

RobustVisRAG is a causality-guided dual-path framework that decouples semantic-degradation entanglement in VisRAG by capturing signals through a non-causal path while learning pure semantics via a causal path. It achieves performance gains of 7.35%, 6.35%, and 12.40% in retrieval, generation, and end-to-end tasks under real-world degradation, respectively, while maintaining performance on clean data.


💻 Code Intelligence (1)

GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning

GeoTikzBridge constructs the largest 2.5M image-TikZ code dataset and the first auxiliary line instruction dataset. It trains a code generation model capable of precise geometric reconstruction, which serves as a plug-and-play module to enhance the geometric reasoning capabilities of any MLLM/LLM.


🔗 Causal Inference (4)

A Polynomial Chaos Framework for Causal Discovery in Nonlinear Uncertain Systems

This paper embeds noise terms into structural equations using Polynomial Chaos Expansion (PCE) to develop PCE-LiNGAM. It proves that causal Directed Acyclic Graphs (DAGs) are uniquely identifiable under mild sparsity conditions. Using a polynomial-time algorithm involving "PCE signature contamination testing + recursive sink finding," the method improves average F1 scores from 0.50 to 0.756 on extreme non-Gaussian industrial data while providing uncertainty quantification based on Sobol indices.

CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization

Addressing the issue that "causal graphs are inaccurately estimated under data scarcity or noise when using Structural Causal Models (SCM) for domain generalization," this paper moves away from point-estimating a single causal graph. Instead, it performs Bayesian inference on the causal graph posterior, selects a set of Causal Markov Blanket (CMB) features from each sampled graph to train predictors, and performs a weighted ensemble using the "alignment uncertainty" between each graph and the test samples. This approach achieves SOTA performance on datasets with strong distribution shifts, such as BLT and CMNIST.

MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations

MaskDiME is proposed, a training-free diffusion framework that transforms global classifier guidance into decision-driven local editing via an adaptive dual-masking mechanism. This achieves precise and efficient visual counterfactual explanations, with inference speeds over 30 times faster than DiME and GPU memory consumption only one-tenth that of ACE/RCSB.

Retrieving Counterfactuals Improves Visual In-Context Learning

The CIRCLES framework is proposed to retrieve counterfactual examples through attribute-guided composed image retrieval, constructing a dual-channel in-context demonstration of "causality + correlation" to significantly enhance the fine-grained visual reasoning capabilities of VLMs.


🩺 Medical LLM (1)

Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data

This paper proposes the Difficulty-Influence Quadrant (DIQ) data selection strategy, which jointly considers sample difficulty and gradient influence. This approach allows a VLM's language backbone to match full SFT performance using only 1% of curated data and exceed full-dataset training with 10% of the data.


⚛️ Physics & Scientific Computing (2)

AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety-Critical Cloud Forecasts

AviaSafe embeds the "localization before quantification" hierarchical strategy and the long-validated "Icing Condition (IC) index" into a Swin Transformer backbone. It achieves the first global, 6-hourly, phase-separable (ice/liquid/rain/snow) cloud microphysics forecast, outperforming the FuXi baseline on 93.7% of variable-lead time combinations and matching or exceeding the operational NWP ECMWF HRES on key background variables up to a 7-day lead time.

Spatial-Spectral Residuals Informed Diffusion Neural Operator for Pan-sharpening

SRINO replaces the attention-based denoising backbone of diffusion models for pan-sharpening with a Galerkin-type Neural Operator (transferring the generation process to a continuous function space to significantly save FLOPs and memory). It treats pixel-level spatial/spectral consistency residuals directly as conditions fed into each step of the reverse sampling process for closed-loop guidance. On WV3/GF2/QB datasets, it outperforms current SOTA methods while being several times more computationally efficient than attention-based diffusion models.


🧮 Scientific Computing (3)

Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis

Ours proposes the Exposure-Time dependent Modulation Transfer Function (ET-MTF), modeling exposure time as a continuous variable. A large-scale synthetic turbulence dataset, ET-Turb (5,083 videos, 2 million frames), is constructed, significantly improving the generalization of turbulence restoration models on real-world data.

EHETM: High-Quality and Efficient Turbulence Mitigation with Events

EHETM is proposed as the first method to leverage the microsecond temporal resolution of event cameras to break the accuracy-efficiency bottleneck of traditional multi-frame Turbulence Mitigation (TM). By discovering two key physical phenomena—the correlation between polarity alternation of turbulence-induced events and sharp gradients, and the formation of spatio-temporally coherent "event tubes" by dynamic objects—the authors design the Polarity-Weighted Gradient and Event Tube Constraint modules. EHETM reduces data overhead by 77.3% and system latency by 89.5%, significantly surpassing SOTA methods, especially in dynamic scenes.

NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training

Ours proposes NESTOR, a nested MoE neural operator. It captures global features of different PDE types through image-level MoE and local correlations within physical fields through token-level Sub-MoE. It achieves large-scale pre-training across 12 PDE datasets and effectively transfers to downstream tasks.


🌍 Earth Science (1)

SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images

This work proposes SIGMA, the first physics-based synthetic seismic image dataset with ground truth labels. By combining wave equation forward modeling and Reverse Time Migration (RTM), velocity models containing gas chimneys are converted into seismic images. The dataset provides pixel-level gas chimney masks (for detection) and paired "degraded-clean" images (for enhancement). Benchmarking multiple baselines reveals that existing methods collectively struggle on this data.


📡 Signal & Communications (2)

AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation

This paper proposes AcTTA, a test-time adaptation framework based on dynamic activation function modulation. By reparameterizing traditional fixed activation functions into a learnable form—incorporating activation center shifts and asymmetric gradient slopes—the model adaptively adjusts activation behavior during inference to handle distribution shifts. AcTTA consistently outperforms normalization-based TTA methods on CIFAR10-C, CIFAR100-C, and ImageNet-C.

CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

CLAY proposes a training-free method for conditional visual similarity calculation, which modulates similarity by constructing a text-conditioned subspace within the VLM embedding space. This approach adapts to different retrieval conditions without recomputing database features and supports multi-condition retrieval.


👥 Social Computing (3)

Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

The MaLSF framework is proposed, utilizing mask-label pairs as semantic anchors to achieve active local semantic conflict detection through Bidirectional Cross-modal Verification (BCV) and Hierarchical Semantic Aggregation (HSA) modules, achieving SOTA on DGM4 and fake news detection tasks.

Instance-level Visual Active Tracking with Occlusion-Aware Planning

OA-VAT constructs discriminative "instance prototypes" offline from a single reference image to resist similar distractors. It utilizes online EMA-enhanced prototypes and confidence-adaptive Kalman filtering to maintain stable tracking, while training a target-box-conditioned diffusion trajectory planner to actively bypass obstacles and recover the target upon occlusion—achieving an average SR of 0.93 on UnrealCV, 90.8% average CAR on real images, and 81.6% TSR on real UAVs, reaching 35 FPS on an RTX 3090.

Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning

Ours proposes E2OAL, a detector-free open-set active learning framework that discovers latent structures of unknown classes via label-guided clustering, jointly models known and unknown categories using a Dirichlet-calibrated auxiliary head, and designs a two-stage adaptive querying strategy to simultaneously achieve high accuracy, high query purity, and high training efficiency across multiple benchmarks.