CVPR2025 CVPR2025 accepted papers CVPR2025 paper list AI paper notes top conference papers 3D Vision Image Generation Multimodal VLM Segmentation Autonomous Driving Video Generation Medical Imaging Human Understanding

📷 CVPR2025 Accepted Papers¶

1819 CVPR2025 paper notes covering 3D Vision (364), Image Generation (305), Multimodal VLM (136), Segmentation (94), Autonomous Driving (89), Video Generation (85), Medical Imaging (78), Human Understanding (73) and other 47 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

💡 LLM Reasoning (7)¶

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought: Argus proposes a grounded visual CoT mechanism that enables explicit target-oriented visual attention by first making the MLLM predict a question-related bounding box (RoI), and then resampling/re-encoding the visual tokens of that region as reasoning context, achieving dual SOTA in visual reasoning and object grounding among 7B/8B-scale MLLMs.
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation: AoTD uses an LLM agent to decompose complex video questions into subtasks, invokes expert vision models to execute them, and collects intermediate results as a Chain-of-Thought (CoT). After quality filtering using an LLM, the CoT is distilled into a Video-LLM, enabling the end-to-end model to achieve both accurate answers and interpretable multi-step reasoning capabilities.
Interleaved-Modal Chain-of-Thought: Proposes Interleaved-Modal Chain-of-Thought (ICoT), which interleaves image region crops as visual rationales within reasoning steps. By using a parameter-free Attention-driven Selection (ADS) to intelligently select and insert key regions from the input image into the generated sequence, it achieves up to a 14% improvement over existing multimodal CoTs on Chameleon and Qwen2-VL.
Learning-enabled Polynomial Lyapunov Function Synthesis via High-Accuracy Counterexample-Guided Framework: This paper proposes a learning-enabled polynomial Lyapunov function synthesis method which combines learning and verification. It uses data-driven machine learning to guide the selection of polynomial forms and iteratively optimizes them through a high-accuracy counterexample-guided framework, striking a balance between flexibility and mathematical rigor.
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval: This paper proposes OSrCIR, a training-free one-stage zero-shot composed image retrieval method. It utilizes multimodal large language models to directly process the reference image and modification text, and accurately understands the user's implicit intent through reflective Chain-of-Thought reasoning, outperforming existing training-free methods by 1.80% to 6.44% across multiple benchmarks.
Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection: This paper proposes a Chain-of-Thought Guided Style Evolution (CGSE) method. By generating three-level progressive style descriptions (word $\rightarrow$ phrase $\rightarrow$ sentence), combined with feature disentanglement and class-specific prototype clustering, CGSE achieves significant performance improvements in domain generalization object detection on five adverse weather scenarios and the Real-to-Art benchmark.
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection: VideoEspresso constructs a large-scale video CoT reasoning dataset of over 200k samples (containing spatial bounding box and temporal grounding annotations). It also proposes a hybrid framework, VideoQA-SC, which employs a lightweight 1.5B model to select an average of 2.36 core frames, followed by an 8B reasoning model performing two-stage evidence extraction and answer generation. With only 1.8% of the frames and 14.7% of the computation, it outperforms GPT-4o and all open-source LVLMs.

🦾 LLM Agent (9)¶

ATA: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Generation: Proposes the ATA (Adaptive Transformation Agent) framework to achieve precise control over subject position and pose in text-guided background generation, dynamically adjusting the subject's placement in the background via an adaptive transformation module while balancing visual consistency and semantic plausibility.
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields: This paper proposes Feature4X, a versatile framework that distills the functionalities of various 2D visual foundation models (e.g., SAM2, InternVideo2) from arbitrary monocular videos into a unified 4D Gaussian feature field via a dynamic optimization strategy. This work represents the first attempt to lift video foundation models to 4D features based on Gaussian Splatting, supporting segment anything from novel views, geometric/appearance editing, and free-form VQA.
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration: Proposed the GUI-Xplore dataset (312 applications, 32K+ QA pairs, 5-level tasks) and the Xplore-Agent framework (Action-aware GUI Modeling + GUI Transition Graph Reasoning). By simulating the human strategy of "exploring before reasoning", it improves StepSR by approximately 10% on unfamiliar applications compared to state-of-the-art GUI agents.
RL-RC-DoT: A Block-level RL Agent for Task-Aware Video Compression: Proposes RL-RC-DoT, a reinforcement learning-based macroblock-level quantization parameter (QP) control agent for task-aware video compression. By modeling QP selection as a sequential decision-making problem in RL, the agent learns to allocate more bitrate to task-relevant regions under given bitrate constraints, significantly improving performance on vehicle detection and ROI saliency coding tasks. A key advantage is that it does not require running downstream task models during inference, making it suitable for edge device deployment.
SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation: Proposes SceneAssistant, a closed-loop agentic framework based on visual feedback. By designing a fully functioning suite of Action APIs (13 atomic operations spanning object search, deletion, 6DoF spatial operations, and camera control) for VLMs, this approach enables iterative, open-vocabulary 3D scene generation using the ReAct paradigm. It significantly outperforms Holodeck and SceneWeaver in both indoor (preference rate of 61.25%) and open-domain (preference rate of 65.00%) scenarios.
Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback: Proposes the Sketchtopia large-scale dataset (20K+ game sessions, 263K sketches, 916 players) and a three-component Agent framework (ActionDecider + DRAWBOT + GUESSBOT) to study asynchronous, goal-driven multimodal collaborative communication in Pictionary scenarios, introducing three new evaluation metrics: AAO, FRS, and MATS.
SpiritSight Agent: Advanced GUI Agent with One Look: This paper proposes SpiritSight, a vision-based end-to-end GUI agent, which resolves grounding ambiguity under dynamic high-resolution inputs through a multi-tier dataset of 5.73 million samples named GUI-Lasagne and the Universal Block Parsing (UBP) method. On Multimodal-Mind2Web under the non-candidate element setup, SpiritSight-8B achieves a Step Success Rate (SR) of 52.7%, outperforming all vision, language, and hybrid methods.
TANGO: Training-free Embodied AI Agents for Open-world Tasks: This paper proposes TANGO, which orchestrates two minimal navigation foundation primitives (PointGoal Navigation + Memory-based Exploration) through the program generation capability of LLMs. Without any task-specific training and using only few-shot examples, TANGO achieves state-of-the-art (SOTA) results across three distinct embodied AI tasks: Open-Set ObjectGoal Navigation, Multi-Modal Lifelong Navigation, and Open Embodied QA, demonstrating the generalizability of the "minimal primitive set + LLM composition" paradigm.
Visual Agentic AI for Spatial Reasoning with a Dynamic API: This paper proposes VADAR, an agentic program synthesis approach for 3D spatial reasoning. Multiple LLM agents collaborate to generate Pythonic APIs and dynamically extend new functions to solve common subproblems during the solving process, overcoming the limitations of prior methods like VisProg/ViperGPT that rely on static, human-defined APIs. At the same time, it introduces a new benchmark involving multi-step spatial localization and reasoning, outperforming existing zero-shot methods on 3D understanding tasks.

⚖️ Alignment & RLHF (5)¶

Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group: Proposes an alternative method to solve the constraint equations of steerable equivariant CNN kernels. By solving simpler invariance conditions at a fixed point and then "steering" to arbitrary points, this approach bypasses the need for computing Clebsch-Gordan coefficients, providing explicit kernel basis formulas for SO(2), O(2), SO(3), O(3), and the Lorentz group.
CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation: This paper proposes the CAD-Llama framework, which converts 3D CAD models into Python-style code rich in semantic descriptions (SPCC) via a hierarchical annotation pipeline. It then utilizes adaptive pre-training and instruction tuning to transform LLaMA3-8B into a parametric CAD model generator. This approach outperforms previous methods by approximately 14% in accuracy on the text-to-CAD task, while supporting various CAD editing tasks such as completion, addition, and deletion.
Continual SFT Matches Multimodal RLHF with Negative Supervision: Through gradient analysis, it is discovered that the core advantage of multimodal RLHF over continual SFT lies in the negative supervision signal within the rejected responses. Based on this, the nSFT method is proposed, which uses an LLM to extract error information from rejected responses and construct corrective dialogue data. It matches or even outperforms RLHF methods like DPO/PPO using only SFT loss, requiring only one model and significantly improving GPU memory efficiency.
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-Modal LLMs?: This work investigates whether safety alignment in multimodal large language models genuinely requires meticulously curated malicious data. It demonstrates that effective safety alignment can be achieved by leveraging existing benign data combined with simple safety fine-tuning strategies, thereby significantly reducing the data curation cost associated with safety alignment.
Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising: This paper proposes JailNTL, the first black-box attack method against Non-Transferable Learning (NTL) models. By utilizing test-time data disguising to transform unauthorized domain data into the style of the authorized domain, it improves unauthorized domain accuracy by up to 55.7% using only 1% authorized samples, without requiring any model modifications.

🔒 LLM Safety (14)¶

A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks: Proposed a training-free, data-free debiasing method for VLMs. By deriving closed-form solutions in a cross-modal space, it achieves Pareto-optimal trade-offs between fairness and utility retention, consistently outperforming existing approaches across three downstream tasks: zero-shot classification, text-to-image retrieval, and text-to-image generation.
Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning: This paper proposes the Duct method, which addresses exemplar-free domain-incremental learning on pre-trained models. Duct employs representation consolidation (accumulating task vectors to build a unified embedding space) and classifier consolidation (utilizing category semantic information via optimal transport to estimate the weights of classifiers for old domains). It outperforms state-of-the-art methods by 1% to 7% across four benchmark datasets.
LLM4SVG: Empowering LLMs to Understand and Generate Complex Vector Graphics: This paper proposes the LLM4SVG framework, which enables open-source LLMs (such as GPT-2, Phi-2, and Falcon) to understand and generate high-quality, complex vector graphics. This is achieved by defining 55 learnable SVG semantic tokens to replace raw XML tags and conducting a two-stage instruction fine-tuning process on the SVGX-SFT dataset, which contains 250K high-quality SVGs and 580K instruction-following pairs. The GPT-2 XL-based model achieves an FID of 64.11 and a CLIPScore of 0.3496, significantly outperforming GPT-4o (127.78 FID) and all existing SVG generation methods.
ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models: Having discovered that semantic-driven visual token pruning discards forensic evidence (as tampering traces reside in low-saliency regions), this work proposes ForensicZip. It utilizes Birth-Death optimal transport to quantify physical inter-frame discontinuities and incorporates a high-frequency prior to preserve forensic signals. ForensicZip achieves 2.97x acceleration and 90%+ FLOPs reduction at a 10% token retention rate with no performance degradation.
Hyperbolic Safety-Aware Vision-Language Models: HySAC proposes constructing safety-aware vision-language models in hyperbolic space. By mapping safe and unsafe content to different regions of hyperbolic space via entailment cones (safe content near the origin, unsafe content far from the origin), the model is equipped with safe content classification and dynamic redirection capabilities, significantly outperforming existing unlearning methods in retrieval safety and NSFW detection.
LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty: This paper proposes LoTUS, which utilizes logit temperature adjustment and Gumbel-Softmax to smooth predictions of forgotten samples. By dynamically scheduling the temperature, it converges to the target where "forget set accuracy equals unseen set accuracy." This enables efficient unlearning on the large-scale ImageNet-1K benchmark (Avg Gap of 0.0150 on ViT). Furthermore, it introduces RF-JSD, a retraining-free evaluation metric (achieving a Pearson correlation of 0.92 with the true JSD).
Low-Rank Adaptation in Multilinear Operator Networks for Security-Preserving Incremental Learning: To address the catastrophic forgetting problem of multilinear operator networks in Leveled Fully Homomorphic Encryption (Leveled FHE) scenarios, an incremental learning method combining Low-Rank Adaptation (LoRA) and Gradient Projection Memory (GPM) mechanisms is proposed to achieve continual learning while preserving data security.
MP-GUI: Modality Perception with MLLMs for GUI Understanding: MP-GUI designs three specialized perceivers to extract graphical, textual, and spatial modal information from GUIs. By combining these three modalities through a spatial structure refinement strategy and an adaptive fusion gate, it outperforms general MLLMs on various GUI understanding tasks under limited training data.
Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating: Neural Gate discovers that privacy-related neurons in LVLMs exhibit strong cross-sample inconsistency—only about 10% of neurons consistently encode privacy signals. Based on this finding, a neuron-level gradient gating editing method is proposed: applying gradient updates only to strongly consistent privacy neurons, which improves Safety EtA from 0.48 to 0.89 on MiniGPT while maintaining Utility.
Protecting Your Video Content: Disrupting Automated Video-Based LLM Annotations: This paper proposes two types of adversarial video watermarking methods—Ramblings (which induce video LLMs to generate incorrect descriptions) and Mutes (which induce video LLMs to generate extremely short or empty descriptions)—to protect personal videos from unauthorized automated annotation via imperceptible adversarial perturbations. It also demonstrates that these low-quality annotations degrade the performance of downstream text-to-video generation models.

Browse all 14 LLM Safety papers →

👻 Hallucination Detection (9)¶

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination: This work constructs 3D-GRAND, the first million-scale densely grounded 3D scene-language dataset (40K scenes, 6.2M instructions), and proposes the 3D-POPE hallucination evaluation benchmark. It demonstrates that densely grounded instruction tuning significantly enhances the grounding capability of 3D-LLMs and reduces hallucinations, while also showcasing effective sim-to-real transfer.
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception: This paper proposes Antidote—a unified, synthetic data-driven post-training framework that enables model self-correction by injecting factual priors into prompts, decoupling hallucination mitigation as a preference optimization problem. It improves CP-Bench by over 50% on the LLaVA series, increases POPE by 1.8-3.3%, and reduces CHAIR/SHR by 30-50% without suffering from catastrophic forgetting.
HalLoc: Token-Level Localization of Hallucinations for Vision Language Models: This work proposes HalLoc, a token-level hallucination annotation dataset with 155K samples covering three categories of tasks: VQA, instruction following, and image captioning. Based on this, a lightweight hallucination detection model named HalLocalizer is trained, which can be integrated into existing VLMs in a plug-and-play manner to achieve real-time probabilistic hallucination detection without sacrificing efficiency.
Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding: Through extensive experiments, this paper reveals the hybrid nature of hallucination causes in LVLMs—different samples and different generation steps face different flags of hallucination challenges. Consequently, the Octopus framework is proposed, which utilizes a learnable decision token and a transformer block to adaptively select the most appropriate contrastive decoding (CD) strategy at each generation step. Optimized via DPO, Octopus outperforms existing CD methods across four benchmarks.
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models: This paper proposes the ODE (Open-set Dynamic Evaluation) protocol, which models real-world object concepts and their distribution associations using a graph structure. It dynamically extracts concept combinations and generates synthetic test images to realize open-set, continuously updated multimodal hallucination evaluation, effectively avoiding the data contamination issues potentially present in current static benchmarks.
One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination: Proposes the first unified, training-free framework for mitigating MLLM hallucinations, operating synergistically within the hidden representation layers based on the dual roles of vision tokens—enhancement (SVC) and suppression (CRC). It improves POPE accuracy by ~2% on LLaVA-1.5 with only a 1.06× increase in inference latency.
PhD: A ChatGPT-Prompted Visual Hallucination Evaluation Dataset: This paper proposes PhD, a large-scale visual hallucination evaluation dataset constructed with the assistance of ChatGPT. It contains 14K+ everyday images, 750 counter-commonsense images, and 102K VQA triplets. Through 4 evaluation modes $\times$ 5 visual tasks, it systematically evaluates the hallucination issues of multimodal large language models (MLLMs), far exceeding existing benchmarks in scale and difficulty.
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding: FarSight is proposed as a plug-and-play, training-free decoding strategy. It introduces attention registers into the upper triangle matrix of the causal mask to absorb excessive attention on anomalous tokens, and designs positional awareness encoding with diminishing masking rates to enhance information propagation for distant visual tokens, thereby effectively mitigating initial and snowball hallucinations in Multimodal Large Language Models (MLLMs).
Stop Learning It All to Mitigate Visual Hallucination, Focus on the Hallucination Target: Proposes TL-DPO (Target-Learning DPO), which limits traditional full-sentence preference learning to the target chunk where hallucination occurs and the corresponding image region. By excluding irrelevant signals through target generation loss and target condition loss, it reduces CHAIR_s on LLaVA-1.5 from 66.8 to 20.1, while improving LLaVA-Bench from 63.4 to 71.2.

⚡ LLM Efficiency (5)¶

Associative Transformer: The Associative Transformer (AiT) is proposed, which integrates a learnable explicit memory module and a Hopfield network for token reconstruction within the Transformer architecture, achieving classification and relational reasoning performance superior to ViT with fewer parameters.
Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks: A method is proposed to automatically extract MoE (Mixture-of-Experts) variants from pre-trained ViTs. By first clustering the output activation patterns of MLP layers and then extracting corresponding subnetworks as experts, this approach avoids training MoEs from scratch. It recovers 98% of the original performance on ImageNet-1k with only minimal fine-tuning, while reducing FLOPs and model size by 36% and 32%, respectively.
LOCORE: Image Re-ranking with Long-Context Sequence Modeling: This paper proposes LoCoRe (Long-Context Re-ranker), achieving list-wise image re-ranking based on local descriptors for the first time. By leveraging the Longformer long-context sequence model to process the local descriptors of both the query image and the entire candidate list simultaneously, LoCoRe significantly improves re-ranking performance by capturing transitive relations among candidate images.
Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks: A post-training method is proposed to extract MoE variants from pre-trained ViTs. By automatically discovering expert structures using HDBSCAN to cluster MLP hidden layer activation patterns, it reduces MACs by 36% and parameters by 32% on ImageNet-1k while preserving 98% of the original accuracy without retraining.
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training: This paper proposes Spatial-TTT, which leverages the Test-Time Training (TTT) mechanism to utilize a subset of model parameters (fast weights) as compact non-linear memory. Combined with a hybrid architecture and a spatial prediction mechanism, the model continuously accumulates and organizes 3D spatial evidence from unbounded video streams, achieving SOTA on video spatial understanding benchmarks.

📚 Pretraining (15)¶

A Unified Framework for Heterogeneous Semi-supervised Learning: This paper proposes a new problem setting termed Heterogeneous Semi-Supervised Learning (HSSL), where labeled and unlabeled data originate from domains with different distributions, and the goal is to train a model that generalizes well to both domains. By expanding the C-class problem into a 2C-class classification task (where the same semantic class in different domains is treated as distinct classes), this work provides a unified solution integrating Weighted Moving Average (WMA) pseudo-labeling, cross-domain prototype alignment, and progressive cross-domain Mixup.
AMO Sampler: Enhancing Text Rendering with Overshooting: This paper proposes the Attention-Modulated Overshooting (AMO) sampler, a training-free inference-time enhancement method. By introducing an overshooting-noise compensation Langevin dynamics correction during the sampling process of rectified flow models, and adaptively controlling the overshooting intensity using text-image cross-attention scores, it significantly improves text rendering accuracy while maintaining the overall quality of generated images.
Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior: This work introduces the concepts of "System GAP" and "Random GAP" for the first time to describe the information mismatch between brain signals and visual stimuli. By dynamically adjusting the image blur level through an Uncertainty-Aware Blur Prior (UBP) to alleviate overfitting during training, it achieves a 50.9% top-1 accuracy on the 200-way zero-shot brain-image retrieval task, outperforming the previous SOTA by 13.7 percentage points.
ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval: The ConText-CIR framework is proposed, which utilizes a Text Concept-Consistency loss to align noun phrases in text modifications with corresponding regions in the query image. Combined with a synthetic data generation pipeline, it achieves SOTA performance on multiple CIR benchmarks.
DreamText: High Fidelity Scene Text Synthesis: DreamText reconstructs the training pipeline of diffusion models, introducing character-level balanced supervision and a heuristic alternate optimization strategy to calibrate character attention. Combined with the joint training of the text encoder and generator to learn diverse font styles, it significantly outperforms state-of-the-art methods in scene text synthesis tasks (improving SeqAcc from 0.763 of UDiffText to 0.940).
Exploration-Driven Generative Interactive Environments: This work provides an open-source implementation of the Genie world model (GenieRedux), which is enhanced to GenieRedux-G by incorporating ground-truth action conditioning, Token Distance Cross-Entropy (TDCE) loss, and token skip connections. Additionally, the AutoExplore agent is proposed to utilize the world model's token prediction uncertainty as an intrinsic reward to drive diverse data collection, improving simulation quality by up to 7.4 PSNR.
Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction: This paper proposes IAR, which rearranges the VQGAN codebook via balanced K-means to align similar embeddings with adjacent indices. Combined with a cluster-oriented cross-entropy loss that guides the model to correctly predict the semantic cluster of the target token, IAR halves the training time while improving generation quality across all LlamaGen scales from 100M to 1.4B.
Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics: Reveals through the NTK framework that linearized attention mechanisms do not converge to the infinite-width NTK limit (the spectral amplification effect cubes the condition number of the Gram matrix, requiring a width of $m = \Omega(\kappa^6)$), and introduces the concept of "influence malleability" to quantify the dual consequences of this non-convergence: an attention network's malleability, which is 6-9 times higher than that of a ReLU network, both enhances task adaptability and exacerbates adversarial vulnerability.
MR-PLIP: Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation: Proposed MR-PLIP, the first multi-resolution pathology vision-language pre-training model. Pre-trained on 34 million multi-resolution image-text pairs from the TCGA dataset, it outperforms SOTA on 26 datasets through cross-resolution vision-text alignment and text-guided visual representation.
PlanarSplatting: Accurate Planar Surface Reconstruction in 3 Minutes: This paper proposes PlanarSplatting, which directly optimizes learnable 3D rectangular plane primitives. By utilizing a newly designed rectangular splatting function, planes are differentiably rendered into depth and normal maps. This enables the reconstruction of accurate indoor planar scenes from multi-view images in just 3 minutes without requiring any plane annotations.

Browse all 15 Pretraining papers →

💬 LLM (Other) (15)¶

Building Vision Models upon Heat Conduction: A vision backbone named vHeat is proposed, which models image patches as heat sources and utilizes physical heat conduction equations via DCT/IDCT transforms to achieve global information propagation with $O(N^{1.5})$ complexity. It achieves 84.0% top-1 accuracy on ImageNet-1K with 3x higher throughput and 80% less GPU memory overhead.
Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment: This paper proposes a new paradigm of Chat-based Person Retrieval (ChatPR), builds the first dialogue-image paired dataset ChatPedes, and designs the DiaNA framework to achieve fine-grained cross-modal alignment between dialogues and images via an adaptive attribute refiner, significantly outperforming traditional single-sentence text retrieval methods.
ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices: This paper proposes ComRoPE, which generalizes RoPE into a rotary position embedding parameterized by trainable commuting angle matrices. It theoretically proves that the pairwise commutativity of angle matrices is a necessary and sufficient condition for RoPE to satisfy relative position dependency, outperforming the state-of-the-art LieRE method by 1.6% (on training resolution) and 2.9% (on higher resolutions) on ImageNet-1K.
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders: Proposes Dora-VAE, which focuses on sharp geometric edge regions via Sharp Edge Sampling (SES) and handles uniform and salient sampled points separately using Dual Cross-Attention. It achieves superior 3D shape reconstruction quality with only 1,280 latent codes (8× fewer than XCube-VAE's 10,000+), while establishing a new evaluation benchmark, Dora-Bench.
Exposure-slot: Exposure-centric Representations Learning with Slot-in-Slot Attention: This paper proposes the Exposure-slot framework, which extends the Slot Attention algorithm into a hierarchical slot-in-slot structure. Guided by learnable exposure prompts for feature clustering, it achieves exposure-centric region-aware representation learning, obtaining SOTA performance in under-/over-exposed image correction tasks.
Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy: Proposed the IP-CIR method, which translates Composed Image Retrieval (CIR) into a standard image retrieval problem by using large language models to generate an "imagined target text description" as a proxy, achieving zero-shot SOTA on benchmarks such as CIRR and FashionIQ.
Learning Textual Prompts for Open-World Semi-Supervised Learning: This paper proposes a new method for open-world semi-supervised learning (OWSSL) that enhances vision-language alignment via a global-and-local textual prompt learning strategy, and designs a forward-and-backward strategy to reduce noise in vision-language matching for unlabeled samples, outperforming the SOTA significantly on multiple fine-grained datasets.
Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration: This paper proposes the MambaOFR framework to address the complex compound degradations unique to old films. It designs degradation-aware prompts to guide the Mamba model in dynamically adjusting restoration modes, incorporates a flow-guided masked deformable alignment module to prevent the propagation of structural defects, and introduces the first benchmark dataset for old film restoration containing both synthetic and real-world data.
MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities: MG-MotionLLM proposes a unified multi-granularity motion-language model. Leveraging a Motion VQ-VAE + T5 language model architecture along with a carefully designed multi-granularity synergy pre-training scheme (comprising 28 tasks), it simultaneously supports coarse- and fine-grained motion comprehension and generation. While achieving state-of-the-art performance on classic tasks, it also enables novel applications such as fine-grained motion editing.
Rethinking Spiking Self-Attention Mechanism: Implementing a-XNOR Similarity Calculation in Spiking Transformers: This paper provides an in-depth analysis of the fundamental reasons why the dot product fails as a similarity metric in spiking query-key pairs due to a large number of "non-spiking events." It proposes the a-XNOR similarity metric specifically designed for spike sequences, redefining the correlation of non-spiking pairs as a specific value $a$. This approach significantly improves performance across various spiking Transformer architectures and datasets.

Browse all 15 LLM (Other) papers →

🎨 Image Generation (305)¶

3DTopia-XL: Scaling High-Quality 3D Asset Generation via Primitive Diffusion: This paper proposes 3DTopia-XL, a native 3D generation model based on a novel primitive representation PrimX and a Diffusion Transformer. It generates high-quality 3D assets with high-resolution geometry, texture, and PBR materials from text or image inputs, significantly outperforming existing methods in both quality and efficiency.
A Bias-Free Training Paradigm for More General AI-generated Image Detection: This work proposes the B-Free training paradigm—generating semantically aligned fake images from real images via self-conditioned reconstruction with Stable Diffusion, combined with inpainting-based content augmentation to eliminate format, content, and resolution biases. This allows the detector to focus on generator-specific artifacts, achieving a generalization $\text{AUC} > 99\%$ and a balanced accuracy of 95.2% across 27 generator models (including recent models like FLUX and SD 3.5).
A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation: This paper systematically investigates the effectiveness of using decoder-only LLMs as text encoders for text-to-image diffusion models. The authors find that while directly using the last-layer embeddings yields worse results than T5, aggregating embeddings across all layers via layer-normalized averaging significantly outperforms the T5 baseline.
Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization: This paper proposes Step-by-step Preference Optimization (SPO), which samples multiple candidates from the same noisy latent at each denoising step and employs a step-aware preference model to select win/lose pairs to guide diffusion model fine-tuning. By implicitly distilling aesthetic information from generic preference data, SPO significantly improves aesthetic quality on SD-1.5 and SDXL, while achieving substantially faster convergence than DPO.

HOI-IDiff: An Image-like Diffusion Method for Human-Object Interaction Detection

AniDoc: Animation Creation Made Easier

AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer: This paper proposes AniMer, which introduces a high-capacity ViT backbone to quadrupedal SMAL parameter estimation for the first time. By distinguishing shape distributions of different species through animal family-level supervised contrastive learning, and using ControlNet-based synthetic dataset CtrlAni3D (10k images), it comprehensively outperforms existing methods on Animal3D, CtrlAni3D, and the cross-domain Animal Kingdom dataset.
SPAI: Any-Resolution AI-Generated Image Detection by Spectral Learning: This work proposes SPAI, which models the frequency distribution of real images through Masked Spectral Learning. By introducing Spectral Reconstruction Similarity (SRS) and Spectral Context Attention (SCA), it detects AI-generated images as out-of-distribution (OOD) samples. SPAI achieves an average AUC of 91.0% across 13 generation models, an absolute improvement of 5.5% over the second-best method, while supporting detection of images with arbitrary resolutions.
Arbitrary-Steps Image Super-Resolution via Diffusion Inversion: This paper proposes InvSR, which achieves diffusion inversion by training a noise prediction network. Utilizing the image prior of a pre-trained diffusion model for super-resolution, it supports arbitrary-step sampling from 1 to 5 steps, achieving or exceeding the performance of existing state-of-the-art (SOTA) methods even with single-step sampling.
ArtiFade: Learning to Generate High-quality Subject from Blemished Images: This paper proposes ArtiFade, the first method to address the problem of "blemished subject-driven generation". By constructing paired blemished-unblemished datasets, partially fine-tuning the cross-attention weights of diffusion models, and optimizing an artifact-free embedding, it enables existing subject-driven methods (e.g., Textual Inversion, DreamBooth) to generate high-quality, artifact-free subject images from inputs containing blemishes such as watermarks, stickers, or adversarial noise.

Browse all 305 Image Generation papers →

🎬 Video Generation (85)¶

4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion: This work proposes 4Real-Video, a 4D video generation framework based on a two-stream architecture. By splitting video tokens into parallel time and view streams and introducing hard/soft synchronization layers to harmonize information between them, it generates high-quality $8 \times 8$ spatio-temporal video grids in approximately 1 minute, outperforming existing methods in visual quality and multi-view consistency.
AnimateAnything: Consistent and Controllable Animation for Video Generation: A two-stage controllable video generation framework is proposed. The first stage unifies different control signals (camera trajectories, user drag-and-drop annotations, reference videos) into a frame-by-frame optical flow representation. The second stage uses the unified optical flow to guide a DiT-based video diffusion model to generate the final video, introducing a frequency-domain stabilization module to suppress flickering under large motions.
Articulated Kinematics Distillation from Video Diffusion Models: This paper proposes the AKD framework, which reduces the degrees of freedom of 3D asset motion from full space to a small number of joint angles through skeletal joint parameterization, then distills text-aligned joint motion sequences using SDS gradients from a video diffusion model (CogVideoX), and further ensures physical plausibility via physical simulation.
BF-STVSR: B-Splines and Fourier—Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution: The BF-STVSR framework is proposed to model temporal motion interpolation using a B-spline Mapper and capture spatial high-frequency details using a Fourier Mapper, achieving SOTA performance in continuous spatial-temporal video super-resolution without relying on external optical flow networks.
Can Text-to-Video Generation Help Video-Language Alignment?: Proposes the SynViTA framework to explore whether synthetic videos generated by text-to-video (T2V) models can improve video-language alignment (VLA). By addressing semantic inconsistency and appearance bias in synthetic videos through alignment quality-based sample weighting and semantic consistency regularization, it achieves a improvement of over 4 percentage points on temporally challenging tasks.
ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer: ConMo proposes a zero-shot motion transfer framework. By disentangling the composite motion in a reference video into independent subject motions and background (camera) motion, and then controllably recomposing these motions during target video generation, it enables various applications such as multi-subject motion transfer, semantic/shape transformation, subject removal, and camera motion simulation. It significantly outperforms existing methods in motion fidelity and text alignment.
Dynamic Camera Poses and Where to Find Them: Proposes DynPose-100K—a large-scale dataset containing 100K dynamic internet videos and their camera pose annotations, achieved through a video filtering pipeline combining specialist models with a VLM, and a pose estimation pipeline integrating state-of-the-art point tracking, dynamic masking, and global BA.
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes: DynamicScaler is proposed as a training-free unified framework that achieves panoramic dynamic scene generation with arbitrary resolutions and aspect ratios through an offset-shifting denoiser and global motion guidance, supporting a 360° field of view, long durations, and loopable videos.
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes: DynamicScaler proposes a training-free unified framework that synthesizes panoramic dynamic scenes with arbitrary resolutions and aspect ratios via an Offset Shifting Denoiser (OSD) and Global Motion Guidance (GMG). It supports both conventional panorama and 360° field-of-view (FoV) video generation while maintaining a constant VRAM footprint.
Exploring Temporally-Aware Features for Point Tracking: This work proposes Chrono, a temporally-aware feature backbone designed for point tracking. By inserting temporal adapters (2D convolutional downsampling + 1D local temporal attention + 2D convolutional upsampling) between the Transformer blocks of DINOv2, Chrono achieves state-of-the-art performance in a refiner-free setting using only simple feature matching (soft-argmax).

Browse all 85 Video Generation papers →

🧩 Multimodal VLM (136)¶

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models: This paper proposes 4D LangSplat, which constructs a 4D language field by leveraging multimodal large language models (MLLMs) to generate object-wise video captions. Combined with a status deformable network to model the temporally continuous evolution of semantics, it achieves the first time-sensitive and time-agnostic open-vocabulary queries in dynamic scenes.
Active Data Curation Effectively Distills Large-Scale Multimodal Models: Proposes ACID (Active data Curation as Implicit Distillation) and ACED (combined with explicit distillation), demonstrating that actively filtering training data using a larger model as a reference is a more effective multimodal model compression approach than traditional knowledge distillation. Combining the two complementarily achieves SOTA performance on 27 zero-shot tasks with fewer inference FLOPs.
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding: This paper proposes the ASAP framework, which systematically advances image-text semantic alignment to improve multi-modal manipulation detection and grounding performance through three core modules: Large Model-Assisted Alignment (LMA), Manipulation-Guided Cross-Attention (MGCA), and Patch Manipulation Modeling (PMM). It achieves a 94.38% AUC and 76.52% text grounding F1 on the DGM4 benchmark, significantly outperforming existing methods.

ASAP: Advancing Semantic Alignment for Multi-Modal Manipulation Detection

Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning: This paper proposes the AiR (Augmenting discriminative Richness) framework, which utilizes a LoRA-fine-tuned Stable Diffusion model to generate synthetic images and construct an auxiliary classifier. By complementarily fusing it with the text classifier, the text-to-image matching paradigm in unsupervised prompt learning is extended to image-to-image matching, significantly improving classification accuracy on challenging datasets such as fine-grained categorizations and remote sensing.
Calico: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models: This paper proposes Calico—the first large vision-language model designed for part-level semantic co-segmentation. By establishing part-level semantic correspondence across multiple images using a Correspondence Extraction Module (CEM) and a Correspondence Adaptation Module (CAM), and fine-tuning only 0.3% of the parameters, it thoroughly outperforms existing methods on the newly constructed MixedParts benchmark, achieving a 6.3% gain in mIoU and a 51.3% speedup in inference.
Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?: This work systematically investigates the self-correction capabilities of VLMs in semantic grounding tasks. It reveals that intrinsic self-correction (without external feedback) actually degrades performance (by -7 to -17 points). However, iterative correction guided by feedback from the same VLM acting as a binary verifier can improve performance by up to 8.4 percentage points, highlighting that feedback quality is the critical bottleneck for self-correction.
CodePercept: Code-Grounded Visual STEM Perception for MLLMs: Through scaling analysis, this work discovers that the true bottleneck of STEM visual reasoning is perception rather than reasoning, and proposes using executable Python code as a precise perceptual medium. By constructing the ICC-1M dataset (Image-Caption-Code triplets) for training, CodePercept-8B improves by $+3.0\%$ to $+12.3\%$ over Qwen3-VL-8B on STEM perception benchmarks.
CoLLM: A Large Language Model for Composed Image Retrieval: This work proposes CoLLM, a unified framework for Composed Image Retrieval (CIR) leveraging Large Language Models. By generating training triplets on-the-fly from image-caption pairs, producing joint multimodal embeddings with an LLM, and constructing a large-scale MTCIR dataset with 3.4 million samples, CoLLM achieves SOTA performance across multiple CIR benchmarks, with MTCIR yielding up to a 15% performance improvement.
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation: Addressing the core issues of poor narrative coherence and inconsistent entity styles in existing interleaved image-text datasets (such as MMC4/OBELICS), this work constructs the CoMM dataset (227K documents, 2.28M images). By targeting instructional content collection combined with a multi-perspective quality filtering strategy, it ensures text coherence, image consistency, and image-text alignment, while proposing four interleaved generation evaluation tasks.

Browse all 136 Multimodal VLM papers →

🧠 VLM Reasoning (13)¶

Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation: This paper proposes the CRYSTAL benchmark (6372 instances) to evaluate MLLMs at the intermediate reasoning step level using Match F1 and Ordered Match F1. It reveals widespread cherry-picking behaviors and disordered reasoning processes, and introduces a CPR-Curriculum training strategy to improve reasoning quality.
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Models: This paper proposes Coarse Correspondences, a lightweight, training-free visual prompting method. By overlaying coarse-grained instance correspondence markers obtained from object tracking onto image frames, it significantly enhances the spatial-temporal reasoning capabilities of MLLMs, achieving improvements of +20.5% on ScanQA, +9.7% on OpenEQA, +6.0% on EgoSchema, and +11% on R2R navigation.
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning: This paper proposes the Critic-V framework, which decouples the VLM reasoning process into a Reasoner and a Critic. By utilizing a DPO-trained Critic model to provide natural language feedback for iteratively optimizing the reasoning path, this approach outperforms GPT-4V on 5 out of 8 benchmarks, showing particularly significant improvements on mathematical reasoning tasks (MathVista +11.8%).
Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents: Proposes two large-scale document retrieval benchmarks, DocHaystack and InfoHaystack (1000+ documents per question), and V-RAG, a vision-centric retrieval-augmented generation framework, which improves Recall@1 by 9%-11% over the best baseline.
ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models: This paper proposes Espire, a simulation-based diagnostic benchmark for embodied spatial reasoning. It decomposes VLM evaluation into localization and execution phases, systematically assessing the capabilities of VLMs across multiple spatial reasoning dimensions and granularities through a fully generative paradigm.
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models: Insight-V proposes a visual reasoning enhancement scheme consisting of a data generation pipeline and a multi-agent reasoning system: it constructs high-quality long-chain reasoning data through progressive generation and multi-granular evaluation, designs a Reasoning Agent and a Summary Agent to collaboratively solve problems, and incorporates iterative DPO to further improve reasoning quality, achieving an average improvement of 7% across seven visual reasoning benchmarks.
MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning: MM-CondChain is the first MLLM benchmark for visually grounded deep compositional reasoning. By using a Verifiable Programmatic Intermediate Representation (VPIR), it automatically constructs multi-layer conditional chains and chain-style hard negatives. The strongest model achieves only a 53.33 Path F1, revealing that deep compositional reasoning remains a fundamental challenge.
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts: This paper proposes the MV-MATH benchmark, consisting of 2,009 high-quality multi-image math problems (sourced from real K-12 scenarios) to systematically evaluate the capability of 25 multimodal large models in multi-image math reasoning scenarios. It is found that all models perform well below human levels (the best, Claude, only achieves 33.9%), revealing that multi-image math reasoning remains a significant challenge for MLLMs.
Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence: Proposes the VAEX-Bench benchmark to systematically evaluate the "abstract spatiotemporal reasoning" capability of MLLMs for the first time. Unlike extractive tasks that pull information from single frames, abstract reasoning requires integrating observations across rooms and time to infer global spatial layouts, perform cross-scene counting, etc. The study reveals that all SOTA models (including GPT-5.2 and Gemini-3 Pro) perform significantly worse than humans on abstract reasoning.
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model: This work proposes the Sequential 3D Affordance Reasoning task and constructs a benchmark of 180K instruction-point cloud pairs. By introducing a <SEG> token and a multi-granular language-point integration module into a 3D MLLM, the model reasons and segments sequential affordance regions from complex human instructions.

Browse all 13 VLM Reasoning papers →

⚡ VLM Efficiency (3)¶

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

MBQ: Modality-Balanced Quantization for Large Vision-Language Models: This work identifies that the sensitivity of vision tokens and language tokens to quantization errors in large VLMs differs by more than tenfold. It proposes MBQ, a post-training quantization method that introduces a gradient-based modality-balancing factor during calibration. Under W3A16 and W4A8 configurations, MBQ improves accuracy by up to 4.4% and 11.6%, respectively, while achieving a 1.4× end-to-end acceleration.
Quantization without Tears: This paper proposes the QwT (Quantization without Tears) method, which compensates for quantization information loss by adding a lightweight linear compensation layer after each block of the quantized network. The parameters of this compensation layer can be obtained via a closed-form solution in under 2 minutes, significantly improving PTQ accuracy across various tasks including vision, language, and multimodality.

🎵 Audio & Speech (19)¶

Contextual AD Narration with Interleaved Multimodal Sequence: A unified framework named Uni-AD is proposed. It takes interleaved multimodal sequences (video features + text + character bank + context) as input. By aligning features through a visual mapping network, identifying main characters via a character-refinement module, and enhancing contextual consistency with a contrastive loss, it achieves SOTA performance on MAD-eval-Named.
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation: This paper proposes Crab, a unified audio-visual scene understanding model. By constructing the AV-UIE dataset (200K samples) with explicit reasoning processes, it clarifies the collaborative relationships across tasks. Combined with interaction-aware LoRA (multi-head LoRA) designed to learn different audio-visual interaction patterns, Crab outperforms specialized models across multiple tasks.
DistinctAD: Distinctive Audio Description Generation in Contexts: Generates distinctive audio descriptions (AD) in contexts to avoid generating generic and featureless descriptions by employing contrastive learning to encourage differences from preceding and succeeding ADs.
DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations: Proposes DualTalk, the first unified framework for multi-turn dual-speaker interactive 3D talking head generation that models both speaker and listener behaviors, accompanied by a dual-speaker dialogue dataset containing 50 hours and over 1,000 identities.
EMoVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions: EMoVA is proposed as the first end-to-end omni-modal LLM that achieves visual understanding, speech recognition, and emotion-controllable speech synthesis simultaneously through a semantic-acoustic decoupled speech tokenizer, outperforming GPT-4o on vision-language benchmarks and achieving a 2.9% WER in speech recognition.
Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model: This work proposes PN-Diffusion, which extracts positive and negative beat conditions from forward-played and backward-played dance videos respectively. It designs a dual diffusion and reverse process to jointly train a U-Net, enhancing the beat consistency and music quality of generated music with dance movements. On the AIST++ and TikTok datasets, it improves BCS by 1.80/3.85 and BHS by 4.22/5.90.
HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation: This paper proposes HOP, a heterogeneous topology-based multimodal entanglement method. By using audio as a bridge, it aligns audio-text semantics via a reprogramming module and audio-action rhythm via a spatio-temporal graph network. This achieves more natural and coherent co-speech gesture generation, reaching SOTA on FGD, BC, and diversity metrics.
Improving Sound Source Localization with Joint Slot Attention on Image and Audio: Proposes a joint slot attention mechanism to decompose both images and audio into target/non-target representations, achieving precise sound source localization through cross-modal attention matching and contrastive learning, resulting in SOTA performance of 65.16% AUC and 86.00% cIoU on Flickr-SoundNet.
ImViD: Immersive Volumetric Videos for Enhanced VR Engagement: This work constructs the first immersive volumetric video dataset by capturing 7 indoor/outdoor scenes using a mobile multi-view system with 46 synchronized GoPros. It proposes STG++, which introduces learnable affine color transformations to resolve cross-camera color inconsistency, achieving rendering at 110.47 FPS with 387MB of storage, and integrates HRTF spatial audio.
Learning to Highlight Audio by Watching Movies: A novel task of visually-guided acoustic highlighting is proposed, leveraging well-crafted audiovisual data from movies as free supervision. Through a Transformer-based multimodal framework, VisAH, poorly mixed audio is converted into visually and semantically aligned highlighted audio, significantly outperforming baseline methods across all metrics.

Browse all 19 Audio & Speech papers →

🔎 AIGC Detection (3)¶

Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration: This paper proposes the BiMC (Bi-level Modality Calibration) framework based on a frozen CLIP model. By leveraging intra-modal calibration (combining fine-grained class descriptions generated by LLMs with visual prototypes) and inter-modal calibration (fusing pre-trained language knowledge with task-specific visual priors), BiMC achieves state-of-the-art FSCIL performance without any parameter training, outperforming the best baseline by 4.25% on CIFAR-100.
ProAPO: Progressively Automatic Prompt Optimization for Visual Classification: Proposes ProAPO, a progressively automatic prompt optimization method based on evolutionary algorithms. With only one-shot supervision and zero human intervention, it progressively optimizes from task-level templates to category-level descriptions to address hallucination and lack of discriminativeness in LLM-generated descriptions, outperforming existing text prompting methods on 13 datasets.
SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection: This paper proposes the Stratified Granular Comparison Network (SGC-Net), which aggregates multi-layer CLIP visual features via a Granularity-Aware Alignment (GSA) module and recursively generates discriminative descriptions using an LLM within a Hierarchical Group Comparison (HGC) module. This addresses the issues of insufficient feature granularity and semantic confusion in open-vocabulary HOI detection.

🧊 3D Vision (364)¶

3D-GSW: 3D Gaussian Splatting for Robust Watermarking: This paper proposes 3D-GSW, the first robust digital watermarking method designed specifically for 3D Gaussian Splatting. It enhances watermark robustness by removing redundant Gaussians and splitting Gaussians in high-frequency regions via Frequency-Guided Densification (FGD). Combined with a gradient mask and wavelet sub-band loss to maintain rendering quality, 3D-GSW achieves superior watermark robustness and rendering quality across the Blender, LLFF, and Mip-NeRF 360 datasets.
3D-HGS: 3D Half-Gaussian Splatting: This work proposes the 3D Half-Gaussian (3D-HGS) reconstruction kernel, which splits a 3D Gaussian into two halves using a cutting plane, each having independent opacity. Acting as a plug-and-play reconstruction kernel to replace standard Gaussian kernels, it significantly enhances rendering quality at shape and color discontinuities without sacrificing rendering speed, outperforming all SOTA methods on Mip-NeRF360, Tanks & Temples, and Deep Blending.
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer: This work proposes 3D-LLaVA, a general-purpose 3D Large Multimodal Model (LMM) with a minimalist architecture. The core is the Omni Superpoint Transformer (OST) acting as a versatile visual connector. It simultaneously serves as a visual feature selector, a visual prompt encoder, and a segmentation mask decoder. Using only point cloud inputs, it fully achieves state-of-the-art (SOTA) performance across five benchmarks, including ScanQA (92.6 CiDEr) and ScanRefer (43.3 mIoU).
3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning: This paper proposes 3D-Mem, a 3D scene memory framework based on "Memory Snapshots." It compactly represents explored areas using a small set of curated multi-view images and models unexplored regions via Frontier Snapshots, enabling efficient embodied exploration and reasoning in combination with VLMs.
3D-SLNR: A Super Lightweight Neural Representation for Large-scale 3D Mapping: This paper proposes 3D-SLNR, a super lightweight neural 3D representation. It defines the global Signed Distance Function (SDF) based on a collection of band-limited local SDFs anchored on support points of a point cloud. Each local SDF is parameterized by a single shared tiny MLP (without latent feature vectors). The output of the MLP is modulated by learnable geometric attributes (position, rotation, and scale) to adapt to complex geometries in different regions. Combined with a parallel query algorithm and a prune-and-expand strategy, it achieves SOTA reconstruction quality with less than 1/5 of the memory footprint of previous methods.
3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes: Replaces Gaussian primitives with 3D smooth convex primitives for radiance field rendering. By defining convex hulls using point sets + LogSumExp smoothing + custom CUDA rasterizer, this method outperforms 3DGS on T&T and Deep Blending using fewer primitives.
3D Dental Model Segmentation with Geometrical Boundary Preserving: This paper proposes CrossTooth, which utilizes selective downsampling based on curvature priors (increasing vertex density in boundary areas by 10-15%) and cross-modal boundary feature fusion with multi-view rendered images. It achieves 95.86% mIoU and 82.05% boundary IoU on the public 3DTeethSeg'22 dataset, outperforming the previous SOTA (ToothGroupNet) by 2.3% and 5.7% respectively.
3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations: A 3D Gaussian head avatar method with compact tensorial representation is proposed, which stores the static neutral-expression appearance using canonical tri-planes and the dynamic texture (opacity offset) of each blendshape using lightweight 1D feature lines. It achieves 300 FPS real-time rendering and accurate capture of dynamic facial details with only 10MB of storage, comprehensively outperforming GA, GBS, and GHA in PSNR and storage efficiency on the Nersemble dataset.
3D Gaussian Inpainting with Depth-Guided Cross-View Consistency: This paper proposes 3DGIC, a framework that achieves object removal and inpainting in 3D Gaussian Splatting scenes through depth-guided cross-view consistent inpainting. By leveraging rendered depth maps, it projects background pixels visible from other views onto the masked region to refine the inpainting mask. Then, 2D inpainting results from a reference view are projected onto 3D space to constrain cross-view consistency for other views. The proposed method outperforms existing approaches in FID and LPIPS on the SPIn-NeRF dataset.
3D Student Splatting and Scooping (SSS): This work proposes SSS (Student Splatting and Scooping), advancing the 3DGS paradigm with three unprecedented innovations: (1) replacing Gaussian distributions with Student-t distributions as mixture components (with learnable tail thickness that varies continuously from Cauchy to Gaussian); (2) introducing negative density components (scooping by subtracting color) to extend the formulation to non-monotonic mixture models; (3) employing SGHMC sampling instead of SGD to decouple parameter optimization. SSS achieves state-of-the-art results in 6 out of 9 metrics across Mip-NeRF360, T&T, and Deep Blending, demonstrating extreme parameter efficiency by matching or exceeding 3DGS using only 18% of the component count.

Browse all 364 3D Vision papers →

🎯 Object Detection (38)¶

AA-CLIP: Enhancing Zero-Shot Anomaly Detection via Anomaly-Aware CLIP: Proposed AA-CLIP, which enhances anomaly discriminability while preserving the generalization ability of CLIP through a two-stage training strategy (first adapting the text encoder to establish anomaly-aware anchors, then aligning patch-level visual features). It achieves SOTA zero-shot anomaly detection performance across multiple industrial and medical datasets with minimal training samples.
ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection: Proposed ABRA (Aligned Basis Relocation for Adaptation), which "teleports" class-specific detection knowledge from a source domain to an unlabeled target domain by performing SVD decomposition and orthogonal rotation alignment in the weight space, achieving zero-shot cross-domain object detection.
AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios: Proposes AnomalyNCD, the first self-supervised multi-class anomaly classification method for industrial scenarios: MEBin extracts major anomaly regions $\rightarrow$ mask-guided ViT focuses on weak-semantic anomalies $\rightarrow$ region fusion strategy achieves flexible region/image-level classification, improving F1 by 10.8% and NMI by 8.8% on MVTec AD.
BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs: This paper proposes BACON, a prompting method that deconstructs verbose image captions generated by VLMs into decoupled structured elements (in JSON dictionary format) such as objects, relationships, styles, and themes. This allows downstream models to efficiently utilize caption information without requiring strong text-encoding capabilities, achieving a 1.51x recall improvement for GroundingDINO in open-vocabulary object detection.
Boosting Domain Incremental Learning: Selecting the Optimal Parameters Is All You Need: Discovers that selecting the optimal subset of parameters is more effective than fine-tuning all parameters in domain incremental learning, and proposes a parameter selection strategy to resolve catastrophic forgetting in domain incremental object detection.
DEIM: DETR with Improved Matching for Fast Convergence: This paper accelerates DETR training convergence through two simple improvements: Dense O2O (increasing targets per image via data augmentation to achieve dense one-to-one matching) and MAL (replacing VFL to better optimize low-quality matches). It cuts the training epochs in half while boosting performance (COCO AP 56.5 with D-FINE-X).
Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection: The DPDL method is proposed to learn multi-Gaussian distribution prototypes and map normal samples to the prototype space via diffusion using the Schrödinger Bridge (while concurrently pushing away anomalous samples). Combined with dispersion feature learning on hyperspherical space to enhance generalization, this method achieves state-of-the-art (SOTA) performance on 9 public anomaly detection datasets (e.g., outperforming AHL by 5.0% on AITEX and 8.7% on ELPV).
Efficient Event-Based Object Detection: A Hybrid Neural Network with Spatial and Temporal Attention: This work proposes the first hybrid SNN-ANN object detection model targeting large-scale benchmarks. An Attention-Squeeze Bridging (ASAB) block is designed to convert sparse spike representations from the SNN into dense features for the ANN via spatio-temporal attention. With only 6.6M parameters, it significantly outperforms SNN methods and approaches the accuracy of ANN/RNN methods on the Gen1/Gen4 datasets, while the SNN component can be deployed on the Intel Loihi 2 neuromorphic chip for low-power inference.
Efficient Test-Time Adaptive Object Detection via Sensitivity-Guided Pruning: Proposes an efficient continual test-time adaptive object detection (CTTA-OD) method, identifying that certain feature channels in the source model are sensitive to domain shifts and impede cross-domain performance. Selective pruning is achieved by guiding weighted sparse regularization with channel sensitivity measured at both image and instance levels, complemented by a random channel reactivation mechanism to prevent erroneous pruning. This approach surpasses SOTA adaptation accuracy while reducing computational cost by 12%.
Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection: This paper introduces diffusion models to domain-generalized object detection for the first time. By extracting multi-timestep intermediate features from the diffusion process to build a domain-invariant detector, and designing a dual-level (feature-level and object-level) alignment knowledge transfer framework, the generalization capability is distilled into a lightweight common detector. It achieves an average improvement of 14.0% mAP across six DG benchmarks, even outperforming most domain adaptation methods.

Browse all 38 Object Detection papers →

✂️ Segmentation (94)¶

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification: Proposes 2DMamba, the first native 2D selective State Space Model with an efficient parallel algorithm. By maintaining 2D spatial continuity (rather than flattening into a 1D sequence) to model inter-patch relationships in WSIs, it comprehensively outperforms 1D Mamba methods across 10 public pathology datasets, while also achieving improvements on ImageNet classification and ADE20K segmentation.
A Distractor-Aware Memory for Visual Object Tracking with SAM2: A Distractor-Aware Memory (DAM) model is proposed for SAM2.1++, splitting the memory of SAM2 into Recent Appearance Memory (RAM, ensuring segmentation accuracy) and Distractor Resolution Memory (DRM, ensuring tracking robustness). Through an introspective update strategy, DAM detects distractors and automatically stores anchor frames, setting a new SOTA on 7 benchmarks.
Assessing and Learning Alignment of Unimodal Vision and Language Models (SAIL): The SAIL framework is proposed: first, the alignment potential of unimodal vision and language models is assessed through alignment probing (discovering that k-NN clustering quality is more crucial than linear separability); second, DINOv2 and pretrained language models are efficiently aligned using a lightweight GLU alignment layer + Sigmoid loss + multi-positive sample strategy, outperforming CLIP with only 6% of its training data.

SAIL: Assessing and Learning Alignment of Unimodal Vision and Language Models

Audio-Visual Instance Segmentation

G2HFNet: GeoGran-Aware Hierarchical Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images: This paper proposes G2HFNet, which designs differentiated optimization strategies for features at different levels through four modules: Multi-scale Detail Enhancement (MDE), Dual-branch Geometric-Granularity Complementarity (DGC), Deep Semantic Perception (DSP), and Local-Global Guided Fusion (LGF), comprehensively outperforming SOTA on three remote sensing salient object detection datasets.
Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging: This paper systematically reviews the performance of traditional methods and deep learning methods in MRI brain glioma segmentation and classification. Through a comprehensive comparative evaluation, it concludes that CNN architectures significantly outperform traditional techniques in segmentation accuracy and robustness.

Condensing Action Segmentation Datasets via Generative Network Inversion

Continuous Locomotive Crowd Behavior Generation: Generates continuous crowd locomotive behaviors by jointly synthesizing trajectories and actions, producing natural and diverse collective motion patterns.
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training: COSMOS proposes a cross-modality self-distillation framework that learns fine-grained cross-modality representations in a student-teacher architecture using a text-cropping strategy and a cross-attention module. Pre-trained on only 30M data, it consistently outperforms CLIP-like baselines across zero-shot retrieval, classification, and semantic segmentation tasks, even surpassing OpenCLIP trained on billions of data points.

Browse all 94 Segmentation papers →

🖼️ Image Restoration (41)¶

A Flag Decomposition for Hierarchical Datasets: This paper proposes Flag Decomposition (FD), an algorithm that decomposes hierarchically structured data into flag manifold representations (Stiefel coordinates) while preserving hierarchical relationships. It demonstrates advantages over standard methods like SVD in denoising, clustering, and few-shot learning tasks.
A Physics-Informed Blur Learning Framework for Imaging Systems: A physics-informed PSF learning framework is proposed, designing a new wavefront basis (where each basis only affects a single SFR direction) to eliminate gradient conflicts. Combined with curriculum learning (from center to periphery), it accurately estimates the spatially-varying PSF of imaging systems without requiring lens parameters.

EQ-Reg: A Regularization-Guided Equivariant Approach for Image Restoration

AdcSR: Adversarial Diffusion Compression for Real-World Image Super-Resolution: An Adversarial Diffusion Compression (ADC) framework is proposed to distill the one-step diffusion model OSEDiff into a streamlined diffusion-GAN hybrid model. This achieves a 73% reduction in inference time, a 78% reduction in computational cost, and a 74% reduction in parameters while maintaining generative quality, reaching real-time super-resolution at 34.79 FPS.
Augmenting Perceptual Super-Resolution via Image Quality Predictors: No-reference image quality assessment (NR-IQA) models are leveraged to replace human annotations. By improving perceptual super-resolution quality through weighted sampling and direct optimization, the proposed method outperforms state-of-the-art methods that rely on human feedback, without requiring any human-labeled data.
Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable: Revisit classic video denoising methods and integrate them with modern ML tools to achieve robust, fast, and noise-level controllable video denoising.
Complexity Experts are Task-Discriminative Learners for Any Image Restoration: MoCE-IR is proposed to replace the uniform architecture of traditional MoEs with "complexity experts" possessing varying computational complexities and receptive field sizes. Complemented by a spring-like routing mechanism biased towards low complexity, it unexpectedly achieves task-discriminative allocation—different degradation types are automatically routed to experts of appropriate complexity, allowing irrelevant experts to be bypassed during inference.
DarkIR: Robust Low-Light Image Restoration: DarkIR proposes an efficient CNN-based multi-task low-light image restoration method. The encoder uses SpAM+FreMLP (frequency magnitude enhancement) to handle illumination, while the decoder utilizes Di-SpAM (dilated spatial attention) to handle blur. With an asymmetric design, it achieves 27.30dB PSNR on LOLBlur with only 3.31M parameters.
Degradation-Aware Feature Perturbation for All-in-One Image Restoration: This paper proposes the DFPIR framework, which adapts the feature space between the encoder and decoder to fit a unified parameter space through two mechanisms: degradation type-guided channel shuffle perturbation and selective attention mask perturbation. It achieves state-of-the-art (SOTA) performance across five distinct tasks, including denoising, dehazing, deraining, deblurring, and low-light enhancement.
Detail-Preserving Latent Diffusion for Stable Shadow Removal: This paper proposes a two-stage Stable Diffusion fine-tuning scheme for shadow removal: In the first stage, the denoiser is fine-tuned in the latent space to perform primary shadow removal. In the second stage, a shadow-aware Detail Injection module extracts features from the VAE encoder to modulate the decoder, recovering the high-frequency details lost in the first stage and achieving high-quality and highly generalizable shadow removal.

Browse all 41 Image Restoration papers →

🛰️ Remote Sensing (11)¶

Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes: This paper proposes the Dense Dispersed Structured Light (DDSL) method, which utilizes an inexpensive diffraction grating film (<$20), a stereo RGB camera, and an RGB projector. By designing spectrally multiplexed DDSL patterns, the required number of projection frames is significantly reduced, achieving real-time hyperspectral 3D imaging at 6.6 fps with a spectral resolution of 15.5 nm FWHM and a depth error of 4 mm.
DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery: The DiSciPLE framework is proposed to automatically synthesize interpretable Python programs for visual data analysis using an LLM-guided evolutionary algorithm. It achieves SOTA on scientific tasks such as population density estimation, reducing error by 35% compared to recent baselines while remaining fully interpretable.
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues: This work proposes EarthDial, a conversational vision-language model tailored for Earth Observation (EO) data. It supports the unified understanding of multispectral (SAR/NIR/infrared), multi-temporal, and multi-resolution remote sensing imagery. Trained on an 11.11 million instruction-tuning dataset, it outperforms existing remote sensing VLMs across 44 downstream datasets.
Hierarchical Dual-Change Collaborative Learning for UAV Scene Change Captioning: This paper proposes a new task named UAV Scene Change Captioning (UAV-SCC) and a novel HDC-CL framework. It models the overlapping and non-overlapping regions of image pairs under moving viewpoints using a Dynamic Adaptive Layout Transformer, enhances viewpoint shift direction awareness via hierarchical cross-modal directional consistency calibration, and constructs a dedicated benchmark dataset.
Joint and Streamwise Distributed MIMO Satellite Communications with Multi-Antenna Ground Users: This paper investigates distributed MIMO downlink communications where multiple LEO satellites jointly serve multi-antenna ground users. Two modes, namely joint transmission and streamwise transmission, are proposed. The former optimizes the precoder using WMMSE iterations to maximize the sum spectral efficiency, while the latter employs a Hungarian algorithm-based stream-satellite association to reduce the fronthaul overhead, achieving a flexible trade-off between performance and the fronthaul signaling load.
Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking: The ORTrack framework is proposed to learn occlusion-robust ViT feature representations through random masking based on spatial Cox processes (imposing mask constraints during training and achieving zero overhead during inference). An adaptive feature distillation method is designed to compress large models into a lightweight student model ORTrack-D, achieving the best balance of state-of-the-art accuracy and real-time speed across several UAV tracking benchmarks.
Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning: MetaPEFT proposes a meta-learning framework that unifies discrete position selection and continuous scaling factors in PEFT into differentiable modulators. Through bi-level optimization, it automatically searches for the optimal PEFT hyperparameter configuration, achieving SOTA on remote sensing and natural image long-tailed distribution adaptation tasks.
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging: Proposed MetaSpectra+, a compact and multi-functional camera based on hybrid metasurface-refractive optics. By utilizing double-layer metasurfaces to independently control dispersion, exposure, and polarization for each channel, it achieves snapshot hyperspectral+HDR or hyperspectral+polarization joint imaging within an ~250nm visible bandwidth, achieving SOTA reconstruction accuracy on benchmark datasets.
MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting: MFogHub constructs the first multi-regional (15 coastal regions) and multi-satellite (6 geostationary satellites) global marine fog detection and forecasting dataset, containing over 68,000 high-resolution samples and 11,600+ pixel-level annotations. Extensive experiments on 16 baseline models reveal the influence of regional differences and satellite variations on model generalization.
SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion: This work introduces satellite imagery to the 3D Semantic Scene Completion (SSC) task for the first time, proposing a dual-branch framework named SGFormer. By utilizing ground-view guided satellite feature correction and adaptive fusion strategies, it effectively addresses the scene incompleteness issue caused by visual occlusions.

Browse all 11 Remote Sensing papers →

🧑 Human Understanding (73)¶

3D Face Reconstruction From Radar Images: For the first time, 3D face reconstruction is achieved from millimeter-wave radar images: a synthetic dataset generated with a physical radar renderer is used to train a CNN encoder to estimate BFM parameters, and a model-based autoencoder is constructed by learning a differentiable radar renderer, achieving a mean vertex-to-vertex error of 2.56 mm on synthetic data while allowing unsupervised parameter optimization during inference.
3D Prior is All You Need: Cross-Task Few-shot 2D Gaze Estimation: This work proposes cross-task few-shot 2D gaze estimation, which leverages a pre-trained 3D gaze model as a prior. Through a physics-based differentiable projection module (with 6 learnable screen parameters), the 3D gaze direction is projected onto 2D screen coordinates. With only 10 annotated images, this approach adapts 2D gaze estimation to unseen devices, achieving over 25% improvement on MPIIGaze/EVE/GazeCapture compared to EFE and IVGaze.
Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation: This paper presents the first systematic study of the synthetic-to-real domain gap in 3D hand pose estimation. By designing a controllable data synthesis pipeline, the authors decompose and analyze the impacts of four key factors: forearms, spectral statistics, pose distribution, and object occlusion. The study demonstrates that with proper integration of these factors, purely synthetic data can achieve accuracy on par with real data.
Any6D: Model-free 6D Pose Estimation of Novel Objects: This paper proposes the Any6D framework to estimate the 6D pose and size of novel objects from a single RGB-D anchor image. By combining InstantMesh 3D reconstruction, oriented bounding box coarse alignment, and joint size-pose refinement, Any6D achieves an ADD-S of 98.7% on HO3D, significantly outperforming GEDI's 71.9%.

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

Co-op: Correspondence-based Novel Object Pose Estimation: This paper proposes Co-op, a correspondence-based 6DoF pose estimation framework for novel objects. In the coarse estimation stage, a hybrid representation (patch-level classification + offset regression) is used to estimate the initial pose quickly and accurately with only 42 templates. In the refinement stage, probabilistic flow regression combined with differentiable PnP is utilized for end-to-end optimization, significantly outperforming existing methods on seven core datasets of the BOP Challenge.
ControlFace: Harnessing Facial Parametric Control for Face Rigging: Proposes ControlFace, which utilizes a dual-branch U-Net (FaceNet + denoising U-Net) combined with 3DMM rendering conditions to achieve flexible editing of facial pose, expression, and illumination without fine-tuning, while precisely preserving identity and semantic details.
CRISP: Object Pose and Shape Estimation with Test-Time Adaptation: Proposes CRISP, a category-agnostic object pose and shape estimation pipeline. The core innovations are an optimization-based corrector utilizing an active shape model and a correct-and-certify self-training strategy, which can adaptively bridge large domain gaps at test time.
CryptoFace: End-to-End Encrypted Face Recognition: This paper proposes CryptoFace, the first end-to-end Fully Homomorphic Encrypted (FHE) face recognition system. By utilizing a hybrid shallow patch CNN architecture (CryptoFaceNet), it significantly reduces the multiplicative depth, achieving encrypted inference that is 7 times faster than state-of-the-art (SOTA) FHE networks while improving verification accuracy.
D3-Human: Dynamic Disentangled Digital Human from Monocular Video: D3-Human proposes a method to reconstruct disentangled (garment + body) digital human geometry from a monocular video. By defining an homomorphic Signed Distance Field on the human manifold (hmSDF), it achieves accurate garment-body segmentation of visible regions without 3D garment priors, generating a disentangled template in approximately 20 minutes and supporting virtual try-on and animation applications.

Browse all 73 Human Understanding papers →

📹 Video Understanding (69)¶

Anomize: Better Open Vocabulary Video Anomaly Detection

BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning: This paper proposes BehaviorVLM, a unified finetuning-free vision-language framework that simultaneously addresses both animal pose estimation and physical behavior understanding via a multi-stage structured reasoning pipeline. It achieves reliable keypoint tracking using only 3 human-annotated seed frames, and enables interpretable multi-animal behavioral segmentation through deep embedded clustering, VLM-based segment description, and LLM semantic merging.
Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding: Proposes R-MSD (Reliable Multi-Sample Distillation), which addresses the issue of unreliable single-sample teacher supervision in black-box distillation of video LVLMs by sampling multiple teacher responses for each input and incorporating task-adaptive quality matching. The 4B student model consistently improves performance on benchmarks such as VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).

BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering: This paper proposes BIMBA, a spatiotemporal token selector based on Mamba selective scan. It compresses long video sequences of over 100K tokens by 16 times down to 6,400 key tokens, achieving state-of-the-art (SOTA) performance across 7 long-video VQA benchmarks.
Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-Grained View-Invariant Video Representations: Learning fine-grained view-invariant representations between egocentric and exocentric perspectives via masked modeling, enabling self-supervised learning from the association of the two views without paired annotations.
Context-Enhanced Memory-Refined Transformer for Online Action Detection: This paper reveals the training-inference inconsistency problem in existing online action detection (OAD) methods—where unbalanced context exposure of short-term memory frames and non-causal information leakage introduced by pseudo-futures bias learning toward intermediate frames—and proposes CMeRT to address this issue through a near-past context-enhanced encoder and a near-future-based memory refinement decoder, achieving state-of-the-art performance on THUMOS'14, CrossTask, and EK100.
Cross-modal Causal Relation Alignment for Video Question Grounding: Eliminates spurious cross-modal associations in Video Question Grounding (VideoQG) via causal intervention. It introduces three modules—Gaussian smoothing grounding, cross-modal alignment, and explicit causal intervention—simultaneously improving grounding (+2.2 Acc@GQA) and question answering (+0.9 Acc@VQA) performance on NextGQA.
DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos: DeCafNet is proposed, which outperforms all prior methods on long video temporal grounding tasks with a 47% reduction in TFLOPs, utilizing a delegate-and-conquer dual-encoder strategy (where a lightweight sidekick encoder extracts dense features and generates saliency maps, while an expert encoder only processes the top-c% key clips) combined with DeCaf-Grounder to unify features across different temporal resolutions.
DivPrune: Diversity-Based Visual Token Pruning for Large Multimodal Models: Reformulates the visual token pruning problem as the Max-Min Diversity Problem (MMDP). By solving it precisely to maximize the minimum pair-wise distance within the retained token set, a training-free and calibration-free plug-and-play pruning scheme is achieved. It yields SOTA performance on 16 multimodal benchmarks, significantly outperforming all baselines particularly under extreme pruning rates of $\ge 80\%$.

Browse all 69 Video Understanding papers →

🚗 Autonomous Driving (89)¶

3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation: Ours proposes 3D-AVS, the first auto-vocabulary segmentation method specifically tailored for LiDAR point clouds. Without requiring users to specify target categories, the system automatically identifies semantic entities in the scene from both images and point clouds to generate a vocabulary, and then finishes point-wise semantic segmentation with an open-vocabulary segmenter. It demonstrates the capability to generate fine-grained semantic categories on nuScenes and ScanNet200.
ProtoOcc: 3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation: This paper proposes ProtoOcc, which enhances the contextual information of low-resolution voxels by mapping 2D image clustering prototypes into the 3D voxel query space via prototype-aware view transformation. Together with a multi-perspective occupancy decoding strategy, it reconstructs high-resolution 3D occupancy scenes from the enhanced voxels. It achieves competitive performance compared to high-resolution methods (Occ3D mIoU 37.80 vs. PanoOcc 38.11) while using a 75% smaller voxel resolution.
A Dataset for Semantic Segmentation in the Presence of Unknowns: This paper proposes the ISSU anomaly segmentation dataset, which represents the first benchmark to simultaneously support the joint evaluation of known classes (closed-set) and unknown anomalies (open-set). It is twice the size of existing anomaly segmentation datasets, covers multiple domains, sensors, and lighting conditions, and its benchmarks reveal significant deficiencies in state-of-the-art (SOTA) methods regarding domain generalization and the segmentation of large/small objects.
A Neuro-Symbolic Framework Combining Inductive and Deductive Reasoning for Autonomous Driving Planning: This paper proposes the first neuro-symbolic framework that directly embeds ASP symbolic reasoning decisions as learnable embeddings into the trajectory decoding of an end-to-end planner. It dynamically extracts scene rules using LLMs, performs logical arbitration via the Clingo solver, generates physically feasible trajectories via a differentiable KBM, and refines them with neural residuals. On nuScenes, it comprehensively outperforms MomAD with an L₂ error of 0.57m, a collision rate of 0.075%, and a TPC of 0.47m.
PAP: A Prediction-as-Perception Framework for 3D Object Detection: Inspired by the brain's "predictive perception," PAP uses the trajectory prediction results of the previous frame as query inputs for the perception module of the current frame to replace some random queries. This achieves a 10% improvement in AMOTA (0.359 to 0.395), a 15% increase in inference speed (14 to 16 FPS), and a 14% reduction in training time on UniAD.
CAWM-Mamba: A Unified Model for Infrared-Visible Image Fusion and Compound Adverse Weather Restoration: CAWM-Mamba proposes the first end-to-end unified framework that simultaneously addresses infrared-visible image fusion and compound adverse weather restoration (e.g., fog + rain, rain + snow). By featuring weather-aware preprocessing, cross-modal feature interaction, and wavelet-domain frequency-SSM decoupling multi-frequency degradations, it comprehensively outperforms SOTA models on the AWMM-100K and standard fusion datasets.
Certified Human Trajectory Prediction: This work introduces randomized smoothing certification to human trajectory prediction for the first time. By leveraging mean/median aggregation functions and a diffusion denoiser, it provides certified robustness for trajectory prediction models—ensuring that the output remains within a certified boundary regardless of how the input noise is perturbed (within a radius $R$).
ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate: This work constructs the first large-scale multi-modal rock climbing motion dataset, AscendMotion (412K frames, RGB+LiDAR+IMU), and proposes ClimbingCap, a method that accurately recovers the 3D motions of climbers in the world coordinate system through separate coordinate decoding, post-processing optimization, and semi-supervised training.

Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

CompoSIA: Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation: CompoSIA proposes a compositional driving video generation framework based on Flow Matching DiT. By disentangling the injection of three types of control signals—structure (3D bboxes), identity (a single reference image), and ego-motion (camera trajectories)—it achieves fine-grained independent control and compositional editing for systematically synthesizing adversarial driving scenarios, resulting in a 17% improvement in FVD and a 173% increase in collision rate.

Browse all 89 Autonomous Driving papers →

🤖 Robotics & Embodied AI (40)¶

3D-MVP: 3D Multiview Pretraining for Robotic Manipulation: This paper proposes 3D-MVP, which extends Masked Autoencoder pretraining from 2D to a 3D multiview setting. By pretraining the multiview Transformer encoder of RVT on 200K 3D objects from Objaverse, downstream fine-tuning improves the average success rate on RLBench from 62.9% to 67.5% and significantly enhances robustness against environmental variations (such as texture, size, and lighting) on COLOSSEUM.
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning: Through systematic evaluation, it is found that DINO/iBOT outperforms MAE in robot tasks but suffers performance degradation on non-object-centric (NOC) data due to the loss of object-centric representation capabilities. This paper proposes SlotMIM, which uses a semantic bottleneck (reducing prototype numbers to encourage the emergence of objectness), cross-view consistency regularization, and slot-level contrastive learning. This enables the model to learn object-centric representations from NOC data, outperforming MVP/VC-1 pre-trained on >1M samples using only 241K samples.
CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos: Utilizing over 2,000 hours of city walking and driving videos from the internet, action labels are automatically extracted via Visual Odometry (VO) for large-scale imitation learning. This trains embodied agents capable of navigating complex, dynamic urban environments, achieving a 77.3% success rate in real-world deployment, significantly outperforming existing methods.
Coordinated Manipulation of Hybrid Deformable-Rigid Objects in Constrained Environments: This paper proposes a quasi-static trajectory optimization framework based on the Globally Variational Strain (GVS) parameterized Cosserat rod model for dual-arm coordinated manipulation of hybrid deformable-rigid linear objects (hDLO) in constrained environments. By leveraging analytical gradients, the solver achieves a 33x speedup over finite differences, and a ~3cm deformation error is validated on a real dual-arm platform.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models: This paper proposes CoT-VLA, which introduces visual Chain-of-Thought (CoT) reasoning into Vision-Language-Action (VLA) models. By utilizing a two-stage reasoning process—first predicting a subgoal image, then generating an action sequence—combined with hybrid attention and action chunking strategies, it achieves an 81.13% average success rate on the LIBERO benchmark, significantly outperforming existing methods.
Decision SpikeFormer: Spike-Driven Transformer for Decision Making: This work proposes DSFormer, the first spike-driven Transformer for offline reinforcement learning. It designs Temporal Spike Self-Attention (TSSA) and Position Spike Self-Attention (PSSA) to capture temporal/positional dependencies in RL, and introduces Progressive Threshold-dependent Batch Normalization (PTBN) to resolve the conflict between normalization and spiking properties. DSFormer outperforms ANN counterparts on the D4RL benchmark while saving 78.4% of energy consumption.
DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness: This paper proposes DexGrasp Anything, which integrates three physical constraint forces into the training and sampling phases of diffusion models to achieve SOTA dexterous grasp pose generation on almost all open datasets. Additionally, it constructs the largest-scale dexterous grasping dataset containing over 15K objects and more than 3.4 million grasping poses.
DRAWER: Digital Reconstruction and Articulation with Environment Realism: The DRAWER framework automatically constructs interactive digital twins from static scene videos. By combining a dual scene representation of SDF and Gaussian Splatting, it achieves high-fidelity rendering and precise geometry. It supports articulation identification and simulation, Unreal Engine game creation, and real-to-sim-to-real robotic policy transfer.
g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks: This paper proposes g3D-LF, which constructs generalizable 3D-language feature fields for unseen environments by performing multi-level contrastive learning pre-training on approximately 5,000 indoor 3D scenes and nearly 1 million language descriptions. It achieves state-of-the-art (SOTA) or near-SOTA performance across four embodied tasks: VLN (monocular/panoramic), zero-shot object navigation, and situated question answering.
GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities: GigaHands is the largest bimanual activity dataset to date. By designing an "Instruct-to-Annotate" procedural acquisition strategy and a 51-camera markerless capture system, it collects 34 hours of bimanual activities from 56 subjects interacting with 417 objects. It contains 183 million RGB image frames and 84K detailed text annotations, demonstrating the value of data scale in text-driven hand motion generation and motion captioning tasks.

Browse all 40 Robotics & Embodied AI papers →

🎮 Reinforcement Learning (4)¶

CALF: Communication-Aware Learning Framework for Distributed Reinforcement Learning: This paper proposes the CALF framework, which injects configurable network delay, jitter, and packet loss models into RL training. This reduces policy performance degradation by approximately 3-4 times when deployed on real distributed edge devices, revealing that network conditions represent an important but overlooked dimension in the sim-to-real gap.
Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging: The Visual Forager (VF) model is proposed, which simulates human eye-movement strategies in hybrid visual search tasks through target feature modulation, target value modulation, and a ViT-based Actor-Critic decision-making network. It achieves a normalized score of 72.6% (compared to 87.4% for humans), with a saccade amplitude difference of only 0.01° (4.06° vs. 4.05° for humans), revealing for the first time how target value and prevalence jointly influence human search decisions.
GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill: This paper proposes the GROVE framework, which constructs a generalized reward function by leveraging LLMs to generate physical constraints and VLMs to evaluate motion semantics in a complementary manner. By using a lightweight Pose2CLIP mapper to skip rendering and project poses directly into the semantic space, GROVE achieves open-vocabulary physical skill learning, yielding 8.4x faster training speed and a 22.2% improvement in motion naturalness compared to existing methods.
SkillMimic: Learning Basketball Interaction Skills from Demonstrations: SkillMimic is proposed, a purely data-driven framework that learns diverse basketball interaction skills from motion capture data using a unified HOI imitation reward (especially the innovative contact graph reward), and composes these skills using a high-level controller to complete complex long-horizon tasks such as continuous scoring.

🎁 Recommender Systems (1)¶

FineVQ: Fine-Grained User Generated Content Video Quality Assessment: This work constructs the first large-scale, fine-grained UGC video quality assessment database, FineVD (6,104 videos, 800k+ ratings, 6 dimensions), and proposes FineVQ, an LMM-based approach. FineVQ enables a single model to simultaneously perform quality rating, scoring, and attribution, achieving state-of-the-art performance on FineVD and other UGC-VQA datasets.

🔄 Self-Supervised Learning (26)¶

AutoSSVH: Automated Frame Sampling for Self-Supervised Video Hashing

BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning: Proposes BoSS—a scalable active learning oracle strategy that generates candidate batches by ensembling multiple selection strategies, evaluates performance gain by freezing the backbone and retraining only the last layer, and selects the optimal batch. It showcases oracle performance on large-scale datasets like ImageNet for the first time, revealing that SOTA active learning strategies still have significant room for improvement.
Breaking the Tuning Barrier: Zero-Hyperparameters Yield Multi-Corner Analysis Via Learned Priors: Legacy hand-tuned priors are replaced with a pretrained Foundation Model (TabPFN) to achieve zero-hyperparameter tuning for circuit yield multi-corner analysis. By freezing the backbone to perform in-context learning, automatically transferring knowledge across corners, and integrating automatic feature selection (1152D to 48D), this method achieves SOTA accuracy (MRE down to 0.11%) on SRAM benchmarks while reducing verification costs by over 10x.

CheXWorld: Image World Modeling for Radiograph Representation Learning

Do Your Best and Get Enough Rest for Continual Learning: Inspired by Ebbinghaus's forgetting curve theory, this paper proposes the View-Batch Model (VBM). By replacing multiple distinct samples in a batch with multiple augmented views (replay) of the same sample, VBM extends the recall interval by a factor of $V$ to an optimal range. Concurrently, it employs a one-to-many KL-divergence self-supervised loss to extract more knowledge from a single sample ("do your best"). Serving as a drop-in replacement, VBM consistently improves performance across various continual learning methods.

Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces

Few-Shot Implicit Function Generation via Equivariance: Generates implicit functions (NeRF/SDF) from few-shot samples using equivariance constraints, leveraging symmetry priors to reduce data requirements.
From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling: Proposes a prototype-driven curriculum learning for MAE, which identifies "prototype" samples (representative images close to cluster centroids) in the dataset using K-means clustering. By using a temperature-controlled sampling strategy, the training smoothly transitions from prototypes to the full distribution, achieving an up to $8\times$ training acceleration (a 200-epoch prototype curriculum performs comparably to an 800-epoch standard MAE).
Hyperbolic Category Discovery: This work proposes the HypCD framework, shifting representation learning in Generalized Category Discovery (GCD) from Euclidean/spherical spaces to hyperbolic space (the Poincaré ball model). Capitalizing on the property of hyperbolic space where the volume grows exponentially—making it naturally suitable for encoding hierarchical structures—this work proposes hybrid distance-angle similarity learning and a hyperbolic classifier. It achieves an accuracy improvement on CUB for SelEx from 69.1% to 71.8%, and on ImageNet-100 from 87.1% to 88.3%.
Learning to Normalize on the SPD Manifold under Bures-Wasserstein Geometry: This paper proposes GBWBN, the first batch normalization method for the SPD manifold based on generalized Bures-Wasserstein geometry. By introducing learnable metric parameters and matrix power non-linear transformations to effectively handle ill-conditioned covariance matrices, it achieves SOTA performance on skeleton-based action recognition and EEG classification.

Browse all 26 Self-Supervised Learning papers →

📐 Optimization & Theory (11)¶

Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression: Proposed the GETA framework to achieve automatic joint structured pruning and quantization-aware training: Quantization-Aware Dependency Graph (QADG) constructs a generic pruning search space + partially projected SGD guarantees layer-wise bit-width constraints + an interpretable joint learning strategy, achieving competitive or state-of-the-art compression performance on both CNNs and Transformers.
Conformal Prediction for Zero-Shot Models: Applying conformal prediction to zero-shot models to provide theoretically guaranteed uncertainty quantification and calibrated prediction sets for models like CLIP.
Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World: GlobustVP introduces convex relaxation techniques to the Manhattan World vanishing point estimation problem for the first time. By formulating the joint estimation of vanishing point locations and line-to-VP associations as a QCQP and relaxing it into an SDP, it achieves a globally optimal and highly efficient solver (~50ms/image) robust to up to 70% outliers.
Federated Learning with Domain Shift Eraser: This paper proposes the FDSE method, which decomposes each network layer into a domain-free feature extractor (DFE, globally aggregated to enhance consensus) and a domain-specific shift eraser (DSE, personalized aggregated to retain local characteristics). Combined with BN consistency regularization, it achieves 76.77% on DomainNet (outperforming Ditto by 1.6%) and 91.58% on Office-Caltech10 (outperforming FedBN by 4.6%).
How to Merge Your Multimodal Models Over Time?: This paper proposes the TIME (Temporal Integration of Model Expertise) framework to systematically study the progressive merging of multimodal expert models over time. By defining a search space across three axes—initialization strategy, deployment strategy, and merging technique, the work uncovers key design principles for temporal model merging on the FoMo-in-Flux benchmark.
Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning: This paper proposes TABASCO, a two-stage two-dimensional sample selection framework to address federated semi-supervised learning under joint label noise and long-tailed distributions. It utilizes two complementary metrics, Weighted JSD (WJSD) and Adaptive Centroid Distance (ACD), to identify clean samples. After GMM clustering, the remaining noisy data is leveraged in a semi-supervised manner, achieving 85.53% accuracy on CIFAR-10 (0.1 imbalance + 0.4 noise).
Model Poisoning Attacks to Federated Learning via Multi-Round Consistency: This work identifies that existing model poisoning attacks in federated learning cancel each other out due to cross-round directional inconsistency. It proposes PoisonedFL, which achieves a multi-round consistent attack through a fixed random direction vector, dynamic magnitude adjustment, and a hypothesis testing mechanism, bypassing 8 SOTA defenses without requiring any real client information.
SCOPE: Semantic Coreset with Orthogonal Projection Embeddings for Federated Learning: SCOPE proposes a semantic coreset selection framework for federated learning. By leveraging zero-shot VLM (MobileCLIP-S2) to extract three scalar metrics (representation score, diversity score, and margin proximity), the server aggregates a global consensus to guide a two-stage pruning process (anomaly filtering + redundancy elimination) on clients. This achieves a 128-512× uplink bandwidth reduction and 7.72× speedup while maintaining competitive accuracy.
Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent: It is discovered that the PGD attack exhibits cyclic behavior on the $L_\infty$ ball for robust samples. Detecting cycles via hashing (PGD_CD) enables early stopping, which achieves an iteration reduction of up to 96% while maintaining identical robustness evaluation results.
Test-Time Augmentation Improves Efficiency in Conformal Prediction: It is discovered that test-time data augmentation (TTA) can systematically improve the efficiency of conformal prediction. By learning augmentation weights on a calibration set to optimize the augmentation aggregation strategy, the prediction set size is reduced by 10-17% on ImageNet with ResNet-50 while strictly preserving the coverage guarantee.

Browse all 11 Optimization & Theory papers →

🔬 Interpretability (21)¶

Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability: This paper proposes ALBM (Attribute-formed Language Bottleneck Model), which avoids spurious correlation reasoning by constructing an attribute-guided class-specific concept space, extracts fine-grained attribute features using visual attribute prompt learning, and automatically generates high-quality concept sets through a Description-Summary-Supplement (DSS) strategy, achieving better interpretability and scalability across 9 benchmarks.
Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability: This paper proposes the ALBM model, which replaces the class-shared concept space of existing Language Bottleneck Models (LBMs) with an Attribute-formed Class-specific Concept Space (ACCS) to address the issue of spurious cue reasoning and support cross-class generalization. Combined with Visual Attribute Prompt Learning (VAPL) to extract fine-grained attribute features, ALBM comprehensively outperforms existing interpretable classification methods on 9 few-shot benchmarks.
Differentiable Inverse Rendering with Interpretable Basis BRDFs: Proposes a differentiable inverse rendering method based on interpretable basis BRDFs, decomposing materials into combinations of physically meaningful basis functions to achieve interpretable material estimation.
Geometry-Guided Camera Motion Understanding in VideoLLMs: Proposes a complete framework spanning benchmark construction, diagnosis, and injection. By extracting camera motion cues from a 3D foundation model (VGGT) and injecting them into the VideoLLM via structured prompting, training-free camera motion perception enhancement is achieved.
Interpretable Image Classification via Non-parametric Part Prototype Learning: This paper proposes an interpretable image classification framework based on non-parametric prototype learning. It discovers semantically distinct object part prototypes by performing optimal transport clustering on self-supervised ViT features, addressing prototype redundancy issues in existing ProtoPNet methods, while introducing two new metrics, Distinctiveness and Comprehensiveness, to quantify explanation quality.
KVQ: Boosting Video Quality Assessment via Saliency-Guided Local Perception: Inspired by the human visual system, KVQ explicitly decouples global video quality into two factors: visual saliency and local texture. It extracts cross-region saliency via Fusion-Window Attention and enhances texture perception in independent regions using a Local Perception Constraint, significantly outperforming SOTA methods on five VQA benchmarks.
L-SWAG: Layer-Sample Wise Activation with Gradients Information for Zero-Shot NAS on Vision Transformers: This paper proposes L-SWAG (Layer-Sample Wise Activation with Gradients), a new general zero-cost proxy that evaluates network architecture quality by combining layer- and sample-wise activation and gradient information. It is the first to systematically extend zero-cost NAS to the Vision Transformer search space and establishes a new benchmark across 6 tasks in the Autoformer search space.
Language Guided Concept Bottleneck Models for Interpretable Continual Learning: This paper introduces language-guided Concept Bottleneck Models (CBMs) into continual learning. It uses ChatGPT to generate human-interpretable concepts and the CLIP text encoder to encode concept embeddings, constructing a concept bottleneck layer. This provides transparent decision explanations while mitigating catastrophic forgetting, outperforming the SOTA on ImageNet-subset by 3.06%.
Learning on Model Weights using Tree Experts: Discovers that most public models belong to a few Model Trees (fine-tuned from common ancestors), and learning weights within the same Tree is much simpler than across Trees. This paper proposes ProbeX, the first lightweight probing method targeting single hidden layer weights. Through Tucker tensor decomposition, it achieves a 30x reduction in parameter size and realizes the first zero-shot model classification (89.8% accuracy) by aligning model weights with text representations.
Learning Visual Composition through Improved Semantic Guidance: This paper proposes to significantly enhance the visual compositional understanding of standard CLIP models by improving the semantic supervision signals of training data (regenerating high-quality captions using foundation models and replacing training-from-scratch with a pre-trained text encoder). This improves performance on the ARO benchmark from CLIP's 59%/63% to 92%/94%, and on DOCCI image retrieval from 58.4% to 94.5% recall@1, without requiring any architectural modifications.

Browse all 21 Interpretability papers →

📦 Model Compression (66)¶

Adapter Merging with Centroid Prototype Mapping for Scalable Class-Incremental Learning: This paper proposes the ACMap framework, which incrementally averages and merges independently trained task adapters into a single adapter (maintaining $O(1)$ inference complexity). Combined with centroid prototype mapping to align the representation of old task prototypes in the new subspace, it achieves comparable accuracy to the SOTA method EASE on five benchmarks while being 39 times faster in inference.
Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks: A unified utility metric based on Alternating Gradient Flow (AGF) is proposed, utilizing feature-space total variation as a structural pruning metric. Combined with confidence-based cascade routing, this decouples offline topology construction from online dynamic inference. It avoids structural collapse caused by traditional metrics under extreme compression on ImageNet-1K, and matches the accuracy of the full model at 0.92x computational cost in dynamic inference on ImageNet-100.
An FPGA Implementation of Displacement Vector Search for Intra Pattern Copy in JPEG XS: This paper proposes the first FPGA architecture implementation for the displacement vector (DV) search module in JPEG XS Intra Pattern Copy (IPC). Utilizing a four-stage pipelined design and optimized memory organization, it achieves a throughput of 38.3 Mpixels/s and a power consumption of 277 mW on Xilinx Artix-7, laying the foundation for practical hardware deployment and ASIC transition of IPC.
ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation: An end-to-end learned image compression framework, ARCHE, is proposed, which integrates hierarchical hyperprior, masked spatial autoregressive context, channel conditioning, and SE-excited channel recalibration into a unified probabilistic architecture. Without requiring Transformers or recurrent components, ARCHE reduces BD-Rate on Kodak by approximately 48% compared to the Ballé baseline and by about 5.6% compared to VVC Intra, with only 95M parameters and a 222ms decoding time.
AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing: This paper proposes AutoSSVH, which selects the most challenging subset of frames as training signals through an adversarial automated frame sampling network (Grade-Net) and designs a Point-to-Set (P2Set) contrastive learning paradigm for hashing. It achieves efficient self-supervised video hashing retrieval and significantly outperforms existing methods on UCF101 and HMDB51.
BHViT: Binarized Hybrid Vision Transformer: To address the severe performance degradation in binarized ViTs, this paper proposes BHViT, a hybrid ViT architecture specifically designed for binarization. It features a multi-scale grouped dilated convolutional token mixer, quantization-decomposed attention matrix binarization, a shift-augmented MLP, and a regularization loss, achieving state-of-the-art performance for 1-bit binarized models on ImageNet-1K.
Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing: Proposed BMTNet—a lightweight hybrid architecture combining binarized Mamba and Swin Transformer for Quad Bayer HybridEVS sensor RAW image demosaicing. By preserving the full precision of the core Selective Scan and incorporating global visual information to compensate for accuracy loss, it significantly reduces computational complexity while maintaining high-quality demosaicing.

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Charm: The Missing Piece in ViT Fine-Tuning for Image Aesthetic Assessment

CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning: This paper proposes CL-LoRA, which designs a dual-adapter architecture (task-shared + task-specific LoRA). By combining knowledge distillation, gradient reassignment, and learnable block-wise weights, CL-LoRA achieves SOTA continual learning performance with only 0.3% trainable parameters.

Browse all 66 Model Compression papers →

🕸️ Graph Learning (7)¶

Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models: This paper reinterprets multi-head attention as a graph convolutional filter subspace, and linearly combines pre-trained attention maps by learning an extremely small set of subspace combination coefficients ($H \times H$ matrices). This breaks the convex hull constraint caused by the softmax function to expand the feature space, improving the performance of various PEFT methods in a plug-and-play manner at near-zero parameter cost.
DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition: This paper proposes DVHGNN, a vision backbone network that utilizes multi-scale dilated hypergraphs to capture high-order correlations among image patches. By employing clustering and Dilated Hypergraph Construction (DHGC) to extract multi-scale hyperedges, alongside dynamic hypergraph convolution for adaptive feature exchange, DVHGNN achieves an 83.1% top-1 accuracy on ImageNet-1K with 30.2M parameters, outperforming ViG-S by 1.0% and ViHGNN-S by 0.6%.
Hypergraph Vision Transformers: Images are More than Nodes, More than Edges: Proposed HgVT, which embeds a hierarchical bipartite hypergraph structure into ViTs. By processing primary image patch vertices and virtual vertices separately, constructing dynamic cosine adjacency, and utilizing a three-layer attention mechanism based on a hyperedge communication pool, HgVT captures high-order semantic relations among patches without clustering. On ImageNet-1K, HgVT-Ti achieves 76.2% accuracy with 7.7M parameters (outperforming ViHGNN-Ti by 1.9%) and reaches 73.23% mAP@10 in image retrieval.
Knowledge Bridger: Towards Training-Free Missing Modality Completion: This paper proposes Knowledge Bridger, a training-free framework for missing modality completion. By leveraging Large Multimodal Models (LMMs) to automatically mine multimodal knowledge and construct a knowledge graph, it guides the generation and ranking of missing modalities, surpassing existing methods in both general and medical OOD scenarios.
NN-Former: Rethinking Graph Structure in Neural Architecture Representation: NN-Former proposes a hybrid GNN-Transformer architecture predictor, revealing that existing methods overlook the topological information of "sibling nodes" (nodes sharing parent/child nodes). By introducing Adjacency-Sibling Multihead Attention (ASMA) and Bidirectional Graph Isomorphism FFN (BGIFFN), it achieves a Kendall's Tau of $0.877$/$0.890$ on NAS-Bench-101/201 and reduces the MAPE of latency prediction by 48-64%.
Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing: This paper proposes the VISA framework to debias video scene graph generation from both visual (Memory-Guided Sequence Modeling (MGSM) to reduce feature variance) and semantic (Iterative Relation Generator (IRG) to introduce hierarchical context and reduce dependence on biased priors) perspectives, significantly improving performance on tail categories on datasets like Action Genome.
Universal Scene Graph Generation: This paper proposes the Universal Scene Graph (USG) representation and its parser USG-Par, which generates a unified scene graph from arbitrary combinations of modalities (images, text, video, 3D) using a cross-modal object associator and text-centric scene contrastive learning, capturing both modality-invariant and modality-specific scene semantics.

📈 Time Series (5)¶

Competition-Aware CPC Forecasting with Near-Market Coverage: This work reformulates paid search CPC forecasting as a "forecasting under partially observable competition" problem. It approximates unobservable competitive states using three types of competition proxies: semantic neighborhoods (via Transformer embeddings), behavioral neighborhoods (via DTW alignment), and geographic intent. Evaluations on Google Ads data covering 1,811 keywords over 127 weeks demonstrate that competition-aware enhancements significantly outperform univariate and weak-context baselines in medium- to long-term forecasting (6/12 weeks).
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification: This paper proposes DejaVid, an encoder-agnostic, lightweight approach for enhancing video classification. Instead of representing a video with a single embedding, DejaVid represents it as a variable-length Temporal Sequence of Embeddings (TSE). By learning importance weights for each time step and feature dimension, combined with an improved differentiable DTW algorithm for temporal alignment classification, it achieves SOTA results of 77.2% on SSV2 and 89.1% on K400 with an increase of only <1.8% parameters.
FLAVC: Learned Video Compression with Feature Level Attention: This work proposes FLAVC, which introduces a Feature-level Attention (FLA) module into the learned video compression (LVC) framework. By converting high-level local patch embeddings into one-dimensional batch-wise vectors and replacing traditional attention weights with a global context matrix, FLA achieves full-frame-level global perception. Combined with a Dense Overlapping Patcher and a hybrid Transformer-CNN encoder, FLAVC achieves state-of-the-art rate-distortion performance across four video compression datasets.
L2GTX: From Local to Global Time Series Explanations: L2GTX proposes a completely model-agnostic global explanation method for time series classification. By aggregating Parameterized Event Primitives (PEPs) generated by LOMATCE, it constructs class-level global explanations, maintaining stable global fidelity ($R^2$) across six benchmark datasets.
Learning Extremely High Density Crowds as Active Matters: This paper models extremely high-density crowds ($\ge 5 \text{ people/m}^2$) as active matter, proposing a neural stochastic differential equation system that combines a novel "crowd material" stress model with Toner-Tu active forces. The system learns and predicts crowd dynamics directly from in-the-wild video optical flow using a hybrid Eulerian-Lagrangian CrowdMPM framework.

🏥 Medical Imaging (78)¶

A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement: This paper proposes a semi-supervised breast ultrasound segmentation framework combining training-free pseudo-label generation from VLMs (Grounding DINO + SAM driven by appearance description prompts) and dual-teacher uncertainty fusion refinement, achieving performance close to fully supervised learning with only 2.5% of labeled data.
Accelerating Stroke MRI with Diffusion Probabilistic Models through Large-Scale Pre-training and Target-Specific Fine-Tuning: Drawing on the foundation model paradigm, a Diffusion Probabilistic Model (DPM) is pre-trained on large-scale public brain MRI data and then fine-tuned on data from only 20 stroke patients. This workflow enables accelerated MRI reconstruction in data-constrained scenarios. A clinical reader study confirms that the image quality with 2× acceleration is non-inferior to the standard-of-care.
Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions: Proposes the SFDA-DeP method. Inspired by machine unlearning, it identifies and corrects the prediction bias (over-predicting certain classes) of the source model in the target domain. This addresses the challenge of amplified prediction bias in weakly supervised localization models during cross-organ/cross-center domain adaptation in histopathology.
Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding: A two-stage label-efficient learning framework is proposed: first, a 3D U-Net encoder is pre-trained via self-supervised Masked Image Modeling on 1,206 unlabeled CT scans; then, combined with VDETR + Vertex RPE and Mean Teacher semi-supervised learning, it achieves a 3D abdominal trauma detection [email protected] of 45.30% (+115%) using only 144 labeled cases.
Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation?: By establishing unified training and evaluation protocols, this study compares 11 specialized and general-purpose vision models across three heterogeneous medical datasets. The findings reveal that General-Purpose Vision Models (GP-VMs) can systematically outperform most Specialized Medical Segmentation Architectures (SMAs) in both segmentation accuracy and interpretability, challenging the traditional assumption that "medical image segmentation necessitates domain-specific architectures."
Association of Radiologic PPFE Change with Mortality in Lung Cancer Screening Cohorts: This study validates that the progression of PPFE (pleuroparenchymal fibroelastosis) automatically quantified by deep learning is independently associated with all-cause mortality across two large-scale lung cancer screening cohorts (NLST: 7,980 cases; SUMMIT: 8,561 cases). It proposes that longitudinal changes in PPFE can serve as an imaging biomarker to identify individuals at high risk for respiratory morbidity in screening populations.
Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI: Detects ovarian cancer and its subtypes on histopathology images using 15 CNN variants (LeNet, ResNet, VGG, Inception), selects InceptionV3 (ReLU) as the optimal model (average 94.58% accuracy), and interprets model predictions using three XAI methods: LIME, SHAP, and Integrated Gradients.
BiCLIP: Bidirectional and Consistent Language-Image Processing for Robust Medical Image Segmentation: BiCLIP proposes a bidirectional consistent vision-language segmentation framework. Through bidirectional multimodal fusion (BMF, letting visual features reversely refine text embeddings) and image augmentation consistency (IAC, regularization across weak/strong perturbations), it maintains robust performance on COVID-19 CT segmentation with only 1% of labeled data and shows tolerance to clinical image degradation (noise/blur).
Boltzmann Attention Sampling for Image Analysis with Small Objects: Proposes BoltzFormer, a novel transformer decoder architecture that dynamically samples sparse attention regions using a Boltzmann distribution to focus on small objects. Combining an annealing temperature schedule (exploration in early layers, exploitation in later layers) and the PiGMA multi-query aggregation module, it achieves a 3-12% improvement in Dice score compared to SOTA on small object segmentation (where objects occupy <0.1% of the image area), while reducing attention computation by an order of magnitude.
Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD: Proposes CBCTRepD, the first bilingual report generation system for oral and maxillofacial CBCT. By constructing a dataset of 7,408 high-quality CBCT-report pairs and establishing a multi-level clinical evaluation framework, it consistently improves report quality across radiologists of different experience levels, especially in reducing missed lesions and standardizing report structures.

Browse all 78 Medical Imaging papers →

🧬 Computational Biology (7)¶

DiffVsgg: Diffusion-Driven Online Video Scene Graph Generation: DiffVsgg is proposed to model Video Scene Graph Generation (VSGG) as an iterative denoising problem along the temporal axis. It unifies object classification, box regression, and relation prediction using a shared feature embedding. Through latent diffusion models for spatial reasoning and using prior-frame predictions as conditioning for temporal reasoning, it achieves the first online VSGG and accomplishes comprehensive SOTA performance across all three evaluation protocols on Action Genome, surpassing DSG-DETR by 3.3 points in R@10.
Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation: This paper proposes the ERBA adapter, which models enzyme kinetic prediction as a staged conditioning process of "substrate recognition $\rightarrow$ conformational adaptation". It injects substrate semantics via MRCA, fuses active-site 3D geometry via G-MoE, and preserves PLM priors via ESDA, consistently outperforming existing methods on three kinetic endpoints: kcat, Km, and Ki.
Semantic and Expressive Variation in Image Captions Across Languages: This work systematically demonstrates significant distributional differences in semantic content (objects, relations, attributes) and expressive style (concreteness, tone, authenticity) in image captions across different languages. Multilingual caption sets provide richer visual information compared to monolingual ones (+46% objects, +66.1% relations, +66.8% attributes), providing empirical support for training vision models on multilingual data.
SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules: This paper proposes the SHREC algorithm, which leverages the spectral embedding of the graph Laplacian to directly recover the projection angles of helical molecules from 2D cryo-EM projection images. Without requiring prior knowledge of helical symmetry parameters (rise/twist) and only requiring the axial point group symmetry $C_n$, SHREC achieves near-atomic resolution ab-initio helical reconstruction on multiple public datasets.
Synthetic Visual Genome: Proposes the SVG (Synthetic Visual Genome) data engine. Through a two-stage pipeline consisting of completing missing relationships on top of existing human annotations via GPT-4 (Stage 1) and Robin self-distillation + GPT-4 editing (Stage 2/SG-Edit), it generates a dense scene graph dataset with 146K images, 2.6M objects, and 5.6M relationships. The trained Robin-3B model outperforms same-sized models trained on over 300M instances using less than 3M instances, achieving a state-of-the-art (SOTA) score of 88.9 on referring expression comprehension.
Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos: This paper proposes the World Scene Graph Generation (WSGG) task and the ActionGenome4D dataset, upgrading video scene graphs from frame-centric 2D representations to world-centric 4D representations. It requires models to perform 3D localization and relation prediction in the world coordinate system for all objects, including invisible ones that are occluded or out of view. Three complementary methods (PWG/MWAE/4DST) are proposed to explore different inductive biases for invisible object reasoning.
Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning: This work proposes Cobra, an unsupervised foundation model-agnostic (FM-agnostic) whole slide image (WSI)-level representation learning framework. It leverages embeddings from multiple pre-trained patch-level foundation models as feature-space augmentations, training a slide-level encoder using a Mamba-2 encoder and contrastive learning. Pre-trained on only 3,048 WSIs, Cobra outperforms existing slide encoders by at least +4.4% in average AUC across 15 downstream tasks.

⚛️ Physics & Scientific Computing (7)¶

Accurate Differential Operators for Hybrid Neural Fields: This paper reveals that gradients and curvatures computed via automatic differentiation in hybrid neural fields (e.g., Instant NGP) suffer from severe high-frequency noise. It proposes a post-processing differential operator based on local polynomial fitting and a self-supervised fine-tuning method, reducing gradient and curvature errors by 4x, which significantly eliminates artifacts in rendering and physical simulations.
ATP: Adaptive Threshold Pruning for Efficient Data Encoding in Quantum Neural Networks: This paper proposes Adaptive Threshold Pruning (ATP) to adaptively prune low-information data features prior to quantum data encoding. By optimizing thresholds via L-BFGS-B, ATP achieves the highest accuracy in binary classification tasks across four datasets (MNIST, FashionMNIST, CIFAR, PneumoniaMNIST) while significantly reducing entanglement entropy.
DiffFNO: Diffusion Fourier Neural Operator: This paper proposes DiffFNO, which integrates the Weighted Fourier Neural Operator (WFNO) with a diffusion framework for arbitrary-scale super-resolution. It preserves critical high-frequency components through Mode Rebalancing, fuses frequency-domain and spatial-domain features using a Gated Fusion Mechanism, and accelerates inference with an adaptive-step ODE solver, outperforming existing methods by 2-4 dB in PSNR across multiple benchmarks.
Improve Representation for Imbalanced Regression through Geometric Constraints: This work is the first to study representation space uniformity in deep imbalanced regression (DIR). It proposes two geometric constraints, namely enveloping loss and homogeneity loss, to ensure that regression representations are uniformly distributed on the hypersphere. It also designs a surrogate-driven representation learning (SRL) framework to integrate global geometric constraints into mini-batch training, achieving SOTA on several DIR tasks such as age estimation.
KAC: Kolmogorov-Arnold Classifier for Continual Learning: First to apply Kolmogorov-Arnold Networks (KAN) to continual learning. By replacing B-splines with Radial Basis Functions (RBF) to construct the classifier KAC, consistent and significant performance gains are achieved across multiple continual learning methods (up to +20.70% on CUB200 40-step) with only 0.23M additional parameters.
Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation: MambaTM is proposed as the first Mamba-based video atmospheric turbulence mitigation network. It reparameterizes the phase distortion traditionally represented by Zernike polynomials into Latent Phase Distortion (LPD) via a VAE, using LPD to guide the state transitions of SSMs. While maintaining linear complexity and a global receptive field, it achieves state-of-the-art restoration quality and nearly 2× inference speedup (55.4 FPS vs 32.7 FPS).
Towards Faithful Multimodal Concept Bottleneck Models: Proposes f-CBM, a faithful multimodal Concept Bottleneck Model framework based on CLIP. By jointly addressing concept detection accuracy and information leakage via a differentiable leakage loss and a Kolmogorov-Arnold Network prediction head, it achieves the optimal trade-off among task accuracy, concept detection, and leakage.

📡 Signal & Communications (5)¶

ABC-Former: Auxiliary Bimodal Cross-domain Transformer with Interactive Channel Attention: This paper proposes ABC-Former, which introduces CIELab color space and RGB histograms as auxiliary bimodal information. It utilizes a cross-domain Transformer and an Interactive Channel Attention (ICA) module to achieve cross-modal transfer of global color knowledge, achieving SOTA performance in sRGB white balance correction tasks. It is also extended to ABC-FormerM to handle mixed illumination scenarios.
Breaking the Low-Rank Dilemma of Linear Attention: This paper theoretically reveals that the fundamental cause of linear attention's performance lagging behind Softmax attention is the low-rank bottleneck of output features. It proposes Rank-Augmented Linear Attention (RALA), which utilizes two complementary strategies—enhancing KV buffer rank and output feature rank—to match or even surpass the performance of Softmax attention while maintaining linear complexity.
Continuous Space-Time Video Resampling with Invertible Motion Steganography: An Invertible Motion Steganography Module (IMSM) is proposed to embed motion information into low-frame-rate frames during video temporal downsampling, and accurately restore motion details via inverse transformation during upsampling. It supports continuous (non-integer) space-time resampling factors, significantly improving reconstruction quality while preserving the visual quality of downsampled frames.
DiTASK: Multi-Task Fine-Tuning with Diffeomorphic Transformations: Proposes DiTASK, which utilizes continuous piecewise-affine (CPAB) diffeomorphic transformations to smoothly transform the singular values of pretrained weight matrices while keeping the singular vectors unchanged. It achieves full-rank update multi-task fine-tuning with only about 32 parameters per layer, outperforming MTLoRA by 26.27% relative improvement with 75% fewer parameters on PASCAL MTL.
Neural Video Compression with Context Modulation: Proposed the DCMVC framework, which modulates temporal context in two steps: flow orientation and context compensation. By fully utilizing reference information in both the pixel domain and the feature domain, it achieves compression performance that saves an average of 22.7% bitrate compared to H.266/VVC and 10.1% bitrate compared to the previous SOTA, DCVC-FM.

👥 Social Computing (6)¶

As Language Models Scale, Low-order Linear Depth Dynamics Emerge: By treating the depth dimension of Transformers as a discrete-time dynamical system, this paper finds that a linear state-space surrogate model of just 32 dimensions can predict inter-layer sensitivity curves with high precision (Spearman up to 0.99) within a given context. Surprisingly, as the model scales, the low-order linear surrogate becomes even more accurate—unveiling a new scaling law.
Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification: This paper proposes Classifier-guided CLIP Distillation (CCD), which achieves unsupervised multi-label classification performance on par with fully supervised methods (90.1% mAP on VOC12) without any manual annotations by leveraging two core techniques: CAM-guided local view label aggregation and CLIP prediction debiasing.
Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers: Proposes C2B (Classifier-to-Bias), the first framework to automatically detect biases in pre-trained visual classifiers using only the textual descriptions of the classification task (without any labeled data). By leveraging LLMs to generate class-specific bias candidates, creating retrieval queries to collect image datasets, and finally calculating bias scores, C2B outperforms supervised SOTA bias detection methods on CelebA and ImageNet-X.
Learning from Neighbors: Category Extrapolation for Long-Tail Learning: It is discovered that finer-grained category division naturally mitigates the impact of long-tail imbalance. This paper proposes using LLMs to discover fine-grained auxiliary categories semantically related to existing ones, web crawlers to collect images, and a Neighbor-Silencing Loss to prevent auxiliary classes from dominating. This achieves a 16-percentage-point improvement ($41.4\% \to 57.4\%$) on Few-shot classes in ImageNet-LT.
Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples: Based on abstract: Deep learning models are known to often learn features that spuriously correlate with the class label during training but are irrelevant to the prediction task. Existing methods typically address this issue by annotating potential spurious attributes, or filtering spurious features based on some emp
Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness: This paper proposes Project-Probe-Aggregate (PPA), a three-step framework that improves the group robustness of foundation models without group annotations, using less than 0.01% of trainable parameters. PPA projects features to remove class proxies and amplify bias, probes group labels corrected with group priors, and aggregates group weights.

🛡️ AI Safety (27)¶

A Simple Data Augmentation for Feature Distribution Skewed Federated Learning: Proposes FedRDN—an extremely simple data augmentation method for federated learning. During training, it randomly uses the channel-wise mean/standard deviation from other clients for data normalization (instead of relying fixedly on local statistics). Requiring only a few lines of code, it significantly mitigates the feature distribution skew problem and consistently improves performance across multiple FL methods.
Data-free Universal Adversarial Perturbation with Pseudo-Semantic Prior: This paper proposes PSP-UAP, a data-free generation method for universal adversarial perturbations. By extracting pseudo-semantic priors from the UAP itself, utilizing input transformation enhancement, and applying a sample reweighting strategy, it achieves an average white-box fooling rate of 89.95% and significantly outperforms existing methods in black-box scenarios without requiring any training data.
DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging: This work proposes DEAL (Data-Efficient Adversarial Learning), an adversarial learning framework trained on only 50 clean infrared images. Through dynamic adversarial degradation synthesis and a dual-channel interaction network (Scale Transform + Spiking Neurons), it simultaneously addresses three types of infrared degradations (stripe noise, low resolution, and low contrast) with an ultra-lightweight parameter size of 0.96M.

DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders

Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection: Proposes the AlignIns defense method, which identifies malicious model updates in federated learning through dual-granularity direction alignment detection (global direction + fine-grained sign analysis), outperforming existing defense methods under both IID and non-IID settings.
Detecting Out-of-Distribution through the Lens of Neural Collapse: Based on Neural Collapse theory, this paper discovers that centered in-distribution (ID) features cluster near the weight vectors of their predicted classes and far from the origin (forming a simplex ETF). Guided by this, the NCI detector is designed by combining the angular proximity (pScore) between features and weight vectors with a feature norm filter. NCI achieves the best overall OOD detection performance on CIFAR-10/100 and ImageNet across multiple architectures while maintaining inference latency on par with the softmax baseline.
Dynamic Integration of Task-Specific Adapters for Class Incremental Learning: Achieves class incremental learning through the dynamic integration of task-specific adapters, where a lightweight adapter is trained for each task, and relevant adapters are dynamically selected and combined during inference.
FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors: FedAWA is proposed, which is inspired by task arithmetic and uses client vectors (the difference between local parameters and global parameters) to adaptively optimize aggregation weights in federated learning. Clients whose updates align with the global optimization direction are assigned higher weights, consistently improving FedAvg by 1–4 percentage points in non-IID scenarios.
Forensics Adapter: Adapting CLIP for Generalizable Face Forgery Detection: This paper proposes Forensics Adapter, a lightweight adapter network with only 5.7M parameters that learns blending boundary features of face forged images in parallel with a frozen CLIP. Highly generalizable cross-dataset face forgery detection is achieved via a triple objective: masked boundary prediction, patch-level contrastive learning, and sample-level contrastive learning, achieving an AUC of 0.914 on CDF-v1.
Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning: Obtains the geometry of the global embedding distribution by accurately reconstructing the global covariance matrix from local covariance matrices in federated learning. It generates augmented samples along global principal directions to localize global distribution information, improving performance by 17 percentage points on CIFAR-100 under extreme heterogeneous scenarios ($\beta=0.01$).

Browse all 27 AI Safety papers →

📂 Others (58)¶

BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending: Proposes a manufacturability metric taxonomy for sheet metal bending processes (categorized into a four-quadrant framework based on two dimensions: configuration-dependency $\times$ feasibility/complexity), and constructs BenDFM, the first synthetic dataset containing 20,000 parts (comprising both manufacturable and non-manufacturable samples). Benchmarking indicates that graph-structured representations (UV-Net) outperform point clouds (PointNext), and predicting configuration-dependent metrics is more challenging.
Bounds on Agreement between Subjective and Objective Measurements: By assuming only that the voting mean converges to the true quality, mathematical bounds on PCC (upper bound) and MSE (lower bound) between subjective tests (MOS) and objective estimators are derived. A Binomial-based voting model, BinoVotes, is proposed to enable the calculation of these bounds even when voting variance is unavailable. Validation on 18 subjective test datasets demonstrates that BinoVotes bounds align closely with full-data-driven bounds.
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction: This paper proposes CARE Transformer, which decouples the learning of local inductive bias and long-range dependencies through asymmetrical feature decoupling. Fueled by a dynamic memory unit and a dual interaction module that fully exploit feature complementarity, it delivers a mobile-friendly linear-complexity vision Transformer. It achieves 78.4% top-1 accuracy on ImageNet with only 0.7 GMACs.
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis: By providing a perfect oracle noise transition matrix T, this work demonstrates that Forward Correction still suffers from training collapse under ideal conditions (first ascending, then descending, and eventually converging to the uncorrected baseline). It systematically diagnoses the root causes of failure from three levels: macro (convergence end-state), micro (gradient dynamics), and information-theoretic (irreversible information loss in noisy channels). This reveals that the failure is not a matter of inaccurate T estimation, but a structural deficiency of high-capacity networks under finite samples.
Do ImageNet-trained Models Learn Shortcuts? The Impact of Frequency Shortcuts on Generalization: This paper proposes a Hierarchical Frequency Shortcut Search (HFSS) method to efficiently discover frequency shortcuts learned by CNNs and Transformers at the ImageNet-1K scale for the first time (permitting correct classification with only 5% of frequencies). It reveals that frequency shortcuts are surprisingly beneficial in texture-preserving OOD tests but detrimental in stylized tests (IN-R/IN-S), pointing out that existing OOD evaluation frameworks overlook the impact of frequency shortcuts.
EBS-EKF: Accurate and High Frequency Event-based Star Tracking: This paper proposes EBS-EKF, which models the circuit behavior of event cameras under low-light conditions to obtain intensity-dependent centroid offset correction, combined with a 3D Extended Kalman Filter for star tracking, achieving an order of magnitude higher accuracy than existing methods on real night-sky data.
EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching: EDM is proposed as the first learning-based dense feature matching method for Equirectangular Projection (ERP) panoramic images. It addresses the polar distortion of ERP through a Spherical Space Alignment Module (SSAM, utilizing spherical positional encoding with 3D Cartesian coordinates + Gaussian Process regression) and geodesic flow refinement. On Matterport3D, it outperforms DKM by 26.72% in AUC@5°, and on Stanford2D3D by 42.62%.
Effortless Active Labeling for Long-Term Test-Time Adaptation: This work proposes EATTA, an approach that labels only one most valuable sample per batch (instead of multiple) based on feature perturbation sensitivity during long-term test-time adaptation (TTA). Combined with a gradient norm debiasing strategy to balance the gradients of supervised and unsupervised losses, EATTA achieves an average error rate of 50.9% on ImageNet-C with an extremely low annotation cost, outperforming SimATTA with three times the labeling budget by 3.9%.
Event Ellipsometer: Event-based Mueller-Matrix Video Imaging: The first system to achieve 30fps video-rate Mueller matrix imaging. By capturing intensity modulations caused by a rapidly rotating QWP via an event camera, the system maps event time differences to Mueller matrix ratios and reconstructs physically valid Mueller matrix videos using SVD estimation combined with spatiotemporal propagation.
EVOS: Efficient Implicit Neural Training via EVOlutionary Selector: This paper proposes EVOS, an evolutionary selection paradigm (sparse fitness evaluation + frequency-guided crossover + augmented unbiased mutation) for intelligent sparse sampling of INR training coordinates. EVOS reduces training time by 48-66% (180s $\rightarrow$ 97s) while maintaining or even improving reconstruction quality (PSNR 37.81 vs. standard 37.10).

Browse all 58 Others papers →

🗂 More Areas (28)¶

👥 Multi-Agent (3)¶

Collaborative Tree Search for Enhancing Embodied Multi-Agent Collaboration: This paper proposes the Cooperative Tree Search (CoTS) framework, which integrates a modified Monte Carlo Tree Search with an LLM-driven reward function to guide multiple embodied agents in long-term strategic planning and highly efficient collaboration. By incorporating a plan evaluation module to prevent action confusion caused by frequent plan updates, CoTS significantly outperforms existing methods in both CWAH and TDW-MAT environments.
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems: ComfyBench proposes the first comprehensive benchmark (200 tasks, 3205 node documentations, and 20 curriculum workflows) to evaluate the capability of LLM-based agents to autonomously design collaborative AI systems in ComfyUI. It also introduces the ComfyAgent framework, which leverages code-based workflow representation and multi-agent collaboration to achieve a resolve rate comparable to o1-preview. However, it resolves only 15% of helper creative tasks, highlighting a significant gap in autonomous system design for LLM agents.
NADER: Neural Architecture Design via Multi-Agent Collaboration: NADER models neural architecture design as a multi-LLM-agent collaborative task: a Reader extracts knowledge from papers, a Proposer generates improvement plans, a Modifier implements modifications using Directed Acyclic Graphs (DAGs), and a Reflector learns from failures. With only 10 trials, it surpasses the accuracy upper bound of the NAS-Bench-201 search space, achieving 74.51% on CIFAR-100 (compared to the best in-space search result of 73.51%).

📊 LLM Evaluation (4)¶

Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways (EraDiff): This paper proposes EraDiff, which establishes a progressive diffusion pathway from "object-containing" to "pure background" through the Chain-Rectifying Optimization (CRO) paradigm, and suppresses artifacts during sampling using the Self-Rectifying Attention (SRA) mechanism. This enables the diffusion model to truly comprehend the "erasure intention," achieving a SOTA Local FID (3.799) on OpenImages V5 and significantly outperforming SD2-Inpaint and LaMa in complex real-world scenes.
PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation: PosterO is proposed to structure poster layouts into SVG layout trees. By vectorizing design intents and modeling hierarchical node representations, it interfaces with LLMs, generating high-quality content-aware layouts via intent-aligned in-context learning. It achieves state-of-the-art performance across multiple benchmarks and introduces the first PStylish7 dataset supporting multi-purpose and multi-shape elements.
RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives: This paper proposes RoadSocial, a large-scale and diverse VideoQA dataset sourced from social media (consisting of 13.2K videos and 260K QA pairs) that covers multi-regional and multi-perspective road event scenarios globally. Through a semi-automatic annotation framework and 12 categories of QA tasks, the paper systematically evaluates the road event understanding capabilities of 18 Video LLMs.
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation: This paper proposes UniGoal, a unified zero-shot goal-oriented navigation framework. By representing both scenes and goals uniformly as graph structures and combined with a graph matching-driven multi-stage exploration strategy, it achieves zero-shot navigation for three goal types—object categories, instance images, and text descriptions—within a single model, outperforming task-specific methods.

✏️ Knowledge Editing (1)¶

MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization: Proposes the MoKus framework, discovering and utilizing the "cross-modal knowledge transfer" phenomenon—where updating knowledge in an LLM text encoder propagates automatically to the visual generation end—to achieve knowledge-aware concept customization. It features a two-stage design: first learning a visual anchor representation, and then binding textual knowledge in seconds.

✍️ Text Generation (2)¶

ArtFormer: Controllable Generation of Diverse 3D Articulated Objects: This work proposes the ArtFormer framework, which generates high-quality, diverse, and kinematically accurate 3D articulated objects from text/image descriptions via tree structure parameterization and a conditional diffusion shape prior, significantly outperforming existing methods in generation quality and diversity.
Dense Match Summarization for Faster Two-view Estimation: This paper proposes a dense match summarization scheme that compresses over 10,000 dense matches into approximately 1% representative matches through clustering and representative match selection. It encodes the geometric constraints of each cluster into a 9×9 matrix, achieving a 10× to 100× speedup for robust RANSAC estimation with negligible accuracy loss.

🌐 Multilingual & Translation (1)¶

SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity: Constructs SMTPD, the first time-aligned temporal prediction benchmark for social media popularity (282K YouTube samples with 30-day continuous observations), proposes a baseline framework based on multi-modal feature extraction and LSTM temporal regression, and reveals that early popularity (EP) is key to accurately predicting subsequent popularity.

🔍 Information Retrieval & RAG (12)¶

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training: Upgrades CLIP from the traditional one-to-one (image, text) contrastive learning paradigm to a multi-to-multi (multi-image-embeddings, multi-texts) contrastive learning paradigm. By utilizing VLMs to generate multi-perspective, multi-level descriptions and a multi-branch visual encoder to output diverse visual embeddings, it achieves more comprehensive vision-language alignment, substantially outperforming baselines in retrieval, classification, and dense prediction tasks.
ChatHuman: Chatting about 3D Humans with Tools: ChatHuman is proposed, an LLM-based language-driven system that manages new tools by automatically selecting and integrating specialized 3D human analysis tools (3D pose estimation, shape recovery, contact detection, human-object interaction analysis, emotion recognition, etc.), utilizing academic papers as tool manuals along with RAG (Retrieval-Augmented Generation) to create in-context examples. It outperforms existing LLM models in tool selection accuracy and overall performance on 3D human-related tasks.
COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Adaptation: Proposes COBRA, a combinatorial mutual information (CMI)-based retrieval-augmented few-shot adaptation method. By simultaneously accounting for both the similarity of retrieved samples to the target task and the diversity among the samples themselves, COBRA retrieves high-quality auxiliary data from LAION-2B. It consistently outperforms traditional nearest-neighbor retrieval methods across multiple image classification benchmarks with negligible computational overhead.
EZSR: Event-based Zero-Shot Recognition: This paper proposes the EZSR framework for zero-shot object recognition in event camera data. By utilizing a scalar-wise modulation strategy, it addresses the semantic misalignment between event embeddings and CLIP text embeddings. It overcomes training data scarcity through large-scale event data synthesis from static RGB images, achieving a 47.84% zero-shot accuracy on N-ImageNet with a ViT-B/16 backbone.
Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning: This paper extends Retrieval-Augmented Learning (RAL) to Few-Shot Recognition (FSR) for the first time, exposing two main challenges of retrieved data: distribution imbalance and domain gap. It proposes a two-stage method, SWAT (finetuning the vision encoder on mixed data first, then retraining the classifier on few-shot labeled data), outperforming all prior methods by $>6\%$ across 9 benchmarks.
GOAL: Global-Local Object Alignment Learning: Proposes the GOAL method, which enhances CLIP's understanding of long text descriptions through two modules: Local Image-Sentence Matching (LISM) and Token Similarity-based Learning (TSL). By introducing local semantic alignment on top of global alignment, it significantly improves image-text retrieval performance.
LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table: This paper proposes LotusFilter, which constructs a cutoff table by precomputes neighbor relationships for each vector offline and performs diversity filtering using greedy set deletion during the online stage. This reduces the complexity of traditional diverse search from $O(DS^2)$ to $O(T+S+KL)$. The filtering process requires only 0.02 ms/query, utilizing only 1/40 of the memory compared to traditional methods.
Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation: The CRPL framework is proposed to improve prompt learning of CLIP in unsupervised domain adaptation (UDA) through source-augmented pseudo-labeling and an optimal transport-based cluster preservation strategy, ensuring that the text embeddings of target prompts better cover the cluster structures of visual embeddings.
RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings: RANGE is proposed, which approximates and injects high-resolution visual information into location embeddings via a retrieval-augmented strategy. This addresses the issues of contrastive learning (e.g., SatCLIP) discarding modality-specific information, achieving up to a 13.1% performance gain on classification tasks and a 0.145 increase in $R^2$ on regression tasks.
Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis: RAG-Gesture proposes a gesture synthesis framework based on Retrieval-Augmented Generation (RAG). It leverages explicit linguistic knowledge to retrieve semantically relevant exemplar motions from a gesture database, and injects them into the diffusion model's generation process at inference time through DDIM inversion and retrieval guidance, producing semantically rich and natural co-speech gestures without training.

Browse all 12 Information Retrieval & RAG papers →

🔗 Causal Inference (4)¶

Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency: The Adventurer series of vision models is proposed, which adapts image inputs to a unidirectional causal scanning framework through two simple designs: "Heading Average token" and "Inter-layer Flipping". This allows the Mamba architecture to achieve 4-6x the training speed of existing Vision Mamba models on vision tasks, while maintaining comparable or even superior accuracy to ViT.
Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference: This paper formulates Full-Reference Image Quality Assessment (FR-IQA) as a counterfactual inference problem. By using a Structural Causal Model (SCM), it distinguishes between the causal components related to perceptual quality and the noise components in deep features. This achieves training-free, backbone-agnostic robust quality prediction, obtaining competitive performance on multiple benchmark datasets.
Joint Scheduling of Causal Prompts and Tasks for Multi-Task Learning: Proposed the JSCPT (Joint Scheduling of Causal Prompts and Tasks) framework, which first designs Multi-Task Vision-Language Prompts (MTVLP) and eliminates spurious correlation features in prompts through causal intervention, and then adjusts learning order and weights using an adaptive task scheduler based on the dynamic changes in task relationships during training, achieving significant improvements across multiple multi-task visual recognition benchmarks.
FG-VCE: Towards Fine-Grained Interpretability — Counterfactual Explanations for Misclassification with Saliency Partition: This paper proposes the FG-VCE (Fine-Grained Visual Contrastive Explanation) framework. By calculating feature point contributions via Shapley values, isolating local features using a saliency partition module, and employing an iterative counterfactual generation strategy, it achieves fine-grained counterfactual explanations at both the object and part levels for the first time. It reveals the specific causes of model misclassification: "which fine-grained features led to the error" and "which local regions dominated the prediction change."

🌍 Earth Science (1)¶

GeoChemAD: Benchmarking Unsupervised Geochemical Anomaly Detection for Mineral Exploration: This paper proposes the GeoChemAD open-source benchmark dataset (comprising 8 subsets covering multiple regions, sampling sources, and target elements) and the GeoChemFormer framework. By employing spatial context self-supervised pre-training and elemental dependency modeling, it achieves unsupervised geochemical anomaly detection and obtains state-of-the-art AUC across all subsets.