ICML2025 ICML2025 accepted papers ICML2025 paper list AI paper notes top conference papers Image Generation Model Compression Reinforcement Learning Optimization & Theory Computational Biology Multimodal VLM LLM Safety AI Safety

🧪 ICML2025 Accepted Papers¶

1060 ICML2025 paper notes covering Image Generation (92), Model Compression (74), Reinforcement Learning (69), Optimization & Theory (61), Computational Biology (48), Multimodal VLM (42), LLM Safety (41), AI Safety (37) and other 50 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

💡 LLM Reasoning (19)¶

Ad-Hoc Human-AI Coordination Challenge (AH2AC2): This work proposes the AH2AC2 challenge—based on the cooperative card game Hanabi—which constructs human proxy agents via behavioral cloning and regularized reinforcement learning, and open-sources a limited human dataset to provide a standardized, reproducible evaluation framework for human-AI ad-hoc coordination research.
AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism: AdaDecode achieves high-confidence early token prediction by training lightweight LM heads at middle layers, and defers the KV cache computation of subsequent layers to be processed in parallel. While maintaining identical output with standard autoregressive decoding, it achieves up to 1.73× decoding throughput acceleration.
Adversarial Manipulation of Reasoning Models using Internal Representations: This paper finds that reasoning models (such as DeepSeek-R1-Distill-Llama-8B) exhibit a linear "caution direction" in the activation space during the CoT generation phase. Ablating this direction effectively jailbreaks the model, revealing that CoT itself is a new target for adversarial attacks.
DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination: Based on the concept of metamorphic testing, this work decomposes programming problems into complexity-related algorithmic abstractions and complexity-independent contextual descriptions. Through the collaboration of four LLM agents, it automatically generates semantically equivalent yet textually distinct variants of programming problems. This effectively mitigates data contamination and evaluates the true reasoning capabilities of Code LLMs, validating the effectiveness of the framework across 18 models.
Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models: Through causal, representation, and attention analyses, this paper identifies a three-stage emergent symbolic architecture supporting abstract reasoning across 13 open-source LLMs: symbolic abstraction heads transform input tokens into abstract variables, symbolic induction heads perform sequence induction at the abstract variable level, and retrieval heads retrieve the corresponding values based on predicted abstract variables for next-token prediction.
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators: This paper proposes the JETTS benchmark to systematically evaluate the performance of LLM-judges as evaluators in test-time scaling scenarios (response reranking, step-level beam search, and critique-based refinement). The findings show that while judges are competitive with outcome reward models in reranking, they are significantly weaker than process reward models in beam search, and natural language critiques currently fail to effectively guide generator improvements.
FMC: Formalization of Natural Language Mathematical Competition Problems: This paper proposes a fully automated formalization pipeline based on LLM error feedback that translates natural language mathematical competition problems into Lean formal representations, constructing FMC, an Olympiad-level dataset containing 3,922 natural language problems aligned with 9,787 Lean formalizations, and validating its value as an automated theorem proving benchmark.
Improving Rationality in the Reasoning Process of Language Models through Self-playing Game: This paper proposes the Critic-Discernment Game (CDG), a self-playing language game where an LLM interacts with a "Helpful Critic" and a "Misleading Critic." Using Reinforced Self-Training (ReST), the three roles are jointly optimized. Without relying on human or stronger model supervision, this approach significantly enhances the LLM's rational understanding of its own reasoning process, achieving consistent improvements across four tasks: mathematical reasoning, step-by-step error detection, self-correction, and long-chain reasoning.
MARGE: Improving Math Reasoning for LLMs with Guided Exploration: MARGE proposes a "hit-guided exploration" approach to enhance the mathematical reasoning capabilities of LLMs. By systematically exploring the intermediate reasoning states in self-generated solutions, it achieves thorough exploration and better credit assignment without requiring external annotations or additional value models, simultaneously improving single-attempt accuracy and exploration diversity.
No Soundness in the Real World: On the Challenges of the Verification of Deployed Neural Networks: This paper demonstrates that all current state-of-the-art neural network verifiers only provide "theoretical soundness" (bounding exact-precision output) rather than "practical soundness" (bounding floating-point outputs in deployment environments), and empirically verifies that all tested verifiers can be deceived by constructing environment-sensitive adversarial backdoor networks.

Browse all 19 LLM Reasoning papers →

🦾 LLM Agent (11)¶

AdvAgent: Controllable Blackbox Red-teaming on Web Agents: This paper proposes AdvAgent, a reinforcement learning (DPO)-based blackbox red-teaming framework. It trains an adversarial prompter model to automatically generate invisible HTML adversarial prompts. When injected into web pages, these prompts mislead GPT-4V-driven Web Agents into executing attacker-specified target actions (e.g., changing buying Microsoft stock to buying NVIDIA stock). AdvAgent achieves a 97.5% attack success rate across 440 tasks and maintains over 88.8% effectiveness against existing defense methods.
AGACCI: Affiliated Grading Agents for Criteria-Centric Interface in Educational Coding Contexts: AGACCI proposes a multi-agent evaluation framework consisting of 9 specialized agents. It decomposes the evaluation task of educational programming assignments into roles such as rubric parsing, code execution validation, visual evaluation, and explanatory reasoning assessment. Through collaboration, it achieves more accurate, consistent, and interpretable rubric-aligned feedback than single-model baselines.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction: Proposes Aguvis, the first fully pure vision-based cross-platform autonomous GUI Agent framework. By unifying visual observation space, standardizing action spaces, and utilizing an inner monologue mechanism, Aguvis achieves SOTA results on offline and online benchmarks without relying on closed-source models.
Evaluating Retrieval-Augmented Generation Agents for Autonomous Scientific Discovery in Astrophysics: This paper constructs CosmoPaperQA (105 expert QA pairs), a RAG evaluation benchmark in the cosmology domain, to systematically evaluate nine RAG agent configurations (covering commercial APIs, hybrid architectures, and academic tools). It finds that the OpenAI RAG solution leads with a 91.4% accuracy rate and calibrates an LLM-as-a-Judge system that can substitute for manual human review.
From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?: This paper introduces AR-Bench, a benchmark specifically designed to evaluate the active reasoning capabilities of LLMs. It features three task families: detective cases, situation puzzles, and guessing numbers. Experiments reveal that state-of-the-art models such as GPT-4o perform far worse than humans in scenarios where they must actively ask questions to retrieve missing information, exposing a massive gap between passive and active reasoning.
GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning: GuardAgent is the first "Agent-safeguarding-Agent" framework that dynamically converts safety rules into executable guardrail code to verify if the actions of a target Agent violate safety policies. It achieves guardrail accuracies of over 98% and 83% on new benchmarks for medical access control and web safety control, respectively.
Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search: Proposes LWM-Planner, which extracts "atomic facts" from interaction trajectories to enhance LLM world model simulation and combines this with recursive lookahead search to improve agent planning purely in-context. It significantly outperforms ReAct and Reflexion on tasks like ALFWorld.
KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search: KBQA-o1 is proposed, which combines a ReAct Agent with Monte Carlo Tree Search (MCTS) to perform knowledge base question answering through heuristic search driven by policy and reward models. Under low-resource settings, it improves the GrailQA F1 from 48.5% (GPT-3.5-turbo SOTA) to 78.5% using Llama-3.1-8B.
Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery: This paper proposes cmbagent, a multi-agent system composed of approximately 30 LLM agents. It adopts a Planning & Control strategy to orchestrate fully autonomous scientific research workflows. Individual agents are responsible for specialized tasks such as literature retrieval, code generation, result interpretation, and output review, with the capability of executing code locally. The system successfully completes a PhD-level cosmology task (measuring cosmological parameters using supernova data) and outperforms state-of-the-art LLMs on two benchmark datasets.
Towards LLM Agents for Earth Observation: This paper proposes UnivEARTH—a Earth Observation benchmark featuring 140 yes/no questions, covering 13 topics and 17 satellite sensors. Evaluation reveals that the best LLM Agent (generating code to use Google Earth Engine) achieves an accuracy of only 33%, primarily limited by the fact that 58% of the generated code fails to execute.

Browse all 11 LLM Agent papers →

👥 Multi-Agent (7)¶

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML: This paper proposes AutoML-Agent, a multi-agent LLM collaborative framework for full-pipeline AutoML. It expands the search space using a Retrieval-Augmented Planning strategy, decomposes tasks into parallel subtasks handled by specialized agents, and introduces a multi-stage verification mechanism to guarantee code generation quality, achieving higher automation success rates and model performance across 14 datasets in 7 task categories.
Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination: Proposes the Cross-Environment Cooperation (CEC) paradigm, which trains agents via self-play across a large number of procedurally generated, diverse environments (rather than increasing partner diversity). This enables agents to learn general cooperative norms, achieving zero-shot coordination with unseen partners in unseen environments.
From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium: This paper models multi-LLM coordination as an incomplete information game and proposes the ECON framework. It achieves implicit belief-driven multi-agent coordinated reasoning via Bayesian Nash Equilibrium (BNE) without explicit message passing while providing theoretical convergence guarantees, yielding an average improvement of 11.2% across six reasoning benchmarks.
Is Your LLM-Based Multi-Agent a Reliable Real-World Planner? Exploring Fraud Detection in Travel Planning: This paper proposes WandaPlan, an evaluation environment that systematically assesses the vulnerability of LLM-based multi-agent planning systems to false information by injecting three progressive types of fraud (single-source misinformation, team-coordinated manipulation, and level-escalating attacks) in travel planning scenarios, and designs an Anti-Fraud Agent to mitigate these risks.
MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines: Proposes MetaAgent, a framework based on Finite State Machines (FSMs) that automatically designs multi-agent systems given only task descriptions, without requiring external training data. Supporting tool invocation and state backtracking, it outperforms existing automated design methods and approaches the performance of hand-crafted systems across text-based, ML, and software development tasks.
ResearchTown: Simulator of Human Research Community: This paper proposes ResearchTown, a multi-agent framework based on agent-data graphs and TextGNN (text-space message passing), which models human scientific communities as heterogeneous graphs to unify the simulation of three core research activities: literature reading, paper writing, and peer review. A scalable and objective simulation quality evaluation is conducted via a node masking prediction task (ResearchBench).
Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models: The Theorem-of-Thought (ToTh) framework is proposed, where three agents simulating abductive, deductive, and inductive reasoning independently generate reasoning trajectories. These trajectories are constructed into a Formal Reasoning Graph (FRG) and consistently scored using NLI-calibrated Bayesian belief propagation. The terminal node of the highest-scoring graph is selected as the final answer, consistently outperforming CoT, Self-Consistency, and CoT-Decoding on symbolic and numerical reasoning tasks.

⚖️ Alignment & RLHF (16)¶

AlphaPO: Reward Shape Matters for LLM Alignment: AlphaPO introduces an \(\alpha\) parameter into the Direct Alignment Algorithms (DAA) framework to alter the "shape" of the reward function, generalizing it from the standard log-based reward to a more general power transform. This enables fine-grained control over likelihood displacement and over-optimization, achieving a 7%-10% improvement over SimPO and a 15%-50% improvement over DPO on Mistral-7B and Llama3-8B.
AMPO: Active Multi-Preference Optimization for Self-play Preference Selection: The AMPO framework is proposed, combining online policy generation, multi-preference group contrastive loss, and active subset selection. By intelligently choosing small but highly informative subsets from a large pool of candidate responses for preference optimization, it achieves state-of-the-art results on AlpacaEval.
AssistanceZero: Scalably Solving Assistance Games: AssistanceZero is proposed, scaling assistance games to complex environments (Minecraft building assistance with \(10^{400}\) possible goals) for the first time. By extending AlphaZero with a reward prediction head and a human action prediction head to perform planning under uncertainty via MCTS, the method significantly outperforms PPO and imitation learning baselines. Human experiments demonstrate that AssistanceZero effectively reduces user actions and exhibits emergent behaviors such as digging foundations, inferring roofs, and learning from corrections.
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective: This work identifies a structural property induced by KL regularization in RLHF—the policy coverage over the optimal policy is bounded by its sub-optimality (\(\text{Cov}^{\pi^*|\pi} \leq 1 + \kappa \cdot (J(\pi^*) - J(\pi))/\beta\)). Based on this, two transfer learning principles are proposed: (1) selecting a transfer policy with high policy value, and (2) self-transfer distilling the policy from online data. The proposed TPO algorithm achieves a regret of \(O(W\sqrt{T})\) in the early stage and \(O(\sqrt{T})\) in the late stage. It can be modularly integrated with DPO/IPO/XPO, and its effectiveness is validated on the T5 summarization task.
Challenges and Future Directions of Data-Centric AI Alignment: This paper is a position paper advocating for shifting the research focus of AI alignment from algorithm design to data quality. Through qualitative analysis of the Anthropic-HH dataset, it reveals six major sources of unreliability in human feedback and proposes future directions for improving data collection, cleaning, and verification.
Diverging Preferences: When do Annotators Disagree and do Models Know?: This paper systematically analyzes the reasons behind annotator disagreement in RLHF preference datasets by taxonomizing them into 10 categories. It reveals that over 75% of disagreements stem from personal preference rather than annotation noise. To address this, the paper proposes a Mean-Var Reward Model to effectively differentiate between diverging and high-consensus preferences, and uncovers systematic biases in LLM-as-Judge evaluation methodologies when facing disagreement.
DPO Meets PPO: Reinforced Token Optimization for RLHF: This paper proposes Reinforced Token Optimization (RTO), which models RLHF as a token-level MDP (rather than a sentence-level bandit). It leverages DPO to implicitly extract token-wise reward signals and then performs policy optimization using PPO. RTO outperforms PPO by 7.5 points on AlpacaEval 2 and by 4.1 points on Arena-Hard, achieving PPO-level performance with only 1/8 of the data.
Improving Model Alignment through Collective Intelligence of Open-Source LLMs: This paper proposes Mixture of Agents Alignment (MoAA), which leverages the collective intelligence of multiple open-source LLMs to generate high-quality alignment data (SFT data and preference data). This significantly improves the performance of the target model on Arena-Hard and AlpacaEval2, demonstrating self-improvement capabilities without external strong supervision.
Instruction Tuning of Large Language Models for Tabular Data Generation—in One Day: This paper is the first to explore utilizing instruction tuning to enhance the tabular data generation capabilities of LLMs. By constructing a high-quality instruction dataset of only 10K instances and fine-tuning Llama3.1-8B-Instruct on a single A100 for less than 6 hours, the approach achieves tabular data generation performance comparable to GPT-4o.
Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models: This work identifies the Image enCoder Early-exiT (ICET) vulnerability in VLMs, where skipping certain layers of the image encoder significantly increases the probability of generating harmful outputs. It proposes Layer-wise PPO (L-PPO), which modifies the Clipped-PPO algorithm to perform multimodal RLHF across different layers, leading to up to a 48% reduction in ASR and a 33.64% reduction in toxicity score.

Browse all 16 Alignment & RLHF papers →

🔒 LLM Safety (41)¶

Activation Space Interventions Can Be Transferred Between Large Language Models: This paper demonstrates that shared activation space structures exist among LLMs. By training an autoencoder to learn activation mappings between models, safety interventions (such as backdoor removal and harmful refusal steering vectors) can be transferred from source models to target models. This enables an efficient safety intervention paradigm of "using small models to align large models."
Align-then-Unlearn: Embedding Alignment for LLM Unlearning: The Align-then-Unlearn framework is proposed to perform unlearning in the semantic embedding space (rather than at the token level). It first pre-trains an embedding prediction module to align future semantic representations, and then fine-tunes the LLM to push predicted embeddings away from the target concept embedding, achieving concept-level knowledge unlearning that is robust to prompt rephrasings.
An Attack to Break Permutation-Based Private Third-Party Inference Schemes for LLMs: An attack method based on token-by-token vocabulary matching is proposed. By leveraging the non-collision property of the hidden states in decoder-only LLMs, the original input tokens can be almost perfectly reconstructed from three types of permuted hidden states, breaking the security claims of three private inference schemes: PermLLM, STIP, and Centaur.
Cape: Context-Aware Prompt Perturbation Mechanism with Differential Privacy: Cape is proposed—a context-aware prompt perturbation mechanism that combines a hybrid utility function (integrating token embedding distance and contextual logits) with a bucketized exponential sampling mechanism to achieve a superior privacy-utility trade-off under local DP guarantees compared to existing methods.
Cascade: Token-Sharded Private LLM Inference: Proposes Cascade, a multiparty inference protocol based on token-dimension sharding. By distributing hidden states to different computation nodes along the token dimension, it avoids the high overhead of cryptographic primitives, achieving up to 100× faster inference than SMPC schemes while maintaining resilience against vocab-matching attacks.
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization: Proposes CROW (Internal Consistency Regularization), which eliminates backdoors in LLMs using adversarial perturbations and inter-layer hidden state consistency regularization. With only 100 clean samples and 4 minutes of fine-tuning on a single GPU, it reduces the attack success rate to under 5% without requiring clean reference models or prior knowledge of the trigger.
Cut out and Replay: A Simple yet Versatile Strategy for Multi-Label Online Continual Learning: Proposed CUTER (CUT-out-and-Experience-Replay), which converts multi-label online continual learning into multiple single-label sub-image classification tasks by cropping label-specific regions from images and storing them in a memory buffer for replay. This simultaneously addresses the three challenges of catastrophic forgetting, missing labels, and class imbalance.
De-mark: Watermark Removal in Large Language Models: The De-mark framework is proposed, which estimates the n-gram watermark strength and reconstructs red-green lists through a random selection probing strategy. It enables watermark removal without requiring knowledge of the hash function, while providing theoretical guarantees on the distribution gap between the post-removal LM distribution and the original distribution.
DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning: This paper proposes DRAGON, a training-free LLM unlearning framework. It identifies prompts to be forgotten using a dual-layer detection module and subsequently performs in-context intervention using a CoT guard model to generate reasoning instructions, achieving efficient unlearning without modifying model parameters.
EgoPrivacy: What Your First-Person Camera Says About You?: Introduces EgoPrivacy, the first large-scale first-person video privacy benchmark, defining three categories of privacy (demographic, individual, and situational) across seven tasks. It designs Retrieval-Augmented Attack (RAA), combining ego-to-exo retrieval and classification, to demonstrate that foundation models can infer the wearer's sensitive attributes (e.g., gender, race) with 70–80% accuracy in a zero-shot setting.

Browse all 41 LLM Safety papers →

👻 Hallucination Detection (3)¶

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models: This paper proposes the MemVR decoding paradigm, which reinjects visual tokens as supplementary evidence into intermediate trigger layers through the key-value memory mechanism of FFNs. This "look-twice" mechanism mitigates hallucinations in MLLMs without introducing additional inference overhead.
Rejecting Hallucinated State Targets during Planning: This paper systematically identifies the types of "delusional behavior" caused by generators producing unfeasible goals (hallucinatory goals) in goal-conditioned decision planning, and designs a feasibility evaluator as an auxiliary module to identify and reject these unfeasible goals. Combined with off-policy learning rules, a distributional architecture, and hindsight relabeling data augmentation, this approach significantly reduces delusional behavior and enhances OOD generalization performance without modifying the original agent.
Steer LLM Latents for Hallucination Detection: Proposes Truthfulness Separator Vector (TSV), a lightweight steering vector that reshapes the LLM representation space at inference time to enhance the separation between truthful and hallucinated outputs, achieving performance close to full supervision with only 32 labeled exemplars.

📊 LLM Evaluation (22)¶

AAAR-1.0: Assessing AI's Potential to Assist Research: The AAAR-1.0 benchmark is proposed to systematically evaluate the actual capabilities of LLMs in assisting scientific research across four expert-level tasks: equation inference, experimental design, paper weakness detection, and peer review critique. The benchmark reveals significant deficiencies in current models when performing deep research tasks.
Are LLM Belief Updates Consistent with Bayes' Theorem?: This paper proposes the Bayesian Coherence Coefficient (BCC) to quantify whether LLM belief updates conform to Bayes' theorem, revealing that larger and more powerful pretrained models exhibit belief updates that are more consistent with Bayes' theorem when presented with new evidence.
Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time: This paper proposes SITAlign, a satisficing alignment framework based on bounded rationality, which maximizes the primary objective (e.g., helpfulness) at inference time while ensuring secondary objectives (e.g., harmlessness) satisfy threshold constraints. Solved through duality theory, it achieves a 22.3% win rate improvement over state-of-the-art multi-objective decoding on GPT-4 evaluation.
Communicating Activations Between Language Model Agents: A method is proposed to allow LLM agents to communicate via intermediate layer activations (instead of natural language) by injecting the activation vector of Model A into the intermediate layers during the forward pass of Model B. This requires zero additional parameters or training data, while improving performance by up to 27% compared to natural language communication across multiple reasoning benchmarks, using only 1/4 of the computation.
Consistency in Language Models: Current Landscape, Challenges, and Future Directions: This paper systematically surveys the landscape of LLM consistency research, proposing a taxonomy that comprises logical consistency (negation, symmetry, transitivity), semantic consistency, factual/informational consistency, and non-logical consistency (morality/norms). It analyzes the deficiencies of evaluation methods from 2019 to 2025 and calls for the establishment of standardized multilingual benchmarks and interdisciplinary approaches.
Correlated Errors in Large Language Models: Through a large-scale empirical analysis of over 350 LLMs, this paper reveals highly correlated error patterns across different LLMs. When both models make mistakes, they choose the same incorrect answer in approximately 60% of cases, and more accurate models exhibit higher correlation. Furthermore, the paper investigates the downstream impacts of this correlation on LLM-as-Judge evaluation and the labor market.
DataDecide: How to Predict Best Pretraining Data with Small Experiments: This work constructs DataDecide—the largest open model suite to date (25 data recipes \(\times\) 14 model scales \(\times\) 3 random seeds)—to systematically study how small-scale experiments can predict the best pretraining data. The study reveals that a single small-scale ranking (e.g., at 150M parameters) achieves approximately 80% pairwise decision accuracy, and continuous likelihood proxy metrics require only 0.01% of the target compute to reach over 80% prediction accuracy across multiple benchmarks.
Disentangling and Integrating Relational and Sensory Information in Transformer Architectures: This paper proposes the Dual Attention Transformer (DAT). By introducing "relational attention" heads into the standard attention mechanism, it decouples and parallelly processes sensory and relational information before integrating them. DAT exhibits significant improvements in data and parameter efficiency across relational reasoning benchmarks, mathematical problem solving, image recognition, and language modeling.
EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities: EnIGMA is an LM agent designed to autonomously solve Capture The Flag (CTF) challenges. By introducing novel interactive agent tools (debuggers and server connection utilities), it enables LM agents to execute interactive terminal programs for the first time. It achieves state-of-the-art (SOTA) results across 390 CTF challenges from 4 benchmarks and uncovers "soliloquizing," a new type of hallucination behavior.
MultiCogEval: Evaluating LLMs Across Multi-Cognitive Levels: Inspired by Bloom's Taxonomy, this work proposes a multi-cognitive level evaluation framework, MultiCogEval, to assess the medical capabilities of LLMs across three levels: knowledge mastery, comprehensive application, and situational problem-solving. The findings reveal that the performance of all models decreases significantly as cognitive complexity increases, and model scale becomes more critical at higher levels.

Browse all 22 LLM Evaluation papers →

⚡ LLM Efficiency (12)¶

Autonomy-of-Experts Models (AoE): AoE proposes allowing experts in an MoE to autonomously decide whether to process an input based on their own internal activation norms (rather than being determined by an external router). By reducing precomputation overhead through low-rank weight factorization, AoE outperforms traditional MoE in pre-training 700M-4B parameter language models.
Cooperation of Experts: Fusing Heterogeneous Information with Large Margin: This paper proposes the Cooperation of Experts (CoE) framework, which encodes heterogeneous information into multiplex networks. Through a two-level expert design and large-margin confidence tensor optimization, CoE achieves expert cooperation (rather than competition), comprehensively outperforming existing MoE and multiplex network methods in node classification tasks.
Curse of High Dimensionality Issue in Transformer for Long-context Modeling: This paper revisits the attention redundancy issue in sequence modeling from a supervised learning perspective and proposes the Dynamic Group Attention (DGA) mechanism. By dynamically grouping and aggregating unimportant tokens to reduce redundancy in attention computation, DGA maintains competitive performance while substantially reducing inference latency (achieving a 2.42x inference speedup for LLaMA2-7B under 16K context).
DSSD: Efficient Edge-Device LLM Deployment and Collaborative Inference via Distributed Split Speculative Decoding: This paper proposes the Distributed Split Speculative Decoding (DSSD) framework, which splits the verification stage of speculative decoding between the device and the edge. By replacing multiple uplink transmissions (the SLM's \(\gamma\) vocabulary distributions) with a single downlink transmission (a single vocabulary distribution of the LLM), DSSD significantly reduces communication latency while maintaining identical inference quality.
EasyInv: Toward Fast and Better DDIM Inversion: Proposes EasyInv, which periodically aggregates the current latent state with the previous latent state via a weighted sum (analogous to Kalman filtering) during inversion. This enhances the influence of the initial latent and suppresses noise accumulation errors, achieving comparable or even superior inversion quality to iterative methods without requiring any iterative optimization, while speeding up the inference by approximately 3x.
Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling: This paper proposes the Grouped Cross-Attention (GCA) mechanism, which integrates chunk-level causal retrieval into the attention mechanism to achieve an end-to-end learnable retriever. The constructed Differentiable Retrieval-based Transformer (DRT) achieves near-perfect accuracy on the passkey retrieval test with a 16M context, achieving length generalization up to 1000 times the training length.
Ladder Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference: This paper proposes Ladder Residual, a simple architectural modification that shifts the input of each block from the output of the previous layer to the output of the layer before the previous one (staggered residual connection). This design decouples module computation from AllReduce communication, enabling overlap between communication and computation. It achieves a 29% end-to-end acceleration in 8-GPU Tensor Parallelism (TP) inference on a 70B model with performance comparable to standard Transformers.
Long-Short Alignment for Effective Long-Context Modeling in LLMs: This paper proposes a new perspective on length generalization from the angle of model output distributions, termed Long-Short Alignment. It highlights that the consistency of output distributions across inputs of different lengths is a key factor in length generalization. The authors introduce a Long-Short Misalignment metric and utilize it as a training regularization term, which significantly improves long-context modeling capabilities on both synthetic and natural language tasks.
Mixture of Lookup Experts: MoLE (Mixture of Lookup Experts) is proposed, which modifies the input of routing experts in MoE from intermediate features to embedding tokens. This allows experts to be reparameterized into lookup tables (LUTs) and offloaded to storage devices before inference, thereby achieving inference speeds and memory footprints comparable to dense models while maintaining MoE-level performance.
MoH: Multi-Head Attention as Mixture-of-Head Attention: This paper reformulates Multi-Head Attention (MHA) into a summation form and proposes Mixture-of-Head Attention (MoH) inspired by MoE. By utilizing a router to dynamically select the most relevant subset of attention heads for each token, MoH satisfies or even surpasses standard MHA performance while activating only \(50\% \sim 90\%\) of the heads. It also demonstrates that pre-trained models (such as LLaMA3-8B) can be successfully converted into MoH models via continue-tuning.

Browse all 12 LLM Efficiency papers →

📚 Pretraining (31)¶

A Square Peg in a Square Hole: Meta-Expert for Long-Tailed Semi-Supervised Learning: This paper proposes the Meta-Expert algorithm. Through a Dynamic Expert Allocation (DEA) module, it automatically selects the most proficient expert to generate pseudo-labels based on the sample's class assignment (head/medium/tail). It also utilizes a Multi-depth Feature Fusion (MFF) module to alleviate the model's bias towards head classes, achieving "a square peg in a square hole"—letting each expert process the sample interval they excel at most.
Algebra Unveils Deep Learning -- An Invitation to Neuroalgebraic Geometry: This paper proposes neuroalgebraic geometry as a new research direction, systematically utilizing tools from algebraic geometry (dimension, degree, singularities, fibers, critical point theory, etc.) to analyze the function spaces parameterized by deep learning models (neuromanifolds). It establishes a dictionary mapping algebraic geometric invariants to core machine learning problems (sample complexity, expressivity, training dynamics, and implicit bias).
Bayesian Neural Scaling Law Extrapolation with Prior-Data Fitted Networks: The first Bayesian extrapolation method for Neural Scaling Laws. By designing customized prior distributions (covering Down, Down-Down, and Down-Up-Down functional families) and leveraging Prior-data Fitted Networks (PFNs) to meta-learn extrapolation capability, this approach outperforms existing methods in both point estimation accuracy and uncertainty quantification quality.
Benign Overfitting in Token Selection of Attention Mechanism: This paper theoretically proves for the first time the phenomenon of benign overfitting in the token selection of the attention mechanism. It demonstrates that a single-layer attention network trained via gradient descent can perfectly fit noisy training labels while still generalizing, provided a balance is maintained between signal learning and noise memorization.
Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning: Introduces the Chameleon framework, which utilizes kernel ridge leverage scores (KRLS) to quantify the importance of each training domain in the embedding space of a proxy model. It achieves comparable or superior data mixing performance at only 1/10 of DoReMi's computational cost, eliminates the need to retrain the proxy model when introducing new domains, and unifiedly handles both pretraining and finetuning scenarios.
Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers: Through a histogram counting task, this paper reveals the delicate division of labor between attention layers and feed-forward networks (FFNs) in small Transformers: attention excels at relation-based counting, whereas FFNs are responsible for inventory-based counting (dictionary memorization). The emergence of these two strategies is determined by the relative relationship among embedding dimension \(d\), hidden layer size \(p\), and vocabulary size \(T\).
Density Ratio Estimation-based Bayesian Optimization with Semi-Supervised Learning: This paper proposes DRE-BO-SSL, which incorporates semi-supervised learning (label propagation/label spreading) into density ratio estimation-based Bayesian optimization. By utilizing unlabeled data points, it alleviates the over-exploitation issue of supervised classifiers, achieving a better balance between exploration and exploitation.
DipLLM: Fine-Tuning LLM for Strategic Decision-Making in Diplomacy: This paper proposes DipLLM, which decomposes the exponential combinatorial action space of Board Game Diplomacy into unit-level decision sequences through an autoregressive factorization framework, and fine-tunes an LLM to learn equilibrium strategies, outperforming Cicero using only 1.5% of its training data.
Does Data Scaling Lead to Visual Compositional Generalization?: This paper systematically investigates the impact of data scale and data diversity on the compositional generalization of visual models through controlled experiments. The authors find that data diversity, rather than data volume, is the key driver of compositional generalization. They also prove that when representations exhibit a linearly factored structure, only 2 compositional samples per concept value are required for perfect generalization.
Evaluating Morphological Alignment of Tokenizers in 70 Languages: This work extends the MorphScore evaluation framework to 70 languages to systematically investigate the correlation between the morphological boundary alignment of tokenizers and downstream task performance. The results show that morphological alignment explains only a minimal amount of performance variance and exhibits a negative correlation, challenging the mainstream assumption that morphologically aligned tokenization benefits model performance.

Browse all 31 Pretraining papers →

✏️ Knowledge Editing (2)¶

Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing: Through synthetic experiments training Transformers on cyclic knowledge graphs, this work reveals that knowledge editing (KE) "shatters" the learned geometric representation manifolds inside the model. The degree of shattering is positively correlated with edit distance (\(r^2=0.905\)). Based on this, "representation shattering" is proposed as a mechanistic hypothesis for how KE impairs model capabilities, and the universality of this phenomenon is validated on Llama 3 and Mamba.
WikiBigEdit: Understanding the Limits of Lifelong Knowledge Editing in LLMs: This paper proposes WikiBigEdit, a large-scale lifelong knowledge editing benchmark containing over 500k real Wikidata knowledge updates, revealing the severe limitations of existing knowledge editing methods under realistic scales—general methods such as retrieval augmentation and continual fine-tuning paired with model merging surprisingly perform better.

💬 LLM (Other) (28)¶

B-score: Detecting biases in large language models using response history: The paper proposes B-score, a metric that detects bias by comparing the difference in probability of LLM responses between single-turn and multi-turn dialogues. It discovers that LLMs can "self-debias" in multi-turn dialogues and leverages B-score to improve answer verification accuracy.
BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute: The paper proposes BEST-Route (Best-of-\(n\) Enhanced Sampling and Test-time Route Optimization). It introduces the best-of-\(n\) sampling strategy into traditional query routing, allowing the router to not only select the model but also adaptively decide the sampling number \(n\). By replacing a single invocation of a large model with multiple samplings and selections from a small model, it reduces inference costs by up to 60% with less than 1% performance loss.
Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence: This paper designs an In-Context Meta-Learning (ICML) experimental setup to reveal that the internal circuits of Transformers undergo three distinct phases of emergence (Bigram \(\rightarrow\) Label Attention \(\rightarrow\) Chunk Example) during the training process of acquiring in-context meta-learning capabilities, rather than the single-stage sudden jump observed in prior induction head studies. This provides a new perspective on understanding the deep mechanisms of ICL.
Binary Hypothesis Testing for Softmax Models and Leverage Score Models: This work investigates the binary hypothesis testing problem for Softmax and Leverage Score models from a theoretical perspective. It establishes tight bounds on the number of queries required to distinguish between two parameterized models under an energy constraint, which is relevant to understanding the discriminative capabilities of LLMs across different domains.
Breaking Silos: Adaptive Model Fusion Unlocks Better Time Series Forecasting: Proposes TimeFuse—a sample-level adaptive model fusion framework. It characterizes input time series features using meta-features and trains a learnable fuser to predict the optimal model combination weights, achieving near-universal improvements (outperforming the best single model on 95.1% of samples) across multiple forecasting benchmarks.
Build Agent Advocates, Not Platform Agents: A position paper arguing that language model agents (LMAs), if controlled by platform companies, will become "platform agents" that exacerbate surveillance, lock-in, and attention manipulation. The authors propose developing user-controlled "agent advocates" to protect individual autonomy, recommending three key interventions: open models/compute, interoperability standards, and market regulation.
Defending LVLMs Against Vision Attacks through Partial-Perception Supervision: Proposes DPS (Defense through Partial-Perception Supervision), which utilizes responses from cropped images as "weak supervision" to guide the full-image model for self-correction during inference. This achieves training-free, black-box visual attack defense for LVLMs, reducing the average attack success rate by 76.3%.
Expert Evaluation of LLM World Models: A High-Tc Superconductivity Case Study: Using the field of high-temperature superconductivity (HTS) as a case study, an expert-level dataset (1,726 papers + 67 expert questions) is constructed to systematically evaluate the scientific literature understanding capabilities of six LLM systems. The evaluation reveals that RAG systems based on curated literature significantly outperform general closed-source models in terms of factual completeness and evidentiary support.
Generalized Interpolating Discrete Diffusion: The Generalized Interpolating Discrete Diffusion (GIDD) framework is proposed, which generalizes Masked Diffusion Models (MDM) to a family of diffusion processes supporting arbitrary time-varying mixture distributions. By combining mask and uniform noise, GIDD equips the model with self-correction capabilities and achieves compute-matched SOTA in diffusion language modeling.
Generative Social Choice: The Next Generation: Extends the generative social choice framework to scenarios with cost/budget constraints and approximate queries, proposes the DemocraticProcess algorithm with near-optimal approximate proportional representation guarantees, and implements a practical system PROSE (based on GPT-4o) validated on drug review and urban governance datasets.

Browse all 28 LLM (Other) papers →

📖 NLP Understanding (1)¶

Cover Learning for Large-Scale Topology Representation: Proposes Cover Learning as a unified unsupervised learning problem. From an optimization perspective, three loss functions (measure, geometry, topology) are designed to learn topologically faithful covers of datasets. The resulting simplicial complexes are more compact than standard geometric complexes in topological inference and can represent higher-dimensional information than Mapper graphs in large-scale topological visualization.

✍️ Text Generation (1)¶

Understanding and Mitigating Memorization in Diffusion Models for Tabular Data: This work presents the first systematic study of the memorization phenomenon in tabular diffusion models, finding that memorization intensifies as training epochs increase and is strongly correlated with dataset size. It proposes TabCutMix/TabCutMixPlus to mitigate memorization via feature-segment swapping while maintaining generation quality.

🗣️ Dialogue Systems (2)¶

Investigating Non-Transitivity in LLM-as-a-Judge: Reveals the non-transitivity problem in LLM-as-a-Judge preferences (where A > B and B > C do not guarantee A > C), demonstrating that fixed-baseline rankings are highly unreliable, and introducing a Round-Robin Bradley-Terry ranking paradigm alongside an efficient Swim tournament strategy.
Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents: This position paper argues that the traditional dichotomy between aleatoric and epistemic uncertainty fundamentally fails in interactive LLM scenarios by reviewing the conflicting definitions in literature. It proposes three new research directions: underspecification uncertainty (task/context under-specification), interactive learning (reducing uncertainty through follow-up questions), and output uncertainty (expressing uncertainty via natural language rather than scalar values).

🌐 Multilingual & Translation (1)¶

KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment: Proposed an intermediate representation based on assertional logic, called Knowledge Equation (KE), to achieve rule-based translation from natural language mathematical statements to multiple formal languages (Lean4/Coq/Isabelle), achieving an 88.9% pass@1 syntax accuracy on MiniF2F and outperforming DeepSeek-V3 and Herald.

🔍 Information Retrieval & RAG (6)¶

Don't Lag, RAG: Training-Free Adversarial Detection Using RAG: This paper proposes the VRAG framework, which constructs a training-free pipeline using an adversarial patch database + Vision Retrieval-Augmented Generation (VRAG) + VLM inference. It achieves highly efficient detection of various adversarial patch attacks, with Gemini-2.0 reaching 98% accuracy and the open-source model UI-TARS-72B-DPO reaching 95%.
FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems: FedRAG proposes a fine-tuning framework for RAG systems that supports both centralized and federated architectures. It fills the gap of lacking unified fine-tuning tools in the RAG ecosystem and achieves seamless transition from centralized to federated training through lightweight abstractions.
POQD: Performance-Oriented Query Decomposer for Multi-Vector Retrieval: POQD, a performance-oriented query decomposition framework, is proposed. It utilizes an LLM-based Prompt Optimizer to iteratively optimize query decomposition prompts, and jointly optimizes the prompts and downstream RAG model parameters through an alternating training algorithm, significantly outperforming existing methods on retrieval and end-to-end QA tasks.
RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding: This paper proposes RAPID, a framework combining RAG with speculative decoding. It utilizes a RAG drafter (an LLM running on compressed retrieval contexts) to generate candidate tokens for a long-context target LLM, and enhances the target distribution through test-time knowledge distillation. This simultaneously delivers a >2× speedup and improved generation quality in long-context inference.
Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length: Drawing on the Proactive Interference (PI) paradigm from cognitive science, this study finds that the information retrieval accuracy of LLMs decreases log-linearly to zero as the amount of interfering information increases. This reveals a "working memory" capacity bottleneck that is independent of context length and cannot be effectively mitigated by prompt engineering.
Understanding Synthetic Context Extension via Retrieval Heads: This paper reveals the underlying mechanism of why synthetic context extension works through systematic experiments: the "retrieval heads" trained on synthetic data highly overlap with those trained on real data. The recall rate of retrieval heads can predict downstream long-context task performance. Mechanistic necessity of retrieval heads is demonstrated using attention knockout and activation patching.

💻 Code Intelligence (9)¶

AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence: Proposes AdaptiveStep, a method that automatically divides reasoning steps based on model prediction confidence to train a more precise Process Reward Model (ASPRM). On mathematical reasoning and code generation tasks, it surpasses existing open-source PRMs at less than 70% of the data construction cost, and further enhances reasoning performance through token-level value-guided decoding.
EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning: EffiCoder constructs an "accurate and efficient" instruction tuning dataset named EffiInstruct. It enables code LLMs to significantly reduce execution time and total memory overhead while improving the pass@1 rate, demonstrating that "efficiency can be learned through data design."
EpiCoder: Encompassing Diversity and Complexity in Code Generation: This paper proposes a code data synthesis framework based on "Feature Trees." By extracting hierarchical semantic features from code and iteratively evolving them, the framework achieves precise control over the complexity and diversity of synthetic data. The resulting trained EpiCoder series of models achieves state-of-the-art (SOTA) performance among similarly sized models on both function-level and file-level code generation benchmarks.
Function-to-Style Guidance of LLMs for Code Translation: F2STrans is proposed to progressively fine-tune LLMs in two stages: functional learning (correctness) and style learning (readability). This allows Qwen-1.5B to outperform prompt-enhanced Qwen-32B and GPT-4 on average across 20 code translation scenarios.
Mind the Gap: A Practical Attack on GGUF Quantization: This work proposes the first attack targeting the GGUF quantization format. It leverages quantization errors as "degrees of freedom" to train a malicious quantized model that behaves normally in full precision but triggers backdoors after quantization. This approach is highly effective in unsafe code generation (\(\Delta=88.7\%\)), targeted content injection (\(\Delta=85.0\%\)), and benign refusal (\(\Delta=30.1\%\)).
Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation: ORPS (Outcome-Refining Process Supervision) is proposed, which unifies process and outcome rewards in a tree-search framework by combining code execution feedback with LLM self-criticism. It achieves a 26.9% accuracy improvement and a 42.2% efficiency boost in code generation without training a PRM.
Robust Learning of Diverse Code Edits (NextCoder): This work proposes a synthetic code editing data generation pipeline alongside a robust adaptation algorithm SeleKT (Selective Knowledge Transfer). By performing periodic top-k sparse projections of task vectors during fine-tuning, the model is equipped with strong specialized code editing capabilities while preserving its original code generation and general reasoning capacities. The resulting NextCoder model family outperforms same-sized or even larger models across five code-editing benchmarks.
SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity: This paper proposes SparseLoRA, which dynamically selects subsets of weights for forward and gradient computation via contextual sparsity. It migrates the inference-time sparsity acceleration paradigm to the LLM fine-tuning stage for the first time, achieving up to 2.2× reduction in FLOPs and 1.6× measured speedup, while maintaining accuracy.
Training Software Engineering Agents and Verifiers with SWE-Gym: This paper proposes SWE-Gym, the first environment designed for training software engineering (SWE) agents, containing 2,438 real-world task instances from 11 open-source Python repositories. By leveraging rejection sampling fine-tuning on SWE-Gym to train SWE agents and verifiers, it achieves resolve rates of \(32.0\%\) on SWE-Bench Verified and \(26.0\%\) on SWE-Bench Lite, setting a new SOTA for open-weight SWE agents.

🎨 Image Generation (92)¶

Action-Minimization Meets Generative Modeling: Efficient Transition Path Sampling with the Onsager-Machlup Functional: This paper proposes interpreting the score functions of pretrained generative models (diffusion models and flow matching) as drift terms in stochastic dynamics. By minimizing the Onsager-Machlup (OM) action functional, the pretrained models are repurposed in a zero-shot manner for transition path sampling (TPS) in molecular systems. This achieves physically realistic transition paths at a fraction of the computational cost of traditional methods on systems like alanine dipeptide and fast-folding proteins.
All-atom Diffusion Transformers: Unified Generative Modelling of Molecules and Materials: This work proposes the All-atom Diffusion Transformer (ADiT), a two-stage framework that maps molecules and crystals into a unified latent space via a VAE, and then utilizes a Diffusion Transformer to generate new samples within this latent space. It is the first to achieve simultaneous generation of periodic materials (crystals) and non-periodic molecular systems using a single model. ADiT achieves SOTA performance on MP20, QM9, and GEOM-DRUGS, while being an order of magnitude faster than equivariant diffusion models.
Angle Domain Guidance: Latent Diffusion Requires Rotation Rather Than Extrapolation: It is discovered that the root cause of color distortion in Classifier-Free Guidance (CFG) is the amplification of sample norms in the latent space. To address this, the Angle Domain Guidance (ADG) algorithm is proposed, which enhances guidance in the angle domain rather than the amplitude domain. By constraining norm variation while optimizing angular alignment, ADG eliminates abnormal color saturation under high guidance weights while maintaining or even improving text-image alignment.
Annealing Flow Generative Models Towards Sampling High-Dimensional and Multi-Modal Distributions: Proposes Annealing Flow (AF)—a Continuous Normalizing Flow (CNF) based method for sampling high-dimensional multi-modal distributions. It trains with a dynamic Optimal Transport (OT) objective combined with Wasserstein regularization to guide mode exploration through an annealing process, significantly outperforming existing NF and MCMC methods in high-dimensional multi-modal settings.
Autoencoder-Based Hybrid Replay for Class-Incremental Learning: Proposed the Autoencoder-Based Hybrid Replay (AHR) strategy, which utilizes a Hybrid Autoencoder (HAE) to compress and store samples in the latent space rather than the original input space. By combining Charged Particle System Energy Minimization (CPSEM) and the Repulsion Force Algorithm (RFA) to incrementally embed new class centroids, it reduces the memory complexity from \(\mathcal{O}(t)\) to \(\mathcal{O}(0.1t)\) in the worst-case scenario while maintaining SOTA performance.
Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment: Proposes Preference Embedding—embedding responses into a multi-dimensional latent space to capture complex preference structures (including intransitive preferences), achieving \(O(K)\) query complexity (identical to Bradley-Terry models but with significantly higher expressiveness). Combined with General Preference Optimization (GPO), it outperforms Bradley-Terry reward models on RewardBench and AlpacaEval 2.0.
Beyond One-Hot Labels: Semantic Mixing for Model Calibration: Proposes CSM (Calibration-aware Semantic Mixing), which leverages pre-trained diffusion models to generate high-fidelity semantically mixed samples (e.g., cat-dog hybrids) and accurately re-annotates soft label confidence using CLIP. Training with an \(L_2\) loss achieves superior model confidence calibration compared to existing calibration methods.
BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling: This paper proposes the Bridge framework, which generates high-quality text-to-time-series paired data using an LLM multi-agent system and utilizes a hybrid prompt of semantic prototypes and textual descriptions to drive a diffusion model. It achieves cross-domain, instance-level text-controlled time-series generation (TC-TSG), ranking SOTA in 11 out of 12 datasets.
Broadband Ground Motion Synthesis by Diffusion Model with Minimal Condition: Proposed HEGGS (High-fidelity Earthquake Groundmotion Generation System), which leverages the naturally paired characteristics of waveforms in seismic datasets, combined with a conditional latent diffusion model and an Amplitude Correction Module (ACM), to generate high-fidelity three-component seismic waveforms end-to-end with minimal conditional information (latitude, longitude, focal depth, magnitude).
Compositional Scene Understanding through Inverse Generative Modeling: This paper proposes the Inverse Generative Modeling (IGM) framework, which reformulates scene understanding tasks as an inversion problem of searching for optimal conditional parameters within compositional generative models. By composing multiple small diffusion models to represent complex scenes, the method achieves robust out-of-distribution generalization capabilities and directly leverages pre-trained text-to-image models for zero-shot multi-object perception.

Browse all 92 Image Generation papers →

🎬 Video Generation (7)¶

AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration: Proposes AsymRnR—a training-free video DiT acceleration method. Based on the observation that redundancy levels vary across different attention components (Q/K/V), layers, and denoising steps, it asymmetrically reduces tokens to achieve lossless acceleration.
Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing: Ca2-VDM is proposed, which eliminates redundant calculations of conditional frames in autoregressive video diffusion models through two key designs: Causal Generation and Cache Sharing. It reduces computational complexity from quadratic to linear, generating 80-frame videos 2.5 times faster than the baseline while maintaining state-of-the-art generation quality.
Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development: This work proposes Data-Juicer Sandbox, a feedback-driven sandbox suite that systematically explores the interactions between data processing operators (OPs) and model performance in low-cost, small-scale experiments through a "Probe-Analyze-Refine" workflow, transferring the obtained data recipes to large-scale scenarios and achieving first place on the VBench leaderboard.
Diffusion Adversarial Post-Training for One-Step Video Generation: This paper proposes the Adversarial Post-Training (APT) framework, which introduces an adversarial training phase after diffusion model pre-training to achieve high-quality one-step video generation (2 seconds, 1280×720, 24fps) with a model named Seaweed-APT.
How Far is Video Generation from World Model: A Physical Law Perspective: This work systematically evaluates whether video generation models can discover physical laws from purely visual data by constructing a 2D physical simulation video dataset that stringently adheres to classical mechanics. It reveals that current models merely memorize patterns within the training distribution rather than generalizing to novel physical conditions.
MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance: Built upon Stable Video Diffusion, this pose-guided human video generation framework achieves a FID-VID of 9.3 (prev. best 12.4) on the TikTok dataset by encoding pose estimation confidence into guidance signals, amplifying training loss for high-confidence hand regions, and employing position-aware progressive latent fusion. It also natively supports the generation of smooth videos of arbitrary length.
RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers: By systematically analyzing the roles of different frequency components in RoPE positional encoding, this paper identifies an "intrinsic frequency" that dominates temporal repetition during extrapolation. It proposes RIFLEx, a minimal intervention scheme that scales down only this frequency to keep it within a single period after extrapolation, achieving high-quality training-free 2× video extrapolation on CogVideoX-5B and HunyuanVideo.

🧩 Multimodal VLM (42)¶

CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization: The CoCoA-Mix framework is proposed to construct a prompt mixture model via a confusion-aware loss (CoA-loss) and confidence-aware weights (CoA-weights), simultaneously improving both the specialization and generalization of VLM prompt tuning without introducing extra network parameters.
CoMemo: LVLMs Need Image Context with Image Memory: This work proposes CoMemo, a dual-path architecture where the Context path concatenates image tokens with text for autoregressive processing, and the Memory path utilizes cross-attention for persistent image memory. Combined with RoPE-DHR positional encoding to maintain 2D spatial awareness and alleviate long-range decay, and a three-stage training strategy to balance the dual paths, this approach comprehensively outperforms LVLM-S and LVLM-X under equivalent settings.
Context is Key: A Benchmark for Forecasting with Essential Textual Information: This paper proposes the Context is Key (CiK) benchmark—consisting of 71 manually designed forecasting tasks across 7 domains. Each task requires combining numerical history and natural language context to make accurate predictions. The paper also introduces the RCRPS evaluation metric and the Direct Prompt method, demonstrating that a simple prompting strategy on Llama-3.1-405B (RCRPS=0.159) significantly outperforms all statistical and time-series foundation models.
Core Knowledge Deficits in Multi-Modal Language Models: This paper proposes the CoreCognition benchmark (comprising 12 core cognitive abilities across 1,503 questions). Following a large-scale evaluation of 230 MLLMs, it reveals that models systematically lag behind humans in foundational cognitive abilities. Moreover, this deficit does not improve with larger model scales; instead, larger models tend to rely more heavily on shortcut learning rather than genuine understanding.
Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces: This work proposes a unified framework for constructing multimodal diffusion models on arbitrary state spaces. By introducing independent decoupled noise schedules for each modality, it simultaneously achieves both unconditional and modality-conditional generation within a single model without requiring external tokenizers or VAE preprocessing.
Do Vision-Language Models Really Understand Visual Language?: This work systematically evaluates the diagram understanding capabilities of large vision-language models (LVLMs) by constructing a comprehensive test suite (including synthetic and real-world diagrams). It reveals that while models can identify entities, their understanding of relationships is extremely limited; their seemingly excellent performance in diagram reasoning actually stems from utilizing background knowledge as a shortcut.
Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning: This paper proposes the D-MoLE method, which automatically evolves the MLLM architecture under parameter budget constraints to continuously adapt to new tasks, achieving an average improvement of 15% over the best baseline through a dynamic layer-wise LoRA expert allocator and a gradient-based inter-modal continual curriculum strategy.
Efficient Quantification of Multimodal Interaction at Sample Level: Proposes the LSMI (Lightweight Sample-wise Multimodal Interaction) estimator, achieving precise and efficient sample-wise quantification of multimodal interactions (redundancy, uniqueness, and synergy) on real-world continuous distribution data for the first time, and demonstrates its practical value in data partitioning, knowledge distillation, and model ensemble.
ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics: ELEMENTAL integrates vision-language models (VLMs) with inverse reinforcement learning (IRL) to extract feature functions via VLMs, optimize weights via IRL, and iteratively improve via self-reflection, achieving a 42.3% improvement over EUREKA across 9 IsaacGym tasks.
ERL-VLM: Enhancing Rating-Based RL to Leverage Feedback from Large VLMs: ERL-VLM is proposed to leverage Large Vision-Language Models (VLMs) to provide absolute ratings for single trajectories instead of pairwise preferences. By combining stratified sampling and MAE loss to address data imbalance and noisy labels, it significantly improves VLM feedback-driven reward learning.

Browse all 42 Multimodal VLM papers →

🧠 VLM Reasoning (5)¶

Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reasoning: DiVLA (Diffusion-VLA) is proposed to unify the reasoning capabilities of autoregressive VLMs and the action generation capabilities of diffusion models into an end-to-end framework. By directly embedding self-generated language reasoning into policy learning via a Reasoning Injection Module, DiVLA achieves generalization to unseen objects, interpretable action decision-making, and high-speed inference (82Hz for the 2B model).
Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner: Proposes a scalable Bayesian Theory-of-Mind (ToM) planner that decomposes multi-step reasoning into step-by-step Bayesian updates. By leveraging a weak-to-strong control mechanism, it transfers specialized ToM capabilities from smaller models to large language models (up to 405B), outperforming the Prev. SOTA by 4.6% on multimodal ToM benchmarks.
Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger: This paper proposes the RCTS framework, which constructs a reasoning-context-rich knowledge base via a self-consistency evaluation mechanism and re-ranks retrieved exemplars using Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR). This enables LVLMs to significantly outperform raw ICL and Vanilla-RAG methods across multiple VQA datasets (by an average of +3-4%).
Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems: This paper systematically evaluates the abstract visual reasoning capabilities of 4 closed-source and 4 open-source MLLMs on three datasets: the classic synthetic Bongard Problems (BPs), Bongard HOI, and Bongard-OpenWorld. Seven problem-solving strategies and a new dataset, Bongard-RWR (which represents synthetic BP concepts using real-world images), are proposed, revealing that the extremely poor performance of MLLMs on synthetic BPs is not due to domain shift but rather an inherent limitation in abstract reasoning.
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas: This work investigates the causes of spatial reasoning failures in VLMs from a mechanistic interpretability perspective, finding that image tokens obtain only ~10% of attention despite making up 90% of the input, and that the geometric distribution of attention is the key factor. The authors propose AdaptVis, a training-free decoding method that adaptively adjusts image attention temperature based on runtime confidence, achieving up to a 50% absolute improvement on the WhatsUp dataset.

⚡ VLM Efficiency (3)¶

CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models: This work is the first to reveal the intrinsic correlation between token sparsity and neuron sparsity in VLMs—core neurons and core tokens mutually determine and reinforce each other. Based on this correlation, the authors propose the CoreMatching co-adaptive sparse inference framework, achieving simultaneous acceleration in both pre-filling and decoding stages, which leads to a 5× FLOPs reduction and 10× overall speedup.
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention: This paper proposes MMInference, which accelerates the prefill stage of long-context VLMs by up to \(8.3\times\) in a \(1\text{M}\) token scenario without modifying model weights or fine-tuning, while maintaining task accuracy. This is achieved via "modality-aware permutation sparse attention + head-level offline pattern search + online dynamic indexing + customized GPU kernels."
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference: SparseVLM proposes the first training-free, text-guided visual token sparsification framework. By selecting vision-related text tokens as "raters" to evaluate the importance of visual tokens, combined with an adaptive pruning ratio and a token recycling mechanism, it preserves 99.1% of the original performance on LLaVA while retaining only 192 tokens (a 66.7% reduction).

🎵 Audio & Speech (15)¶

Aligning Spoken Dialogue Models from User Interactions: This work introduces the first comprehensive preference alignment framework designed for a full-duplex spoken dialogue model (Moshi). By automatically constructing content and temporal preference pairs from over 150k real user voice interactions and performing DPO-LN alignment exclusively on text tokens, this approach achieves an average QA improvement of 3.1% and a safety increase of 6.9%, with human evaluations confirming enhanced multi-turn dialogue quality.
BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models: This work proposes BinauralFlow, a streamable binaural speech synthesis framework based on conditional Flow Matching. Incorporating a causal U-Net architecture and a continuous inference pipeline, it produces high-fidelity, streamable binaural audio. In perceptual evaluations, a 42% confusion rate demonstrates that the synthesized audio is virtually indistinguishable from real recordings.
Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition: This paper proposes LatentVoiceMix, which performs mixup interpolation in the latent space of the speaker style encoder of the voice conversion model Diff-HierVC to generate synthetic speech data with novel voice characteristics for augmenting ASR training. This approach achieves superior WER improvements on the low-resource language Wolof compared to waveform augmentation, spectrogram augmentation, and standard voice conversion.
Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech: This paper introduces the speaker identity unlearning task in zero-shot TTS for the first time, designing a Teacher-Guided Unlearning (TGU) framework that introduces randomness to make models "forget" target speaker voiceprint features while maintaining high-quality speech synthesis capabilities for other speakers, and proposes the spk-ZRF metric to quantify unlearning effectiveness.
ETTA: Elucidating the Design Space of Text-to-Audio Models: ETTA systematically elucidates the design space (data, architecture, training objectives, and sampling strategies) of text-to-audio (TTA) models through large-scale experiments, and constructs the current state-of-the-art TTA model under public data based on these findings.
FLAM: Frame-Wise Language-Audio Modeling: Proposes FLAM, a frame-level audio-language contrastive model that achieves precise temporal localization of open-vocabulary sound events through text-dependent logit bias correction and a million-scale synthetic SED dataset, while maintaining outstanding performance in global retrieval and zero-shot classification.
IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling: This paper proposes the IMPACT framework, which combines iterative mask-based parallel decoding (MGM) with latent diffusion models (LDMs) for text-to-audio generation in a continuous latent space. It replaces heavy attention layers with a lightweight MLP diffusion head and introduces an unconditional pre-training stage, achieving state-of-the-art (SOTA) FD/FAD metrics on AudioCaps while maintaining an inference speed comparable to the fastest MAGNET-S model.
Long-Form Speech Generation with Spoken Language Models: Proposes SpeechSSM, the first textless spoken language model capable of learning and generating up to 16 minutes of speech in a single decoding session. It leverages the Griffin hybrid SSM architecture to achieve constant-memory decoding and infinite context, and introduces the LibriSpeech-Long evaluation benchmark along with new embedding and LLM-as-a-judge metrics.
MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners: This work proposes MuseControlLite, which introduces Rotary Position Embedding (RoPE) into decoupled cross-attention layers. This enables precise time-varying conditional control for text-to-music generation with only 85M trainable parameters (6.75x fewer than ControlNet), while pioneering unified support for both music attribute control and audio inpainting/outpainting.
NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction: Proposes the Next-Token-Pair Prediction (NTPP) paradigm, which models the joint distribution of dual-channel spoken dialogue in a speaker-independent manner using a decoder-only architecture for the first time, achieving more natural turn-taking, lower inference latency, and stronger speaker independence.

Browse all 15 Audio & Speech papers →

🧊 3D Vision (17)¶

ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization: This work introduces the concept of DPO into diffusion-based Human Mesh Recovery (HMR). By training an HMR-Scorer to evaluate prediction quality and constructing a preference dataset (winner/loser pairs), the base diffusion model is fine-tuned via DPO, improving HMR performance on in-the-wild images without requiring 3D annotations.
D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples: This paper proposes D-Fusion, a method that constructs visually consistent preference data pairs and preserves denoising trajectories via mask-guided Self-Attention Fusion. It addresses the performance limitations in training diffusion models with DPO caused by visual inconsistency, significantly improving prompt-image alignment quality across various RL algorithms and prompt types.
Diverse Prototypical Ensembles Improve Robustness to Subpopulation Shift: The paper proposes Diversified Prototypical Ensemble (DPE), which replaces the standard linear classification head with multiple diverse prototype classifiers. By utilizing both explicit (inter-prototype similarity loss) and implicit (bootstrap sampling) diversification strategies, DPE adaptively discovers subpopulation decision boundaries without requiring subpopulation annotations, significantly improving worst-group accuracy.
FlowDrag: 3D-aware Drag-based Image Editing with Mesh-guided Deformation Vector Flow Fields: Proposed FlowDrag, which constructs a 3D mesh from an image and generates continuous 2D vector flow fields through progressive SR-ARAP deformation. This injects global geometric priors into the motion supervision process of diffusion models, leading to comprehensive state-of-the-art performance on DragBench (MD=22.88) and the newly proposed VFD-Bench (PSNR=18.55, 1-LPIPS=0.82, MD=28.23).
FreeMesh: Boosting Mesh Generation with Coordinates Merging: This work proposes the Per-Token-Mesh-Entropy (PTME) metric to evaluate mesh tokenizer quality without training, and introduces Rearrange & Merge Coordinates (RMC), a coordinate merging technique borrowed from NLP, achieving a compression rate of up to 21.2% across three tokenizers (MeshXL, MeshAnythingV2, and EdgeRunner), while significantly increasing the number of generatable faces and preserving geometric details.
GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model: This paper proposes GAPrompt, a geometry-aware PEFT method for pre-trained 3D vision models. By synergistically leveraging point cloud geometric information through three modules—Point Prompt, Point Shift Prompter, and Prompt Propagation—it matches or even outperforms full fine-tuning while training only 2.19% of the parameters.
High Dynamic Range Novel View Synthesis with Single Exposure: First proposes the problem setting of HDR novel view synthesis (HDR-NVS) using only single-exposure LDR images, and designs Mono-HDR-3D, a meta-algorithm framework based on camera imaging principles. It achieves HDR scene modeling without HDR supervision through an LDR-to-HDR Color Converter (L2H-CC) and an HDR-to-LDR closed-loop Color Converter (H2L-CC).
Of Mice and Machines: A Comparison of Learning Between Real World Mice and RL Agents: This paper systematically compares the behavioral differences between real mice and RL agents in a predator-prey maze. Revealing that RL agents lack self-preservation instincts, the authors propose two mechanisms: Trauma-Inspired Safety Buffer (TISB) and Variance-Penalized TD learning (VP-TDMPC-2), which improve the state visitation overlap between agents and mice from 20.9% to 86.1%.
PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views: PhysicsNeRF proposes a physics-prior-based sparse-view NeRF framework. By leveraging four complementary constraints—depth ranking, cross-view consistency, sparsity regularization, and progressive training—it achieves a PSNR of 21.4 dB with only 8 views while providing an in-depth theoretical analysis of the nature of overfitting under sparse-view conditions.
Probabilistic Interactive 3D Segmentation with Hierarchical Neural Processes: NPISeg3D proposes the first probabilistic interactive 3D segmentation framework based on Hierarchical Neural Processes. Through a two-level latent variable structure (scene-level and object-level) and a probabilistic prototype modulator, it achieves segmentation accuracy superior to AGILE3D under a few clicks, while providing reliable uncertainty estimations.

Browse all 17 3D Vision papers →

🎯 Object Detection (12)¶

BlueGlass: A Framework for Composite AI Safety: This work proposes BlueGlass, a composite AI safety framework that integrates three safety analysis tools—distributed evaluation, approximation probes, and sparse autoencoders—via a unified infrastructure to systematically analyze the capability boundaries, layer dynamics, and internal concept representations of Vision-Language Models (VLMs) in object detection tasks.
Causality-Aware Contrastive Learning for Robust Multivariate Time-Series Anomaly Detection: This paper proposes CAROTS—a multivariate time-series anomaly detection framework that integrates causal relationships into contrastive learning. It utilizes causality-preserving augmentation as positive samples (normal variations) and causality-violating augmentation as negative samples (simulated anomalies) to train encoders to distinguish normal from abnormal patterns based on causal structures.
CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering: By introducing the core concept of cost volume filtering from stereo matching and optical flow estimation into Unsupervised Anomaly Detection (UAD), this work constructs a matching cost volume between the input and templates. It utilizes a 3D U-Net with dual-stream attention guidance for denoising and filtering. Designed as a general plug-and-play post-processing module, it simultaneously boosts the performance of both reconstruction-based and embedding-based UAD methods, achieving state-of-the-art (SOTA) results on MVTec-AD and VisA.
Few-Shot Learner Generalizes Across AI-Generated Image Detection: This paper is the first to redefine AI-generated image detection as a few-shot classification task. It proposes FSD (Few-Shot Detector) based on prototypical networks to learn a metric space. Using only 10 samples from unseen generative models, it achieves an average accuracy of 84.1% on the GenImage dataset, outperforming the previous SOTA (LARE2) by +11.6%.
FG-CLIP: Fine-Grained Visual and Textual Alignment: FG-CLIP systematically addresses the three major bottlenecks of fine-grained understanding in CLIP: capturing global semantic details with 1.6B long-description-image pairs, achieving precise regional alignment with 12M images and 40M region annotations, and training models to distinguish subtle semantic differences with 10M hard negatives, achieving comprehensive leading performance in fine-grained understanding, open-vocabulary detection, and image-text retrieval.
KAN-AD: Time Series Anomaly Detection with Kolmogorov-Arnold Networks: KAN-AD reformulates time series anomaly detection as approximating sequences using smooth univariate functions. By replacing B-splines in KAN with truncated Fourier expansion to avoid local perturbation sensitivity, it improves detection accuracy by an average of 15% across four benchmarks with fewer than 1000 parameters.
Open-Det: An Efficient Learning Framework for Open-Ended Detection: Open-Det proposes an efficient open-ended detection (OED) framework. By reconstructing the object detector (decoupling one-to-many/one-to-one matching), introducing a VL-prompts distillation module to bridge the vision-language semantic gap, utilizing a LoRa Head + Text Denoising to accelerate LLM training, and applying a Masked Alignment Loss to eliminate contradictory supervision, Open-Det achieves superior detection performance (APr +1.0%) using only 1.5% of the training data and 20.8% of the training epochs compared to GenerateU.
Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models: Outlier Gradient Analysis (OGA) is proposed to reformulate the identification of detrimental training samples in influence functions as an outlier detection problem in the gradient space. This sidesteps the high computational overhead of Hessian matrix inversion while outperforming traditional influence function methods on tasks such as noisy label correction, NLP data filtering, and LLM influence data identification.
Self-Organizing Visual Prototypes for Non-Parametric Representation Learning: This paper proposes the Self-Organizing Prototypes (SOP) strategy, which replaces the single prototype in traditional self-supervised learning (SSL) with multiple semantically similar support embeddings to represent local regions of the feature space. It also introduces a non-parametric masked image modeling (MIM) task, achieving state-of-the-art performance on downstream tasks such as retrieval, detection, and segmentation.
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction: UI-Vision is proposed – the first comprehensive offline evaluation benchmark for desktop environments, covering 83 software applications. It provides dense annotations of bounding boxes, UI labels, and action trajectories. It defines a three-level evaluation task from fine-grained to coarse-grained (Element Grounding \(\rightarrow\) Layout Grounding \(\rightarrow\) Action Prediction) to systematically evaluate and reveal key shortcomings of SOTA models in professional software understanding, spatial reasoning, and complex actions.

Browse all 12 Object Detection papers →

✂️ Segmentation (18)¶

ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation: Introduces ActionPiece, the first context-aware action sequence tokenizer that models user behavior sequences as "sequences of feature sets." By adopting a BPE-like merge strategy to discover high-frequency feature patterns both within sets and across adjacent sets, it allows the same action to be tokenized into different tokens depending on the context, significantly improving generative recommendation performance.
ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation: This paper proposes ActionPiece, the first context-aware action sequence tokenization method. It represents each action as an unordered set of features, learning merge rules within and across adjacent sets using weighted co-occurrence statistics to build a vocabulary. This allows the same action to be tokenized into different tokens depending on the context, significantly improving the accuracy of generative recommendation in recommendation tasks.
Adapter Naturally Serves as Decoupler for Cross-Domain Few-Shot Semantic Segmentation: This paper discovers that adapters naturally possess the capability of domain information decoupling (based on architecture rather than loss). Consequently, the authors propose the Domain Feature Navigator (DFN) as a structural domain decoupler, coupled with SAM-SVN to prevent overfitting on the source domain. This approach significantly outperforms state-of-the-art methods in cross-domain few-shot semantic segmentation (CD-FSS), achieving a 1-shot average of 63.99% and a 5-shot average of 69.77% MIoU.
Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery: This paper introduces the first large-scale oil and gas well detection benchmark, the Alberta Wells Dataset (containing over 213k well locations and 188k+ satellite imagery patches). The localization of abandoned, suspended, and active oil and gas wells is formulated as binary segmentation and object detection tasks, and various CNN and Transformer baseline models are evaluated.
Balanced Learning for Domain Adaptive Semantic Segmentation: This paper proposes BLDA, which directly quantifies class bias by analyzing the logit distributions predicted by the network. It aligns the logit distributions of each class using a shared anchor distribution for post-processing calibration, while utilizing GMMs for online estimation and logit correction in self-training to generate unbiased pseudo-labels. This brings consistent improvements to various baseline methods on both GTA→CS and SYN→CS benchmarks.
ConText: Driving In-context Learning for Text Removal and Segmentation: This work applies the visual in-context learning (V-ICL) paradigm to OCR tasks for the first time. It proposes three key designs: task-chaining prompting, context-aware aggregation (CAA), and self-prompting (SP) strategies. ConText significantly outperforms existing general V-ICL models and task-specific models in text removal and segmentation tasks, achieving improvements of +4.50 PSNR and +3.34% fgIoU, respectively.
Dual form Complementary Masking for Domain-Adaptive Image Segmentation: Proposes the MaskTwins framework, which theorizes masked reconstruction as a sparse signal reconstruction problem, proves that dual form complementary masks have theoretical advantages in extracting domain-invariant features, and achieves domain-adaptive segmentation through complementary mask consistency constraints in end-to-end training.
Efficient and Robust Semantic Image Communication via Stable Cascade: A semantic image communication framework built upon the Stable Cascade architecture. It uses EfficientNet-V2 to extract highly compact image embeddings (occupying just 0.29% of the original size) as LDM conditioning. Through noise-robust fine-tuning, the system reconstructs images faithfully even under low SNR channels, while achieving 3-16x inference acceleration.
FeatSharp: Your Vision Model Features, Sharper: This paper proposes FeatSharp, which coherently upsamples feature maps of low-resolution vision encoders to high resolution at an extremely low cost by taking FeatUp's Joint Bilateral Upsampling (JBU) and attentively fusing it with image tiling features, while capturing fine-grained details lost at the original resolution.
InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective: This paper proposes InfoSAM, which designs a relationship compression and distillation framework based on Rényi mutual information for the Parameter-Efficient Fine-Tuning (PEFT) of SAM from an information-theoretic perspective, enhancing fine-tuning performance by compressing pseudo-invariant information and preserving domain-invariant relationships.

Browse all 18 Segmentation papers →

🖼️ Image Restoration (5)¶

Adaptive Estimation and Learning under Temporal Distribution Shift: Proposes an estimation algorithm based on wavelet soft-thresholding that achieves optimal pointwise estimation error bounds under temporal distribution shift without prior knowledge. It establishes a connection between sequence non-stationarity and sparsity in the wavelet domain, applying it to binary classification and total variation denoising under distribution shift.
ε-VAE: Denoising as Visual Decoding: This paper proposes ε-VAE, which replaces the single-step deterministic decoder in traditional autoencoders with a diffusion/denoising process to implement "denoising as decoding." Under the same compression rate, the reconstruction quality is improved by 40% and downstream generation quality is enhanced by 22%. Alternatively, it achieves a 2.3× inference acceleration by increasing the compression rate while maintaining the same generation quality.
Evaluating Deepfake Detectors in the Wild: A new dataset containing over 500k high-quality deepfake images is constructed. By introducing in-the-wild enhancements such as JPEG compression, resolution reduction, and image restoration, six open-source deepfake detectors are systematically evaluated, revealing that fewer than half achieved an AUC > 60%, with the lowest performance around 50% (random-guess level).
HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration: This work proposes the HarmoniCa framework, which addresses the misalignment between training and inference in existing learning-based feature caching methods through two core designs: Step-Wise Denoising Training (SDT) and Image Error Proxy-Guided Objective (IEPO). It achieves over a 40% reduction in latency (2.07× theoretical speedup) without compromising generation quality across 8 different models including PixArt-α.
TimeDART: A Diffusion Autoregressive Transformer for Self-Supervised Time Series Representation: TimeDART is proposed to unify autoregressive modeling and denoising diffusion processes within a self-supervised pre-training framework. It captures long-term dynamic evolution via a causal Transformer encoder and fine-grained local patterns through patch-level diffusion denoising, outperforming existing methods on both forecasting and classification tasks.

🛰️ Remote Sensing (7)¶

Causal Foundation Models: Disentangling Physics from Instrument Properties: Introduces a causally-driven foundation model that disentangles physical signals and instrumental effects from astronomical time series using a dual-encoder architecture and structured contrastive learning. By leveraging naturally occurring observational triplets (the same target observed by different instruments, or different targets observed by the same instrument), the proposed model significantly outperforms single latent space approaches in low-data regimes.
ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts: ExPLoRA is proposed to continue self-supervised pre-training on a target domain in a parameter-efficient manner by unfreezing 1-2 ViT blocks and applying LoRA to the remaining layers. Under domain shift scenarios like remote sensing, it outperforms SOTAs that undergo full pre-training from scratch, while utilizing <10% of the parameters.
High-Resolution Live Fuel Moisture Content (LFMC) Maps for Wildfire Risk from Multimodal Earth Observation Data: Fine-tuning the pretrained multimodal Earth observation model Galileo generates 10-meter resolution Live Fuel Moisture Content (LFMC) maps, reducing RMSE by 20%+ compared to randomly initialized models, with the pipeline's utility validated by a 2025 Los Angeles wildfire case study.
LIGHTHOUSE: Fast and Precise Distance to Shoreline Calculations from Anywhere on Earth: This work introduces Lighthouse, a global shoreline dataset with a 10-meter resolution and a millisecond-level query library. By fusing ESA WorldCover and OpenStreetMap data, and combining a hierarchical BallTree with spherical Voronoi indexing, it enables real-time shoreline distance queries requiring only 1 CPU and 2GB RAM, improving accuracy by over 100 times compared to existing datasets.
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models: This paper proposes the MapEval benchmark, which systematically evaluates the geo-spatial reasoning capabilities of 30 foundation models in map scenarios using 700 multiple-choice questions across textual, API, and visual tasks. The results show that the strongest model achieves an accuracy of no more than 67%, and all models lag behind human performance by over 20%.
Neural Augmented Kalman Filters for Road Network Assisted GNSS Positioning: A Temporal Graph Neural Network (TGNN) is proposed to integrate open-source road network information into GNSS Kalman filtering. The TGNN predicts the most likely road segments on the graph structure and dynamically estimates their uncertainties, reducing the P95 localization error from 77.23m to 55.02m (a 29% reduction) in real-world urban data.
Resampling Augmentation for Time Series Contrastive Learning: Application to Remote Sensing: The paper proposes a resampling augmentation for time series contrastive learning, which constructs positive pairs through "upsampling + disjoint subsequence extraction + realigning back to the original timeline." This approach outperforms common augmentation strategies on multiple SITS agricultural classification tasks and yields leading results on S2-Agri100.

🧑 Human Understanding (3)¶

How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects: This paper presents a unified framework for text-driven motion generation targetting large-vocabulary heterogeneous skeletal objects, achieved by annotating text descriptions for the Truebones Zoo dataset (70+ species), introducing rig augmentation, and integrating TreePE and RestPE encodings into the Motion Diffusion Model. It enables high-quality 3D motion synthesis for animals, dinosaurs, and even fictional creatures.
LLaVA-ReID: Selective Multi-Image Questioner for Interactive Person Re-Identification: This paper defines a new task of interactive person re-identification (Inter-ReID), constructs the Interactive-PEDES multi-turn dialogue dataset, and proposes LLaVA-ReID—a large multimodal question generation model based on selective multi-image context and look-ahead supervision, which progressively refines target person descriptions through iterative dialogue.
Scaling Large Motion Models with Million-Level Human Motions: This paper introduces MotionLib (the first million-level motion dataset, containing 1.2 million sequences), MotionBook (comprising lossless features and a 2D lookup-free motion tokenizer), and Being-M0 (a large motion model), demonstrating the scaling laws of both data and model size in the motion generation field for the first time.

📹 Video Understanding (4)¶

Fine-Grained Captioning of Long Videos through Scene Graph Consolidation: This paper proposes the SGVC framework, which achieves state-of-the-art zero-shot long video captioning performance while substantially reducing computational overhead compared to LLM-based methods. It parses segment-level video descriptions into scene graphs, iteratively consolidates them into a unified graph representation using the Hungarian algorithm, and generates video-level descriptions using a lightweight graph-to-text decoder.
MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition: The MoMa framework is proposed, which injects the linear complexity SSM of Mamba into a frozen CLIP Transformer via a scale-bias sequence modulation operation (SeqMod) to achieve efficient global spatiotemporal dynamic modeling, reaching SOTA performance on multiple video recognition benchmarks with lower computational cost.
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation: ViLaMP proposes the Differential Distillation principle to achieve "mixed-precision" video processing through two mechanisms: hierarchical frame-level Differential Keyframe Selection (DKS) and patch-level Differential Feature Merging (DFM). In this paradigm, keyframes retain all visual tokens, while non-keyframes are compressed into a single token. This enables processing ultra-long videos of up to 10K frames (approximately 2.7 hours) on a single A100 GPU.
Unifying Specialized Visual Encoders for Video Language Models: MERV proposes a multi-encoder video representation method that integrates four visual encoders with different areas of expertise (DINOv2, ViViT, SigLIP, LanguageBind) into a single VideoLLM through spatio-temporal alignment and cross-attention fusion. It improves performance on video reasoning benchmarks by up to 4.62% compared to the baseline Video-LLaVA, validating the complementary strengths of different encoders.

🚗 Autonomous Driving (10)¶

Don't be so Negative! Score-based Generative Modeling with Oracle-assisted Guidance: Proposes the Gen-neG method, which redirects the generative distribution from constraint-violating regions to the positive support region by iteratively training a Bayes-optimal classifier on synthetic data from diffusion models and using it to guide the sampling process. The key innovation lies in correctly handling the importance sampling of class prior probabilities, reducing the collision and out-of-boundary rate from 29.3% to 5.6% in traffic scene generation.
DriveGPT: Scaling Autoregressive Behavior Models for Driving: Proposes DriveGPT, a 1.4B-parameter autoregressive Transformer driving behavior model trained on 120 million real driving clips (50x larger than the largest existing dataset). It systematically establishes the data/model/compute scaling laws for driving behavior modeling for the first time, demonstrates that data is the primary performance bottleneck, and outperforms the state-of-the-art on planning and WOMD prediction tasks.
Geometry-to-Image Synthesis-Driven Generative Point Cloud Registration: Proposes a new paradigm of Generative Point Cloud Registration, designing two registration-tailored controllable 2D generative models: DepthMatch-ControlNet and LiDARMatch-ControlNet, to generate cross-view consistent RGB image pairs from pure geometric point cloud pairs. It plug-and-play improves existing 3D registration methods through geometry-color feature fusion, validated on 3DMatch/ScanNet/Dur360BEV.
GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction: This work integrates the Maximum Entropy Inverse Reinforcement Learning (MaxEnt IRL) framework with vectorized scene representations for the first time, proposing the GoIRL trajectory prediction framework. Utilizing a learnable Feature Adaptor, it aggregates graph features into a grid space to accommodate IRL. It then employs a hierarchical parameterized trajectory generator (Bézier curves + refinement module) along with an MCMC probability fusion mechanism for multimodal trajectory prediction. GoIRL achieves state-of-the-art (SOTA) performance on Argoverse and nuScenes, demonstrating significantly stronger generalization capabilities compared to supervised models.
Hierarchical and Collaborative LLM-Based Control for Multi-UAV Motion and Communication in Integrated Terrestrial and Non-Terrestrial Networks: Proposes a hierarchical collaborative LLM-based control framework that coordinates dual-level LLMs—a meta-controller LLM deployed on the HAPS and edge-controller LLMs deployed on the UAVs—to achieve joint optimization of motion planning and communication access for multi-UAVs in 3D aerial highway scenarios.
Hybrid Quantum-Classical Multi-Agent Pathfinding: Proposed the first optimal hybrid quantum-classical MAPF algorithms, QP and QCP, converting the path selection problem of MAPF into QUBO subproblems solvable on quantum hardware. By utilizing a conflict graph and column generation framework, theoretical optimality is achieved, and feasibility is validated on real quantum hardware.
InfoCons: Identifying Interpretable Critical Concepts in Point Clouds via Information Theory: This paper proposes the InfoCons framework, which applies the Information Bottleneck (IB) principle to interpretable point cloud models. By training an attention bottleneck network, the framework decomposes point clouds into 3D concepts of varying importance. It introduces a learnable, unbiased prior to replace the fixed prior, generating conceptually cohesive explanations while ensuring faithfulness to model predictions.
R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning: Proposes the R3DM framework, which balances role diversity and coordination by maximizing the mutual information between agent roles, historical trajectories, and future expected behaviors, leveraging intrinsic rewards driven by dynamics models. It improves the win rate by up to 20% in SMAC/SMACv2 environments.
SafeMap: Robust HD Map Construction from Incomplete Observations: SafeMap proposes a plug-and-play robust framework for HD map construction. By utilizing two modules, Gaussian-based Perspective View Reconstruction (G-PVR) and Distillation-based BEV Correction (D-BEVC), it accurately constructs vectorized HD maps even under incomplete observations where camera views are missing.
SPHINX: Structural Prediction using Hypergraph Inference Network: This paper proposes SPHINX, an unsupervised hypergraph inference model that frames hyperedge discovery as a sequential soft clustering problem. By employing differentiable k-subset sampling, SPHINX generates discrete, sparse hypergraph structures that can be seamlessly integrated into any hypergraph neural network. SPHINX achieves a 90% overlap rate in hypergraph reconstruction on synthetic data and outperforms existing methods in NBA trajectory prediction and 3D object classification.

🤖 Robotics & Embodied AI (20)¶

Action-Constrained Imitation Learning: Formulates a new problem of "Action-Constrained Imitation Learning (ACIL)" where a constrained agent learns from an unconstrained expert; proposes DTWIL, which generates alternative constrained trajectories via MPC and DTW distance to eliminate occupancy measure mismatch, outperforming baselines significantly on various robotic tasks.
Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning: Proposes the first algorithm to optimize general static Spectral Risk Measures (SRM) within the distributional RL framework, moving beyond existing methods limited to simple CVaR. By leveraging reward distributions, it achieves closed-form outer optimization and temporal decomposition of auxiliary risk measures, outperforming existing risk-sensitive DRL models across diverse risk settings.
BiAssemble: Learning Collaborative Affordance for Bimanual Geometric Assembly: The BiAssemble framework is proposed to decompose the geometric assembly task into three steps (pick-up -> alignment -> assembly) by learning collaboration-aware point-level affordances. It outperforms existing affordance and imitation learning methods in fractured object reassembly tasks and is validated on a real-world benchmark.
Closed-loop Long-horizon Robotic Planning via Equilibrium Sequence Modeling: Models the self-refinement planning process of LLMs as a fixed-point problem (deep equilibrium model) to achieve end-to-end supervised training via implicit differentiation without additional verifiers or RL, and designs nested equilibrium solvers for closed-loop, long-horizon robot planning.
CommVQ: Commutative Vector Quantization for KV Cache Compression: This paper proposes CommVQ, which compresses the KV cache using Additive Vector Quantization (AVQ). By innovatively designing a codebook that commutes with RoPE and training it via the EM algorithm, CommVQ achieves near-lossless accuracy at 2-bit and retains usable accuracy at 1-bit, enabling LLaMA-3.1 8B to support a 128K context length on a single RTX 4090 GPU.
Efficient Robotic Policy Learning via Latent Space Backward Planning: Proposes Latent Space Backward Planning (LBP), which recursively predicts intermediate subgoals starting from the final goal to sequentially approach the current state. This significantly improves planning efficiency while maintaining task alignment, achieving a new state of the art (SOTA) in both LIBERO-LONG simulation and real-robot long-horizon tasks.
Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples: Flow of Reasoning (FoR) is proposed to model multi-step LLM reasoning as a Markov flow on a DAG. By fine-tuning LLMs with the trajectory balance objective of GFlowNets, the model can sample multiple high-quality and diverse reasoning paths with probabilities proportional to rewards, using only a minimal number of training examples (e.g., 15).
FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making: The FOUNDER framework is proposed to align the multimodal task representations of Foundation Models (FMs) to the state space of World Models (WMs) by learning a mapping function. In combination with a temporal distance predictor, it generates reward signals to achieve open-ended multi-task embodied decision-making without environment rewards.
Geometric Contact Flows: Contactomorphisms for Dynamics and Control: Proposes Geometric Contact Flows (GCF), which leverage Riemannian and contact geometry as inductive biases. Using contactomorphisms, GCF maps latent contact Hamiltonian dynamics with desired properties (such as stability and energy conservation) to the target dynamics, while utilizing ensemble uncertainty to drive geodesics for robust generalization and obstacle avoidance.
Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning: Proposes Annealed Q-learning (AQ-L), which smoothly transitions from the Bellman optimality operator to the Bellman operator by annealing the parameter \(\tau\) of the expectile loss from close to 1 down to 0.5. In continuous action spaces, this both accelerates early learning and suppresses late-stage overestimation bias. When integrated with TD3/SAC, it significantly outperforms baselines on various locomotion and robotic manipulation tasks.

Browse all 20 Robotics & Embodied AI papers →

🎮 Reinforcement Learning (69)¶

A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization: From the unified perspective of "interacting entities", this paper proves that a single-layer linear self-attention can efficiently represent, learn, and generalize pairwise interaction functions with \(\Theta(|\mathcal{S}|^2)\) parameters (whereas fully connected networks require \(\Omega(L^2|\mathcal{S}|^2)\)). Based on this theoretical insight, two new modules, HyperFeatureAttention (feature-level interaction coupling) and HyperAttention (higher-order multi-entity interactions), are proposed, which reduce perplexity in language modeling.
Action-Dependent Optimality-Preserving Reward Shaping (ADOPS): The ADOPS method is proposed. By querying the extrinsic/intrinsic value function estimates from the critic network, it adjusts rewards only when the intrinsic reward would change the preference for the optimal action. This achieves action-dependent optimality-preserving reward shaping, overcoming the limitation of PBRS which can only handle action-independent forms, and outperforms all previous optimality-preserving methods and the RND baseline on Montezuma's Revenge.
Actor-Critics Can Achieve Optimal Sample Efficiency: This paper is the first to prove that Actor-Critic algorithms can achieve an optimal sample complexity of \(O(1/\epsilon^2)\) under general function approximation and strategic exploration. This is achieved by integrating optimistic exploration, off-policy critic estimation, and rare-switching policy resets, and the results are further extended to the hybrid RL setting.
Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets: Reveals a hidden flaw in the cooperative rationalization framework (RNP)—even on clean datasets, the generator's sampling bias introduces spurious correlations between rationales and labels. An adversarial detection and instruction intervention method is proposed, significantly outperforming existing methods on text and graph classification.
Automatic Reward Shaping from Confounded Offline Data: Proposes the first theoretically guaranteed data-driven method to automatically learn potential-based reward shaping (PBRS) functions from offline data contains unobserved confounders. The method uses the causal Bellman optimality equation to upper bound the optimal state value as the potential function, and proves that the resulting Q-UCB Shaping algorithm enjoys a superior gap-dependent regret bound compared to vanilla Q-UCB on pseudo-suboptimal state-action pairs.
BEAVER: Building Environments with Assessable Variation for Evaluating Multi-Objective Reinforcement Learning: Proposes the BEAVER benchmark, the first multi-objective contextual reinforcement learning evaluation framework for building energy management. By parameterizing thermal dynamics and climate zones, it constructs controllable environmental variations to systematically evaluate the cross-environment generalization capabilities of existing MORL algorithms.
Benchmarking Quantum Reinforcement Learning: Proposes a rigorous benchmarking methodology for quantum reinforcement learning (QRL)—introducing a statistical estimator based on sample complexity and the concept of "surpassing" defined by statistical significance. Conducts the largest-scale (100 seeds) comparison of QRL vs. classical RL to date on a newly designed 6G beam management environment, revealing that prior claims regarding QRL superiority need to be treated with greater caution.
Beyond The Rainbow: High Performance Deep Reinforcement Learning on a Desktop PC: BTR (Beyond The Rainbow) is proposed—integrating 6 RL improvements into Rainbow DQN to train on Atari-60 to an IQM of 7.4 (compared to 1.9 for Rainbow) within 12 hours on a single desktop PC, and successfully training agents to play 3D games like Super Mario Galaxy, Mario Kart, and Mortal Kombat for the first time.
BRITE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning: BRITE is proposed to iteratively collect and reinforce the intermediate thinking processes of LLMs via bootstrapping, combining process-level reward models and PPO training to continuously enhance LLM performance on tasks such as mathematical reasoning.
Conceptual Belief-Informed Reinforcement Learning: Proposes HI-RL (Human Intelligence-RL), which integrates conceptual abstraction and probabilistic prior belief mechanisms from cognitive science into RL. It extracts high-level concepts from experience and constructs concept-associated adaptive priors to guide value function/policy updates, consistently improving the sample efficiency of DQN/PPO/SAC/TD3 as an algorithm-agnostic plug-in.

Browse all 69 Reinforcement Learning papers →

🎁 Recommender Systems (17)¶

Adaptive Elicitation of Latent Information Using Natural Language: An LLM-based adaptive information elicitation framework is proposed. By performing autoregressive forward simulation of future observations using a meta-learned predictive model, it quantifies and distinguishes epistemic and aleatoric uncertainties, and adaptively selects the most informative natural language questions to efficiently reduce epistemic uncertainty about a latent entity.
Aligning LLMs by Predicting Preferences from User Writing Samples: A new paradigm is proposed to achieve personalized LLM alignment by predicting user preferences based on user writing samples. It infers preference signals directly from user textual styles without requiring explicit preference annotations, opening up a new data source for personalized alignment.
Deprecating Benchmarks: Criteria and Framework: Proposes a set of 7 criteria to determine when an AI benchmark should be deprecated, alongside a three-phase deprecation framework (Assessment-Reporting-Notification), and provides an institutional implementation plan using the EU AI Office as a case study.
ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces: The ELMO framework is proposed to reduce the training memory of XMC models with 3 million labels from 39.7 GiB to 6.6 GiB without losing classification accuracy, achieved via pure BFloat16/Float8 low-precision training combined with peak memory optimizations such as gradient fusion and chunking strategies.
How to Set AdamW's Weight Decay as You Scale Model and Dataset Size: By interpreting the weight updates of AdamW as an Exponential Moving Average (EMA), this work reveals that the EMA timescale \(\tau = 1/(\eta\lambda)\) is a core hyperparameter. Its optimal value in terms of epochs remains stable across varying model and dataset scales, thereby providing clear scaling rules for weight decay.
LCRON: Learning Cascade Ranking as One Network: This work proposes LCRON, which trains multi-stage cascade ranking systems as a unified network in an end-to-end manner. Specifically, an end-to-end surrogate loss \(L_{e2e}\) constructed via differentiable ranking techniques directly optimizes the lower bound of the survival probability of ground truth items through the entire cascade. This is assisted by auxiliary individual stage losses \(L_{single}\) derived from the tightness of the lower bound to drive collaboration among stages. LCRON achieves significant improvements in both public benchmarks and online A/B tests of industrial advertising systems (Ad Revenue +4.10%, User Conversion +1.60%).
New Interaction Paradigm for Complex EDA Software Leveraging GPT: This work proposes the SmartonAI system, which integrates Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) into the EDA tool KiCad. It achieves task decomposition, document retrieval, and intelligent plugin recommendation and execution through natural language interaction, significantly reducing the learning curve for complex engineering software.
Not All Explanations for Deep Learning Phenomena Are Equally Valuable: This is a position paper arguing that "counter-intuitive phenomena" in deep learning (such as double descent, grokking, and the lottery ticket hypothesis) rarely occur in practical settings. Instead of pursuing isolated explanations for these phenomena, researchers should treat them as empirical testbeds to evaluate and refine broader deep learning theories.
PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model: This work proposes PARM, a single unified preference-aware autoregressive reward model, which conditions preference vectors into the ARM via PBLoRA (Preference-Aware Bilinear Low-Rank Adaptation) for efficient multi-objective test-time alignment—replacing \(k\) independent ARMs with a single reward model to reduce inference costs and support weak-to-strong guidance (e.g., a 7B model guiding a 65B model).
Position: Don't Use the CLT in LLM Evals with Fewer Than a Few Hundred Datapoints: As a position paper, this work argues that when the sample size for LLM evaluation is fewer than a few hundred, confidence intervals based on the Central Limit Theorem (CLT) severely underestimate uncertainty. It recommends using Bayesian credible intervals or Wilson score intervals as alternative solutions.

Browse all 17 Recommender Systems papers →

🔄 Self-Supervised Learning (22)¶

A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints: Introduces "downstream free energy" as a Bayesian model selection criterion for pretraining checkpoint adaptability, proves that "pretraining free energy" serves as its upper bound proxy (without requiring downstream data), and experimentally validates that a large learning rate, small batch size, and high momentum improve downstream transfer performance by reducing pretraining free energy.
AdaWorld: Learning Adaptable World Models with Latent Actions: AdaWorld is proposed, which builds highly adaptable world models by performing action-aware pre-training through self-supervised extraction of latent actions from videos, supporting zero-shot action transfer and fast adaptation to new environments with few interactions.
Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search: Alpha-SQL models zero-shot Text-to-SQL as a tree search problem. By combining a Monte Carlo Tree Search (MCTS) framework with an LLM-as-Action-Model and a self-supervised reward function, it achieves a 69.7% execution accuracy on the BIRD dataset using a 32B open-source model without any fine-tuning, surpassing the GPT-4o-based zero-shot SOTA by 2.5 percentage points.
Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions: Using physical activity data from 162K participants and 2.5 billion hours of wearable behavioral data from the Apple Heart and Movement Study, this work systematically explores combinations of tokenizers and architectures. By constructing WBM, a behavioral foundation model leveraging TST + Mamba-2 + contrastive learning, the model significantly outperforms hand-crafted feature baselines across 57 health detection tasks and complements PPG sensor models.
CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries: Proposes CLARIFY, a method that constructs a trajectory embedding space integrating preference information via contrastive learning and utilizes rejection sampling to select clearer, more distinguishable preference queries, thereby improving annotation efficiency and policy performance of offline PbRL under non-ideal feedback.
ReSA: Clustering Properties of Self-Supervised Learning: This work systematically analyzes the clustering properties of various components in JEA-based SSL. It discovers that the encoding possesses superior and more stable clustering capabilities compared to the embedding and hidden layers of the projector. Based on this, ReSA (Representation Self-Assignment) is proposed to utilize encoding clustering information to guide embedding learning, forming a positive-feedback SSL framework that significantly outperforms SOTA on multiple standard benchmarks.
Collapse-Proof Non-Contrastive Self-Supervised Learning: This paper proposes the FALCON method, which designs the projector and loss function based on the principles of hyperdimensional computing. It theoretically proves the simultaneous prevention of four known training failure modes (representation collapse, dimensional collapse, cluster collapse, and intracluster collapse) while naturally endowing representations with decorrelation and clustering properties.
Contextures: Representations from Contexts: Establishes the contexture theory to unify and prove that various representation learning paradigms, including supervised learning, self-supervised learning, and manifold learning, can be understood as learning the top-\(d\) singular functions of the expectation operator induced by context variables, while revealing the law of diminishing marginal returns in model scaling and proposing context quality evaluation metrics.
Deep Learning is Not So Mysterious or Different: This is a position paper arguing that the generalization phenomena deemed "mysterious" in deep learning (benign overfitting, double descent, and the success of overparameterization) are neither unique to deep learning nor mysterious. They can be formalized using long-standing generalization frameworks (PAC-Bayes and countable-hypothesis bounds) and unified under the explanatory principle of soft inductive biases.
Discovering Global False Negatives On the Fly for Self-supervised Contrastive Learning: GloFND is proposed to learn a dynamic threshold for each anchor sample, discovering and filtering global false negatives in real-time during training. This improves the representation quality in contrastive learning with low computational overhead.

Browse all 22 Self-Supervised Learning papers →

📐 Optimization & Theory (61)¶

A Generalization Result for Convergence in Learning-to-Optimize: A probabilistic framework is proposed to combine PAC-Bayesian generalization theory with the Kurdyka-Łojasiewicz (KL) convergence theorem from variational analysis, proving for the first time with high probability that learned optimization algorithms converge to critical points without restricting the algorithm design.
A Near-Optimal Single-Loop Stochastic Algorithm for Convex Finite-Sum Coupled Compositional Optimization: This paper proposes the ALEXR algorithm—an efficient single-loop primal-dual block-coordinate stochastic algorithm for solving convex finite-sum coupled compositional optimization (cFCCO) problems. It achieves near-optimal convergence rates under both smooth and non-smooth conditions, and proves the optimality of the algorithm by deriving lower bounds.
A Unified View on Learning Unnormalized Distributions via Noise-Contrastive Estimation: Proposes two estimator families, alpha-CentNCE and f-CondNCE, based on f-NCE, to unify methods for learning unnormalized distributions (such as MLE, MC-MLE, GlobalGISO, pseudo-likelihood, and ISO), corrects the misleading connection between CondNCE and score matching, and establishes the first finite-sample convergence guarantees for bounded exponential families.
Adjustment for Confounding using Pre-Trained Representations: This paper investigates how to leverage latent representations from pre-trained neural networks to adjust for confounding in non-tabular data (e.g., images, text). It formalizes representation sufficiency conditions, proves that sparsity/additivity assumptions do not hold under Invertible Linear Transformations (ILTs), and establishes convergence rate theory for deep networks based on low intrinsic dimension and Hierarchical Compositional Models (HCMs), thereby guaranteeing valid inference of ATE estimation within the Double Machine Learning (DML) framework.
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs: Proposed AdvPrompter, which uses an LLM (AdvPrompter) to generate human-readable adversarial prompt suffixes for target LLMs in seconds. Trained via an alternating optimization algorithm, it achieves high attack success rates on AdvBench and HarmBench and transfers to closed-source black-box LLMs, while presenting a strategy for adversarial training using generated adversarial suffixes to enhance target LLM robustness.
Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression: This paper theoretically proves that early-stopped gradient descent (GD) possesses a statistical advantage over asymptotic GD in overparameterized logistic regression: early-stopped GD is calibrated and consistent, whereas the logistic risk of asymptotic GD diverges to infinity and its calibration error does not vanish. Additionally, a quantitative connection between early stopping and \(\ell_2\) regularization is established.
Beyond Self-Repellent Kernels: History-Driven Target Towards Efficient Nonlinear MCMC on General Graphs: This paper proposes the History-Driven Target (HDT) framework, which embeds self-repellent dynamics into any MCMC sampler by modifying the target distribution (rather than the transition kernel). It achieves the \(O(1/\alpha)\) variance reduction while resolving the three major limitations of SRRW: high computational overhead, restriction to reversible chains, and high memory footprint.
BOPO: Neural Combinatorial Optimization via Best-anchored and Objective-guided Preference Optimization: This paper introduces preference optimization into Neural Combinatorial Optimization (NCO) and proposes BOPO. Through (1) best-anchored preference pair construction (hybrid rollout + uniform filtering + best-anchored pairing) and (2) an objective-guided adaptively scaled loss function (\(\beta = g(y_l)/g(y_w)\)), BOPO comprehensively outperforms state-of-the-art (SOTA) approaches on three classical combinatorial optimization problems (JSP, TSP, and FJSP) without requiring a reward model or a reference policy.
Can Transformers Learn Full Bayesian Inference In Context?: This paper demonstrates that Transformers can perform full Bayesian inference in context. By pre-training an encoder-decoder architecture (TabPFN encoder + diffusion Transformer decoder) on synthetic data, the model can generate posterior samples of comparable quality to HMC for statistical models like GLMs and Gaussian mixture models during deployment, without requiring any parameter updates.
Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed: This paper proves that the high-probability convergence of AdaGrad/Adam under heavy-tailed noise can be poor (with polynomial dependence on the confidence level) and demonstrates that gradient clipping resolves this issue—specifically, Clip-AdaGrad-Norm and Clip-Adam-Norm achieve high-probability convergence bounds with polylogarithmic dependence on the confidence level under heavy-tailed noise, which are then extended to delayed stepsizes versions.

Browse all 61 Optimization & Theory papers →

📐 Learning Theory (16)¶

Avoiding Catastrophe in Online Learning by Asking for Help: A novel theoretical framework for online learning is proposed to address catastrophic (irreversible) errors: payoffs are defined as catastrophe avoidance probabilities, and the objective function is the product of payoffs (overall catastrophe avoidance probability). By introducing an instructor querying mechanism and a Local Generalization assumption, the paper establishes an impossibility result (catastrophe is inevitable without query) and a possibility result (if the policy class is learnable, both regret and query rate vanish simultaneously), elevating sublinear regret in standard online learning to subconstant regret.
Heavy-Tailed Linear Bandits: Huber Regression with One-Pass Update: This paper proposes Hvt-UCB, a one-pass Huber regression algorithm based on Online Mirror Descent for linear bandits with heavy-tailed noise. It reduces the per-round computational complexity from \(\mathcal{O}(t\log T)\) to \(\mathcal{O}(1)\) while maintaining an optimal and instance-dependent regret bound.
Improved and Oracle-Efficient Online \(\ell_1\)-Multicalibration: Proposes to reduce online \(\ell_1\)-multicalibration to a newly defined Online Linear-Product Optimization (OLPO) problem, achieving improved and oracle-efficient online multicalibration error bounds of \(\widetilde{O}(T^{-1/3})\) and \(\widetilde{O}(T^{-1/4})\), respectively.
Improved Generalization Bounds for Transductive Learning by Transductive Local Complexity and Its Applications: This paper proposes the Transductive Local Complexity (TLC) framework, extending classical local Rademacher complexity to the transductive learning setting. It achieves excess risk bounds that are almost consistent with inductive learning (differing only by logarithmic factors) and resolves a decade-old open problem.
Learning-Augmented Algorithms for MTS with Bandit Access to Multiple Predictors: In Metrical Task Systems (MTS), when the algorithm can only access \(\ell\) heuristics in a bandit fashion (querying only one heuristic per step and requiring \(m\) consecutive queries to observe the state), this paper presents an algorithm with \(O(\text{OPT}^{2/3})\) regret and proves that this bound is tight.
Learning-Augmented Hierarchical Clustering: This paper investigates leveraging side information from a splitting oracle to bypass the approximation hardness barriers of hierarchical clustering, achieving an \(O(1)\)-approximation for the Dasgupta objective and a \((1-o(1))\)-approximation for the Moseley-Wang objective, with extensions to streaming and parallel computing settings.
Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures: This work presents the first single-pass streaming algorithm for the maximum coverage problem under the turnstile streaming model (supporting arbitrary insertions/deletions) with \(\tilde{O}(d/\varepsilon^3)\) space and \(\tilde{O}(1)\) update time. The algorithm is further extended to the privacy fingerprinting scenario, achieving up to a 210× speedup compared to prior methods in experiments.
Multiple-Policy Evaluation via Density Estimation: This paper proposes the CAESAR algorithm, which simultaneously evaluates \(K\) policies through a two-phase approach (coarse estimation of the visitation distribution + density ratio estimation under the optimal sampling distribution). This achieves a non-asymptotic, instance-dependent sample complexity. The core technique is "coarse estimation"—obtaining a constant-factor distribution approximation using only \(O(1/\epsilon)\) samples.
Near-Optimal Consistency-Robustness Trade-Offs for Learning-Augmented Online Knapsack Problems: A family of online knapsack algorithms based on simple predictions (point or interval predictions of the critical value) is proposed. These algorithms achieve near-Pareto optimal trade-offs between consistency and robustness, alongside a general reduction from fractional to integer solutions.
Near Optimal Best Arm Identification for Clustered Bandits: In the multi-agent clustered multi-armed bandit setting, this paper proposes two algorithms, Cl-BAI and BAI-Cl, which exploit the clustering structure to significantly reduce the sample complexity of best arm identification. It is further proved that BAI-Cl++ achieves minimax optimality when \(M\) is a constant.

Browse all 16 Learning Theory papers →

🔗 Causal Inference (17)¶

Causal Abstraction Inference under Lossy Representations: This paper proposes the Projected Abstraction framework, breaking the reliance of existing causal abstraction theory on the "Abstract Invariance Condition (AIC)." This enables mathematically consistent causal inference under lossy/dimension-reduced representations and provides identifiability criteria at the graphical model level.
Causal Effect Identification in lvLiNGAM from Higher-Order Cumulants: In the Linear Non-Gaussian Acyclic Model with latent confounding (lvLiNGAM), this paper identifies causal effects using higher-order cumulants (instead of only the covariance matrix). It addresses two challenging settings: (1) a single proxy variable that may affect the treatment; and (2) the underdetermined instrumental variable (IV) problem where the number of IVs is less than the number of treatments. Identifiability is proved and consistent estimators are provided for both cases.
Causal Evidence for the Primordiality of Colors in Trans-Neptunian Objects: Using a model-agnostic causal discovery method (the FCI algorithm), this paper demonstrates with 98.7% confidence that the color of Trans-Neptunian Objects (TNOs) is the root cause of their orbital inclination distribution. This provides strong support for the "primordial" hypothesis of TNO colors—implying that color reflects the formation location rather than post-formation collisional evolution.
Classifier Reconstruction Through Counterfactual-Aware Wasserstein Prototypes: This paper proposes using Wasserstein barycenters to fuse original and counterfactual samples into class prototypes, enabling high-fidelity reconstruction of target binary classifiers under limited query budgets and effectively mitigating the decision boundary shift problem caused by the naive use of counterfactual samples.
Doubly Protected Estimation for Survival Outcomes Utilizing External Controls for Randomized Clinical Trials: Proposing a doubly protected estimation framework for survival outcomes that corrects covariate shift via density ratio weighting and detects outcome drift via DR-Learner to selectively borrow comparable external controls, achieving robustness to external data heterogeneity while guaranteeing consistency and efficiency gains.
E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time: Proposes E-LDA (Exemplar-LDA), which reformulates the MAP topic-word assignment problem of LDA as a monotone submodular function maximization. For the first time, a practical algorithm with a \(1-1/e\) approximation guarantee is achieved, which converges in logarithmic parallel time while ensuring that each learned topic possesses formal keyword-based interpretability.
Estimating Causal Effects in Gaussian Linear SCMs with Finite Data: This work proposes the Centralized Gaussian Linear SCM (CGL-SCM), which significantly reduces the parameter space by standardizing exogenous variables to \(\mathcal{N}(0,1)\), and designs an EM-based estimation algorithm to accurately recover identifiable causal effects under finite observational data.
Exogenous Isomorphism for Counterfactual Identifiability: This paper proposes the concept of Exogenous Isomorphism (EI), proving that \(\sim_{\mathrm{EI}}\)-identifiability implies \(\sim_{\mathcal{L}_3}\)-identifiability (complete counterfactual layer identifiability). It provides sufficient conditions for achieving EI in two special classes of models: Bijective SCMs (BSCMs) and Triangular Monotonic SCMs (TM-SCMs), thereby unifying and generalizing existing counterfactual identifiability theories.
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors: Using identified internal causal mechanisms in LLMs to predict model output correctness on out-of-distribution (OOD) inputs, this work proposes two methods—counterfactual simulation and value probing—achieving an average AUC-ROC improvement of 13.84% over existing baselines in OOD settings.
Isolated Causal Effects of Natural Language: Proposes a formal estimation framework for the "Isolated Causal Effect," which isolates the causal effect of focal language attributes from correlated non-focal language using a doubly robust estimator and omitted variable bias (OVB) sensitivity analysis.

Browse all 17 Causal Inference papers →

🔬 Interpretability (31)¶

A Reasoning-Based Approach to Cryptic Crossword Clue Solving: A three-stage LLM reasoning pipeline (Answer Candidate Generation \(\rightarrow\) Wordplay Suggestion \(\rightarrow\) Python Formalisation & Verification) is proposed. Using open-source 9B models, it achieves a new SOTA on the Cryptonite dataset. The key innovation lies in formalizing wordplay reasoning into executable Python code and iteratively correcting it via a verifier with hints.
Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large p: Proposed the PAN+SR framework, which reduces high-dimensional symbolic regression problems to low-dimensional subspaces through BART-based nonparametric variable pre-screening, achieving significant performance improvements for 19 existing SR methods in high-dimensional scenarios.
Concept-Based Unsupervised Domain Adaptation: Proposes the CUDA framework, which combines Concept Bottleneck Models (CBMs) with Unsupervised Domain Adaptation (UDA). By aligning concept representations via relaxed consistency (allowing minor domain discrepancies) and inferring unlabeled concepts in the target domain, CUDA simultaneously provides interpretability and cross-domain generalization under domain shift for the first time, backed by theoretical guarantees.
Configurable Preference Tuning with Rubric-Guided Synthetic Data: This paper proposes the Configurable Preference Tuning (CPT) framework, which trains LLMs using synthetic preference data generated from fine-grained rubrics. This enables the model to dynamically adjust its behavioral style at inference time simply by modifying the system prompt without retraining, improving accuracy from 0.52-0.68 to 0.76-0.83 across multiple base models.
DeltaSHAP: Explaining Prediction Evolutions in Online Patient Monitoring with Shapley Values: DeltaSHAP is an explainable AI algorithm designed specifically for online patient monitoring systems. By adapting Shapley values to temporal scenarios, it explains the evolution (change) between consecutive predictions rather than absolute prediction values. It provides both the direction and magnitude of feature attributions, achieving a 62% improvement in explanation quality and a 33% reduction in computation time on the MIMIC-III benchmark.
Do Sparse Autoencoders Generalize? A Case Study of Answerability: This paper systematically evaluates the out-of-domain (OOD) generalization capabilities of features extracted by Sparse Autoencoders (SAEs) on the task of "answerability." The study reveals highly inconsistent OOD transfer performance of SAE features—outperforming residual stream linear probes on some datasets while performing near-randomly on others, highlighting the fundamental limitations of current SAE interpretability methods in capturing abstract concepts.
Evaluating Neuron Explanations: A Unified Framework with Sanity Checks: Proposes the NeuronEval unified framework, formalizing 19 existing neuron explanation evaluation methods into the same mathematical paradigm. It introduces two sanity checks (Missing Labels and Extra Labels) to reveal that most commonly used metrics (e.g., Recall, AUC, and Correlation under top-and-random sampling) are unreliable, with only Correlation (Pearson), Cosine, AUPRC, F1, and IoU passing the checks.
Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective: Proposes the PromptQuine framework, which performs token-level pruning on ICL prompts through evolutionary search. It discovers that pruning clear exemplars into seemingly "gibberish" subsequences can actually improve LLM performance, matching or surpassing SOTA prompt optimization methods.
Explaining, Fast and Slow: Abstraction and Refinement of Provable Explanations: This paper proposes an abstraction-refinement-based method to efficiently compute provably sufficient explanations for neural network predictions, accelerating the verification process by abstracting large networks into small ones with formal guarantees on explanation quality.
FastCAV: Efficient Computation of Concept Activation Vectors for Explaining Deep Neural Networks: FastCAV is proposed to replace SVM training with the normalized mean difference vector of concept activation samples. This approach is theoretically equivalent to a simplified form of Fisher Discriminant Analysis. It achieves up to 63.6\(\times\) (average 46.4\(\times\)) acceleration while maintaining comparable classification accuracy and downstream explanation quality to SVM-CAV.

Browse all 31 Interpretability papers →

📦 Model Compression (74)¶

A Cross Modal Knowledge Distillation & Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features: Proposes Semi-Clipped (a CLIP-based cross-modal distillation method) and PEA (Perturbation Embedding Augmentation). In weakly paired data scenarios, these methods distill rich morphological features from microscopy images into transcriptomics representations, significantly improving their predictive power while maintaining the interpretability of gene expression.
A Mathematical Framework for AI-Human Integration in Work: This paper proposes a mathematical framework for evaluating AI-human work integration, decomposing skills into decision-level and execution-level sub-skills. It theoretically proves that the probability of work success exhibits a phase transition effect, and that merging complementary skills can yield super-additive gains. It also mathematically explains the "productivity compression" phenomenon where low-to-medium skilled workers benefit more from GenAI assistance, validating the framework using O*NET and Big-bench Lite data.
ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via α-β-Divergence: This paper provides an in-depth analysis of the probability mass allocation deficiencies of FKLD and RKLD in knowledge distillation, finding that they represent extremes in two effects: Hardness-Concentration and Confidence-Concentration. Based on this, the ABKD framework utilizing \(\alpha\)-\(\beta\)-divergence is proposed to flexibly balance these two effects by tuning \(\alpha\) and \(\beta\), achieving SOTA performance across 17 language/vision datasets and 12 teacher-student configurations.
An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks: Proposes the RSR/RSR++ algorithm—by preprocessing fixed binary/ternary weight matrices to build bucketed permutation indices, it achieves vector-matrix multiplication with \(O(n^2/\log n)\) complexity, achieving up to 29× faster matrix multiplication and 6× memory savings compared to the standard \(O(n^2)\) method, as well as a 5.24× speedup in 1.58-bit LLM inference.
any4: Learned 4-bit Numeric Representation for LLMs: This paper proposes any4, a method that learns the optimal 4-bit non-uniform quantization codebook for each row of the weight matrix via k-means clustering. Without requiring weight/activation preprocessing, any4 outperforms int4/fp4/nf4 on Llama 2/3, Mistral, and Mixtral, using only a single calibration sample.
BECAME: BayEsian Continual Learning with Adaptive Model MErging: This paper proposes BECAME, which reformulates the model merging mechanism based on Bayesian continual learning principles. It derives a closed-form solution for the optimal merging coefficient using Laplace approximation. Combining gradient projection (stability) and unconstrained training (plasticity) into a two-stage framework, it significantly outperforms SOTA on multiple continual learning benchmarks.
Best Subset Selection: Optimal Pursuit for Feature Selection and Elimination: This paper revisits the feature selection/elimination criteria in classical best subset selection from an optimization perspective. It reveals that traditional criteria (correlation selection + Wald-T elimination) only capture "one-step changes" in the objective function while neglecting feature interactions. Consequently, the authors propose "objective-aware" optimal selection and elimination criteria to enhance classic algorithms such as OMP, CoSaMP, and (A)BESS as a plug-and-play Meta-Substitution. This achieves significant performance improvements in compressed sensing and sparse regression tasks without increasing computational complexity.
Beyond Communication Overhead: A Multilevel Monte Carlo Approach for Mitigating Compression Bias in Distributed Learning: This paper proposes a gradient compression scheme based on Multilevel Monte Carlo (MLMC), which constructs statistically unbiased gradient estimators from biased compressors, turning compression bias into manageable variance. This allows the approach to enjoy the theoretical guarantees of unbiased methods while maintaining the empirical efficiency of biased compressors. Combined with adaptive probability optimization, its superiority is validated on BERT fine-tuning and CIFAR-10.
Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics: This work theoretically analyzes and experimentally validates, from an infinite-width perspective, that initializing both the A and B matrices of LoRA as non-zero (Init[AB]), compared to the traditional zero initialization (Init[A]), significantly enhances robustness to suboptimal learning rates. Furthermore, the introduced random noise does not impair the fine-tuning performance—meaning that fine-tuning does not strictly need to start from the pre-trained model.
BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference: Proposes BlockDialect—a block-wise fine-grained mixed-format quantization method for weights and activations. It selects the optimal numerical format for each block from a formatbook of FP4 variants (dialects), improving accuracy on LLaMA3-8B by 10.78% compared to MXFP4, and remaining only 5.45% below full precision.

Browse all 74 Model Compression papers →

🕸️ Graph Learning (31)¶

A Cognac Shot To Forget Bad Memories: Corrective Unlearning for Graph Neural Networks: Cognac is proposed as the first effective corrective unlearning method for GNNs. By alternating between Contrastive Unlearning on Graph Neighborhoods (CoGN) and AsCent DesCent DeCoupled (AC⚡DC), it restores performance close to the oracle (trained on fully clean data) while identifying only 5% of the manipulated entities, achieving 8× higher efficiency than retraining from scratch.
A General Graph Spectral Wavelet Convolution via Chebyshev Order Decomposition: Proposed WaveGC, a multi-resolution graph spectral convolutional network that constructs learnable graph wavelets strictly satisfying the admissibility condition by Separating Odd and Even Chebyshev terms and combines them with matrix-valued filter kernels, achieving consistent improvements across both short- and long-range graph tasks (with a 15.7% gain on VOC).
A Recipe for Causal Graph Regression: Confounding Effects Revisited: This paper systematically extends causal graph learning from classification to regression tasks for the first time. By acknowledging the predictive capacity of confounding subgraphs via an Enhanced Graph Information Bottleneck (Enhanced GIB) and substituting discrete-label-dependent causal intervention methods with contrastive learning, the proposed method significantly outperforms existing approaches on graph-level OOD regression benchmarks.
Balancing Efficiency and Expressiveness: Subgraph GNNs with Walk-Based Centrality: Proposed HyMN: an efficient framework that uses walk-based centrality (Subgraph Centrality) to sample subgraph bags for subgraph GNNs. With only 1-2 subgraphs, it matches the performance of full-bag subgraph GNNs while using centrality as a structural encoding to further enhance discriminative power, scaling subgraph methods to graphs hundreds of times larger for the first time.
Banyan: Improved Representation Learning with Explicit Structure: Banyan introduces two innovations, entangled hierarchical tree structures and diagonalised message passing. With only 14 non-embedding parameters, it outperforms large-scale Transformer models on semantic textual similarity tasks, offering an efficient and viable alternative for semantic representation learning in low-resource languages.
Beyond Message Passing: Neural Graph Pattern Machine: This paper proposes the Neural Graph Pattern Machine (GPM), which samples graph patterns using random walks. It utilizes dual encoders for semantic and anonymous paths to capture node features and topological structures, respectively. A Transformer is then employed to identify task-relevant pattern tokens, bypassing the message-passing paradigm entirely and outperforming SOTA methods across node-, edge-, and graph-level tasks.
CoDy: Counterfactual Explainers for Dynamic Graphs: Proposes CoDy—the first counterfactual explanation method for Temporal Graph Neural Networks (TGNNs), which efficiently explores the space of possible explanatory subgraphs by combining Monte Carlo Tree Search (MCTS) with spatio-temporal heuristic policies, achieving a 16% improvement in AUFSC+ across multiple datasets.
Diss-l-ECT: Dissecting Graph Data with Local Euler Characteristic Transforms: This paper proposes the Local Euler Characteristic Transform (\(\ell\)-ECT), extending classical ECT topological invariants to the local neighborhoods of graphs to generate lossless topological-geometric fingerprints for each node. It outperforms standard GNNs on node classification tasks, particularly on highly heterophilous graphs, while providing theoretical invertibility guarantees and interpretability.
Does Graph Prompt Work? A Data Operation Perspective with Theoretical Analysis: Provides the first comprehensive theoretical framework for Graph Prompts from the perspective of "data operation": proves that prompts can map the original graph to a "bridge graph" by simulating graph data transformations to adapt the frozen model to downstream tasks, and derives the error upper bounds and distributions in both single-graph and multi-graph scenarios.
EvoMesh: Adaptive Physical Simulation with Hierarchical Graph Evolutions: EvoMesh proposes a fully differentiable hierarchical graph evolution framework that adaptively constructs a multi-scale graph hierarchy evolving over time based on physical inputs using Anisotropic Message Passing (AMP) and Gumbel-Softmax-based differentiable node selection (DiffSELECT), outperforming fixed-hierarchy methods by an average of approximately 20% across five physical simulation benchmarks.

Browse all 31 Graph Learning papers →

📈 Time Series (21)¶

A Generalizable Physics-Enhanced State Space Model for Long-Term Dynamics Forecasting in Complex Environments: This paper proposes Phy-SSM, which integrates partially known physical knowledge into deep state space models (SSMs). Through dynamics decomposition (known/unknown matrices) and physical state regularization, it achieves accurate long-term dynamics forecasting and extrapolation for noisy, irregularly sampled data.
Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle: This paper proposes Daily Oracle, a continuous evaluation benchmark that automatically generates predictive QA pairs from daily news. It systematically reveals a smooth decay in LLM predictive performance as pre-training data becomes outdated, showing an average accuracy drop of 21.55% on True/False (TF) questions and 11.33% on Multiple Choice (MC) questions, which cannot be fully mitigated even with RAG.
Causal Discovery from Conditionally Stationary Time Series: Proposed SDCI (State-Dependent Causal Inference)—a causal discovery method for conditionally stationary time series. It models non-stationary behavior using discrete latent state variables to perform state-dependent causal structure recovery, with its effectiveness validated on physical particle systems, gene regulatory networks, and NBA player motion prediction.
Channel Normalization for Time Series Channel Identification: This work proposes Channel Normalization (CN), which enhances the Channel Identifiability (CID) of time series models by assigning independent affine transformation parameters to each channel. It further extends to an adaptive version ACN (dynamically adjusting parameters) and a prototypical version PCN (supporting unknown/variable channel counts), achieving significant performance improvements across various time series models.
Customizing the Inductive Biases of Softmax Attention using Structured Matrices: This paper proposes replacing the low-rank scoring function in softmax attention with efficient structured matrices (BTT and MLR). This both addresses the low-rank bottleneck of standard attention and introduces a distance-dependent computational bias through MLR, yielding improvements in in-context regression, language modeling, and long-range time series forecasting.
Event-Aware Sentiment Factors from LLM-Augmented Financial Tweets: A Transparent Framework for Interpretable Quant Trading: This study leverages Large Language Models (LLMs) to perform multi-label event classification and annotation on financial tweets, transforming unstructured social media text into structured, interpretable, event-driven quantitative factors. It discovers that specific event categories (e.g., rumor/speculation) possess significant negative Alpha signals (with Sharpe ratios as low as -0.38).
Foundation Models for Clinical Records at Health System Scale: Proposes GPT-EHR, a generative pre-training framework based on next-visit event prediction. By training a decoder-only Transformer on longitudinal EHR data of 1.29 million patients from NYU Langone, GPT-EHR predicts the onset of dementia and knee osteoarthritis in a zero-shot manner. Its performance is comparable to fully fine-tuned BERT baselines, while successfully uncovering and addressing a critical pitfall where repeated event tokens artificially inflate evaluation metrics.
HyperIMTS: Hypergraph Neural Network for Irregular Multivariate Time Series Forecasting: HyperIMTS is proposed to represent the observations and their dependencies in irregular multivariate time series (IMTS) using a hypergraph structure. By leveraging three message passing mechanisms (node-to-hyperedge, hyperedge-to-hyperedge, and hyperedge-to-node), it achieves irregularity-aware temporal and variable dependency learning. It achieves SOTA performance on 5 IMTS datasets with superior computational efficiency compared to padding methods.
IMTS is Worth Time × Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction: The VIMTS framework is proposed, which converts irregular multivariate time series (IMTS) into an image-like time × channel patch structure. By leveraging the sparse multi-channel modeling capability of a visual MAE pre-trained on large-scale RGB images, combined with GCN-based cross-channel imputation and a coarse-to-fine prediction strategy, VIMTS achieves SOTA performance and strong few-shot capabilities on IMTS prediction tasks.
Learning Soft Sparse Shapes for Efficient Time-Series Classification: The SoftShape model is proposed to replace the traditional hard filtering of shapelets with soft sparsification based on contribution scores. Combining MoE-driven intra-shape and shared-expert-driven inter-shape dual-mode temporal pattern learning, it achieves SOTA classification accuracy on 128 UCR datasets.

Browse all 21 Time Series papers →

🏥 Medical Imaging (21)¶

Bayesian Inference for Correlated Human Experts and Classifiers: A general Bayesian framework is proposed to model the joint labeling behavior between correlated human experts and classifiers. It captures correlations among experts using latent representations and evaluates the utility of additional queries via simulation-based inference, significantly reducing the number of expert queries in medical classification and image annotation while maintaining predictive accuracy.
Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners (D-BETA): D-BETA proposes a contrastive learning framework that integrates generative masked autoencoders with enhanced discriminative capabilities. Through the ECG-Text Sigmoid (ETS) loss and Nearest Neighbor Negative Sampling (N3S) strategy, it significantly outperforms existing methods in cross-modal ECG-text representation learning, achieving a 15% average AUC improvement in linear probing with only 1% of the training data, and a 2% improvement in zero-shot performance.
Certification for Differentially Private Prediction in Gradient-Based Training: An Abstract Gradient Training (AGT) framework is proposed to compute the upper bounds of the reachable set of model parameters during training using convex relaxation and bound propagation techniques. This leverages the smooth sensitivity mechanism to significantly tighten the privacy analysis of private prediction, achieving privacy bounds several orders of magnitude tighter than global sensitivity on medical imaging and NLP tasks.
Context Matters: Query-aware Dynamic Long Sequence Modeling of Gigapixel Images: This paper proposes the Querent framework, which achieves efficient long-range context modeling in gigapixel Whole Slide Images (WSIs) through query-aware dynamic region importance evaluation. It theoretically achieves a bounded approximation of full self-attention and outperforms state-of-the-art (SOTA) methods in biomarker prediction, gene mutation prediction, cancer subtyping, and survival analysis across 10+ WSI datasets.
Do Multiple Instance Learning Models Transfer?: This work presents the first systematic evaluation of the transfer learning capabilities of MIL models in computational pathology, finding that MIL models pre-trained on a pancancer dataset can generalize across organs and tasks, outperforming self-supervised slide foundation models (CHIEF, GigaPath) using less than 10% of the pre-training data.
EEG-Language Pretraining for Highly Label-Efficient Clinical Phenotyping: This paper pioneers the EEG-Language Model (ELM). Trained on 15,000 EEG recordings and clinical reports, ELM integrates time-series cropping, text segmentation, and multi-instance learning strategies. It achieves zero-shot EEG classification and cross-modal retrieval for the first time, significantly outperforming EEG-only self-supervised methods in low-label scenarios.
Efficient Noise Calculation in Deep Learning-based MRI Reconstructions: An efficient method based on Jacobian Sketching is proposed. By probing the Jacobian diagonal elements of DL reconstruction networks via random phase vectors, it accelerates the computation of voxel-level noise variance in MRI reconstruction using an unbiased estimator. The computation and memory requirements are reduced by more than an order of magnitude while maintaining a 99.8% correlation coefficient with the Monte Carlo reference.
Enhancing Statistical Validity and Power in Hybrid Controlled Trials: A Randomization Inference Approach with Conformal Selective Borrowing: An inference framework for hybrid controlled trials is proposed based on the Fisher Randomization Test (FRT) and Conformal Selective Borrowing (CSB). It achieves finite-sample exact Type I error rate control and model-free statistical inference, minimizing MSE through adaptive thresholding to enhance statistical power while maintaining strict Type I error control.
From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining: MELP proposes a multi-scale ECG-language pretraining model. By utilizing cross-modal supervisory signals at three levels (Token, Beat, and Rhythm) combined with domain-specific cardiology language model pretraining, it comprehensively outperforms existing self-supervised and multimodal ECG methods in zero-shot classification, linear probing, and transfer learning.
I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts: I2MoE proposes an interpretable multimodal interaction-aware mixture-of-experts framework. By incorporating four interaction experts (uniqueness \(\times 2\) + synergy + redundancy) combined with a weakly supervised interaction loss, it explicitly models heterogeneous interactions between modalities. Furthermore, it provides sample-level and dataset-level interpretability through a reweighting model, improving accuracy on the ADNI dataset by 5.5%.

Browse all 21 Medical Imaging papers →

🩺 Medical LLM (4)¶

Agent WARPP: Workflow Adherence via Runtime Parallel Personalization: Proposes WARPP, a training-free multi-agent framework that dynamically prunes conditional branch workflows at runtime based on user attributes, executing them through a parallelized Personalizer agent in coordination with modular domain-specific agents, thereby improving tool call precision and parameter fidelity while reducing token consumption.
Autoformulation of Mathematical Optimization Models Using LLMs: This paper proposes a method that combines Large Language Models (LLMs) with Monte-Carlo Tree Search (MCTS) to automatically convert optimization problems described in natural language into mathematical programming models solvable by solvers, significantly improving search efficiency through symbolic pruning and LLM-based value estimation.
EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration: This paper proposes the BanditBench benchmark and three enhancement strategies (inference-time algorithm guidance, few-shot demonstration, and oracle fine-tuning) to systematically evaluate and improve the in-context exploration capabilities of LLMs in bandit environments, enabling smaller models to outperform larger ones through algorithm distillation.
On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains: This paper systematically reveals the vulnerability of RAG retrieval systems in knowledge-intensive domains (healthcare, law) to universal poisoning attacks. It proposes the "orthogonal augmentation" property to explain the cause of the attack and designs a detection-based defense method using distribution-aware distance, achieving near-perfect detection rates in almost all scenarios.

🧬 Computational Biology (48)¶

ADIOS: Antibody Development via Opponent Shaping: Introducing opponent shaping from multi-agent reinforcement learning into antibody design, this paper proposes the ADIOS meta-learning framework: the outer loop optimises the antibody, and the inner loop simulates adaptive viral escape, ensuring that the designed "shaper" antibodies (shapers) not only counter current viral variants but also actively steer viral evolution toward weaker, more easily targeted directions.
Aligning Protein Conformation Ensemble Generation with Physical Feedback: This work proposes Energy-based Alignment (EBA), which integrates energy feedback from physical force fields into the fine-tuning process of diffusion generative models. By aligning the generative distribution with the physical energy landscape via a Boltzmann factor-weighted classification objective, the method achieves state-of-the-art (SOTA) performance in protein conformation ensemble generation on the ATLAS MD benchmark.
CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models: Proposing CFP-Gen—a large-scale diffusion language model that achieves combinatorial protein generation under multimodal functional constraints (functional annotations + sequence motifs + 3D structures) via Annotation-Guided Feature Modulation (AGFM) and Residue-level Control Function Encoding (RCFE), improving the F1 score by 30% compared to ESM3.
Compositional Flows for 3D Molecule and Synthesis Pathway Co-design: Proposes CGFlow (Compositional Generative Flows)—extending flow matching to the step-by-step generation of compositional objects. It interleaves discrete compositional structure sampling (synthesis pathways) and continuous state transport (3D conformations). Applied as 3DSynthFlow to synthesizable drug design, it achieves SOTA results in both binding affinity and synthesizability across 15 targets of LIT-PCBA for the first time.
ComRecGC: Global Graph Counterfactual Explainer through Common Recourse: This paper formally defines the Common Recourse global counterfactual explanation problem for Graph Neural Networks (GNNs) for the first time, proves its NP-hardness, and proposes the ComRecGC algorithm. By searching for counterfactual graphs using Multi-head Vertex-Reinforced Random Walk (VRRW) and extracting common recourses via DBScan clustering, ComRecGC consistently outperforms existing baselines by 10%–30% in coverage across four real-world datasets: NCI1, Mutagenicity, AIDS, and Proteins.
DeepSeq: High-Throughput Single-Cell RNA Sequencing Data Labeling via Web Search-Augmented Agentic Generative AI Foundation Models: Proposed the DeepSeq pipeline, which utilizes large language models (especially Agentic GPT-4o with real-time web search capabilities) to automatically annotate cell types in single-cell RNA sequencing data. It achieves a maximum accuracy of 82.5%, resolving the throughput bottleneck of large-scale omics data annotation.
Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling: The authors propose the CpSDE framework, which uses alternate sampling between a harmonic SDE generative model (AtomSDE) and a residue-type predictor (ResRouter) to achieve the first all-type cyclic peptide design based on 3D receptor structures, surpassing existing linear peptide design methods in both stability and affinity.
eccDNAMamba: A Pre-Trained Model for Ultra-Long eccDNA Sequence Analysis: eccDNAMamba is the first bidirectional state space encoder tailored for circular DNA. Combining BPE tokenization, circular data augmentation, and SpanBERT-style pre-training, it coordinates linear time complexity with ultra-long eccDNA sequence modeling up to 200Kbp. It significantly outperforms DNABERT-2, HyenaDNA, and Caduceus in cancer classification and genuine eccDNA identification tasks.
Efficient Molecular Conformer Generation with SO(3)-Averaged Flow Matching and Reflow: Proposes an SO(3)-Averaged Flow training objective to eliminate the need for rotation alignment between the prior and data distributions by analytically averaging over all rotations in the rotation group SO(3). Combined with Reflow and distillation, it achieves high-quality few-step or even single-step molecular conformer generation.
Elucidating the Design Space of Multimodal Protein Language Models: This work systematically explores the design space of token-based multimodal protein language models (PLMs). Through innovations across four dimensions—bit-wise discrete modeling, geometry-aware architectures, representation alignment, and multimer data expansion—it reduces the folding RMSD of a 650M parameter model from 5.52 to 2.36, surpassing a 3B baseline model and approaching the level of specialized folding models.

Browse all 48 Computational Biology papers →

⚛️ Physics & Scientific Computing (20)¶

Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel: Proposes Causal-PIK, which encodes physical causal similarity into a Physics-Informed Kernel for Bayesian optimization. This enables agents to find optimal actions with very few attempts in physical reasoning tasks, outperforming SOTA on the Virtual Tools and PHYRE benchmarks.
Causal Discovery of Latent Variables in Galactic Archaeology: Utilizing the Rank-based Latent Causal Discovery (RLCD) algorithm, this study automatically recovers two physically meaningful latent variables—birth radius and guiding radius—from only five observable stellar properties in a purely data-driven manner. This validates the potential of causal discovery methods to identify hidden physical quantities in astrophysics.
Closed-form Symbolic Solutions: A New Perspective on Solving Partial Differential Equations: This paper proposes the SymPDE framework, which utilizes deep reinforcement learning to directly search for closed-form symbolic solutions to PDEs, bypassing the issues of insufficient numerical accuracy and poor interpretability of PINNs. It achieves a 90% recovery rate on Poisson and heat equations.
Compact Matrix Quantum Group Equivariant Neural Networks: This paper extends group equivariant neural networks to the setting of compact matrix quantum groups, characterizing the weight matrices of such networks using Woronowicz's formulation of Tannaka-Krein duality, thereby providing a theoretical foundation for learning data on noncommutative geometries.
Differentiable Stellar Atmospheres with Physics-Informed Neural Networks: This work proposes Kurucz-a1, a physics-informed neural network (PINN) designed to simulate 1D stellar atmosphere models under the Local Thermodynamic Equilibrium (LTE) assumption. It resolves the key bottleneck of non-differentiable atmospheric structure solvers in differentiable stellar spectroscopy, outperforming the classic ATLAS-12 code in hydrostatic equilibrium and solar spectrum consistency.
Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems: The authors propose Erwin, a Transformer architecture based on a hierarchical ball tree structure. By restricting attention computation within fixed-size local spherical regions, Erwin achieves linear time complexity. Meanwhile, it captures multi-scale features through progressive coarsening/refinement and cross-ball interaction mechanisms, achieving SOTA performance in multiple domains including cosmology, molecular dynamics, PDE solving, and particle fluid dynamics.
Finetuning Stellar Spectra Foundation Models with LoRA: This work applies LoRA to the stellar spectra foundation model SpecCLIP for the first time, achieving efficient adaptation of models pre-trained on LAMOST/Gaia XP to DESI survey data with only approximately 100-200 labeled samples, demonstrating that LoRA is a lightweight and effective strategy for cross-survey spectral migration.
Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents: Proposes Gravity-Bench-v1, an interactive environment benchmark based on gravitational dynamics simulation to evaluate the capability of AI agents to make scientific discoveries (including OOD physics scenarios) under restricted observation budgets. The results show that current models possess significant shortcomings in observational planning and budget utilization.
Improving Memory Efficiency for Training KANs via Meta Learning: Proposes MetaKANs, which use a small meta-learner to generate the parameters of all learnable activation functions in KANs. This compresses the trainable parameter count from \((G+k+1)\) times that of KANs to a level close to MLPs (approximately 1/3 to 1/9), while maintaining or even improving performance.
L2D: Large Language Models to Diffusion Finetuning: This paper proposes the L2D finetuning method, which treats a seed pretrained LLM as a single-step diffusion model and introduces a parallel diffusion path to achieve multi-step inference scaling. Without modifying the original weights, it obtains monotonically increasing accuracy as the number of inference steps increases, achieving consistent improvements across mathematical, coding, and reasoning tasks on four LLMs.

Browse all 20 Physics & Scientific Computing papers →

📡 Signal & Communications (3)¶

Deep Electromagnetic Structure Design Under Limited Evaluation Budgets: Proposes the Progressive Quadtree-based Search (PQS) method, which compresses the high-dimensional design space of electromagnetic structures via a hierarchical quadtree representation and utilizes a consistency-based sample selection mechanism to efficiently search for high-quality designs under a limited simulation budget, saving 75–85% of evaluation costs compared to generative methods.
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization: By extending each dimension in RoPE from a single frequency to a Fourier series representation and clipping undertrained low-frequency components, this work achieves reliable periodic extension of the attention mechanism, thereby significantly enhancing the length generalization capability of LLMs.
Large Language Model (LLM)-enabled In-context Learning for Wireless Network Optimization: This paper proposes a base station power control algorithm based on LLM In-context Learning (ICL). By leveraging natural language task descriptions and experience-pool-driven exemplar selection, it achieves performance close to traditional deep reinforcement learning without updating model parameters.

👥 Social Computing (6)¶

DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts: DEFAME is proposed, which is a modular, zero-shot multimodal LLM pipeline. By using a six-stage dynamic workflow (Plan -> Execute -> Summarize -> Develop -> Predict -> Justify) combined with external multimodal tool retrieval for evidence, it achieves end-to-end joint text-image fact-checking, reaching new SOTA performance on three benchmarks: AVeriTeC, MOCHEG, and VERITE.
Dynamical Phases of Short-Term Memory Mechanisms in RNNs: This work discovers two distinct underlying dynamical mechanisms supporting short-term memory in RNNs—slow-point manifolds and limit cycles. It analytically derives the power-law scaling laws of their maximum learnable learning rates using toy models (SP: \(\beta\) approx. 4-5 vs LC: \(\beta\) approx. 2-3), and provides large-scale empirical validation by training approximately 80,000 RNNs.
Learning Survival Distributions with the Asymmetric Laplace Distribution: This paper proposes a parametric survival analysis method based on the asymmetric Laplace distribution (ALD). By using a neural network to learn the three parameters of the ALD (location, scale, and asymmetry), it achieves continuous, closed-form estimation of the survival distribution, comprehensively outperforming existing parametric and non-parametric approaches in both discriminative and calibration performance.
OR-Bench: An Over-Refusal Benchmark for Large Language Models: This work proposes OR-Bench, the first large-scale over-refusal benchmark for LLMs. It contains 80K safe prompts that are prone to being falsely refused, revealing a strong trade-off between safety and over-refusal with a Spearman correlation coefficient of up to 0.89.
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing: This paper proposes the GETA framework, which integrates Computerized Adaptive Testing (CAT) from psychometrics with Automatic Item Generation (AIG). Utilizing a variational IRT model and an LLM-driven item generator, GETA dynamically probes the value boundaries of LLMs to address the "evaluation chronoeffect" (data leakage and difficulty saturation) inherent in static benchmarks.
When Bad Data Leads to Good Models: This paper proposes a "pre-training/post-training co-design" perspective, demonstrating through controlled experiments that incorporating a moderate amount of toxic data (~10%) into pre-training data actually reduces the entanglement of toxic features. This makes the model easier to detoxify during post-training (e.g., via ITI activation steering), ultimately reducing toxicity on ToxiGen from 41.40 to 2.63 while maintaining language capabilities.

🛡️ AI Safety (37)¶

A Certified Unlearning Approach without Access to Source Data: This paper proposes the first certified unlearning framework that does not require access to the original training data. By leveraging a surrogate dataset to approximate the statistical properties of the original data, and employing a noise scaling mechanism based on the statistical distance between the source and surrogate distributions, it achieves provable data deletion guarantees.
Accelerating Spectral Clustering under Fairness Constraints: The fair spectral clustering (Fair SC) problem is formulated into a Difference-of-Convex (DC) optimization framework. By employing a variable augmentation strategy and an ADMM-type algorithm, expensive eigendecomposition computations are avoided, achieving significant acceleration on large-scale problems.
Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection: An Adaptive Multi-prompt Contrastive Network (AMCN) is proposed to perform high-quality OOD detection under few-shot ID label conditions by generating three classes of adaptive textual prompts (learnable ID prompts, label-fixed OOD prompts, and label-adaptive OOD prompts) combined with class-adaptive thresholds, significantly outperforming existing few-shot OOD detection methods.
Adversarial Inception Backdoor Attacks against Reinforcement Learning: Proposes the "inception" backdoor attack framework—by inserting triggers into the RL agent's training trajectories and replacing high-reward actions with targeted adversarial actions, achieving a 100% attack success rate (ASR) under strict reward constraints for the first time, while maintaining agent performance on clean tasks.
An Efficient Private GPT Never Autoregressively Decodes: This paper proposes POST (Public decOding and Secure verificationTion), a method that leverages public GPT models to generate draft tokens and securely verifies them using a private model. Exploiting the characteristic that secure decoding latency is insensitive to input length, POST achieves a 2.1× to 6.0× speedup in private inference while maintaining the same privacy guarantees and generation quality as standard secure decoding.
Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts: This paper reveals the "leakage poisoning" phenomenon in Concept Bottleneck Models (CBMs)—where information bypassing the concept bottleneck hurts prediction accuracy under distribution shifts, leading to failed concept interventions. It proposes MixCEM, which utilizes a confidence gate to dynamically decide when to use or discard leaked information, maintaining both high accuracy and effective interventions under both in-distribution and out-of-distribution scenarios.
Breaking the n^{1.5} Additive Error Barrier for Private and Efficient Graph Sparsification: This paper breaks the \(n^{1.5}\) additive error barrier for differentially private graph cut sparsification by proposing a polynomial-time \((\varepsilon,\delta)\)-DP algorithm that reduces the additive error to \(n^{1.25+o(1)}\). The core technology is the first privacy-preserving expander decomposition algorithm.
Can One Safety Loop Guard Them All? Agentic Guard Rails for Federated Computing: Proposes Guardian-FC—the first backend-agnostic unified security framework for federated computing. By employing a finite-state safety loop (Sense→Predict→Act→Prove) on an Agentic-AI control plane, Guardian-FC uniformly regulates heterogeneous privacy mechanisms such as FHE, DP, and MPC, achieving consistent execution of a single set of guard-rail policies across all privacy backends.
Clients Collaborate: Flexible Differentially Private Federated Learning with Guaranteed Improvement of Utility-Privacy Trade-off: This paper proposes the FedCEO framework, which applies low-rank tensor proximal optimization on stacked client model parameters at the server side. By leveraging semantic complementarity among different clients, it recovers semantic information corrupted by DP noise, improving the utility-privacy trade-off bound by an order of \(O(\sqrt{d})\).
Collaborative Mean Estimation Among Heterogeneous Strategic Agents: Individual Rationality, Fairness, and Truthful Contribution: For the collaborative mean estimation problem among multi-agents with heterogeneous costs, this paper designs monetary-free mechanisms that simultaneously satisfy individual rationality (IR), incentive compatibility (IC), and fairness, achieving an \(\mathcal{O}(\sqrt{m})\) approximation ratio in the worst case, and proves three impossibility results.

Browse all 37 AI Safety papers →

📂 Others (90)¶

Access Controls Will Solve the Dual-Use Dilemma: Proposes a conceptual framework based on access control to address the dual-use dilemma in AI safety. By obtaining real-world context through user verification and combining it with content classification, the framework achieves fine-grained permission management, simultaneously mitigating over-refusal and under-refusal.
Addressing Imbalanced Domain-Incremental Learning through Dual-Balance Collaborative Experts (DCE): DCE proposes a two-stage training framework of a frequency-aware expert group + a dynamic expert selector to simultaneously resolve the two challenges of intra-domain class imbalance and cross-domain class distribution shift in domain-incremental learning, achieving state-of-the-art (SOTA) performance on four benchmarks.
Adversarial Combinatorial Semi-bandits with Graph Feedback: This paper introduces graph feedback into the adversarial combinatorial semi-bandits framework and proposes the OSMD-G algorithm, establishing the optimal regret bound of \(\widetilde{\Theta}(S\sqrt{T} + \sqrt{\alpha S T})\), where \(S\) is the size of the combinatorial action and \(\alpha\) is the independence number of the feedback graph. The key technique lies in utilizing randomized swap rounding to achieve negatively correlated sampling.
AutoAL: Automated Active Learning with Differentiable Query Strategy Search: Proposes the first differentiable active learning strategy search framework, AutoAL. By collaboratively training two networks, SearchNet and FitNet, under a bilevel optimization framework, it automatically selects the optimal strategy from multiple candidate AL strategies for a given task, consistently outperforming all candidate strategies and other SOTA methods on natural and medical image datasets.
Beyond Entropy: Region Confidence Proxy for Wild Test-Time Adaptation: Reveals the fundamental limitation of entropy minimization in wild test-time adaptation (WTTA)—conflicting optimization dynamics caused by inconsistent predictions of semantically similar samples in local regions. Proposes the ReCAP framework, which models regions probabilistically and utilizes a finite-to-infinite asymptotic approximation to convert the intractable region confidence into an efficiently optimizable proxy objective, consistently outperforming the state-of-the-art on ImageNet-C.
Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation: This paper theoretically analyzes the Bayes-optimal solutions of two aggregation strategies in multi-label bipartite ranking—loss aggregation and label aggregation—revealing that loss aggregation suffers from a "label dictatorship" phenomenon (where a single label dominates the ranking due to marginal skewness), whereas label aggregation treats all labels in a more balanced manner.
Constrained Hamiltonian Systems on Observation-Induced Fiber Bundles: Theory of Symmetry and Integrability: This work proposes a geometric framework of "observation-induced fiber bundles" that internalizes observational uncertainty in partially observable systems from external perturbations into intrinsic variations of fiber coordinates. On this structure, it unifies the treatment of state and observational constraints, establishing a complete theory of symplectic geometry, integrability, symmetry, and conservation laws.
Continuous-Time Analysis of Heavy Ball Momentum in Min-Max Games: Through continuous-time ODE modeling, this work systematically reveals that heavy ball momentum behaves completely differently in min-max games compared to minimization problems: smaller momentum (including negative momentum) expands the stable stepsize range and guides trajectories toward flatter gradient regions, while alternating updates converge faster than simultaneous updates and amplify this regularization effect.
Cross-regularization: Adaptive Model Complexity through Validation Gradients: Proposes Cross-regularization, which directly optimizes regularization parameters (weight norm, noise scale, augmentation intensity) via validation set gradients, converging to the cross-validation optimal solution in a single training run, thereby eliminating the need for manual hyperparameter tuning.
Curvature Enhanced Data Augmentation for Regression: Proposes CEMS (Curvature-Enhanced Manifold Sampling), which utilizes the second-order approximation (curvature information) of the data manifold to generate synthetic samples for data augmentation in regression tasks, achieving state-of-the-art (SOTA) or near-SOTA performance in both in-distribution and out-of-distribution scenarios.

Browse all 90 Others papers →