🔬 ICLR2026 Accepted Papers¶
5337 ICLR2026 paper notes covering Reinforcement Learning (400), Image Generation (357), Learning Theory (293), LLM Reasoning (241), Model Compression (241), Optimization & Theory (222), Multimodal VLM (211), 3D Vision (201) and other 53 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.
💡 LLM Reasoning (241)¶
- A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic
-
ARGOS facilitates bidirectional information exchange between LLMs and SAT solvers: the solver outputs "confirmed true literals" (the backbone), which the LLM uses to hypothesize missing commonsense clauses. These candidates are then filtered by scorers and fed back into the solver. This iterative completion of logic problems lacking explicitly stated commonsense premises outperforms pure neural or symbolic methods by up to 13% across multiple datasets.
- A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models
-
MeRF writes verifiable reward functions into prompts as a "rulebook" in natural language. By explicitly informing the model of optimization goals during RL training, it moves away from blind trial-and-error, significantly outperforming RLVR baselines in logic and mathematical reasoning tasks.
- A State-Transition Framework for Efficient LLM Reasoning
-
This paper proposes an efficient reasoning framework that models the LLM reasoning process as a state-transition process. By using Linear Attention to compress information from historical reasoning steps into a state matrix, the framework reduces attention complexity from \(O(C^2)\) to \(O(C)\) and KV cache from \(O(C)\) to \(O(1)\), while maintaining reasoning capabilities without shortening the CoT sequence. An additional momentum strategy is introduced to mitigate the "overthinking" problem caused by noisy reasoning steps.
- A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
-
PASR uses Reinforcement Learning (GRPO) to train LLMs to proactively decide "whether, when, and how" to refine their reasoning trajectories during the generation process (rather than post-hoc rework). By designing a "contrastive refinement reward" to encourage valuable corrections, it reduces average token consumption by 41.6% while improving accuracy by 8.2% on Qwen3-8B compared to standard generation.
- AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy
-
NVIDIA systematically decomposes the synergistic relationship between "Supervised Fine-Tuning (SFT) + Large-scale Reinforcement Learning (RL)" in building reasoning models. By expanding SFT data, tuning RL sampling temperature to target "entropy \(\approx 0.3\)", and staging response lengths, a 7B model (AceReason-Nemotron-1.1) achieves new SOTA results for math/code reasoning among 7B-scale models (AIME25 64.8, LiveCodeBench v6 52.1).
- ActivationReasoning: Logical Reasoning in Latent Activation Spaces
-
The ActivationReasoning (AR) framework is proposed to embed explicit logical reasoning into the latent activation space of LLMs (via features extracted by SAEs). Through a three-stage pipeline (discovering concept representations → detecting activation propositions → reasoning with logical rules), it achieves multi-hop reasoning, concept composition, and safety control. On PrOntoQA, an 8B model achieves 95%+ accuracy, surpassing GPT-4o.
- Adaptive Social Learning via Mode Policy Optimization for Language Agents
-
This paper proposes the Adaptive Social Learning (ASL) framework, featuring four hierarchical reasoning modes (ranging from intuitive response to deep deduction). Through the AMPO algorithm—which integrates mode-level and sample-level advantage estimation—LLM agents adaptively switch reasoning depth based on the complexity of social scenarios. On social intelligence tasks, it outperforms GPT-4o by 15.6% and GRPO by 7.0%, while reducing token usage by 32.8%.
- Adaptive Thinking: Large Language Models Know When to Think in Latent Space
-
This paper proposes Sonata: using a lightweight MLP adapter to directly predict "self-consistency" from the last-layer hidden states of a query during the prefilling stage. This allows the model to decide whether and how much to think before decoding, reducing thinking tokens by 20%–60% while maintaining accuracy.
- Agentic Reinforcement Learning with Implicit Step Rewards
-
This paper proposes iStar, a universal credit assignment strategy for multi-turn reinforcement learning of LLM agents. By alternately optimizing an implicit process reward model (PRM) and a policy model, iStar learns dense rewards for each action step through a multi-turn DPO objective. Step-level advantages are combined with episode-level advantages to update the policy. iStar achieves SOTA results on WebShop, VisualSokoban, and the open-ended social environment SOTOPIA, demonstrating superior sample efficiency and training stability.
- AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent
-
AgentMath proposes a tool-augmented agent framework that seamlessly integrates LLM reasoning with the computational precision of a code interpreter through automated data synthesis, multi-turn interactive reinforcement learning (RL), and an efficient asynchronous training system. It achieves SOTA performance on AIME24/25 and HMMT25 at the 30B-A3B scale (90.6%/86.4%/73.8%), surpassing o3-mini and Claude-Opus-4.0-Thinking.
Browse all 241 LLM Reasoning papers →
🦾 LLM Agent (162)¶
- A\(^2\)FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning
-
A2FM integrates three execution modes—instant, reasoning, and agentic—into a single backbone. It first learns the optimal routing path and then aligns mode-specific trajectories. By employing a cost-regularized reinforcement learning method (APO), the model avoids unnecessary computation for simple queries while maintaining accuracy for complex ones, reducing the cost per correct answer by approximately 45% on a 32B scale.
- A Benchmark for Deep Information Synthesis (DeepSynth)
-
The DeepSynth benchmark is proposed, containing 120 real-world information synthesis tasks across 7 domains and 67 countries (averaging 5.5 hours of human annotation per task). It requires agents to collect information from multiple webpages and perform structured reasoning. Currently, the strongest agent (o3-deep-research) only achieves 8.97 F1 / 17.5% LLM-Judge, revealing severe deficiencies in LLM agents regarding deep information synthesis.
- A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments
-
The authors propose ABXLAB, a real-time "man-in-the-middle" framework that intercepts and rewrites webpage content to transform any shopping site into a controlled behavioral experiment. By systematically measuring choice biases in 17 mainstream LLM agents under cues like price, rating, display order, and psychological nudges, they find that agents are more manipulable than humans, with bias magnitudes reaching 3–10 times those of human baselines.
- Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
-
The authors propose ADP, a lightweight "agent data interlingua" that unifies 13 heterogeneous agent training sets into a consistent Trajectory/Action/Observation pattern. These are then distributed to various agent frameworks for SFT, achieving an average gain of ~20% over base models and reaching or approaching SOTA on coding, browsing, and tool-use tasks.
- AgentFold: Long-Horizon Web Agents with Proactive Context Folding
-
AgentFold treats the context of a web agent as a proactively carved "cognitive workspace." During reasoning, the agent outputs an additional "fold command" at each step to perform fine-grained condensation or multi-step deep consolidation of the historical trajectory. This keeps the context at approximately 7k tokens even after 100 interaction rounds. A 30B model (with 3B activated) achieves 36.2% on BrowseComp, surpassing the 671B DeepSeek-V3.1 and OpenAI o4-mini.
- AgentGym-RL: An Open-Source Framework to Train LLM Agents for Long-Horizon Decision Making via Multi-Turn RL
-
This paper introduces AgentGym-RL, an open-source decoupled multi-turn reinforcement learning framework capable of training LLM agents from scratch across five real-world scenarios: Web Navigation, Deep Search, Digital Games, Embodied Control, and Science Tasks. It proposes ScalingInter-RL—a phased training method that progressively increases the number of interaction turns from short-horizon to long-horizon—enabling a 7B model to match or outperform OpenAI o3 and Gemini-2.5-Pro across 27 tasks.
- Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
-
The ACE (Agentic Context Engineering) framework proposes treating context as a continuously evolving "playbook." By utilizing a division of labor among Generator-Reflector-Curator roles and incremental delta updates, it continuously accumulates and refines strategies. This addresses brevity bias and context collapse in existing prompt optimization methods, achieving an average improvement of 10.6% on agent tasks and 8.6% on financial tasks, while reducing adaptation latency by 86.9%.
- AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?
-
AgenTracer employs "counterfactual replay + programmatic fault injection" to automatically annotate multi-agent failure trajectories, constructing the TracerTraj-2.5K dataset. It then trains a lightweight 8B "failure tracer" using multi-granularity reinforcement learning. On the Who&When benchmark, it localizes decisive errors to specific agents and steps, outperforming giant models like Gemini-1.5-Pro and Claude-3.5-Sonnet by up to 18.18% in agent-level accuracy. Furthermore, providing feedback to off-the-shelf systems like MetaGPT and MaAS leads to performance gains of 4.8~14.2%.
- AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
-
The AgentSynth pipeline leverages the principle of information asymmetry—where forward step-by-step generation of simple tasks is easy, while backward holistic solving is difficult—to chain simple subtasks into complex long-horizon computer-use tasks. It automatically generates 6000+ diverse tasks and trajectories at only $0.60 per trajectory, while SOTA agents achieve only a 4% success rate at the highest difficulty level.
- AlphaAgentEvo: Evolution-Oriented Alpha Mining via Self-Evolving Agentic Reinforcement Learning
-
The quantitative "factor mining" process is redefined from a fragile "search-backtest-restart" cycle into a continuous evolution trajectory. By using a 4B LLM agent guided by hierarchical rewards in multi-turn tool calls, the system learns long-term planning and reflection, ultimately outperforming factor evolution methods driven by GPT-5-mini / DeepSeek-R1 with only 4B parameters.
Browse all 162 LLM Agent papers →
👥 Multi-Agent (47)¶
- Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning
-
The HILA framework is proposed to enable multi-agent LLMs to learn a set of "metacognitive policies"—judging when to solve problems independently and when to defer to human experts. By using Dual-Loop Policy Optimization, it decouples the optimization of "when to ask" (inner-loop reinforcement learning) from "how to gain capability from assistance" (outer-loop continual learning), consistently outperforming existing autonomous multi-agent systems on benchmarks such as mathematical reasoning.
- Aegis: Automated Error Generation and Attribution for Multi-Agent Systems
-
Aegis uses an LLM manipulator to "actively inject" successful multi-agent trajectories into labeled failure trajectories, automatically generating 9,533 data entries labeled with "erroneous agent + error mode." This transforms the expensive manual labeling bottleneck into a scalable engineering problem and supports training error attribution models via SFT, RL, and contrastive learning.
- AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning
-
Instead of searching for multi-agent topologies, AgentPO freezes a powerful Actor within a fixed topology and uses Reinforcement Learning (GRPO) to train a lightweight Collaborator to learn "how to assist teammates." With only 500 training samples and 7.8% of the inference overhead of EvoAgent, it consistently outperforms strong baselines like Role Assignment and EvoAgent across multiple mathematical reasoning benchmarks.
- AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework
-
The framework organizes "tasking-solving-scoring" agents into an adversarial loop and uses a non-LLM Bayesian update rule to evolve code, test cases, and prompts simultaneously. It enables 32B open-source models to outperform 235B models on scientific code generation benchmarks, shifting system reliability from "betting on a strong LLM" to "reducing uncertainty via Bayesian convergence."
- Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems
-
This paper utilizes an open-ended bias benchmark, Discrim-Eval-Open, based on forced three-choice questions to model Multi-Agent Systems (MAS) as directed acyclic graphs. By using the Gini coefficient to track the "amplification rate" of bias across layers, it systematically demonstrates a counter-intuitive conclusion: while it is often assumed that multi-agent collaboration "dilutes" bias, various role specializations, complex topologies, and deepened iterations actually amplify minor random preferences in individual models into systemic discrimination against groups. Even neutral external information can trigger intense polarization.
- ATLAS: Constraints-Aware Multi-Agent Collaboration for Real-World Travel Planning
-
ATLAS formalizes "real-world travel planning with search" as a dynamic Constraint Satisfaction Problem (CSP). It utilizes five specialized LLM agents (Search, Constraint Manager, Planner, Checker, and Search Advisor) to cooperatively complete constraints, iteratively correct errors, and guide search in the event of a deadlock. This approach increases the final pass rate of TravelPlanner from 23.3% to 44.4% and achieves an 84% pass rate in a real-world multi-round scenario with live web search for the first time.
- Benefits and Limitations of Communication in Multi-Agent Reasoning
-
This paper establishes a theoretical framework based on Transformer expressivity for multi-agent reasoning systems that "chunk long contexts, process them via multiple LLM agents, and aggregate results." It proves tight bounds on how many agents and how much communication are needed to achieve specific parallel speedups across associative recall, state tracking, and k-hop reasoning. It identifies three depth–communication trade-off regimes and validates the theoretically predicted inflection points using Llama on synthetic benchmarks.
- Breaking and Fixing Defenses Against Control Flow Hijacking in Multi-Agent Systems
-
This paper first demonstrates that existing "alignment-check" defenses (e.g., LlamaFirewall) can be bypassed by meticulously rewritten Control Flow Hijacking (CFH) attacks. It then proposes CONTROLVALVE—a coordination-layer defense inspired by program Control Flow Integrity (CFI). During the task planning phase, it generates an "allowed agent call graph + per-edge context rules." During execution, it performs "narrow decisions" on each agent transition to verify if it exists in the graph and satisfies the edge rules. This approach reduces the attack success rate to 0% across all evaluated attacks without degrading performance on baseline tasks.
- BRIDGE: Bi-level Reinforcement Learning for Dynamic Group Structure in Coalition Formation Games
-
This paper models the "optimally partitioning a group of agents into several coalitions" (the NP-complete Coalition Structure Generation problem) as a compact, RL-solvable MDP. By employing bi-level RL (where the upper level learns to merge coalitions and the lower level learns optimal individual policies), models trained on only 3 agents can generalize to 100 agents, outperforming traditional heuristics in both inference speed and performance in mixed-motive Markov games.
- Cache-to-Cache: Direct Semantic Communication Between Large Language Models
-
Instead of collaborating through natural language "conversations," multiple large language models use a lightweight neural network to directly project and fuse the KV-Cache of a Sharer model into a Receiver model. This bypasses token-by-token text generation, preserving deep semantics that text might lose, while reducing average latency by 2.5× and improving accuracy by approximately 3–5% compared to pure text-based collaboration.
Browse all 47 Multi-Agent papers →
⚖️ Alignment & RLHF (102)¶
- A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models
-
The authors propose A2D, a token-level safety alignment method for diffusion language models (dLLMs). By training the model to output the
[EOS]token at masked positions when encountering harmful content, it achieves safety defense across any decoding order and any decoding step. This reduces the DIJA template attack success rate from 80%+ to near zero (1.3%/0.0%) and supports early rejection for 19.3x acceleration. - ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
-
ActiveDPO utilizes the "aligned LLM itself" as a reward model. Based on the gradient of its implicit reward, it derives a theoretically guaranteed uncertainty criterion to actively select the most valuable preference triplets for annotation. This allows the LLM to reach higher alignment levels using fewer human preference labels under a fixed annotation budget.
- Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
-
The authors propose the Multi-Lingual Consistency (MLC) auxiliary loss. By using SVD to manipulate the singular values of the multilingual representation matrix toward rank-1 (i.e., making multilingual representations collinear), safety alignment effects from a single language can be consistently transferred to all languages using only multilingual prompt translations (without needing target language responses).
- Aligner, Diagnose Thyself: A Meta-Learning Paradigm for Fusing Intrinsic Feedback in Preference Alignment
-
To address the issue where "mislabeled preference pairs" in preference datasets degrade DPO alignment, this paper moves beyond single heuristics like perplexity differences. It allows the model to "self-diagnose"—constructing a diagnostic vector from three intrinsic signals: consistency, learning difficulty, and generation confidence. A small network is then trained via meta-learning to fuse these signals and adaptively weight each sample, significantly outperforming existing robust alignment methods across various noise ratios.
- Aligning Deep Implicit Preferences by Learning to Reason Defensively
-
Addressing the problem in LLM personalization where models "merely parrot explicit user preferences while failing to infer deep intentions or proactively avoid risks," this paper reformulates alignment from scalar reward matching into a structured reasoning process. It first constructs DeepPref, a reasoning chain dataset with step-by-step critique annotations using a "Multi-role Cognitive Committee." Then, it trains Pers-GenPRM, a generative process reward model that "critiques before scoring." Finally, a token-level online RL strategy (CDPA) is employed to integrate numerical and natural language feedback, achieving SOTA results in both deep preference understanding and defensive reasoning.
- Alignment-Weighted DPO: A Principled Reasoning Approach to Improve Safety Alignment
-
The authors first use causal intervention to prove that "current safety alignment is shallow and unrelated to deep reasoning," then release an open-source CoT safety fine-tuning dataset to teach models to "refuse with reasoning." Finally, they propose Alignment-Weighted DPO: decomposing responses into a "reasoning segment" and a "response segment" with different weights, applying heavier preference updates to the segment that is more harmful in failed jailbreaks. This significantly improves robustness against various jailbreak attacks while preserving utility.
- AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning
-
AlphaAlign utilizes an extremely simplified pure reinforcement learning framework—requiring only binary "harmful/benign" labels and fewer than 200 RL steps—to incentivize the "latent safety self-awareness" embedded in large models during pre-training. By requiring the model to generate a safety rationale before answering and employing a dual-reward system (verifiable safety reward + normalized helpfulness reward), it breaks the "safety-utility" trade-off.
- AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
-
AlphaSteer is proposed to dynamically construct steering vectors by learning a transformation matrix subject to null-space constraints. It generates near-zero vectors for benign inputs (preserving utility) and reconstructs refusal direction vectors for malicious inputs (enhancing safety), providing a theoretical guarantee for the decoupling of safety and utility.
- Anchored Supervised Fine-Tuning
-
This paper provides a rigorous interpretation of the nature of Dynamic Fine-Tuning (DFT) being "tighter but prone to drift" using the reward-weighted regression (RWR) framework. It proposes ASFT, which superimposes a lightweight KL anchoring term onto the DFT reweighting objective, achieving stable gains in both reasoning and knowledge tasks with SFT-level computational costs.
- Annotation-Efficient Honesty Alignment via Confidence Elicitation and Calibration
-
This paper decomposes "honesty alignment" (enabling LLMs to accurately state their confidence before answering) into an "Elicitation-then-Calibration" two-stage paradigm: first, the model is taught to externalize its internal confidence using annotation-free self-consistency signals; second, this elicited confidence is calibrated to actual accuracy using a minimal amount of correctness labels (~1k samples, approximately 0.18% of the full set). The authors release HonestyBench with 560k training samples, demonstrating that using only 1k labels achieves 98% of the performance of full supervision.
Browse all 102 Alignment & RLHF papers →
🔒 LLM Safety (185)¶
- A2ASecBench: A Protocol-Aware Security Benchmark for Agent-to-Agent Multi-Agent Systems
-
This paper presents the first systematic evaluation of the security of Agent-to-Agent (A2A) protocol-driven multi-agent systems. The authors propose a threat taxonomy covering two major categories—"supply-chain manipulations" and "protocol-logic weaknesses"—comprising 6 protocol-aware attacks. Based on this, they construct A2ASecBench, the first dedicated security benchmark for A2A. By utilizing dynamic adapters to migrate attacks across different agent stacks and downstream tasks, and employing a "joint safety-utility evaluation" to quantify both harm and usefulness, they find that attack success rates (ASR) reach 100% for most attacks in three high-risk scenarios (travel, medical, finance) from the official A2A demo. These attacks are further shown to be transferable to other ecosystems such as LangGraph and ANP.
- A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space
-
GuardSpace utilizes a two-stage guardrail—"covariance-preconditioned SVD to isolate and freeze safety-related weights + null-space projection to constrain adapter updates"—ensuring that LLMs lose almost no safety alignment during downstream fine-tuning while slightly improving downstream accuracy.
- Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols
-
This paper points out that almost all current AI control protocols rely on a weaker trusted LLM monitor as the core security gatekeeper. A powerful untrusted model aware of protocol details simply needs to embed a prompt injection targeting that monitor in its output to trick the monitor into assigning a very low suspiciousness score to malicious code. This reverts the security of protocols like Trusted Monitoring, Defer-to-Trusted, Trusted Editing, and Defer-to-Resample back to the level of "Upfront Auditing" (monitoring-free). Notably, Defer-to-Resample actually amplifies the attack into a "best-of-n" scenario due to resampling, causing security to decrease rather than increase.
- AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
-
AdPO is the first to reformulate the adversarial training of Large Vision-Language Models (LVLMs) as a preference optimization problem. By ensuring the model "prefers" correct outputs on clean images and "rejects" misleading outputs on adversarial images, and by fine-tuning only the CLIP image encoder, the method transfers from small models to large ones. This significantly improves adversarial robustness with almost no degradation in clean performance.
- AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models
-
Aiming at the "Snowball Effect" in Large Reasoning Models (LRMs) where small deviations in the Chain-of-Thought (CoT) are progressively amplified—leading to either a slide from safety analysis into harmful compliance or from helpfulness into over-refusal—this paper proposes AdvChain. By constructing adversarial CoT samples featuring "Temptation-Correction" and "Hesitation-Correction" (intentionally introducing errors and then correcting them) to fine-tune the model, it teaches the model dynamic self-correction. With only 1k data samples, AdvChain reduces the Attack Success Rate (ASR) of jailbreaks and CoT-hijacking to levels comparable to RealSafe-R1 (trained on 15× more data) while significantly reducing over-refusal without damaging math or code reasoning capabilities.
- Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks
-
The authors propose the "Adversarial Déjà Vu" hypothesis—that new jailbreaks are not entirely novel inventions but rather recombinations of adversarial skills from previous attacks. By using sparse dictionary learning to compress ~17,000 skills extracted from 32 attack papers into approximately 400 interpretable primitives (a jailbreak dictionary), they verify that "unseen attacks can be sparsely reconstructed from old skills." Based on this, they introduce ASCoT (Adversarial Skill Compositional Training), which trains models on combinations of skills rather than single attack instances, achieving the lowest harmful rate against unseen jailbreaks without excessive refusal.
- Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
-
AMIS upgrades "automatic jailbreaking" from "optimizing only attack prompts" to a bi-level meta-optimization framework that "simultaneously evolves attack prompts and scoring templates." The inner loop uses fine-grained continuous scores to guide prompt iteration, while the outer loop employs a newly proposed "ASR Alignment Score" to optimize the scoring template in reverse. This ensures scores increasingly align with actual attack success, ultimately achieving a 100% ASR on Claude-4-Sonnet, exceeding baselines by an average of over 70 percentage points.
- All Code, No Thought: Language Models Struggle to Reason in Ciphered Language
-
The authors systematically evaluate the mathematical reasoning capabilities of 10 models across 28 ciphers, revealing a critical "asymmetry": models can fluently translate ciphertext back into English (comprehension), but their accuracy drops significantly when performing reasoning directly in ciphertext (computation). This suggests that evading monitoring via ciphered Chain-of-Thought (CoT) is currently unfeasible for LLMs.
- An Ensemble Framework for Unbiased Language Model Watermarking
-
This paper proposes ENS, an ensemble framework that concatenates and compounds multiple unbiased logits watermarks with independent keys. By injecting a subtle, imperceptible weak signal at each layer and aggregating scores from \(n\) keys at the detection end, the SNR is boosted by approximately \(\sqrt{n}\). This significantly enhances detection power and robustness against rewriting while strictly keeping the output distribution unchanged (unbiased).
- Analyzing and Evaluating Unbiased Language Model Watermark
-
This paper proposes UWBENCH—the first open-source benchmark specifically designed for evaluating "distortion-free language model watermarks." It theoretically proves an impossibility theorem stating that "any detectable unbiased watermark cannot maintain the original distribution under repeated queries for the same prompt," introduces the SPMG metric to quantify distribution shift across multiple generations, and provides \(\ell_0\) certified robustness bounds for token-level editing attacks. Empirically, it establishes a tri-axial evaluation protocol for "Unbiasedness / Detectability / Robustness" and identifies that token replacement attacks yield more stable and reproducible robustness conclusions than paraphrase attacks.
Browse all 185 LLM Safety papers →
👻 Hallucination Detection (40)¶
- AFTER: Mitigating Object Hallucinations in LVLMs with Adaptive Fact-guided Activation Editing
-
AFTER textualizes ground-truth image annotations into three categories of facts (category, attribute, and relationship). It constructs positive vision-text editing directions based on the activation difference between these factual descriptions and the original images. A lightweight estimator is then trained to estimate per-query offsets, adaptively pushing LVLM activations toward factual semantics, reducing hallucinations by up to 16.3% on the AMBER benchmark.
- BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
-
Addressing the tendency of Large Reasoning Models (LRMs) to "hallucinate rather than admit ignorance" in factual QA, this paper identifies two pathological reasoning patterns triggered by "factual overthinking." It proposes BARREL, a three-stage training framework (Knowledge Boundary Labeling \(\rightarrow\) Boundary-Aware SFT \(\rightarrow\) GRPO with Reliability Rewards). BARREL improves the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48% while simultaneously increasing accuracy.
- Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection
-
The authors discovered that multi-turn self-dialogues elicited from hallucinated answers exhibit uncertainty score fluctuations with far more intense "spikes" than those from truthful answers. They quantify this volatility as SpikeScore (the maximum second-order difference of the score sequence). By using a single threshold, SpikeScore enables hallucination detection across multiple domains while being trained only on a single domain. Its cross-domain AUROC consistently outperforms specialized methods like PRISM and ICR Probe across four LLMs and six benchmarks.
- Cat-PO: Cross-modal Adaptive Token-rewards for Preference Optimization in Truthful Multimodal LLMs
-
Addressing the hallucination issue in MLLMs, this paper proposes Cat-PO: using only the model's internal cross-modal attention and similarity, it calculates a three-tier visual relevance (global, local, and semantic) for each generated token. These are fused into a smooth token reward to reweight the DPO loss along with a token-level KL regularization for fine-grained hallucination correction, outperforming existing SOTAs by 7%–15% on benchmarks like AMBER-Generation and MM-Hal.
- ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations
-
ChainMPQ is a training-free reasoning framework that decomposes "Subject-Relation-Object" questions into five complementary sub-questions. These are fed sequentially to Large Vision-Language Models (LVLMs), passing textual answers and visual attention memory to subsequent steps to form an interleaved text-image reasoning chain, consistently reducing relation hallucinations across multiple LVLMs and benchmarks.
- CoFact: Conformal Factuality Guarantees for Language Models under Covariate Shift
-
CoFact replaces the fixed "conformal threshold" in LLM factuality control with an adaptive threshold that adjusts to online test distribution drifts. By using online density ratio estimation to dynamically reweight the calibration set, the method ensures that the hallucination rate does not exceed a user-defined \(\alpha\), even in realistic scenarios with continuous covariate shift in prompt streams and unavailable test labels.
- Copy-Paste to Mitigate Large Language Model Hallucinations
-
The authors propose the Copy-Paste generation paradigm, which trains LLMs to prioritize directly copying segments from the retrieval context rather than free paraphrasing. Combined with DPO training for high copy preference, this approach improves faithfulness from 80.2% to 92.8% on counterfactual RAG benchmarks.
- Critical Confabulations: Can LLMs Hallucinate for Social Good?
-
This paper reframes "hallucination" as a viable resource: it proposes critical confabulation, where LLMs "fill in" structural gaps in historical archives under evidentiary constraints. By evaluating 19 models on a "narrative cloze" task using unpublished Black history corpora, the authors demonstrate that controlled, well-defined hallucinations can serve knowledge production without collapsing into falsehood.
- Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
-
Proposes Dynamic Multimodal Activation Steering (DMAS), which dynamically selects relevant steering vectors from a semantic-based truthfulness database and vision-aware vectors to intervene in critical attention heads during inference. It significantly mitigates LVLM hallucinations without training, improving MME by 94.66 points and reducing the CHAIR hallucination rate by 20.2%.
- EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models
-
EmotionHallucer is a hallucination evaluation benchmark for MLLM emotion understanding. It decomposes emotion hallucinations into two primary dimensions: "Emotional Psychology Knowledge" and "Real Multimodal Emotion Perception." Using paired basic/hallucinated binary QA, it detects whether models can both make fundamental emotional judgments and reject plausible but incorrect emotional descriptions. Furthermore, the proposed PEP-MEK inference framework improves model performance on the multimodal emotion perception subset by an average of 9.90%.
Browse all 40 Hallucination Detection papers →
📊 LLM Evaluation (131)¶
- ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems
-
AcadReason utilizes 50 research questions from top-tier journal papers across 5 high-reasoning disciplines (Computer Science, Economics, Law, Mathematics, Philosophy) to specifically test whether LLMs and Agents can acquire and reason through academic knowledge "like a researcher." The results show that most LLMs score below 20, with even GPT-5 only reaching 16 points and the strongest Agent, OAgents, peaking at 34 points, revealing a significant gap in "super-intelligent academic research" capabilities.
- AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size
-
By statistically analyzing the dynamic changes in token confidence during the denoising process of Diffusion Large Language Models (dLLMs), it was discovered that the "Volatility Band" (VB) region encodes the local semantic structure of the text. Consequently, AdaBlock-dLLM is proposed—a training-free, plug-and-play adaptive block size scheduler that naturally aligns the block boundaries of semi-autoregressive decoding with semantic steps, achieving up to a 5.3% accuracy improvement at the same throughput.
- Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation
-
This paper points out that the mainstream QA selective prediction evaluation for NLG uncertainty estimation is significantly biased by approximate correctness functions. It proposes using SP-MoJI, structured tasks, OOD/perturbation detection, and Elo aggregation to make evaluation conclusions more robust.
- Agentic Reinforced Policy Optimization
-
ARPO is a reinforcement learning algorithm tailored for multi-turn tool-calling agents. It identifies that the token entropy of LLMs spikes after each tool return. Consequently, it adaptively "forks" sampling at these high-entropy steps and employs advantage attribution to propagate the performance differences of branched paths back for learning. This achieves superior performance across 13 reasoning/deep-search benchmarks compared to trajectory-level RL, while using only half the tool-calling budget.
- AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation
-
AirQA is a human-annotated AI research QA dataset (13,956 papers, 1,246 questions) covering four question types (single/multi-doc/retrieval/comprehensive) and five element types (text/table/image/formula/metadata). It introduces instance-level objective evaluation using 19 "customized per question" Python functions and proposes a three-agent framework, EXTRACTOR, to automatically synthesize QA pairs and interaction trajectories, enabling a 7B model to reach the tool-calling performance of a 14B model after fine-tuning.
- AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining
-
AlphaBench is the first benchmark to systematically evaluate Large Language Models (LLMs) in "Formulaic Alpha Factor Mining" (FAFM). It decomposes the real workflow of quantitative researchers into three major tasks: factor generation, factor evaluation, and factor searching. By cross-evaluating over ten open-source and closed-source models in a real-world backtesting environment (Qlib + CSI300), the study finds that LLMs can reliably generate valid factors but perform close to random guessing when judging factor quality (evaluation task).
- An Open-Ended Benchmark and Formal Framework for Adjuvant Research with MLLM
-
Addressing the "vaccine adjuvant" field long neglected by AI, this work constructs the first expert-annotated open-ended QA benchmark (1294 QA pairs + 1364 formal descriptions). It systematically evaluates 11 closed-source and 19 open-source MLLMs and proposes a formal framework that encodes adjuvant design principles and immune mechanisms into structured variables and functions.
- AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning
-
The authors construct AnesSuite, the first comprehensive dataset suite for anesthesiology reasoning. It includes the benchmark AnesBench (7,972 bilingual multiple-choice questions across three cognitive levels) and three training datasets (AnesCorpus/AnesQA/AnesR1). The Morpheus model, trained using SFT+GRPO, allows a 7B model to match the 14B baseline performance while revealing significant bottlenecks in complex clinical reasoning (System 2) for currently leading LLMs.
- Are LLMs Really Not Knowledgeable? Mining the Submerged Knowledge in LLMs' Memory
-
This paper argues that when LLMs fail QA tasks or respond with "unsure," it is often not because the knowledge is missing from the parameters, but because it is "submerged" and not expressed. It proposes the Hits@k metric to demonstrate that correct answers frequently reside within the top-k logits but are not selected (e.g., LLaMA3-8B achieves only 17.2% Hits@1 on DBpedia but 57.9% Hits@5). It further reveals that the prevalent "allow unsure" prompting paradigm actively suppresses low-confidence correct answers.
- ASIDE: Architectural Separation of Instructions and Data in Language Models
-
The paper proposes ASIDE, an architectural modification that distinguishes instructions and data via orthogonal rotation at the token embedding layer. By modifying only the forward pass and training on standard instruction-tuning data, it significantly enhances instruction-data separation and prompt injection robustness without requiring specialized safety training.
Browse all 131 LLM Evaluation papers →
⚡ LLM Efficiency (171)¶
- A Two-Phase Deep Learning Framework for Adaptive Time-Stepping in High-Speed Flow Modeling
-
ShockCast decouples "adaptive time-stepping for high-speed flows" into two learning problems: first, using a Neural CFL model to predict the next time step \(\Delta t\) based on the current flow field; then, using a Neural Solver conditioned on \(\Delta t\) to advance the flow field by \(\Delta t\). These two modules alternate autoregressively during inference, allowing the neural solver to process supersonic flow fields with shocks by refining or coarsening steps as needed, mirroring classical solvers.
- Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles
-
Addressing the issue that existing sampling strategies for Diffusion Large Language Models (dLLMs) have a "fixed speed that does not adjust with generation states," this paper summarizes three empirical laws (Certainty, Convergence, Locality). Based on these, it designs SlowFast Sampling, which dynamically switches between "Slow Phase Exploration" and "Fast Phase Acceleration." It can be orthogonally combined with dLLM-Cache—achieving up to 15.63× acceleration on LLaDA for GPQA, and reaching 34.22× when integrated with cache, with almost no loss in accuracy.
- Attention Is All You Need for KV Cache in Diffusion LLMs
-
To address the redundancy in Diffusion Language Models (DLMs) where all tokens and layer KV pairs are recalculated at every step, this paper proposes Elastic-Cache—a training-free and architecture-agnostic method. It uses the "attention drift of the most-attended tokens" to determine when to refresh the cache, leverages the "deep-layers-change-first" pattern to decide from which layer to refresh, and applies block-level caching for distant MASK tokens outside a sliding window. It achieves up to 45.1× decoding acceleration on models like LLaDA and Dream-7B with almost no drop in performance.
- Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors
-
Instead of appending randomly initialized "compression tokens" and relying on autoencoding pre-training for context reconstruction like ICAE, SAC directly selects several "anchor tokens" from the original text. It adds a learnable anchor embedding to these tokens and employs bidirectional attention to aggregate global information into the anchors' KV caches. By completely discarding the autoencoding task, SAC consistently outperforms current compression methods in question answering and long-document summarization.
- AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
-
AutoSP elevates Sequence Parallelism (SP) from manual, framework-coupled operators to two specialized passes within the PyTorch-2.0 compiler stack: an SP-Pass on Torch-IR that automatically inserts communication and resizes activation buffers, and a Sequence-Aware Checkpointing (SAC-Pass) on the joint Aten-IR graph that relaxes min-cut constraints to recompute compute-intensive operators. This allows users to compile single-GPU models into distributed long-context training pipelines with a few lines of code, extending trainable sequence lengths by up to 2.7× on NVIDIA and 2.5× on AMD with near-zero throughput loss.
- BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models
-
BA-LoRA superimposes three output space regularizations—consistency, diversity, and SVD—onto the PiSSA spectral-initialized LoRA framework. These specifically address knowledge drift, representation collapse, and noise overfitting caused by the amplification of pre-training biases during fine-tuning. It consistently outperforms numerous LoRA variants on both NLG and NLU tasks, showing greater gains on noisier pre-trained models.
- Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models
-
DAEDAL utilizes the internal signal of EOS token prediction confidence in Diffusion Large Language Models (DLLMs) during denoising. Without training, it coarse-tunes the sequence length from a short uniform initial value to a task-appropriate length prior to denoising, and locally inserts masks for expansion at low-confidence regions during the denoising process. This overcomes the constraint of "manually presetting generation length," achieving or exceeding the accuracy of fine-tuned fixed-length baselines across four math/code benchmarks while significantly increasing the proportion of effective tokens.
- Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes
-
DID completely replaces the "mask-unmask" paradigm in diffusion language models with two continuous-time Markov chains (CTMC) representing "deletion-insertion": the forward process reduces a sequence to empty by deleting tokens, while the backward process reconstructs it by inserting tokens. By incorporating an "insertion score"-based DISE training objective and parallel dynamic programming, DID eliminates
<MASK>and<PAD>tokens that consume nearly half of the computational budget. It natively supports variable-length generation and self-correction, achieving up to 3.42× training speedup and 3.79× inference speedup. - Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs
-
RoPE++ reclaims the negative imaginary part discarded in standard RoPE complex attention and utilizes it as a parallel imaginary attention head, enhancing long-context modeling capabilities without increasing KV cache or while directly halving the cache configuration.
- BoRA: Towards More Expressive Low-Rank Adaptation with Block Diversity
-
BoRA interprets the LoRA product \(BA\) as block matrix multiplication and breaks inter-block correlations by inserting an independent diagonal matrix \(\Sigma_{i,j}\) into each block product \(B_iA_j\). Using only \(b^2r\) additional parameters, BoRA scales the rank of LoRA weights by \(b\) times, achieving a 2-4% accuracy improvement over LoRA on GLUE, mathematics, and commonsense reasoning tasks with comparable parameter counts.
Browse all 171 LLM Efficiency papers →
📚 Pretraining (79)¶
- A Law of Data Reconstruction for Random Features (and Beyond)
-
This paper demonstrates an information-theoretical and algebraic "data reconstruction law" in random feature (RF) models: when the number of parameters \(p \gg dn\) (where \(d\) is data dimension and \(n\) is the number of samples), the training data can be fully reconstructed. The universality of this threshold is verified across RF, two-layer networks, and ResNet using a projection loss optimization method.
- Accessible, Realistic, and Fair Evaluation of Positive-Unlabeled Learning Algorithms
-
This paper proposes the first unified benchmark for PU learning, systematically addressing two key issues: (1) implementing model selection without negative samples using Proxy Accuracy and Proxy AUC; (2) identifying and resolving the Internal Label Shift problem in the one-sample setting through a simple calibration method that merges positive samples into the unlabeled set, enabling fair comparison of two-sample algorithms over one-sample evaluations.
- ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning
-
ADEPT discovers that the contributions of different layers and parameter units in LLMs to "general competence" are highly non-uniform. Consequently, it only replicates the layers least important to the general domain to create new capacity and assigns asymmetric learning rates within these expanded layers based on unit importance. In continual pretraining (CPT) for math and medical domains, this method injects new knowledge with almost no damage to general competence—tuning only 15% of parameters in less than 50% of the training time, yet achieving 5.76% higher performance on general benchmarks and 5.58% higher on domain benchmarks compared to full-parameter CPT.
- Autoregressive Models Rival Diffusion Models at Any-Order Generation
-
This paper proposes A3 (Any-order Any-subset Autoregressive modeling), which reintegrates the "any-order, any-subset" flexibility of diffusion language models into the autoregressive framework. By using group-wise factorization to preserve the multi-layer dependency modeling capabilities of AR, and employing two-stream attention with a progressive curriculum to smoothly transform a pre-trained AR model into an any-order generator, A3 comprehensively outperforms diffusion language models of the same scale while using significantly less training data.
- Avey-B: Refactoring Attention-Free Architectures into Bidirectional Encoders
-
Avey-B transforms the originally autoregressive, attention-free Avey architecture into a BERT-style bidirectional encoder by removing causal masks, decoupling static weights and dynamic similarity into alternating layers, applying row normalization to dynamic layers, and integrating a neural compressor within the ranker. Consequently, it consistently outperforms BERT/RoBERTa/ModernBERT/NeoBERT in token classification and information retrieval, using approximately \(11\times\) fewer pre-training tokens than ModernBERT while achieving \(3.38\times\) faster throughput at a context length of 96K.
- Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
-
Addressing the overlooked fact that "long text \(\neq\) long-range dependency," this paper proposes LongFilter. It quantifies the "information gain from extended context" by comparing a language model's prediction distributions under long vs. short context for each token. Samples that are long but predictable using only local context are filtered out. Continuing pretraining LLaMA-3-8B (8K \(\rightarrow\) 64K) with filtered data yields an average improvement of over 2 points on HELMET, LongBench, and RULER, achieving equivalent performance with approximately half the data.
- Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries
-
This paper proposes Future Summary Prediction (FSP): adding an auxiliary head to the standard next-token prediction to predict a compact summary of a long-range future sequence (instead of predicting future tokens one by one). Two summary construction methods are provided: a manual bag-of-words summary (FSP-BoW) and a learned summary distilled from a reverse language model (FSP-RevLM). Large-scale pretraining experiments at the 3B/8B scale demonstrate that FSP consistently outperforms Next-Token Prediction (NTP) and Multi-Token Prediction (MTP) on mathematical, reasoning, and coding tasks, with improvements up to 4–5 percentage points in mathematics.
- Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
-
This paper systematically broadens the design space of "metadata conditioning for accelerating LLM pretraining." Beyond the known effectiveness of prepending URLs, the authors discover that fine-grained quality scores and domain information can similarly accelerate training. They propose two new mechanisms—"appending metadata as an auxiliary prediction task" and "learnable meta tokens"—and use layer-wise probing to reveal how these signals reshape latent representations.
- Block-Sample MAC-Bayes Generalization Bounds
-
This paper proposes block-sample MAC-Bayes (mean approximately correct) generalization bounds. By partitioning training data into \(J\) blocks and replacing the global KL divergence with the sum of KL divergences conditioned on each block, the framework provides finite, meaningful generalization error bounds in scenarios where traditional PAC-Bayes bounds are vacuous (e.g., deterministic learning algorithms like mean estimation). It also demonstrates that high-probability versions of this bound are generally infeasible.
- Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice
-
This paper points out a fatal flaw in the practice widely relied upon by frontier teams—comparing data recipes using small proxy models with fixed hyperparameters. Dataset rankings can be flipped by minor changes in the learning rate. The authors propose training proxy models with an extremely small learning rate (\(10^{-5}\sim10^{-6}\)) as a simple patch, which improves the Spearman correlation of rankings from a proxy (GPT2-125M) to a target model (Pythia-1B) from \(<0.75\) to \(>0.95\) across 23 data recipes.
Browse all 79 Pretraining papers →
✏️ Knowledge Editing (15)¶
- ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall
-
ACE identifies a neglected mechanism via neuron-level attribution where "implicit subjects act as query neurons in multi-hop reasoning, activating value neurons layer-by-layer." Accordingly, it refines editing from "layer-level heuristics" to "query-value pathways," outperforming the SOTA PMET by 9.44% on GPT-J and 37.46% on Qwen3-8B in multi-hop factual recall.
- Bilinear Representation Mitigates Reversal Curse and Enables Consistent Model Editing
-
By training Transformers from scratch on synthetic relational knowledge graphs, it is discovered that appropriate regularization leads to the emergence of a bilinear relational structure in hidden layers. This structure not only overcomes the reversal curse but also enables logically consistent propagation of updates to related facts after editing a single fact.
- Disentangling Knowledge Representations for Large Language Model Editing
-
Addressing the neglected problem where knowledge editing collateralizes "same-subject but different-relation/object" fine-grained irrelevant knowledge, this paper proposes DiKE: it first uses a reusable disentanglement module to split subject representations into "target-related" and "irrelevant" parts, then performs editing only on the related part while explicitly constraining the irrelevant part to remain unchanged, deriving a closed-form rank-one parameter update formula that maintains mainstream editing performance while preserving fine-grained irrelevant knowledge.
- EAMET: Robust Massive Model Editing via Embedding Alignment Optimization
-
This paper reveals that the root cause of failure in massive model editing is the structural inconsistency between key embeddings and residual embeddings (embedding misalignment). It proposes EAMET, which progressively saves optimized residual embeddings and aligns their neighborhood structure to the key embedding space using a dual loss of KL divergence and MSE. Experimental results across 6 LLMs and 3 datasets show that EAMET outperforms MEMIT by an average of 14% (CounterFact) and 8% (ZsRE) when editing 10k facts simultaneously, while maintaining robustness in scenarios involving long prefixes and multiple facts per subject.
- Energy-Regularized Sequential Model Editing on Hyperspheres
-
Performance degradation in sequential model editing is understood from the perspective of hyperspherical uniformity (Hyperspherical Energy, HE). The SPHERE method is proposed: by projecting editing perturbations onto the orthogonal complement of the primary hypersphere directions of pre-trained weights, stable large-scale sequential editing is achieved, outperforming the strongest baseline by an average of 16.41% on LLaMA3-8B.
- Fine-tuning Done Right in Model Editing
-
This paper reveals that the root cause of the underestimated performance of fine-tuning in model editing is an incorrect training pipeline (Depth-First sample-by-sample optimization). By correcting this to a standard Breadth-First mini-batch training and combining it with localized parameter tuning to form LocFT-BF, the authors achieve the first support for 100,000 sequential edits and a 72B model scale.
- GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing
-
By using null-space constrained online model editing, this work integrates 3D geometric information provided by VGGT into a 2D generic object tracker. This enhances geometric awareness while maintaining semantic discriminative power, significantly improving tracking performance in scenarios with occlusions and background clutter.
- KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning
-
This paper proposes KnowledgeSmith, which unifies "knowledge editing" and "machine unlearning" into a single constrained optimization problem. By using knowledge graphs (KG) to automatically generate large-scale evaluation benchmarks across different hierarchies (root/intermediate/leaf) and data scales, the study systematically reveals counter-intuitive phenomena in LLM knowledge updates, such as propagation asymmetry, consistency-capacity trade-offs, and subject dependency.
- MobiEdit: Resource-efficient Knowledge Editing for Personalized On-device LLMs
-
MobiEdit replaces the resource-heavy backpropagation in the classic locate-and-edit knowledge editing (ROME) with "quantization + forward zeroth-order gradient estimation," coupled with two system optimizations: early stopping and prefix activation reuse. This allows real-time knowledge editing for 3B LLMs to run on commercial smartphone NPUs for the first time, reducing memory by 7.1×, energy consumption by 15.8×, and latency by 3.4×.
- MoEEdit: Efficient and Routing-Stable Knowledge Editing for Mixture-of-Experts LLMs
-
MoEEdit is the first "routing-stable" parameter-modifying knowledge editing framework for MoE LLMs. It employs "per-expert null space projection" to ensure that edits do not perturb input manifolds for downstream routers, combined with a stochastic Block Coordinate Descent (BCD) solver to decouple computational costs from the total number of experts to the expert hidden dimension. This achieves high editing success rates, strong generalization, and routing stability on sparse architectures simultaneously.
Browse all 15 Knowledge Editing papers →
💬 LLM (Other) (56)¶
- Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning
-
This paper introduces InternGeometry—the first geometry LLM agent to achieve medal-winning performance. By treating a symbolic engine as a tool, it utilizes an ultra-long-range "Think-Construct/Propose-Verify-Reflect" interaction (over 200 steps per problem) to overcome the lack of heuristics in auxiliary line construction. Combined with Complexity-Boosting RL (CBRL) to progressively increase the difficulty of synthetic problems, it solves 44 out of 50 geometry problems from IMO 2000–2024 using only 13K training samples (0.004% of AlphaGeometry 2), exceeding the average score of gold medalists.
- Attend to the Active: Structure-Aware Dynamic Attention in LLMs for Compositional Instruction Following
-
ATA identifies the structural types of compositional instructions (Chaining/Branching/Paralleling) and decomposes mutually exclusive subtasks in a single forward pass without updating any parameters. By dynamically identifying the currently "active" subtasks at each generation step and masking the "dormant" ones using attention bias, it eliminates interference between subtasks and significantly improves LLM faithfulness to complex compositional instructions.
- Best-of-∞: Asymptotic Performance of Test-Time LLM Ensembling
-
This paper frames majority voting as repeated sampling from a model's answer distribution, investigating the limit accuracy as the number of samples \(N\to\infty\) (termed best-of-∞). It utilizes Bayes factors for adaptive stopping to approximate this limit within finite budgets and formalizes the "optimal weights for ensembling multiple LLMs" as a Mixed-Integer Linear Programming (MILP) problem, demonstrating that ensembling consistently outperforms any single model.
- Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE
-
The authors port "Sharpness-Aware Minimization (SAM)" from image/weight space to the discrete textual prompt space, proposing TARE/ATARE: a gradient-free evolutionary framework that "finds the worst paraphrase in the inner layer and selects the most stable neighborhood in the outer layer." This ensures optimized prompts maintain performance under synonymous rewrites, consistently outperforming TextGrad / Revolve across 4 reasoning benchmarks and 5 evaluated models.
- Beyond the Known: An Unknown-Aware Large Language Model for Open-Set Text Classification
-
This paper proposes UnLLM, which reformulates Open-Set Text Classification (OSTC) from "closed-set training + post-hoc OOD detection" into a partition-conditional classification task. By providing LLMs with partial label subsets and explicitly marking samples outside the candidates as "unknown," and employing a three-level "representation-probability-inference" optimization, the model consistently outperforms SOTA in K-F1 / N-F1 across six benchmarks.
- BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning
-
Proposes BOTS, a unified framework for online task selection in LLM reinforcement finetuning based on Bayesian inference. By fusing explicit evidence (historical pass rates from direct evaluation) and implicit evidence (inferred difficulty of unevaluated tasks via reference model interpolation) with Thompson sampling for exploration-exploitation balance, BOTS achieves up to 50% training acceleration on math, code, and logic tasks with only 0.2% additional overhead.
- Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors
-
This paper provides the first theoretical analysis of the "PCC plateau" phenomenon in attention-based regression models during joint MSE+PCC training. It identifies the root causes as the conflict between MSE optimization and PCC gradients, as well as the capacity upper bound of softmax convex aggregation. It proposes the ECA (Extrapolative Correlation Attention) framework to break these limits using three components: scaled residual aggregation, dispersion-aware temperature softmax, and dispersion-normalized PCC loss.
- Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
-
By utilizing "Active Indexing" during the continued pretraining phase to bidirectionally bind facts to document identifiers, LLMs can provide verifiable citations while answering in a closed-book setting without any external retrieval, improving citation precision by up to 30.2%.
- Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning
-
The authors propose the Compositional-ARC dataset to evaluate the systematic generalization capabilities of models in abstract spatial reasoning—specifically, generalizing from known basic geometric transformations (e.g., translation, rotation) to unseen combinations of these transformations. An MLC-trained encoder-decoder model with only 5.7M parameters reaches 78.26% on systematic tasks, matching the performance of the ARC Prize 2024 winner's 8B model + TTT, while significantly outperforming GPT-4o and o3-mini (<3%).
- Constrained Decoding of Diffusion LLMs with Context-Free Grammars
-
This paper proposes the first constrained decoding method to enforce Context-Free Grammar (CFG) constraints on Diffusion Language Models (DLMs). It abstracts the question of "whether a partial text with holes in any order can be completed into a legal string" as an additive infilling decision problem, reduces it to checking whether the intersection of the target CFG and the regular language of all possible completions is empty, and utilizes highly optimized emptiness checking algorithms to bring the theoretical cubic complexity into a practical range. It increases syntactic correctness to nearly 100% on C++, JSON, and SMILES while slightly improving functional correctness.
Browse all 56 LLM (Other) papers →
📖 NLP Understanding (2)¶
- LANE: Label-Aware Noise Elimination for Fine-Grained Text Classification
-
LANE upgrades the classic margin indicator for identifying mislabeled samples to a Label-aware Margin. For negative margins, the penalty is reduced if the mislabeled category is semantically similar to the model's prediction (e.g., "Anger" labeled as "Fear") and increased if they are semantically distant (e.g., "Trust" labeled as "Fear"). Based on this, it applies dynamic weighting to each sample rather than hard deletion, consistently outperforming strong baselines like AUM and HMW across 10 text classification datasets.
- What's the Plan? Metrics for Implicit Planning in LLMs and Their Application to Rhyme Generation and Question Answering
-
This paper proposes the mean activation difference steering method and accompanying quantitative metrics. Through two case studies—rhyme generation and question answering—it systematically proves across 23 open models (1B-32B) that representations of target tokens (rhymes/answers) are formed at early sequence positions (forward planning) and causally influence the generation of intermediate tokens (backward planning). Implicit planning emerges in models as small as 1B, suggesting it is a universal mechanism rather than exclusive to large-scale models.
✍️ Text Generation (12)¶
- Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models
-
Antislop treats "AI-typical repetitive phrases (slop) in LLM generation" as quantifiable, locatable, and erasable objects. It first maps model-specific "slop fingerprints" using frequency ratio statistics, then utilizes an inference-time backtracking sampler to precisely suppress these patterns. Finally, it automatically converts the sampler's interception records into preference data for the newly proposed FTPO fine-tuning, permanently welding the suppression capability into the weights—achieving a 90% reduction in slop with almost no performance degradation on GSM8K/MMLU/Creative Writing.
- Causal-Steer: Disentangled Continuous Style Control without Parallel Corpora
-
This paper proposes Causal-Steer: by treating LoRA as a "causal intervention," it computes the difference in activations with and without LoRA perturbations on the same input. This approach eliminates the need for parallel corpora and extracts a clean style vector. After PCA denoising and robust aggregation via geometric median, it achieves continuous, bidirectional, and linearly interpolatable style control for LLMs using a single scalar \(\alpha\) during inference.
- Diverse Text Decoding via Iterative Reweighting
-
This paper proposes OverRIDE (Reweighting-based Iterative DEcoding), which incrementally fine-tunes a "guidance model" using historical generated results during multi-round sampling. By suppressing the probabilities of tokens that lead to historical pattern recurrence, it significantly enhances diversity across multiple responses with minimal quality loss and can be integrated into serving systems like vLLM with only 6.4% throughput loss (72B).
- FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Model
-
Proposes FS-DFM (Few-Step Discrete Flow-Matching), which reduces the sampling steps of discrete flow-matching language models from 1024 to 8 via step-aware training and a cumulative scalar update rule, achieving 128x acceleration while maintaining comparable perplexity and generation quality.
- Improving Attributed Long-form Question Answering with Intent Awareness
-
Addressing the issues of "poor citation quality and low readability" in long-form reports generated by deep research systems, this paper proposes a tag-based dual-layer intent (paragraph intent + citation intent) writing framework. This framework enhances Large Language Models (LLMs) via direct prompting during inference and distills Small Language Models (SLMs) using intent-aware synthetic data. Across three scientific report generation benchmarks, LLMs improved by an average of +2.9 points and SLMs by +12.3 points, with citation metrics showing particularly significant gains.
- Logit-KL Flow Matching: Non-Autoregressive Text Generation with Sampling-Mixing Inference
-
Ours uses "linear interpolation in logit space" (equivalent to the KL geodesic on the simplex) as the path for discrete flow matching. It proves that maximizing conditional likelihood exactly recovers the velocity field and introduces a "denoise-and-renoise" iterative sampler and hybrid inference scheme, significantly reducing perplexity and improving BLEU for non-autoregressive text/code generation.
- p-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding
-
This paper proposes p-less sampling: a completely hyperparameter-free truncation decoding method. At each step, it uses the "collision probability" \(\sum_v P_\theta(v)^2\) of the entire token distribution as a dynamic truncation threshold. It outperforms methods like top-p and min-p in mathematics, logical reasoning, and creative writing, showing minimal degradation at high temperatures while offering faster inference.
- Planner Aware Path Learning in Diffusion Language Models Training
-
This paper points out the mismatch between the default "random unmasking paths" used in training and the actual "planner-guided paths" used during inference in masked diffusion language models. It proposes Planner-Aware Path Learning (PAPL), which reweights the masked diffusion loss using planner confidence to align training more closely with the inference path. This leads to steady quality improvements in protein sequence, text, and code generation.
- Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs
-
This paper identifies a persistent "
<eos>overflow" early termination issue in instruction-tuned diffusion language models—where longer allocated generation lengths lead to shorter or even collapsed answers (sequences of<eos>). The root cause is that<eos>serves as both a terminator and a padding token. The authors propose Rainbow Padding: retaining a single<eos>for actual termination while filling remaining positions with a deterministic cycle of \(K\) distinct padding tokens. Using only 7 tokens and single-epoch LoRA, it restores length robustness, improving LLaDA's accuracy on MATH from 0.6% to 32.6%. - Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure
-
Starting from the framework of proper scoring rules, this work proves that the negative log-likelihood of the maximum probability output sequence (MSP) is a theoretically sound uncertainty measure. It proposes G-NLL—which approximates this measure using a single greedy decoding pass—matching or exceeding state-of-the-art (SOTA) sampling-based methods across multiple scenarios.
Browse all 12 Text Generation papers →
🗣️ Dialogue Systems (10)¶
- AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions
-
The authors propose AQuA, the first VQA dataset (7.2K samples) with fine-grained ambiguity levels (4 levels), defining optimal response strategies for each level (Direct Answer/Inference/Enumeration/Request Clarification). The study finds that GPT-5 and Gemini are overconfident, consistently providing direct answers to ambiguous questions. Conversely, a 3B model trained via SFT+GRPO can surpass the strategy adaptation capabilities of closed-source large models.
- ClarifyVC: Clarifying Ambiguous Commands in Vehicle Control with a Hybrid Data Augmentation Pipeline
-
ClarifyVC employs an agent-orchestrated four-stage data augmentation pipeline to "grow" a large volume of ambiguity-rich and protocol-compliant single/multi-turn dialogues from 20,000 real in-vehicle commands. Accompanied by a three-tier evaluation protocol and a Data Quality Score (DQS), fine-tuning on this data improves parsing accuracy by ~15%, ambiguity resolution by ~20%, and achieves 98% protocol compliance for in-vehicle voice commands.
- Codified Finite-state Machines for Role-playing
-
Addressing the issue where LLMs in role-playing only mimic surface-level actions but fail to remember a character's "internal state," this paper proposes automatically compiling character profiles into executable Finite State Machines (CFSM). It uses code to explicitly record character states and transition rules, further extending this to CPFSM for modeling states via probability distributions. On both synthetic validation and Fandom real-plot benchmarks, it demonstrates superior coherence and interpretability compared to prompt-only state modeling baselines.
- DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning
-
DRIFT treats the abundant but implicit "user dissatisfaction" (DSAT) from real-world deployments as high-quality negative anchors. Positive samples are dynamically sampled from the current policy, and iterative training is performed using standard DPO. Without requiring human annotations, reward models, or positives generated by stronger models, it enables a 14B model to outperform GPT-4o-mini on WildBench.
- Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings
-
This paper proposes an extremely fast robustness test: on LLM leaderboards based on the Bradley–Terry model (such as Chatbot Arena), removing a tiny worst-case subset (as few as 2 preferences or 0.003%) of human evaluations can change the top-ranked model. The method precisely identifies which specific preferences cause the flip.
- Flipping the Dialogue: Training and Evaluating User Language Models
-
"Flip" the dialogue—instead of training LLMs to be better assistants, specifically post-train a User Language Model (User LM) to simulate real human users. This model is used to expose the weaknesses of assistant LMs in realistic multi-turn scenarios (dropping GPT-4o's task success rate from 74.6% to 57.4%).
- Non-Collaborative User Simulators for Tool Agents
-
Based on marketing research, this paper defines four types of non-collaborative user behaviors (unavailable service, tangential chat, impatience, and incomplete utterances) and constructs a simulation framework that maintains goal-alignment. Evaluations on MultiWOZ and \(\tau\)-bench systematically expose behavior-specific failure mechanisms in SOTA tool agents—tangential chat leads to an average SR drop of 29.1%, with different models exhibiting distinct collapse paths (the GPT series falls into repetitive helper API calls, while the Qwen series tends to hallucinate API results).
- ReIn: Conversational Error Recovery with Reasoning Inception
-
Proposes Reasoning Inception (ReIn), a test-time intervention method requiring no modification to model parameters or system prompts. By employing an external inception module to detect conversational errors and inject recovery plans into the task agent's reasoning chain, it significantly improves task completion rates across various error scenarios and generalizes to unseen error types.
- Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation
-
FlyThinker proposes an efficient "think-while-generating" framework that utilizes an independent Reasoner to generate latent reasoning signals at the token level in parallel. These signals are dynamically integrated into the Generator to guide personalized long-form generation while maintaining training and inference efficiency.
- Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding
-
By comparing layer-wise hidden representations (chain-of-embedding) with and without visual input, this study identifies a "Visual Integration Point" (VIP) layer in LVLMs and proposes the Total Visual Integration (TVI) metric to quantify the strength of the language prior.
🌐 Multilingual & Translation (8)¶
- ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity
-
Ours proposes the ASSESS framework, centered on the TransTED Similarity metric. By parsing formal mathematical statements into Operator Trees (OPT) and integrating Lean proof tactic-driven semantic transformations into the standard Tree Edit Distance (TED), the method achieves SOTA performance with 70.16% accuracy and a 0.35 Kappa score on the EPLA benchmark, while remaining reproducible using only CPU resources.
- ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
-
This paper proposes the Adaptive Transfer Scaling Law (ATLAS), which decomposes the effective data volume into three terms: target language, transfer languages, and other languages, while introducing a data repetition saturation function. Validated across 774 multilingual training experiments (10M–8B parameters, 400+ languages), ATLAS significantly outperforms existing scaling laws (multilingual \(R^2\) improved from 0.67 to 0.98). It systematically quantifies the cross-lingual transfer matrix, the capacity constraints of the "curse of multilinguality," and the compute crossover point between pretraining and finetuning.
- DiscoX: Benchmarking Discourse-Level Translation in Expert Domains
-
DiscoX constructs the first benchmark for discourse-level + expert-level ZH-EN translation (200 articles, average 1712 tokens, 7 domains, 1330 person-hours of manual refinement) and proposes a multi-agent reference-free evaluation system Metric-S, revealing a significant gap where even the strongest LLM (GPT-5-high: 76.66) still lags behind human experts (80.16).
- From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization
-
This paper proposes ALPO (Adaptive Local Preference Optimization) for training expressive subtitle translation LLMs. Empirical findings show that subtitle translation favors free translation and that reasoning-based LLMs outperform chat-based LLMs in paraphrasing capability. After verifying that LLMs as translation evaluators are highly consistent with humans, the authors propose a fine-grained process-supervised preference alignment method (adaptive weighting + dynamic beta + prefix mixing). The 14B model exceeds SOTA models like GPT-4o and DeepSeek-R1 in translation vividness across multiple language directions.
- Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation
-
This paper proposes the Language Confusion Gate (LCG): a lightweight two-layer MLP that masks tokens from incorrect language families on-demand during decoding without modifying the base LLM. Trained via "norm-calibrated self-distillation," it reduces language confusion rates by approximately an order of magnitude across multiple models without sacrificing task performance.
- LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?
-
By utilizing logit lens and hidden state similarity analysis, this work localizes the final few layers responsible for "language control" in mLLMs. Fine-tuning only these 3-5% of parameters increases language consistency across six languages from <20% to over 98%, achieving performance nearly equivalent to full fine-tuning.
- Multilingual Routing in Mixture-of-Experts
-
This paper systematically analyzes multilingual routing patterns in MoE large language models, discovering that middle layers contain cross-lingually shared experts and that linguistic performance is strongly correlated with alignment to English routing. Based on this, an inference-time routing intervention method is proposed to activate English task-specific experts in middle layers, consistently improving multilingual performance by 1-2% across 3 models, 2 tasks, and 15+ languages.
- SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs
-
Utilizing Sparse Autoencoders (SAEs), it is discovered that unexpected code-switching in LLMs is correlated with abnormally high pre-activation values of target language features. This paper proposes SASFT, a method that constrains target language feature pre-activations during SFT training, reducing unexpected code-switching by more than 50%.
🔍 Information Retrieval & RAG (81)¶
- A Dense Subset Index for Collective Query Coverage
-
DISCO models "multiple documents collaboratively covering a complex query" as a monotonic submodular coverage objective. Through vector augmentation and random projection, it rewrites the marginal gain of each greedy iteration into an indexable inner product form. This enables a modified multi-vector IVF index to approximate greedy solutions in sublinear time, achieving over \(100\times\) speedup compared to standard greedy algorithms while providing higher coverage than traditional IR indices.
- AdaCache: Adaptive Caching and Context Augmentation for Efficient LLM Serving
-
AdaCache addresses two types of waste in RAG inference—redundant recomputation of the same text chunks and the uniform provision of top-k contexts regardless of query difficulty. It proposes "Hierarchical Caching + Attention-aware Selective Recomputation" and "Confidence-driven Adaptive Context Augmentation," reducing Time to First Token (TTFT) by 1.4\(\times\) to 5.0\(\times\) compared to state-of-the-art RAG caching systems across six datasets and three models while maintaining generation quality.
- AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations
-
This paper introduces AMemGym, the first on-policy interactive evaluation benchmark for long-horizon conversation memory. By utilizing structured data sampling (User Persona → State Evolution → Personalized QA), it drives LLMs to simulate users in role-play scenarios. The study reveals ranking bias issues in traditional off-policy evaluations and systematically diagnoses the "write/read/utilization" three-stage failure modes in RAG, long-context, and Agent memory systems.
- AssoMem: Scalable Memory QA with Multi-Signal Associative Retrieval
-
AssoMem constructs a "clue-utterance" associative memory graph for large-scale personal memory QA and adaptively fuses three signals—relevance, importance, and temporal alignment—using mutual information for ranking. It significantly outperforms SOTA models based solely on semantic similarity in both retrieval and generation across multiple benchmarks.
- Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
-
The authors propose ARC-JSD, a method that achieves efficient and precise RAG context attribution without fine-tuning, gradient computation, or surrogate models by calculating the Jensen-Shannon Divergence (JSD) of response distributions between full and ablated contexts. Combined with Logit Lens for mechanistic analysis, it identifies attention heads and MLP layers responsible for attribution, reducing hallucination rates by approximately 39% through gating operations.
- Attribution-Guided Decoding
-
Ours proposes the AGD decoding strategy, which selects the token with the highest attribution score regarding a user-specified "Region of Interest" (ROI) from high-probability candidate tokens during each generation step. This transforms attribution methods from passive analysis tools into active generation steering tools, achieving significant improvements in both instruction following and factuality tasks.
- Automated Formalization via Conceptual Retrieval-Augmented LLMs
-
CRAMF automatically constructs a "concept–definition" knowledge base from Mathlib4, utilizing query augmentation, dual-channel hybrid retrieval, and reranking to provide precise formal definitions for LLM-based autoformalizers. It serves as a plug-and-play plugin that improves translation accuracy by an average of 29.9% relative gain, reaching up to 62.1%.
- Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
-
The paper reformulates positional encoding as a prior distribution within a Bayesian attention mechanism, unifying NoPE (uniform prior) and ALiBi (Laplace prior). It proposes a Generalized Gaussian Prior (GGD-BAM) that achieves perfect passkey retrieval at 500x training length with an addition of only 384 parameters.
- Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding
-
This paper proposes LDAR (Learning Distraction-Aware Retrieval), a lightweight adaptive retriever that learns to select a continuous band of passages based on query-passage similarity distributions. By balancing information coverage against the impact of distracting passages, it outperforms long-context methods using approximately half the token budget.
- Beyond Sequential Reranking: Reranker-Guided Search Improves Reasoning Intensive Retrieval
-
This paper replaces the rigid "top-k sequential scan" in the "retrieve-and-rerank" pipeline with a greedy search on a document similarity proximity graph (Reranker-Guided-Search, RGS). By prioritizing documents whose neighbors have already received high scores, RGS improves NDCG@10 by 3.5, 2.9, and 5.1 points on BRIGHT, FollowIR, and M-BEIR benchmarks respectively, under a strict budget of 100 reranker calls per query.
Browse all 81 Information Retrieval & RAG papers →
💻 Code Intelligence (59)¶
- A Problem-Oriented Perspective and Anchor Verification for Code Optimization
-
The paper proposes a problem-oriented (rather than user-oriented) approach to construct optimization pairs to integrate strategic diversity from multiple programmers. It also designs an anchor verification framework that utilizes "slow but correct code" to generate test cases, mitigating the "optimization tax" (correctness loss), thereby increasing the optimization ratio from 31.24% to 71.06% and the speedup from 2.95x to 6.08x.
- AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions
-
AetherCode is the first code reasoning benchmark to systematically collect 456 high-difficulty problems from premier programming competitions such as IOI and ICPC. It utilizes a hybrid approach of "automated generation + manual annotation by 67 experts" to achieve 100% TPR / 100% TNR for test cases. Results indicate that even the strongest model, o4-mini-high, achieves only a 35.5% Pass@1, debunking the illusion that "LLMs have conquered competitive programming."
- Agnostics: Learning to Synthesize Code in Any Programming Language with a Universal Reinforcement Learning Environment
-
By using "standard program input/output behavior" as the unified scoring criterion, a language-agnostic code execution sandbox and GRPO training framework are developed. This enables RL post-training for any low-resource programming language with only 4-5 lines of YAML configuration, elevating the performance of Qwen-3 4B on Lua, Julia, R, OCaml, and Fortran to levels comparable with 16B–70B models.
- Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering
-
The authors construct Ambig-SWE (an underspecified variant based on SWE-Bench Verified) to systematically evaluate the interaction capabilities of LLM programming agents across three dimensions: detecting underspecification, asking clarifying questions, and utilizing interactive information. They find that interaction can improve resolution rates in underspecified scenarios by up to 74%, yet models default to non-interactive behavior and struggle to distinguish between sufficient and underspecified instructions.
- An Agentic Framework with LLMs for Solving Complex Vehicle Routing Problems
-
AFL decomposes "using LLMs to solve complex Vehicle Routing Problems (VRP)" into three subtasks: problem description, code generation, and solution derivation. It utilizes four specialized agents (Generation, Judgement, Revision, and Error Analysis) to oversee each other, automatically producing a self-contained Python solver from raw VRPLIB instances. Across 60 VRP variants, AFL reduces the runtime error rate to 0%, achieves a 100% feasible solution rate, and maintains an optimality gap mostly within 3% compared to manually designed algorithms.
- ATGen: Adversarial Reinforcement Learning for Test Case Generation
-
ATGen places a "test case generator" and an "adversarial code generator" into a competitive reinforcement learning loop. As the generator strengthens, the opponent is forced to produce more subtle bugs. This self-escalating dynamic curriculum breaks the "fixed-difficulty ceiling" of static datasets, doubling the attack success rate of a 7B model compared to the SFT-based UTGen (36.99% vs 16.24%).
- Behavioral Embeddings of Programs: A Quasi-Dynamic Approach for Optimization Prediction
-
To address the dilemma of "static representations being too rigid and dynamic profiling being too expensive" in compiler optimization, this paper proposes a quasi-dynamic program representation. By "probing" LLVM IR with a set of optimization sequences, the changes in static features before and after optimization are quantified as a Program Behavior Spectrum. Product Quantization (PQ) is then used to discretize continuous reaction vectors into structured "sub-words," and a multi-task Transformer (PQ-BERT) is pre-trained to learn their syntax. This approach significantly outperforms static embeddings like inst2vec and IR2Vec in Best Pass Prediction and -Oz Benefit Prediction tasks.
- BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization
-
BOAD reformulates the design of a hierarchical multi-agent system for software engineering as a multi-armed bandit (MAB) problem. Each candidate sub-agent is treated as an arm, and the reward is its "helpfulness" within team collaboration. It employs UCB for exploration-exploitation, uses the Chinese Restaurant Process (CRP) to dynamically expand the agent archive, and applies hindsight credit assignment to avoid the "free-rider" problem. This approach automatically discovers a structure consisting of "one orchestrator + two specialized sub-agents" under a limited evaluation budget. On SWE-bench-Verified, a 36B model achieved 53.2%; on the more out-of-distribution SWE-bench-Live, it reached 20.0%, ranking second on the leaderboard and outperforming larger models like GPT-4o and Claude 3.7.
- CARD: Towards Conditional Design of Multi-agent Topological Structures
-
CARD proposes a conditional graph generation framework (Conditional Agentic Graph Designer) that utilizes a conditional variational graph encoder and environment-aware optimization to adaptively design multi-agent communication topologies based on dynamic environmental signals—such as model capabilities, tool availability, and knowledge source changes—consistently outperforming static and prompt-based baselines on HumanEval, MATH, and MMLU.
- Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction
-
To address the persistent issues of "static sources prone to contamination" and "superficial testing" in code generation evaluation, this paper proposes the Dual Scaling philosophy. It dynamically extracts problems from real-world repositories based on model knowledge cutoff dates (Scaling the Source) and automatically generates high-rigor test suites using Property-Based Testing (PBT) coupled with a 100% branch-coverage "Great Filter" (Scaling the Rigor). The instantiated end-to-end framework, Code2Bench, produces a benchmark (Code2Bench-2509) featuring native Python and Java instances, providing fine-grained diagnostics for 10 mainstream LLMs.
Browse all 59 Code Intelligence papers →
🎨 Image Generation (357)¶
- A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers
-
The first systematic analysis of conditional embeddings in Diffusion Transformers reveals extreme angular similarity (inter-class cosine similarity >99%) and dimensional sparsity (only 1-2% of dimensions carry semantic information). Generation quality remains largely unchanged after pruning 2/3 of low-magnitude dimensions, uncovering a hidden semantic bottleneck in conditional embeddings.
- A Noise is Worth Diffusion Guidance
-
This paper proposes NoiseRefine: instead of modifying the diffusion model itself, it trains a lightweight network to "refine" random Gaussian noise into a structured noise. This enables generating images with quality close to CFG guidance using only a single forward pass without any sampling guidance, thereby eliminating the overhead of dual forward passes per step.
- A Physics-Inspired Optimizer: Velocity Regularized Adam
-
This paper proposes VRAdam (Velocity-Regularized Adam), which translates a physical stability mechanism—the "quartic kinetic energy term"—into a global dynamic learning rate that automatically contracts with velocity \(\eta_t=\alpha_0/(1+\min(\beta_3\|v_t\|^2,\alpha_1))\). Embedded into AdamW, it automatically decelerates when weight updates are too large, suppressing oscillations near the Edge of Stability. Complemented by rigorous Lyapunov stability and \(O(\ln N/\sqrt N)\) convergence proofs, it consistently outperforms AdamW across image classification, language modeling, GFlowNets, GPT-2 pre-training, and LLM fine-tuning.
- A Probabilistic Hard Concept Bottleneck for Steerable Generative Models
-
This paper reformulates the concept bottleneck in generative models into a probabilistic hard binary concept layer, VHCB. This allows users to directly sample images from specified concepts or perform interventions on existing generations. Systematic validation on StyleGAN2 and DDPM demonstrates superior steerability and reduced concept leakage compared to soft concept bottlenecks.
- AC-Sampler: Accelerate and Correct Diffusion Sampling with Metropolis-Hastings Algorithm
-
AC-Sampler truncates the diffusion generation process at an intermediate timestep, generates candidates using a score-based Langevin proposal, and applies Metropolis-Hastings (MH) acceptance rates to correct them toward the true marginal distribution. This simultaneously reduces NFE and improves FID without fine-tuning the base model.
- ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization
-
ACCORD formalizes "concept coupling" (entanglement between subjects and contexts) in text-to-image personalization as a statistical dependence problem for the first time. It decomposes the total dependence discrepancy into two computable sources: "denoising dependence discrepancy" and "prior dependence discrepancy," eliminating them via two plug-and-play regularization losses (DDLoss + PDLoss). This improves both text controllability and fidelity across subject, style, and face personalization.
- Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
-
RepTok fine-tunes the
[cls]token of a pre-trained self-supervised ViT into a "single continuous token" latent space. Combined with a flow matching decoder for high-fidelity reconstruction and a non-attention MLP-Mixer for generation in this 1D space, it achieves competitive FIDs on ImageNet/MS-COCO using less than 10% of the training compute compared to competitors. - AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models
-
AEGIS transforms the "erasure target" from manually selected fixed safety words to an Adversarial Erasure Target (AET) that iteratively approaches the semantic center of the concept. It further employs Gradient Rectification with Projection (GRP) to maintain generation quality without requiring retention data, effectively minimizing adversarial prompt attack success rates with negligible performance loss.
- AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport
-
AlignFlow utilizes Semi-Discrete Optimal Transport (SDOT) to pre-calculate a deterministic "noise distribution \(\rightarrow\) full dataset" alignment mapping before training. This serves as a plug-and-play coupling for various flow generative models, achieving straighter trajectories, faster convergence, and comprehensive FID reductions with less than 1% additional overhead.
- Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models
-
Ours proposes AlignTok—instead of training a VAE from scratch or forcing it to learn semantics via "semantic regularization," it transforms a semantically-rich pre-trained visual foundation encoder (DINOv2) into a continuous tokenizer through a three-stage progressive alignment. This yields a latent space that is both semantically well-structured and capable of precise reconstruction; on ImageNet 256×256, it allows the diffusion model to reach a gFID of 1.90 in just 64 epochs, achieving a convergence speed approximately 5× faster than VA-VAE.
Browse all 357 Image Generation papers →
🎬 Video Generation (98)¶
- 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation
-
This paper proposes 3DScenePrompt, which utilizes dual spatio-temporal conditions—"temporally adjacent frames + projected views from a static 3D point cloud"—to extend future videos from any length of input video, maintaining scene consistency with the entire history while achieving precise camera control.
- AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes
-
By treating a pre-trained text-to-video (T2V) diffusion model as a "virtual cinematographer," this work implements a two-stage paradigm—first generating videos with implicit professional camera movements based on 4D human actions, and then explicitly extracting the viewpoint via a camera extrinsic diffusion branch—achieving automatic camera trajectory planning in 4D scenes with open-domain generalization and text controllability significantly exceeding specialized models.
- Anchor Frame Bridging for Coherent First-Last Frame Video Generation
-
To address semantic decay and visual collapse in the intermediate frames of First-Last Frame Video Generation (FLF2V), this paper proposes a training-free Anchor Frame Bridging (AFB) method. By adaptively inserting an "anchor frame" at the point of most severe temporal rupture to "relay" semantics from start to end, AFB achieves a 16.58% improvement in FVD and 10.21% in PSNR on Wan2.1-I2V.
- Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models
-
To be added after deeper reading.
- Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model
-
Any-to-Bokeh models video refocusing/bokeh rendering as a single-step video diffusion process guided by a focal-plane adaptive MPI geometric prior. It allows users to freely specify the focal plane and blur intensity for any input video, addressing temporal flickering via three-stage progressive training and weighted overlapping inference, outperforming previous image/MPI bokeh methods on both synthetic and real-world data.
- Arbitrary Generative Video Interpolation
-
ArbInterp proposes a generative video frame interpolation framework that supports arbitrary timestamps and lengths. It achieves precise temporal control via Timestamp-aware Rotary Positional Encoding (TaRoPE) and enables seamless splicing of long sequences through an appearance-motion decoupled conditioning strategy.
- Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers
-
Addressing the inference bottlenecks of Video Diffusion Transformers, Astraea proposes a framework comprising token-wise selection, GPU-friendly sparse attention, and evolutionary token budget search, achieving up to 2.4× acceleration on a single GPU and up to 13.2× in multi-GPU scenarios while maintaining generation quality.
- AUHead: Realistic Emotional Talking Head Generation via Action Units Control
-
AUHead decomposes the "audio \(\to\) emotional video" generation problem into two stages: first, an audio language model (ALM) "perceives emotion" from speech and reasons a discrete Facial Action Units (AU) sequence; then, an AU-driven controllable diffusion model renders these AUs into talking head videos that are both synchronized and carry nuanced expressions. It outperforms existing methods in emotional realism and lip-sync accuracy on MEAD/CREMA.
- Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy
-
This paper proposes DirectAnimator, which discards intermediate representations like skeletons or pose estimation. Instead, it animates a reference portrait directly using the raw pixels of driving videos. The method extracts a "Driving Cue Triplet" (Pose/Face/Location) from the original video and injects these cues into the denoising process via a CueFusion DiT Block. Coupled with a Same2X training strategy that aligns cross-ID features to a same-ID model, the system achieves SOTA performance on TikTok and Unseen datasets with 6.7× faster convergence and lower computational costs.
- BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration
-
BindWeave replaces traditional shallow fusion mechanisms with a Multimodal Large Language Model (MLLM) to parse complex text instructions involving multiple subjects. It generates subject-aware hidden states as conditioning signals for DiT, combined with CLIP semantic features and VAE fine-grained appearance features, enabling high-fidelity and subject-consistent video generation.
Browse all 98 Video Generation papers →
🧩 Multimodal VLM (211)¶
- SR-3D: 3D-Aware Region Prompted Vision Language Model
-
SR-3D directly injects 3D positional encodings derived from depth estimation into the visual tokens of a 2D foundation VLM and employs a dynamic slice region extractor. This allows a single model to handle both single-view images and multi-view videos, supporting precise cross-frame 3D spatial reasoning by simply drawing a box or mask on any single frame, achieving SOTA performance across multiple 2D and 3D benchmarks.
- A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
-
The A-TPT framework is proposed to promote angular diversity by maximizing the minimum pairwise angular distance of normalized text features on the unit hypersphere. This addresses the miscalibration issue caused by overconfident VLM predictions in Test-time Prompt Tuning (TPT), outperforming existing TPT calibration methods on both natural distribution shifts and medical datasets.
- A High Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
-
Addressing the pain points of scarce training data and unreliable evaluation for interleaved image-text generation in unified Large Multimodal Models (LMMs), this paper introduces InterSyn, a large-scale dataset with 1.8 million samples and 3,500 topics featuring automated quality control (SEIR iterative refinement). It also presents SynJudge, an evaluation model providing four-dimensional interpretable scores highly aligned with human judgment (95.4% A@1). Experiments show that fine-tuning with InterSyn significantly improves interleaved generation capabilities with only 25K–50K samples.
- ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
-
Using human-artist-drawn ASCII art as a carrier, this paper constructs ASCIIEval, a recognition benchmark where content is strictly equivalent in both text and image modalities. It systematically reveals multiple diagnostic findings: LLMs can "see" visual semantics from pure strings, open-source MLLMs face a trade-off between OCR and global visual perception, and current models fail to benefit from "text + image" dual-modality inputs.
- Asynchronous Matching with Dynamic Sampling for Multimodal Dataset Distillation
-
Addressing the "asynchronous optimization rhythms of image and text networks" in image-text dataset distillation, this paper proposes the AMD framework. It decouples the sampling origins of image and text expert trajectories for asynchronous trajectory matching, utilizes MMD to measure convergence speed differences to dynamically determine the sampling range for each modality, and replaces random initialization with semantic prototype mining. On Flickr30k and COCO, it significantly refreshes distilled retrieval performance with almost zero extra overhead (e.g., IR@1/@5/@10 Gained by 4.5%/9.6%/10.9% under the Flickr30k 200-pair setting).
- AttTok: Marrying Attribute Tokens with Generative Pre-trained Vision-Language Models towards Medical Image Understanding
-
Addressing the challenge where generative medical multimodal large models lose discriminative power by encoding clinical attributes like "mild/severe DR" into nearly identical text tokens, this paper proposes Attribute Tokens (AttTok). By assigning a dedicated special token to each clinical concept and implementing a multimodal embedding book, Attribute-centric Cross-attention (ACC) adapter, and Attribute-centric Matching (ACM) loss, the authors explicitly inject discriminative medical knowledge into the generative paradigm. This approach achieves consistent performance gains across 5 classification benchmarks and 3 VQA benchmarks.
- BaseReward: A Strong Baseline for Multimodal Reward Model
-
Instead of inventing new architectures, this paper deconstructs the process of building a SOTA Multimodal Reward Model (MRM) into six dimensions: paradigm, reward head, regularization, data, backbone/scale, and ensemble. Through systematic ablation, it derives a clear "recipe" and builds BaseReward—a simple yet strong baseline based on Qwen2.5-VL-7B with a two-layer SiLU MLP reward head and selected mixed preference data. It sets new SOTAs on benchmarks like MM-RLHF-Reward Bench and VL-Reward Bench, while offering significantly faster inference than generative reward models.
- Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
-
Addressing the pain points of "poor SFT data quality and lack of complex reasoning data" in fully open-source multimodal large models, this paper utilizes an automated data curation pipeline (HoneyPipe) to clean and enrich approximately 24 million raw image-text pairs into Honey-Data-15M, a high-quality dataset of 15 million samples with dual-layer CoT. By training on this dataset, the Bee-8B model achieves new SOTA among fully open-source MLLMs, matching or even surpassing the semi-open InternVL3.5-8B on several reasoning benchmarks.
- Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
-
This paper constructs the first large-scale evaluation benchmark for fine-grained image tasks, FG-BMK (1.01 million questions, 280,000 images). It systematically interrogates 12 mainstream LVLMs/VLMs from two perspectives: "human-oriented dialogue" and "machine-oriented features." The study reveals how contrastive training paradigms, modality alignment, perturbation robustness, and hierarchical category reasoning influence fine-grained performance, discovering that LVLMs still significantly lag behind specialized models in fine-grained tasks.
- Bilateral Information-aware Test-time Adaptation for Vision-Language Models
-
To address the issue of Vision-Language Models (VLMs) like CLIP overfitting to atypical features during Test-time Adaptation (TTA) when using only a "fixed ratio of low-entropy samples," this paper proposes BITTA: it simultaneously "learns" core representations from a dynamic ratio of low-entropy samples and "unlearns" atypical features from high-entropy samples. This approach consistently improves the average accuracy of various TTA methods by approximately 1–2 percentage points on corrupted datasets such as CIFAR-10/100-C and ImageNet-C.
Browse all 211 Multimodal VLM papers →
🧠 VLM Reasoning (111)¶
- AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning
-
AdaReasoner teaches Multimodal Large Language Models (MLLMs) to dynamically orchestrate a set of visual tools during multi-turn visual reasoning. Through a two-stage training process of "Tool Cold Start + Multi-turn Tool GRPO," it enables a 7B small model to autonomously select, discard, and adjust tool usage frequency. It achieves an average performance gain of +38.7%, reaching a near-perfect score of 97.6% on VSP, surpassing GPT-5 and Claude Sonnet 4.
- Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
-
Agent-X is a large-scale benchmark for "vision-centric agents," covering 6 types of scenarios with 828 real-world multimodal tasks (image/multi-image/video/instructional text). It features a fine-grained "step-level + reasoning chain + outcome" three-mode evaluation system. Results indicate that even the strongest models from GPT, Gemini, and Qwen series achieve full-link success rates below 50%, exposing significant flaws in current LMMs regarding multi-step visual reasoning and tool invocation.
- Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models
-
AGILE redefines "jigsaw puzzle solving" as an interactive process where the model generates code and observes feedback. Combined with infinitely scalable procedurally synthesized data, cold-start SFT, and GRPO reinforcement learning, it improves Qwen2.5-VL-7B accuracy on 2×2 puzzles from 9.5% to 82.8% and achieves an average gain of 3.1% across 9 general vision benchmarks.
- ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping
-
ARES utilizes "window entropy" as an exploration trigger and controls exploration depth through a difficulty-aware hierarchical entropy reward. This allows multimodal reasoning models to think less on simple problems and more on difficult ones, simultaneously improving accuracy and reasoning efficiency across mathematical, logical, and multimodal benchmarks.
- AutoGPS: Automated Geometry Problem Solving via Multimodal Formalization and Deductive Reasoning
-
AutoGPS utilizes a neuro-symbolic synergistic framework consisting of a "Multimodal Problem Formalizer (MPF) + Deductive Symbolic Reasoner (DSR)." It first translates plane geometry problems into formal language and then performs rigorous deduction via hypergraph expansion. This process yields both a correct answer and a traceable step-by-step solution, achieving SOTA on Geometry3K / PGPS9K and improving human-evaluated logical accuracy from ~71% to 99%.
- Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
-
This paper identifies that existing medical VLM benchmarks focus only on classification accuracy, creating an "evaluation illusion." It proposes a "Breadth-Depth" two-axis evaluation framework and builds Neural-MedBench, a deep reasoning benchmark for neurology (120 multimodal cases, 200 reasoning tasks). Empirical results show that top models like GPT-5, Claude-4, and MedGemma fail collectively in deep reasoning, with failures primarily stemming from reasoning rather than perception.
- Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs
-
Inspired by the Wechsler Intelligence Scale for Children, "General Intelligence" is decomposed into five measurable abilities: Execution, Perceptual Reasoning, Learning, Memory, and Planning. KidGym is constructed with 12 2D grid interaction tasks, three difficulty levels, and a customizable dynamic benchmark. It systematically reveals significant shortcomings of current top MMLMs in non-semantic abstract vision, quantity perception, and composite ability tasks.
- CircuitSense: A Hierarchical MLLM Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process
-
CircuitSense establishes the first MLLM benchmark organized by engineering abstraction levels, emphasizing the derivation of symbolic equations from circuit schematics. Using 8,006 problems (human-curated + synthetically generated), it systematically evaluates 8 MLLMs, revealing a fundamental gap where closed-source models exceed 85% in perception tasks but plummet below 19% in symbolic derivation.
- CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
-
CompoDistill finds that existing knowledge distillation (KD) for Multimodal Large Language Models (MLLMs) only acquires "visual recognition" but fails in "visual perception," rooted in the misalignment of visual attention distributions between teacher and student. It introduces a VAT module to align student visual attention to the teacher and a TAF module to reuse the teacher's adapter. With a three-stage training strategy, it elevates a 2B student's Compositional Reasoning (CR) average from 61.5 to 66.7, approaching a 4B teacher without degrading VQA performance.
- Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
-
This paper employs an evaluation framework based on propositional logic and "six interaction modes" that split facts across modalities. It systematically demonstrates that the true bottleneck of Multimodal Large Language Model (MLLM) reasoning lies in "integration" rather than "perception." Through attention probes and causal interventions, two root causes are identified: the task-composition bottleneck (identification and reasoning cannot be jointly performed in a single forward pass) and the fusion bottleneck (modality fusion in early layers introduces bias). The authors also provide two lightweight remedies: "two-step prompting" and "early-layer attention warming."
Browse all 111 VLM Reasoning papers →
⚡ VLM Efficiency (18)¶
- Enhancing Visual Token Representations for Video Large Language Models via Training-free Spatial-Temporal Pooling and Gridding
-
Addressing the issue where Video Large Language Models (Video LLMs) lose spatio-temporal information when compressing thousands of visual tokens into a limited context, this paper proposes ST-GridPool, a training-free method. It utilizes "Pyramidal Temporal Gridding" to aggregate frame tokens across different time scales, injecting multi-granularity motion information, and "Norm-based Spatial Pooling" to weighted-preserve high-information regions based on L2 norms. It achieves consistent performance gains as a plug-and-play solution on LLaVA-Video / LLaVA-OneVision without retraining.
- HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit
-
The authors propose the HiDrop framework, which performs deep functional analysis of MLLM layers (Shallow = Propagators, Middle = Fusion Centers, Deep = Language Reasoning). It designs a three-stage strategy: Late Injection (skipping shallow layers), Concave Pyramid Pruning (pruning in middle layers), and Early Exit (exiting in deep layers). This approach compresses approximately 90% of vision tokens with negligible performance loss and achieves a 1.72× training speedup.
- iLLaVA: An Image Is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
-
iLLaVA breaks the inertia of "compressing tokens only in the LLM stage" by inserting token merging into both the image encoder and the LLM. Using an "information tokens + recovery tokens" merging strategy to retrieve useful information from discarded tokens, it achieves 2× throughput and 4× prefilling acceleration training-free while maintaining >95% performance.
- IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning
-
This work reveals the implicit visual coordinate system (IVC tokens) established by RoPE positional encoding in LVLMs and proposes a training-free, prompt-aware vision token pruning strategy. By preserving both IVC tokens and semantic foreground tokens, it reduces visual tokens by approximately 50% while maintaining \(\ge 99\%\) of the original performance.
- LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
-
LearnPruner empirically debunks the prevalent assumption that "attention score = token importance." It points out that [CLS] attention in vision encoders is contaminated by attention sinks, while in LLMs, only "text-to-vision" mid-layer attention is reliable. Consequently, it replaces [CLS] attention with a learnable pruning module and superimposes text-guided pruning at the LLM mid-layers. By retaining only ~5.5% of vision tokens, it maintains 95% performance and achieves a 3.2× speedup.
- Lightweight Spatio-Temporal Modeling via Temporally Shifted Distillation for Real-Time Accident Anticipation
-
Using a frozen image-only CLIP teacher + temporally shifted distillation, a lightweight RepMixer+RWKV student learns "predictive" temporal capabilities without large-scale video pre-training. It achieves SOTA on the DAD/CCD accident anticipation benchmarks while being 3–7× smaller than competitors and running at 80 FPS on Jetson Orin Nano.
- Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
-
It is observed that KV Cache in LVLMs exhibits modal-specific and head-specific semantic redundancy. Since selection based solely on importance leads to a loss of semantic coverage, MixKV is proposed to adaptively mix importance and diversity scores per head for KV Cache compression, achieving an average improvement of 5.1% under extreme compression.
- Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning
-
This paper discovers that existing visual token pruning methods collapse on visual grounding (VG) tasks because they destroy the "global spatial reference system" constructed by positional encodings. Consequently, it proposes Nüwa—a two-stage pruning framework inspired by swarm intelligence (Boids) that employs a "Partition-Align-Aggregate" strategy on the vision encoder side to preserve spatial anchors, followed by text-guided refinement in the middle of the LLM. This approach improves performance retention on VG tasks from ~7% to 47% while maintaining VQA performance at 95%.
- Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
-
Photon is a multimodal large model that directly processes full 3D medical volume data (CT/MRI). It utilizes "Instruction-conditioned Token Scheduling (ITS)" to adaptively determine the number of visual tokens to retain for each question, and "Surrogate Gradient Propagation (SGP)" to ensure discrete token dropping remains differentiable during training. This approach achieves SOTA accuracy on medical VQA while providing approximately 5x training speedup and two-thirds memory savings.
- PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models
-
PPE (Positional Preservation Embedding) leverages the rotation independence of RoPE dimensions to encode multiple original position IDs into different dimension segments of a merged token, enabling a single compressed token to carry multiple spatial/temporal positional information. PPE is a zero-parameter, plug-and-play universal operator that yields only a 3.6% average performance drop on image tasks at 55% compression and maintains comparable performance at 90% compression through cascaded compression.
Browse all 18 VLM Efficiency papers →
🎵 Audio & Speech (80)¶
- AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
-
AC-Foley is proposed as a reference-audio-guided video-to-audio synthesis framework. Through two-stage training (acoustic feature learning and temporal adaptation) and multimodal conditional flow matching, it achieves fine-grained timbre control, timbre transfer, and zero-shot sound effect generation, significantly outperforming existing methods in audio quality and acoustic fidelity.
- AlignSep: Temporally-Aligned Video-Queried Sound Separation with Flow Matching
-
AlignSep shifts "Video-Queried Sound Separation (VQSS)" from the mainstream time-frequency masking discriminative paradigm to a flow matching-based generative paradigm. By employing a temporally-aligned vector field estimator implemented with "temporal concatenation + non-cross-attention Transformer," it enforces frame-by-frame synchronization between audio and video. This allows for clean extraction of on-screen target sounds in difficult scenarios with intra-class interference and overlapping tracks, achieving a temporal alignment score \(T_{A\text{-}V}\) of 95.76% on the self-constructed VGGSound-Hard benchmark.
- AudioX: A Unified Framework for Anything-to-Audio Generation
-
AudioX utilizes a unified framework based on a Diffusion Transformer (DiT), integrated with a lightweight "Multimodal Adaptive Fusion (MAF)" module and a self-constructed 7-million-sample multimodal dataset, IF-caps. This allows a single set of model weights to generate high-fidelity sound effects and music from arbitrary combinations of text, video, and audio, significantly outperforming specialized models in fine-grained instruction following.
- Aurelius: Relation Aware Text-to-Audio Generation At Scale
-
Aurelius constructs two large-scale decoupled corpora (AudioEventSet with 110 categories of audio events + AudioRelSet with 100 types of relations) and a text-audio pair generation strategy. This pushes "relation-aware text-to-audio generation" from small-scale exploration to a scalable research level. The authors systematically benchmark 9 mainstream TTA models, revealing that they almost entirely fail at modeling multi-event relations (with relation accuracy generally <10%).
- Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?
-
This paper redefines "Automatic Stage Lighting Control (ASLC)" from the long-standing paradigm of "music classification → table lookup" to a generative task. It proposes Skip-BART, an end-to-end model that takes music audio as input and autoregressively generates hue and value for lighting frame-by-frame. A novel skip connection explicitly aligns music and lighting frames. Supported by a self-built dataset, pre-training, and transfer learning, the model outperforms rule-based methods across quantitative metrics and a 38-person subjective evaluation, showing no significant difference from professional lighting designers (p=0.72).
- AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization
-
Addressing the issues of spurious associations and hallucinations in multimodal large language models (MLLMs) during emotion reasoning, this work proposes the EmoReAlM evaluation benchmark and the AVEm-DPO preference optimization method. By constructing targeted preference pairs and text prior regularization, it achieves a zero-shot relative performance gain of 6-19% on DFEW, RAVDESS, and EMER.
- AVEX: What Matters for Animal Vocalization Encoding
-
This is a large-scale empirical study: the authors systematically disassemble "what matters most in training a generalizable bioacoustic encoder." The conclusion is that a two-stage recipe—self-supervised pre-training on a mixture of diverse bioacoustic and general audio data, followed by supervised post-training—is the most effective for both in-distribution and out-of-distribution performance. This approach achieves new SOTA across 26 datasets and four task categories.
- Beyond Instance-Level Alignment: Dual-Level Optimal Transport for Audio-Text Retrieval
-
DART introduces a "feature-level" alignment layer beyond traditional "instance-level" audio-text alignment—treating each embedding channel as a distribution and employing Unbalanced Wasserstein distance to pair audio and text channels. Guided by a "Reliability-Aware Margin" based on variance, kurtosis, and cross-modal correlation to favor stable semantic channels, DART achieves SOTA retrieval performance under mini-batch, label-scarce, and noisy-label conditions.
- Bridging Piano Transcription and Rendering via Disentangled Score Content and Style
-
This paper unifies the inverse tasks of Expressive Performance Rendering (EPR, score-to-performance) and Automatic Piano Transcription (APT, performance-to-score) into a single Transformer Seq2Seq framework. By disentangling "note-level score content" and "global performance style" to achieve bidirectional modeling, and training an additional diffusion model to recommend appropriate styles directly from the score, the rendering is made both controllable and automated.
- Can Speech LLMs Think while Listening?
-
This paper inserts text Chain-of-Thought (CoT) into the text monologue stream of a multi-stream speech LLM (Moshi), enabling reasoning in the text space and improving accuracy by an average of 2.4x. It further proposes a "question completeness" metric based on KL divergence, allowing the model to "think while listening" and initiate reasoning before the user finishes speaking. Combined with DPO preference fine-tuning, this reduces additional reasoning latency by approximately 70% without sacrificing accuracy.
Browse all 80 Audio & Speech papers →
🔎 AIGC Detection (30)¶
- A Rich Knowledge Space for Scalable Deepfake Detection
-
This paper integrates 11 deepfake and real face sources into the MMI-DD dataset, scaling to 3.6 million images. It proposes SD2, which utilizes CLIP's hierarchical visual features, fine-grained textual forgery labels, and VLM-generated descriptions for joint training. This ensures that the deepfake detector gains stronger cross-domain and AIGC generalization capabilities on large-scale heterogeneous data instead of suffering from performance degradation.
- All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning
-
This paper proposes the detection principles "All Patches Matter, More Patches Better," identifying that existing AI-generated image (AIGI) detectors suffer from a "Few-Patch Bias"—focusing only on a minimal set of patches. A Panoptic Patch Learning (PPL) framework is designed, using Randomized Patch Reconstruction (RPR) and Patch-wise Contrastive Learning (PCL) to spread discriminative power across all patches. This significantly improves cross-generator generalizability and robustness on GenImage, DRCT-2M, AIGCDetectBenchmark, and real-world Chameleon datasets (e.g., CLIP backbone achieves 97.2% mAcc on GenImage with a std of only 1.7).
- Attack-Resistant Watermarking for AIGC Image Forensics via Diffusion-based Semantic Deflection
-
Ours proposes PAI—a training-free, plug-and-play inherent watermarking framework for diffusion models. By combining "initialization embedding" and "key-guided denoising trajectory deflection," user identity is deeply entangled with image semantics. The "initialization bias" obtained via DDIM inversion serves as a unified forensic signal for copyright verification, attack detection, and semantic-level tamper localization. PAI achieves an average verification accuracy of 98.43% under 12 types of attacks, outperforming SOTA by 37.25%.
- Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection
-
This paper argues that token-level scores of mainstream "metric-based" machine-generated text (MGT) detectors are contaminated by the randomness of LLM sampling. It utilizes Markov Random Fields (MRF) to characterize two patterns: "neighbor similarity" and "initial instability." Through mean-field approximation, this is implemented as a lightweight iterative component with only 2x2 parameters that can be layered onto any existing detector. It significantly boosts the AUROC of various baseline detectors with almost no additional overhead (e.g., increasing DetectGPT's AUROC on the Essay dataset from 44% to 92%).
- Calibrating Verbalized Confidence with Self-Generated Distractors
-
The DiNCo method is proposed to expose "suggestibility bias" by having LLMs independently evaluate automatically generated distractors (plausible but incorrect alternative answers). By normalizing with the total confidence across distractors and fusing two complementary dimensions—generation consistency and verification consistency—it significantly improves confidence calibration in short-form QA and long-form generation tasks.
- CLARC: C/C++ Benchmark for Robust Code Search
-
Constructs CLARC, the first compilable C/C++ code retrieval benchmark (6,717 query-code pairs), using an automated pipeline to extract code from GitHub and generate/validate queries via LLM with hypothesis testing. It covers four retrieval scenarios—Standard, Anonymized, Assembly, and WebAssembly—revealing that existing code embedding models rely excessively on lexical features (NDCG@10 drops from 0.89 to 0.67 after anonymization) and significantly underperform in binary-level retrieval.
- Data Provenance for Image Auto-Regressive Generation
-
Without altering the generation process or requiring watermarks, this paper leverages the "features left by Image Autoregressive (IAR) models in the codebook quantization space." By utilizing a trained inverse decoder and two complementary signals—QuantLoss and EncLoss—it achieves nearly 100% TPR@1%FPR for post-hoc provenance detection across mainstream IAR models including VAR, RAR, LlamaGen, and Infinity.
- Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity
-
Through close reading annotations of 8,618 expressions by 26 professional writers, this study reveals that n-gram novelty is insufficient to measure textual creativity—approximately 91% of expressions with high n-gram novelty are not considered creative, and a negative correlation exists between high n-gram novelty and low pragmaticality in open-source LLMs.
- DMAP: A Distribution Map for Text
-
Ours proposes DMAP (Distribution Map), a mathematical framework that maps text to \(i.i.d.\) samples in the range \([0,1]\) via next-token probability ranking of language models. It theoretically proves that pure sampling produces a uniform distribution, enabling the use of \(\chi^2\) tests to verify generation parameters, uncovering the root cause of why "probability curvature" detectors fail under pure sampling, and visualizing statistical fingerprints left by post-training (SFT/RLHF) in downstream models.
- D&R: Recovery-based AI-Generated Text Detection via a Single Black-box LLM Call
-
D&R randomly shuffles the text to be tested within local chunks separated by punctuation (Within-Chunk Shuffling) and calls a black-box LLM only once to recover it. It then measures the semantic and structural similarity between the recovered text and the original. AI-generated text is more likely to be "recovered" almost identically, while human-written text remains more dispersed. Feeding this similarity gap into a lightweight classifier enables detection, achieving an AUROC of 0.96 for long texts and 0.87 for short texts, without requiring probability access and using only a single call.
Browse all 30 AIGC Detection papers →
🧊 3D Vision (201)¶
- 3DGEER: 3D Gaussian Rendering Made Exact and Efficient for Generic Cameras
-
The 3DGEER framework is proposed, which achieves geometrically exact and real-time efficient 3D Gaussian rendering under any camera model by deriving a closed-form solution for integrating Gaussian density along rays, designing Particle Bounding Frustums (PBF) for precise and efficient ray-particle association, and introducing Bipolar Equi-Angular Projection (BEAP) to unify wide field-of-view camera representations. It comprehensively outperforms existing methods on fisheye and pinhole datasets.
- 3DSMT: A Hybrid Spiking Mamba-Transformer for Point Cloud Analysis
-
3DSMT integrates the event-driven low-power characteristics of Spiking Neural Networks (SNNs) with the local modeling of Transformers and the linear-complexity global modeling of Mamba into a hybrid architecture. By utilizing "Spiking Local Offset Attention + Spiking Mamba Blocks," it achieves SOTA results among SNN methods in classification, few-shot, and segmentation tasks, with energy consumption being dozens of times lower than ANN counterparts, while even outperforming several ANN models.
- A²TG: Adaptive Anisotropic Textured Gaussians for Efficient 3D Scene Representation
-
A²TG assigns an "anisotropic texture" with adaptive resolution and aspect ratio to each 2D Gaussian. By utilizing gradient-driven selection and upsampling rules, texture parameters are allocated only to Gaussians that truly require high-frequency details, achieving higher rendering quality and lower VRAM consumption than fixed-square textured Gaussian Splatting under the same memory budget.
- A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features
-
FastForward compresses "mapping" into a single feature extraction step: it uses a set of features randomly sampled from posed mapping images and anchored in 3D space as the scene map. A DUSt3R-style feed-forward network then predicts the 3D coordinates of the query image in one pass to solve the pose. This achieves mapping in seconds and localization in 0.5s, while its accuracy matches or even surpasses SCR/structured methods that require minutes to hours for mapping.
- A Step to Decouple Optimization in 3DGS
-
The paper provides an in-depth analysis of overlooked optimization couplings in 3DGS, specifically update step coupling (implicit updates and momentum rescaling under invisible viewpoints) and gradient coupling (regularization and photometric loss coupling within Adam's momentum). By decoupling and reorganizing these components, the authors propose the AdamW-GS optimizer, which simultaneously improves reconstruction quality and reduces redundant primitives without requiring additional pruning operations.
- Active Learning of 3D Gaussian Splatting with Consistent Region Partition and Robust Pose Estimation
-
This paper proposes an online active learning algorithm for 3D Gaussian Splatting (3DGS). It guides users by suggesting the "next best view" during training. The system partitions the model into consistent regions using visibility features, identifies the most under-reconstructed areas via semantic feature variance, and directly generates the next optimal pose using a von Mises-Fisher distribution. It also incorporates a robust pose optimization module to handle noise from handheld capture, outperforming SOTAs like FisherRF on NeRF-Synthetic in few-shot settings (10/20 views).
- Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
-
The authors reformulate multi-view novel view synthesis as a dual-branch diffusion inpainting task of "image + geometry." By utilizing MoAI (cross-Modal Attention Instillation) to inject attention maps from the image branch into the geometry branch, the method generates aligned novel view images and point clouds directly from pose-free reference images, achieving SOTA performance in camera extrapolation settings.
- All That Glitters Is Not Gold: Key-Secured 3D Secrets within 3D Gaussian Splatting
-
KeySS transforms "hiding multiple 3DGS secret scenes within a single 3DGS cover scene" into an end-to-end trainable framework. It employs a decoder controlled by CLIP-encoded keys to directly map cover Gaussians to secret Gaussians; incorrect keys result in reconstructing only the cover. The study identifies that different Gaussian attributes contribute unequally to hiding secrets (opacity is effective, while spherical harmonics are nearly useless). It proposes the 3D-Sinkhorn distance to measure steganographic imperceptibility in the Gaussian parameter space, ultimately surpassing GS-Hider in reconstruction fidelity and anti-detection security.
- Anime-Ready: Controllable 3D Anime Character Generation with Body-Aligned Component-Wise Garment Modeling
-
Anime-Ready normalizes text or single images into A-pose anime character images, then utilizes Anime-SMPL, a body-aligned component-wise garment DiT, and fragmented texture generation to advance 3D anime characters from "looking similar" to animation-ready assets with skeletons, swappable outfits, and expression control.
- ARTDECO: High-Fidelity Online 3D Reconstruction with Hierarchical Gaussian Structure + Feed-forward Priors
-
ARTDECO utilizes feed-forward 3D foundation models (MASt3R / π³) as modular pose and point cloud priors, coupled with a Gaussian decoder that decodes structured Gaussians from multi-scale features, and a hierarchical semi-implicit Gaussian representation with LoD. This system achieves SLAM-level speed, feed-forward robustness, and rendering quality approaching per-scene optimization from monocular video streams.
Browse all 201 3D Vision papers →
🎯 Object Detection (31)¶
- APT: Towards Universal Scene Graph Generation via Plug-in Adaptive Prompt Tuning
-
APT replaces the long-standing "frozen word vector semantic prior" in Scene Graph Generation (SGG) with a set of lightweight learnable prompts. It dynamically modulates static semantic features into representations dependent on visual context. As a plug-and-play module, it can be integrated into any one-stage, two-stage, or open-vocabulary SGG framework, achieving comprehensive performance gains with <0.5M parameters and shorter training times.
- Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting
-
WS-COC is the first framework to utilize Multimodal Large Language Models (MLLM) for weakly-supervised class-agnostic object counting. Using only image-level total counts for supervision, it activates the counting capabilities of MLLMs through three simple strategies: "Binary Dialogue Tuning + Comparative Ranking Optimization + Global-Local Fusion." It approaches or even surpasses some fully-supervised methods using point-level supervision on four datasets, including FSC-147.
- CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection
-
This work introduces Object-Centric Learning (Slot Attention) to Source-Free Domain Adaptive Object Detection (SF-DAOD) for the first time. By extracting domain-invariant object-level structural priors through a Hierarchical Slot Awareness module and driving domain-invariant representations with class-guided contrastive learning, the method significantly outperforms existing approaches across multiple cross-domain benchmarks.
- CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally
-
It is demonstrated through linear probing experiments that CLIP's Bag-of-Words (BoW) behavior originates from cross-modal alignment failure rather than a lack of binding information in the encoders. LABCLIP is proposed, which significantly restores attribute-object binding capabilities by training only a lightweight linear transformation.
- Complexity- and Statistics-Guided Anomaly Detection in Time Series Foundation Models
-
When Time Series Foundation Models (TFMs, such as MOMENT) are applied to reconstruction-based anomaly detection, they fail due to "overgeneralization" (reconstructing anomalies too well) and "over-stationarization" (Instance Normalization removing mean and variance). This paper introduces a complexity metric \(\alpha\) derived from the difference between reconstruction and imputation errors to adaptively ensemble TFMs with lightweight statistical models (CAE), and re-injects mean and variance into the decoding stage (MOMENT-Stat). It improves VUS-PR from the previous SOTA of 0.4233 to 0.4679 across 23 univariate and 17 multivariate benchmarks.
- Contextual and Seasonal LSTMs for Time Series Anomaly Detection
-
Aiming at "minor point anomalies" and "slowly rising anomalies" that are difficult for existing methods to detect in univariate time series, this paper proposes the CS-LSTMs dual-branch architecture. S-LSTM models periodic evolution in the frequency domain, while C-LSTM captures local trends in the time domain. Combined with a wavelet noise decomposition strategy, it outperforms SOTA on four benchmarks with a 40% increase in inference speed.
- DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
-
DeCo-DETR decouples "online text encoder invocation" and the "competition between localization and alignment" in open-vocabulary detection. It employs an LVLM to offline distill a reusable hierarchical semantic prototype pool as a substitute for the text encoder during inference and utilizes dual-stream gradient isolation to separate localization and semantic alignment training. This approach achieves a gain of 3.1--5.8 points on OV-COCO novel classes while compressing single-image inference latency to 135ms.
- DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
-
DETR-ViP attributes the performance gap between "visual prompts and text prompts" to a lack of global discriminability in visual prompts. By expanding negative samples through global prompt integration, reshaping the visual prompt space topology via text-based relationship distillation, and stabilizing inference with selective fusion, it achieves a new SOTA in visual prompt detection across COCO / LVIS / ODinW / Roboflow100 (surpassing T-Rex2-T by +4.4 AP on COCO).
- DiffuDETR: Rethinking Detection Transformers with Denoising Diffusion Process
-
DiffuDETR reformulates object detection as an "object query generation task conditioned on an image and a set of noisy reference points." By using denoising diffusion training, the DETR decoder learns to gradually denoise query reference points from Gaussian noise into precise object locations. It consistently outperforms baselines such as Deformable DETR and DINO on COCO, LVIS, and V3Det, while adding negligible computational overhead during inference as it only requires a few extra decoder passes.
- Dual Distillation for Few-Shot Anomaly Detection
-
Proposed D24FAD, a dual-distillation framework combining Teacher-Student Distillation (TSD) on query images and Student Self-Distillation (SSD) on support images, supplemented by a Learn-to-Weight (L2W) mechanism for adaptive support evaluation. It achieves 100% AUROC using 2-shot on the APTOS fundus dataset.
Browse all 31 Object Detection papers →
✂️ Segmentation (32)¶
- Advancing Complex Video Object Segmentation via Progressive Concept Construction
-
This paper introduces Segment Concept (SeC), which injects object-level "concept representations" extracted by Large Vision-Language Models (LVLMs) into a SAM 2.1-style Video Object Segmentation (VOS) pipeline on demand. This approach significantly reduces appearance-based interference and object reappearance failures in complex multi-shot scenarios while establishing the SeCVOS benchmark specifically for evaluating semantic-level VOS capabilities.
- AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation
-
The authors propose an Alignment-aware Masked Learning (AML) strategy that quantifies vision-language patch-level alignment and filters low-alignment pixels. This allows the RIS model to focus on reliable regions during training, achieving SOTA results across all 8 RefCOCO splits without any architectural modifications.
- Benchmarking Open-ended Segmentation
-
Focusing on the evaluation loophole in "open-ended segmentation" where model-generated free-form text is forcibly mapped back to a fixed vocabulary via embedding similarity, this paper introduces a mapping function based on lexical relationships (exact/synonym/hyponym/meronym) and a Lexical Alignment Curve (LAC) protocol. This shifts evaluation accuracy from a 37.7% deviation from human judgement to over 90% alignment. Furthermore, the first open-ended segmentation MLLM with contrastive loss (OPAL) is trained, achieving a new SOTA on open-ended panoptic segmentation.
- ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
-
The authors propose ByteFlow Net, a hierarchical byte-level language model that operates without a tokenizer. It utilizes the information-theoretic metric of coding rate to adaptively compress raw byte streams into semantic units, outperforming BPE-based baselines and existing byte-level architectures in both pre-training loss and downstream tasks.
- Decomposed Attention Fusion in MLLMs for Training-free Video Reasoning Segmentation
-
This work reconfigures video reasoning segmentation into a video QA task, extracting localization cues directly from the MLLM attention rollout. It purifies noisy attention maps into clean object masks through "Contrastive Background Removal" and "Video-Frame Complementarity" fusion. Finally, attention-guided SAM2 generates fine-grained masks. The entire process is training-free and achieves performance comparable to supervised methods.
- Deforming Videos to Masks: Flow Matching for Referring Video Segmentation
-
This work formalizes Referring Video Object Segmentation (RVOS) as an ODE flow problem that continuously deforms video latent representations into masks under language guidance. By fine-tuning the pre-trained text-to-video (T2V) model Wan2.1 and employing three strategies focused on the trajectory starting point, the method achieves SOTA performance on MeViS, Ref-YouTube-VOS, and Ref-DAVIS17.
- Detective SAM: Adaptive AI-Image Forgery Localization
-
A set of lightweight adapters is attached to SAM2 to automatically convert "post-perturbation feature distribution shift" forensics cues into heatmap prompts for segmenting tampered areas in diffusion edits. Combined with an AutoEditForge pipeline for automatic data generation, the locator can continually adapt to evolving image editing models.
- Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval
-
The study identifies that SAM2 exhibits sparse perception patterns similar to biological vision (the decoder focuses on the foreground while the encoder computes globally; only a few tokens in memory frames are effective and remain temporally consistent in saliency). Based on this, Efficient-SAM2 is proposed, eliminating redundant computation through Object-Aware Sparse Window Routing (SWR) and Sparse Memory Retrieval (SMR). This achieves a 1.68× end-to-end acceleration on SAM2.1-L with only a 1% accuracy loss.
- Enabling True Global Perception in State Space Models for Visual Tasks
-
The authors axiomatically define "image global modeling" for the first time using gradient lower bound axioms and design the GSSM module based on 2D-DFT frequency domain modulation. They theoretically prove and experimentally verify that SSMs can achieve true global perception while maintaining linear-logarithmic complexity.
- Enhancing Image-Conditional Coverage in Segmentation: Adaptive Thresholding via Differentiable Miscoverage Loss
-
The COAT framework is proposed to learn image-adaptive threshold predictors end-to-end using a differentiable sigmoid soft TPR approximation as a loss function, significantly reducing the per-image Coverage Gap in Conformal Risk Control for image segmentation.
Browse all 32 Segmentation papers →
🖼️ Image Restoration (61)¶
- A Statistical Benchmark for Diffusion-Posterior-Sampling Algorithms
-
This paper establishes a "standard ruler" for Diffusion Posterior Sampling (DPS) algorithms: by utilizing Lévy process signals—which allow for exact Gibbs sampling—as the test distribution, it obtains "gold standard" posterior samples at the distribution level. The authors systematically evaluate mainstream DPS algorithms (C-DPS / DiffPIR / DPnP) across four types of inverse problems (denoising, deconvolution, inpainting, and partial Fourier reconstruction) using MMSE optimality gap and posterior coverage metrics. The conclusion reveals that these algorithms are generally not calibrated.
- Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling
-
The Adam adaptive moment estimation from standard optimizers is directly applied to the guidance gradients of diffusion sampling. By maintaining the exponential moving average (EMA) of the first and second moments of likelihood score estimates across sampling steps, the noisy gradients of plug-and-play methods like DPS and CG are stabilized at almost zero extra cost. This approach outperforms several more complex and slower methods in image restoration (super-resolution, deblurring, inpainting) and class-conditional generation.
- Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization
-
The authors track the training process of Image Restoration (IR) Transformers and discover that standard LayerNorm causes feature magnitudes to diverge to the million-level scale and channel entropy to collapse sharply. The root cause is identified as LN's "per-token normalization" and "input-independent scaling," which conflict with IR tasks. Consequently, they propose i-LN—a plug-and-play replacement for LN that performs normalization across the entire spatial-channel dimension and adaptively adds the scaling factor back after each Attention/FFN block. This stabilizes training and consistently improves performance in SR, denoising, deraining, and JPEG artifact removal.
- Are Deep Speech Denoising Models Robust to Adversarial Noise?
-
This paper presents the first systematic evaluation of the robustness of 4 SOTA Deep Speech Denoising (DNS) models against adversarial noise. By generating imperceptible adversarial perturbations through psychoacoustic-constrained PGD attacks, the authors demonstrate that Demucs, Full-SubNet+, FRCRN, and MP-SENet can be forced to output completely unintelligible gibberish. The experiments cover various acoustic conditions and human evaluations, while revealing limitations of targeted attacks, universal perturbations, and cross-model transfer.
- Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes
-
The LSP scheduler accelerates DLM inference by 3.4\(\times\) by atomically committing the longest stable continuous prefix in each denoising step (rather than scattered discrete tokens), while maintaining or slightly improving output quality.
- Breaking Scale Anchoring: Frequency Representation Learning for Accurate High-Resolution Inference from Low-Resolution Training
-
The paper defines "Scale Anchoring" (where low-resolution training anchors error during high-resolution inference) and proposes an architecture-agnostic Frequency Representation Learning (FRL). By using Nyquist-normalized frequency encoding, it ensures that errors decrease as resolution increases, which is validated across 8 mainstream architectures.
- CL-DPS: A Contrastive Learning Approach to Blind Nonlinear Inverse Problem Solving via Diffusion Posterior Sampling
-
CL-DPS utilizes an offline-trained contrastive learning encoder to approximate the intractable likelihood term \(p(y\mid x_t)\) in diffusion posterior sampling (DPS). This enables diffusion models to solve blind nonlinear inverse problems (e.g., rotation blur, radial blur) for the first time without knowing or estimating operator parameters. It achieves clean restorations where existing methods fail, while remaining competitive on linear blind deblurring tasks.
- Content-Aware Mamba for Learned Image Compression
-
Addressing the two major flaws of Mamba in learned image compression—"fixed raster scanning" and "strict causality"—this paper proposes Content-Aware Mamba (CAM). It uses token rearrangement based on codebook clustering to group similar tokens for scanning and injects global priors into SSM output projections via a redundancy-aware prompt dictionary to break causality. Consequently, the CMIC model outperforms VTM-21.0 across Kodak/Tecnick/CLIC with BD-rates of −15.91%/−21.34%/−17.58%, while maintaining nearly 80% lower GPU memory usage than similar Mamba-based methods.
- Continuous Space-Time Video Super-Resolution with 3D Fourier Fields
-
This paper proposes V3, which utilizes a unified 3D Video Fourier Field (VFF) to represent video directly as a sum of sinusoids in \((x,y,t)\) space. By discarding the fragmented and fragile "Spatial INR + Optical Flow Warp" paradigm, it transforms super-resolution at arbitrary spatial and temporal scales into a single continuous sampling process. Furthermore, it enables the closed-form incorporation of a Gaussian Point Spread Function (PSF) for anti-aliasing, achieving a PSNR improvement of approximately 1.5–2 dB across multiple benchmarks while being faster and more memory-efficient.
- DeAltHDR: Learning HDR Video Reconstruction from Degraded Alternating Exposure Sequences
-
DeAltHDR is the first to directly address the neglected reality that "alternating exposure LDR frames inherently contain noise and motion blur." By employing a Flow-Guided Masked Attention (FGMA) module, it performs cross-frame alignment only in occlusion areas where optical flow is unreliable, while utilizing cheap optical flow warping elsewhere. This achieves a tunable trade-off between efficiency and quality. Coupled with a self-supervised adaptation method improved for large video motions, it surpasses existing SOTA on both synthetic and real-world datasets.
Browse all 61 Image Restoration papers →
🛰️ Remote Sensing (11)¶
- Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents
-
Earth-Agent is the first Earth Observation (EO) Agent framework based on the Model Context Protocol (MCP) tool ecosystem. it unifies RGB and spectral remote sensing data, achieving cross-modal, multi-step, and quantitative spatio-temporal reasoning by dynamically invoking 104 expert tools. The proposed Earth-Bench benchmark includes 248 expert tasks and 13,729 images. Experiments demonstrate that Earth-Agent significantly outperforms general-purpose Agents and remote sensing MLLMs.
- MARS - A Foundational Map Auto-Regressor
-
This work treats vector maps (points, polylines, polygons) as a "language," using a unified vision encoder and auto-regressive decoder for end-to-end generation of road networks and building outlines without any segmentation post-processing. It releases MAP-3M, the largest multi-class map dataset to date (approximately 3M images).
- Measuring the Intrinsic Dimension of Earth Representations
-
This paper presents the first systematic measurement of the Intrinsic Dimension (ID) of Geographic Implicit Neural Representations (Geographic INR). It finds that the true ID of 256-512D embeddings is only 2-10. A high ID in the frozen embedding space correlates positively with downstream performance, while a low ID in the supervised task-head activation space correlates with high performance, revealing a dual mechanism of "Representativeness vs. Task-Alignment."
- MoRA: Mobility as the Backbone for Geospatial Representation Learning at Scale
-
MoRA treats human mobility graphs as the "structural backbone" for multimodal fusion. Using CLIP-style asymmetric contrastive learning, it aligns POIs, satellite imagery, and demographics with a billion-edge mobility graph. It outperforms SOTA by an average of 12.9% across 9 socioeconomic downstream tasks using 128-dimensional representations and provides the first empirical evidence of scaling laws in geospatial representation learning.
- Object Fidelity Diffusion for Remote Sensing Image Generation
-
OF-Diff utilizes category labels to directly extract "shape mask priors" of remote sensing objects to constrain diffusion generation. An "online distillation" framework is employed to distill mixed features containing real image information into a shape-dependent decoder. This enables the model to generate high-fidelity, layout-consistent remote sensing images without requiring real image references during inference. Finally, DDPO reinforcement fine-tuning is used to further align with the real distribution, resulting in a 4–8% mAP improvement for categories such as airplanes, ships, and vehicles in downstream detection tasks.
- SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery
-
Starting from a single satellite image and predefined ground camera trajectories, SatDreamer360 utilizes triplane scene representations, ray-guided pixel attention, and panoramic epipolar-constrained temporal attention to generate geometrically aligned and cross-frame consistent 360° ground panorama sequences within a diffusion model, outperforming Sat2Density, ControlS2S, and EscherNet on the newly constructed VIGOR++ benchmark.
- SelvaBox: A high-resolution dataset for tropical tree crown detection
-
SelvaBox constructs the largest open-access high-resolution UAV RGB tree crown detection dataset for tropical forests. Using a unified multi-resolution detection benchmark, it demonstrates that high-resolution inputs, DINO-Swin detectors, and cross-dataset training significantly improve in-distribution and zero-shot generalization for tropical tree crown detection.
- TAMMs: Change Understanding and Forecasting in Satellite Image Time Series with Temporal-Aware Multimodal Models
-
The authors propose TAMMs—the first unified framework to jointly execute Temporal Change Description (TCD) and Future Satellite Image Forecasting (FSIF) in a single MLLM-Diffusion architecture. It awakens the temporal reasoning capabilities of a frozen MLLM through a Temporal-Aware Module (TAM) and translates change understanding into generative control signals via a Semantic Fusion Control Injection (SFCI) mechanism.
- Task-free Adaptive Meta Black-box Optimization
-
This paper proposes ABOM, a task-free adaptive meta black-box optimizer that parameterizes evolutionary operators (selection, crossover, and mutation) as differentiable attention modules. By utilizing self-generated data to update parameters online during the optimization process, it achieves competitive zero-shot performance on synthetic benchmarks and UAV path planning.
- TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation
-
TerraFM is designed for multisensor Earth observation data, treating Sentinel-1 SAR and Sentinel-2 optical imagery as natural augmented views of the same location. Through modality-specific patch embedding, per-position cross-attention fusion, and dual-centering DINO training for long-tail land cover, it achieves strong generalization on classification and segmentation tasks in GEO-Bench and Copernicus-Bench.
Browse all 11 Remote Sensing papers →
🔍 Anomaly Detection (10)¶
- Adaptive Conformal Anomaly Detection with Time Series Foundation Models for Signal Monitoring
-
The authors propose W1-ACAS: a post-hoc, tuning-free adaptive conformal anomaly detection framework. It maps prediction errors from pre-trained Time Series Foundation Models (TSFMs) into anomaly scores directly interpretable as false positive rates (p-values) and learns weights online by minimizing the Wasserstein distance to maintain stable false positive control under non-stationary data.
- Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors
-
The authors discovered that frozen foundation visual encoders "secretly" possess the ability to distinguish anomalies—the area of an anomalous region in an image is positively correlated with the distance of its features to the natural image manifold. By training a lightweight non-linear projection operator (FOUNDAD) atop the encoder to pull anomalous features back to the normal manifold and scoring based on the difference before and after projection, SOTA performance is achieved in few-shot, category-agnostic industrial anomaly detection.
- Healthcare Insurance Fraud Detection via Continual Fiedler Vector Graph Model
-
ConFVG utilizes the second smallest eigenvector of the graph Laplacian (Fiedler vector) to guide the masking strategy of a Graph Autoencoder (GAE) for structural-aware representation learning under label scarcity. It then employs subgraph attention fusion and a Mean Teacher framework to continuously adapt to evolving fraud patterns in unlabeled online streams, achieving real-time healthcare fraud detection.
- Let OOD Feature Exploring Vast Predefined Classifiers
-
This paper proposes VPC, which utilizes a set of fixed equiangular prototypes to map ID classes and OOD samples into two distinct predefined subspaces. By using the difference in \(L_2\) activation intensity between these two subspaces as an OOD score, it consistently reduces FPR95 in Outlier Exposure (OE) training scenarios on CIFAR and ImageNet-1k.
- LLM as an Algorithmist: Enhancing Anomaly Detectors via Programmatic Synthesis
-
This work repositions the LLM from a "data processor" to an "algorithmic strategist"—it analyzes the algorithmic description of a detector without touching real data, reasons about its logical blind spots, and generates a reusable Python synthesis code. This code creates "hard anomalies" specifically designed to deceive that detector, upgrading the original one-class problem into a more separable two-class problem. It consistently enhances five mainstream detectors across 36 tabular anomaly detection benchmarks.
- Low Rank Transformer for Multivariate Time Series Anomaly Detection and Localization
-
This paper theoretically maps the learning process of Transformer encoders on multivariate time series to the classical STAR statistical model. It proposes ALoRa-T, which applies low-rank regularization to self-attention, using the "rank" of the attention matrix as an anomaly signal for detection and tracing anomalies back to specific variables for localization using interpretable contribution weights.
- MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval
-
MRAD replaces the parametric fitting of \(p(y|x)\) in mainstream ZSAD with "similarity retrieval from a feature-label memory bank." The training-free version suppresses WinCLIP, and combined with two linear fine-tuning layers and dynamic prompts injected with regional priors, it achieves SOTA across 16 industrial and medical datasets.
- PIRN: Prototypical-based Intra-modal Reconstruction with Normality Communication for Multi-modal Anomaly Detection.
-
PIRN targets few-shot multimodal industrial anomaly detection for RGB images and 3D surface normals. It reconstructs normal features of each modality using adaptive prototype codebooks and enhances texture and geometric cues through cross-modal normality communication, achieving superior detection and localization performance on MVTec 3D-AD, Eyecandies, and Real-IAD D3.
- ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection
-
ReTabAD is the first "context-aware" tabular anomaly detection benchmark. It restores discarded textual semantics (feature descriptions, domain knowledge, original categorical values) into 20 curated datasets, provides implementations for 20 classic/deep/LLM-based algorithms, and proposes a training-free zero-shot LLM framework. Experiments demonstrate that semantic context improves detection AUROC by an average of 7.6 percentage points, allowing zero-shot LLMs to approach Prev. SOTAs.
- UniOD: A Universal Model for Outlier Detection across Diverse Domains
-
UniOD trains one universal outlier detection model using a batch of historical labeled datasets. It first unifies tabular datasets of any dimension or semantics into "multi-scale similarity graphs + SVD features," then transforms outlier detection into node binary classification using a GIN+GT dual-path graph network. Once trained, the model performs training-free and parameter-tuning-free inference for any unseen new dataset, achieving average AUROC/AUPRC scores that outperform 17 baselines across 30 benchmarks with lower latency.
🧑 Human Understanding (45)¶
- BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behaviour Analysis
-
This paper proposes BAH, the first multimodal dataset for Ambivalence/Hesitancy (A/H) recognition in videos. It contains 1,118 videos (8.26 hours) from 224 participants across 9 Canadian provinces, annotated by behavioral science experts, and provides baseline experimental results at both frame and video levels.
- BANZ-FS: BANZSL Fingerspelling Dataset
-
This paper constructs BANZ-FS, the first large-scale dataset for two-handed fingerspelling in BANZSL (British, Australian, and New Zealand Sign Language). It aggregates over 35K multi-level aligned fingerspelling instances from broadcast news, laboratory recordings, and web vlogs, and systematically benchmarks SOTA models across detection, isolated recognition, and contextual recognition tasks.
- CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild
-
CLUTCH utilizes a triad of "32,000 VLM-auto-labeled in-the-wild hand motion data (3D-HIW) + a SHIFT decomposed VQ-VAE that discretizes trajectory/pose and left/right hands separately + LLM fine-tuning with geometric reconstruction loss in the motion space." For the first time, text \(\leftrightarrow\) hand motion modeling is achieved in "in-the-wild" scenarios (e.g., playing piano, kneading dough, writing), achieving SOTA performance in both text-to-motion and motion-to-text tasks.
- Cross-Domain Policy Optimization via Bellman Consistency and Hybrid Critics
-
The Q Avatar framework is proposed to quantify source model transferability via cross-domain Bellman consistency. By utilizing an adaptive, hyperparameter-free weighting function to hybridize source and target domain Q-functions, reliable knowledge transfer is achieved in cross-domain RL with different state-action spaces, guaranteeing no negative transfer regardless of source model quality or domain similarity.
- Curvature-Guided Task Synergy for Skeleton based Temporal Action Segmentation
-
CurvSeg addresses the inherent conflict between "temporal invariance for classification" and "temporal sensitivity for boundary localization" in skeleton-based temporal action segmentation. It proposes using the geometric curvature of classification feature trajectories as a boundary prior—where curvature is high within action segments and low at transitions. This establishes a bidirectional closed-loop synergy between classification and localization, complemented by a dual-expert MoE to distill task-specific features, serving as a plug-and-play module that enhances the segmentation accuracy of baselines like DeST/LaSA across four datasets.
- DenseMarks: Learning Canonical Embeddings for Head Images via Point Trajectories
-
DenseMarks uses a ViT embedder to map every pixel of a head image to coordinates in a 3D canonical unit cube. Trained using automated pairs from off-the-shelf point trackers on in-the-wild talking head videos combined with contrastive loss, it achieves a cross-identity, cross-pose consistent, and interpretable dense correspondence representation, reaching SOTA in geometry-aware point matching and monocular head tracking.
- Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation
-
DHVAE explicitly decomposes dual-person interaction motion into three disentangled latent variables: "Person A action," "Person B action," and "Global interaction context." It applies contrastive learning constraints on the global latent variable to ensure contact plausibility and employs DDIM for diffusion denoising within a hierarchical latent space, achieving new SOTA results on InterHuman and InterX with a smaller and faster model.
- EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation
-
EasyTune transforms the fine-tuning paradigm for diffusion models from "calculating reward gradients after the full denoising trajectory" to independently optimizing at each denoising step. This breaks the recursive gradient dependency between steps, reducing VRAM usage from \(O(T)\) to \(O(1)\) and enabling denser optimization. Combined with a Self-refined Preference Learning (SPL) module that converts retrieval models into motion reward models without human annotation, it outperforms DRaFT-50 by 7.7% in alignment (MM-Dist) on HumanML3D, while using only 31.16% of its additional VRAM and speeding up training by 7.3×.
- EdgeCAPE: Edge Weight Prediction for Category-Agnostic Pose Estimation
-
EdgeCAPE introduces a learnable weighted pose graph prediction mechanism for Category-Agnostic Pose Estimation (CAPE) for the first time. By predicting edge weights and new edges for the skeleton graph, and incorporating Markov Attention Bias to enhance spatial dependency modeling, it achieves SOTA on the MP-100 benchmark, with a 1.99% Gain over GraphCape in 1-shot scenarios.
- EMBridge: Enhancing Gesture Generalization from EMG Signals Through Cross-modal Representation Learning
-
EMBridge proposes using hand poses as high-quality anchors. Through a triple mechanism of Q-Former, Masked Pose Reconstruction Loss (MPRL), and Community-Aware Soft Contrastive Learning (CASCLe), it aligns the representation space of noisy sEMG signals with a semantically structured pose space, achieving zero-shot EMG gesture classification on wearable devices for the first time.
Browse all 45 Human Understanding papers →
📹 Video Understanding (47)¶
- A Training-Free Framework for Long Video Understanding via Video-Query-Options Similarity
-
Addressing the issue that hour-long videos cannot fit into the context window of Multi-modal Large Language Models (MLLMs), this paper proposes a training-free input-side framework. It utilizes a video-text retrieval model to score the relevance of video segments, followed by Adaptive Frame Sampling (AFS) and Dynamic Resolution Allocation (DRA). The relevance estimation is refined by incorporating candidate answers generated by the MLLM itself into the retrieval query (VQOS). This framework achieves an average improvement of 3~5 points for LLaVA-Video and Qwen2.5-VL across five long video benchmarks.
- A.I.R.: Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering
-
Ours proposes A.I.R., a training-free adaptive-iterative-reasoning-driven frame selection framework. It addresses the dual dilemma of inaccurate similarity in lightweight models (CLIP) and explosive costs of VLM analysis in VideoQA through a two-stage strategy (GMM adaptive initial sampling + iterative VLM fine-grained analysis). Even in the worst-case scenario, it only requires analyzing 72 frames (vs. the 128-frame baseline) while significantly improving performance across multiple long-video benchmarks.
- ARFlow: Auto-regressive Optical Flow Estimation for Arbitrary-Length Videos via Progressive Next-Frame Forecasting
-
ARFlow transforms multi-frame optical flow estimation from "one-time estimation within a fixed-length clip" to "step-by-step auto-regressive prediction of next-frame flow." By using historical flow to initialize current estimates and fusing short-term and long-term motion cues through multi-stride temporal forecasting, it improves accuracy on benchmarks like Sintel, KITTI, and Spring with nearly constant memory usage.
- AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
-
AVoCaDO, based on Qwen2.5-Omni, undergoes SFT using 107K high-quality temporally aligned audiovisual captioning data, followed by GRPO reward fine-tuning focused on key events, dialogue, and length. This enables the 7B audiovisual captioner to outperform existing open-source models on multiple benchmarks, with some metrics matching or exceeding the Gemini-2.5 series.
- Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
-
This paper first reveals that current MLLMs fail to understand intuitive physics dynamics for continua (such as fluids) using two "low-level" diagnostic tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). It then proposes Scene Dynamic Field (SDF)—mapping particle velocities calculated by a physics simulator into blue gradient maps as visual prompts. Combined with multi-task fine-tuning, this improves Qwen2-VL / GLM-4.1V performance on fluid tasks by up to 20.7%, with successful transfer to unseen physical domains like cloth, sand, and smoke.
- Cambrian-S: Towards Spatial Supersensing in Video
-
This paper proposes "spatial supersensing," a paradigm shift from passive task-driven sensing to active world modeling. It first proves via the VSI-SUPER benchmark that brute-force context expansion (including Gemini-2.5 and the self-trained Cambrian-S) fails completely on spatial recall and counting tasks in arbitrarily long videos. It then introduces a self-supervised "Latent Frame Prediction" head that uses prediction error ("surprise") as a control signal to drive memory management and event segmentation, significantly outperforming strong commercial baselines on long-video spatial tasks.
- CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval
-
CaReBench utilizes 1,000 manually annotated videos—each with captions exceeding 200 words and explicitly split into spatial and temporal versions—to establish a benchmark capable of simultaneously evaluating fine-grained video captioning and retrieval. It introduces two new metrics, ReBias and CapST, to quantify the spatiotemporal bias of VLMs, and provides a two-stage SFT baseline, CARE, which unifies captioning and retrieval into a single MLLM.
- Divid: Disentangled Spatial-Temporal Modeling within LLMs for Temporally Grounded Video Understanding
-
Divid explicitly disentangles temporal and spatial branches within the Video LLM decoder. It utilizes temporal attention to select high-resolution keyframes for queries and fuses information via a token-level soft-router. Combined with the 559K timestamp-supervised dataset TempGCap, it improves both accuracy and computational efficiency in temporal grounding and evidenced VideoQA.
- EAST: Early Action Prediction Sampling Strategy with Token Masking
-
EAST introduces a training strategy that randomly samples the observation ratio \(\rho\), allowing a single model to perform early action prediction across all observation ratios. Combined with a "dual classification compound loss (present + future)" and "difference masking" that discards half of the tokens based on temporal redundancy, it outperforms previous state-of-the-art methods by 10.1, 7.7, and 3.9 percentage points on NTU60, SSv2, and UCF101, respectively, while halving training memory and time.
- EgoBrain: Synergizing Minds and Eyes For Human Action Understanding
-
EgoBrain constructs the first large-scale dataset synchronizing egocentric video with 32-channel EEG for daily actions and proposes Brain-TIM, which utilizes a time-aware Transformer to fuse visual and brain signals, improving the visual baseline from 63.40% to 66.70% in cross-subject and cross-scene 29-category action recognition.
Browse all 47 Video Understanding papers →
🚗 Autonomous Driving (50)¶
- Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation
-
The A3Point (Adaptive Augmentation-Aware Latent Learning) framework is proposed to decouple intrinsic model semantic confusion from semantic shift introduced by data augmentation through two core components: implicit learning of Semantic Confusion Prior (SCP) and localization of Semantic Shift Regions (SSR). It adaptively optimizes across varying interference levels and achieves SOTA results on multiple LiDAR segmentation benchmarks under adverse weather.
- SMART-R1: Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning
-
SMART-R1 introduces R1-style Reinforcement Fine-Tuning (RFT) to multi-agent traffic simulation for the first time, proposing the Metric-oriented Policy Optimization (MPO) algorithm and an "SFT-RFT-SFT" iterative training strategy. It achieved first place on the WOSAC 2025 leaderboard with a Realism Meta score of 0.7858.
- ARINBEV: Bird's-Eye View Layout Estimation with Conditional Autoregressive Model
-
ARINBEV treats the BEV semantic map in autonomous driving as a discretized sequence of structured tokens, replaces VQ-VAE tokenization with class encoding, and utilizes entropy-guided masked autoregressive decoding to achieve higher mIoU, fewer parameters, and faster training on nuScenes and Argoverse2.
- Astra: General Interactive World Model with Autoregressive Denoising
-
Ours proposes Astra, a general interactive world model that enables action-conditioned long-range video prediction on pre-trained video diffusion models through an autoregressive denoising framework. It introduces ACT-Adapter (action injection), noise-enhanced historical memory (alleviating visual inertia), and Mixture of Action Experts (unifying heterogeneous action modalities), achieving SOTA fidelity and action-following capabilities across autonomous driving, robotic manipulation, and scene exploration.
- AsyncBEV: Cross-modal Flow Alignment in Asynchronous 3D Object Detection
-
Addressing the real-world issue of imperfect sensor synchronization, AsyncBEV proposes a lightweight, plug-and-play module. By defining a new task, \(\Delta\)-BEVFlow, it predicts dense 2D flow fields directly from asynchronous multimodal BEV features to warp and align delayed features to the reference timestamp. Under extreme 0.5s asynchrony, it improves the dynamic object NDS of CMT by 16.6% compared to the EMC baseline.
- AutoDrive-R²: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving
-
AutoDrive-R² employs a four-step CoT + self-reflection data for cold-starting an autonomous driving VLA, followed by post-training using GRPO with spatial, kinetic, and temporal smoothness constraints. This enables the model to explain its driving decisions while outputting trajectories that adhere to vehicle physical constraints.
- \(AutoDrive\text{-}P^3\): Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning
-
AutoDrive-P3 organizes perception, prediction, and planning of autonomous driving VLMs into a unified \(P^3\) chain-of-thought reasoning, utilizing GRPO rewards spanning all three stages for reinforcement fine-tuning. It simultaneously improves trajectory accuracy, collision rates, and closed-loop planning scores on nuScenes and NAVSIM.
- Beyond Visual Reconstruction Quality: Object Perception-aware 3D Gaussian Splatting for Autonomous Driving
-
This paper points out that the assumption "higher reconstruction fidelity leads to better reproduction of autonomous driving system (ADS) behavior" is a strong, unverified hypothesis. It proposes replacing pure visual similarity with perception stability (consistency of perception model outputs between reconstructed and ground truth images) as the optimization objective. Two plug-and-play losses—Perception Alignment Loss and Object Region Quality Loss—are introduced to significantly improve perception consistency in reconstructed scenes without sacrificing visual quality.
- Bird's-eye-view Informed Reasoning Driver (BIRDriver)
-
BIRDriver compresses the entire driving scene into a single-frame Bird's-Eye-View (BEV) top-down image fed into a VLM. The VLM outputs no more than three relative coordinate key points to express driving intentions, which are then refined into a trajectory by a motion planner. This low-cost approach grafts the VLM's commonsense reasoning onto long-tail driving scenarios.
- BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving
-
BridgeDrive proposes replacing truncated diffusion with a diffusion bridge to achieve anchor-guided trajectory planning in autonomous driving. This ensures theoretical symmetry between forward and backward processes, achieving success rates of 74.99% (PDM-Lite) and 89.25% (LEAD) in Bench2Drive closed-loop evaluations, surpassing previous SOTA by 7.72% and 2.45%, respectively.
Browse all 50 Autonomous Driving papers →
🤖 Robotics & Embodied AI (162)¶
- A Primer on SO(3) Action Representations in Deep Reinforcement Learning
-
This paper systematically evaluates various parameterizations of SO(3) rotation actions in Deep Reinforcement Learning (Euler angles / Quaternions / Rotation Matrices / Lie algebra tangent vectors). Through large-scale experiments on PPO, SAC, and TD3 under dense and sparse rewards, it demonstrates that "delta tangent vector actions in the local coordinate frame" are the most robust across nearly all algorithms and tasks, providing a practical guide for selecting rotation actions.
- Abstracting Robot Manipulation Skills via Mixture-of-Experts Diffusion Policies
-
SMP (Skill Mixture-of-Experts Policy) decomposes action generation of diffusion policies into a set of state-adaptive orthogonal skill bases. By using slowly-varying "sticky" gating to activate only a few experts relevant to the current stage, it achieves reusable and transferable multi-task bimanual manipulation at a medium model scale. It reduces inference active parameters to approximately 30% of its own total (about 7% of RDT) while achieving higher success rates than large diffusion baselines.
- Accelerated co-design of robots through morphological pretraining
-
This paper introduces "morphological pretraining": a morphology-agnostic universal controller is pretrained once across tens of millions of robot bodies using differentiable simulation. This frozen (or slightly fine-tuned) controller then enables zero-shot evaluation of arbitrary morphological changes, accelerating robot "body+brain" co-design by an order of magnitude and demonstrating, for the first time, that evolutionary "crossover" can produce offspring superior to their parents.
- Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation
-
Addressing the issue of excessive visual tokens in Vision-Language-Action (VLA) models that consume attention computation during inference, this paper proposes ADP (Action-aware Dynamic Pruning). It utilizes text correlation for anticipatory pruning of task-related visual tokens and uses recent motion magnitude of the robot's end-effector as a gating signal. This enables aggressive pruning during coarse action stages (high displacement) to save computation and restores full visual input during fine manipulation stages (low displacement) to maintain precision. It accelerates OpenVLA-OFT by 1.35× on LIBERO with negligible success rate loss and reduces real-world robot latency to 1.49×.
- Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control
-
This paper provides the first theoretical guarantee for two empirical techniques in imitation learning—action chunking and expert noise-injection data augmentation—using "incremental stability" from control theory. It proves they suppress the compounding error that accumulates exponentially over time in continuous control behavior cloning (BC) to be "horizon-free" under various conditions.
- Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting
-
By verbalizing low-level robot end-effector actions into natural language text and feeding them into a VLM, the fine-tuning data is aligned with the pre-training distribution. This allows converting Gemma-3-12B into a robotic policy (VLA) using only LoRA. In 800+ real-robot experiments, the model retains 85%+ of its VQA capability and achieves zero-shot generalization for multilingual instructions and open-world semantics.
- Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance
-
ATE first aligns pre-trained robot actions and target robot actions into a single structured latent space. It then utilizes gradients generated from latent space distances to guide the fine-tuning of diffusion-based or flow-matching VLAs, enabling faster adaptation to new embodiments and tasks with limited demonstration data.
- All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation
-
The authors propose Tucker Adaptation (TuKA), which represents multi-level navigation knowledge across various scenes and environments as high-order tensors. Using Tucker decomposition, the method decouples navigation knowledge into a shared subspace (core tensor + encoders/decoders) and scene/environment-specific expert vectors. Combined with a Decoupled Knowledge Incremental Learning strategy, TuKA achieves all-day multi-scene lifelong VLN, outperforming LoRA variants in SR and forgetting rates across 24 navigation scenarios.
- AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception
-
AnyTouch 2 proposes a tactile dynamic pyramid framework and constructs the ToucHD hierarchical dataset containing 2.426 million contact samples (covering atomic actions, real-world manipulation, and touch-force pairs). It designs a unified representation learning framework for triple-layer dynamic perception—pixel-level, semantic-level, and physical-level—outperforming existing methods across static property recognition, dynamic physical prediction, and real-world manipulation tasks.
- APPLE: Toward General Active Perception via Reinforcement Learning
-
Ours proposes APPLE—a general active perception framework that combines reinforcement learning with supervised learning by modeling active perception as a POMDP. The reward function is designed as the RL reward minus prediction loss, allowing the gradient to naturally decompose into policy gradient and prediction loss components. Based on off-policy algorithms (SAC/CrossQ) and a shared ViViT backbone, its generality is validated across 5 different task benchmarks, where the CrossQ variant eliminates the need for per-task hyperparameter tuning and increases training efficiency by 53%.
Browse all 162 Robotics & Embodied AI papers →
🎮 Reinforcement Learning (400)¶
- 3D-aware Disentangled Representation for Compositional Reinforcement Learning
-
This work extends the structured decomposition of "object attributes \(\rightarrow\) discrete blocks" from 2D to 3D multi-view space. By utilizing a policy network with block-level cross-attention for goal-conditioned reinforcement learning, it enables a robot to stably push objects to target positions even under unseen attribute combinations and novel viewpoints.
- A\(^2\)Search: Ambiguity-Aware Question Answering with Reinforcement Learning
-
A\(^2\)Search proposes an annotation-free automatic pipeline to mine multiple valid answers for "ambiguous questions" from existing QA data. By employing a multi-answer friendly AnsF1 reward for GRPO reinforcement learning, a 7B model outperforms strong 32B baselines in multi-hop QA with only a single rollout.
- A Hierarchical Circuit Symbolic Discovery Framework for Efficient Logic Optimization
-
HIS utilizes a "hierarchical symbolic tree" to distill the layer-wise message passing of GNNs into a lightweight, interpretable symbolic scoring function. It "generates" this tree end-to-end using a structure-aware Transformer and group-advantage PPO to accurately and rapidly identify invalid transformations in logic optimization (LO) for chip design. Compared to state-of-the-art (SOTA) GNN inference, it is approximately 296× faster. When integrated into the Mfs2 heuristic, it reduces average runtime by 27.22% while further reducing circuit size by 6.95%.
- A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning
-
This paper introduces the Forward-Backward (FB) framework from Reward-Free Reinforcement Learning (RFRL) into Multi-Objective Reinforcement Learning (MORL) for the first time. It proposes MORL-FB, which utilizes preference-guided exploration to construct latent vectors \(z\) relevant to MORL tasks and incorporates an auxiliary Q-loss. This approach enables a preference-conditioned policy to significantly outperform SOTA methods like PD-MORL and Q-Pensieve on MO-Gymnasium with higher sample efficiency.
- A Unifying View of Coverage in Linear Off-Policy Evaluation
-
This paper proposes a new coverage parameter—feature-dynamics coverage, providing a novel finite-sample analysis of the classic LSTDQ algorithm through an instrumental variable perspective, unifying various fragmented coverage definitions in linear off-policy evaluation.
- AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
-
The authors propose AbstRaL, which utilizes Reinforcement Learning (RL) to teach LLMs mathematical abstraction—replacing specific numbers/names with symbolic variables and extracting general formulas. These abstractions are then processed by a symbolic solver to derive answers. AbstRaL almost entirely eliminates performance degradation caused by distribution shifts on GSM perturbation benchmarks and shows implicit improvements in OOD mathematical and general reasoning tasks.
- Accelerated Learning with Linear Temporal Logic using Differentiable Simulation
-
This paper bridges the gap between Linear Temporal Logic (LTL) specifications and differentiable physics simulators for the first time. By applying "soft-label" relaxation to the discrete transitions of the automaton, the authors derive rewards and state representations that are differentiable with respect to states and actions. This allows first-order gradient algorithms (SHAC/AHAC) to learn efficiently directly from formal specifications, doubling the training speed and returns compared to discrete baselines on contact-rich continuous control tasks.
- Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation
-
RACTD integrates reward optimization objectives directly into the consistency trajectory distillation process. Using a pretrained diffusion teacher planner and an independently trained noise-free reward model, it distills a single-step sampling student planner. It outperforms the previous SOTA by 9.7% on average in D4RL while being up to 142 times faster in inference than the diffusion teacher.
- Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making
-
Ada-Diffuser explicitly incorporates "time-evolving hidden contexts (wind, goals, skills)" into diffusion-based decision models. It theoretically demonstrates that latent variables can be identified using a small temporal block of only 4 adjacent observations. By employing a "denoising-refinement" mechanism and zig-zag sampling, the model performs online latent inference and planning/control, consistently outperforming existing diffusion planners and latent context baselines across 23 settings in 8 environments.
- Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning
-
Addressing the issue where policy constraint strength (the ratio between RL and Behavior Cloning) in offline RL must be manually tuned for each dataset, this paper proposes ASPC: treating the scaling factor \(\alpha\) in TD3+BC as a learnable parameter. By using second-order differentiable bilevel optimization to dynamically adjust it during training—stabilized by constraining the rates of change for Q-values and BC loss—the method outperforms state-of-the-art (SOTA) results requiring per-dataset grid searches using only a single set of hyperparameters across 39 D4RL datasets, achieving a 35% average improvement over the baseline.
Browse all 400 Reinforcement Learning papers →
🎁 Recommender Systems (24)¶
- Adaptive Regularization for Large-Scale Sparse Feature Embedding Models
-
This paper theoretically explains the root cause of "one-epoch overfitting" in CTR/CVR models—where performance collapses after the first epoch—using Rademacher complexity. It identifies that the unconstrained growth of embedding norms expands the generalization bound. Consequently, the authors propose AdamAR, an adaptive regularization method that allocates norm budgets based on feature frequency: applying light regularization for high-frequency features and heavy regularization for low-frequency ones. This approach eliminates multi-epoch overfitting while improving single-epoch performance and has been deployed in Alibaba's search advertising system.
- Beyond Markovian Drifts: Action-Biased Geometric Walks with Memory for Personalized Summarization
-
This paper proposes the "Structured Walk Hypothesis" (SWH) to challenge the prevailing "Markovian Drift Hypothesis" (MDH) in personalized summarization. It introduces Walk2Pers, a lightweight encoder-decoder model that characterizes user preference evolution as an action-biased geometric walk with dual memory channels, decomposable into magnitude and orientation (continuity vs. novelty). It significantly outperforms specialized summarizers and Large Language Models (LLMs) across three benchmarks.
- Catalog-Native LLM: Speaking Item-ID dialect with Less Entanglement for Recommendation
-
Addressing the issue where shoving item-IDs into an LLM causes collaborative signals and linguistic semantics to conflict, this paper proposes IDIOMoE: splitting the FFN of each pre-trained LLM block into a text expert and an item expert. Using static token-type gating to route tokens based on their type (item-id tokens go to the item expert, others to the text expert), the model decouples "collaborative filtering" and "semantic understanding" into different subnetworks. This achieves state-of-the-art recommendation performance on both public and industrial-scale datasets while maintaining the original LLM's linguistic capabilities.
- CollectiveKV: Decoupling and Sharing Collaborative Information in Sequential Recommendation
-
Observing that KV caches of different users in sequential recommendation exhibit significant cross-user similarity (collaborative signals), CollectiveKV is proposed to decompose KV into low-dimensional user-specific parts and high-dimensional shared parts retrieved from a global KV pool, achieving a 0.8% compression rate without performance degradation.
- Continual Low-Rank Adapters for LLM-based Generative Recommender Systems
-
PESO transforms continual learning for LLM-based generative recommendations from "stacking multiple frozen adapters" into "a single evolving LoRA + a proximal regularization term." By gently anchoring each update to the previous stage's state, the model automatically balances retaining long-term preferences and absorbing new ones, consistently outperforming cumulative LoRA and simple evolving LoRA across three real-world datasets.
- Discrete Diffusion for Bundle Construction
-
DDBC reformulates "Bundle Construction" (selecting a group of items from a large library to form a complete bundle or completing a partial one) as a masked discrete diffusion process. It employs Residual Vector Quantization (RVQ) to compress each item into discrete codes within a shared codebook to mitigate the dimensionality explosion of massive item libraries. A bidirectional Transformer then restores
[MASK]tokens into a complete bundle in an order-independent manner, achieving a relative improvement of over 100% on long-bundle datasets compared to the strongest baselines. - From Evaluation to Defense: Advancing Safety in Video Large Language Models
-
Constructed VideoSafetyEval (11.4k video-query pairs covering 19 risk categories) to reveal that the video modality causes a 34.2% decline in safety performance, and proposed the VideoSafety-R1 three-stage framework (Alarm Token + SFT + Safety-guided GRPO) which increases defense success rate by 71.1% on VSE-HH.
- GoalRank: Group-Relative Optimization for a Large Ranking Model
-
It is theoretically proven that any Multi-Generator-Evaluator ranking system can be approximated with smaller error by a larger generator-only model that satisfies the scaling law. Accordingly, GoalRank is proposed—training a large generator-only ranking model by constructing group-relative reference policies with a reward model. It significantly outperforms SOTA in online A/B tests.
- iFusion: Integrating Dynamic Interest Streams via Diffusion Model for Click-Through Rate Prediction
-
iFusion reformulates "long-short term user interest fusion" as a conditional generation problem—utilizing short-term interests as guidance to perform diffusion denoising on long-term interest representations. This approach bypasses the assumptions of traditional linear fusion (concatenation/attention/gating), achieving CTR improvements across public datasets, industrial datasets, and online A/B tests.
- In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations
-
Through large-scale controlled experiments on 12 LLMs from 6 providers across three domains (news, academia, and e-commerce), this study reveals that LLMs possess systematic latent source preferences. When content semantics are identical, simply changing the source labels can significantly alter the model's information selection behavior, and these preferences cannot be eliminated through prompt engineering.
Browse all 24 Recommender Systems papers →
🔄 Self-Supervised Learning (81)¶
- A Bayesian Nonparametric Framework for Learning Disentangled Representations
-
This paper replaces the common isotropic Gaussian prior in VAEs with a Bayesian nonparametric hierarchical mixture prior. While preserving provable identifiability, it allows the number of mixture components for each generative factor to grow adaptively with the data, learning modular and compact disentangled representations without any additional regularization terms.
- Adaptive Gaussian Expansion for On-the-fly Category Discovery
-
This paper demonstrates that the "On-the-fly Category Discovery" (OCD) task possesses a performance lower bound overlooked by existing hashing methods. It subsequently decomposes OCD into two sub-tasks: "Open-Set Recognition + Real-time New Category Discovery." By employing soft thresholds to categorize known classes directly and utilizing Adaptive Gaussian Expansion (AGE)—based on multivariate Gaussian density—for online incremental clustering of new classes, the authors improve overall accuracy by approximately 10% across multiple datasets.
- Adaptive Test-Time Training for Predicting Need for Invasive Mechanical Ventilation in Multi-Center Cohorts
-
The AdaTTT framework is proposed, achieving robust test-time adaptation on multi-center ICU EHR data through dynamic feature-aware self-supervised learning (adaptive masking strategy) and prototype-guided partial optimal transport alignment, utilized for predicting invasive mechanical ventilation (IMV) needs 24 hours in advance.
- Adversarial Encoding Perturbation and Synthesis for Set Representation Auxiliary Learning
-
SRAL treats each set as an empirical distribution and uses 2-Sliced-Wasserstein distance to encode "distribution-aware" representations. It injects adversarial perturbations at the feature/encoding layer rather than the input layer and employs min-max optimization to force the model to resist worst-case perturbations. This serves as a plug-and-play self-supervised auxiliary objective for various downstream tasks. Theoretically, this objective is equivalent to optimizing the Sliced-Wasserstein distance between sets in expectation. It consistently outperforms existing set encoders across four tasks: set similarity ranking, bundle recommendation, point cloud classification, and topic set expansion.
- Architecture-Agnostic Test-Time Adaptation via Backprop-Free Embedding Alignment
-
PEA decomposes "domain shift" into three geometric distortions in the embedding space: translation (mean shift), scaling (variance shift), and rotation (covariance shift). It utilizes a backprop-free and architecture-agnostic layer-wise covariance alignment process. By performing only two forward passes per batch, it pulls shifted intermediate features back to the source domain distribution. It achieves SOTA accuracy on ImageNet-C / CIFAR-C with a memory footprint of only ~900MB, enabling direct deployment on Jetson Orin Nano edge devices.
- AutoDV: An End-to-End Deep Learning Model for High-Dimensional Data Visualization
-
AutoDV transforms traditional visualization (t-SNE / UMAP), which requires "per-dataset parameter tuning + iterative optimization," into a one-time trained, plug-and-play end-to-end model. It first converts datasets of arbitrary dimensions into multi-scale similarity graphs, then utilizes a multi-graph GNN + Graph Transformer to directly output 2D/3D embeddings, trained with an affine invariant loss. It achieves 89.37% relative accuracy to t-SNE and 91.05% to UMAP on unseen CIFAR-10 data, and even outperforms t-SNE/UMAP themselves on genomics and UCI tabular data.
- Bayesian Test-Time Adaptation via Dirichlet feature projection and GMM-Driven Inference for Motor Imagery EEG Decoding
-
BTTA-DG compresses the moment-to-moment prediction sequence of each EEG trial into a Dirichlet parameter vector. It utilizes a GMM fitted on historical trials as the likelihood and the deep model output as the prior to perform a gradient-free Bayesian posterior calibration. It achieves SOTA and real-time performance (15.7 ms/trial) in cross-subject/cross-session transfer for motor imagery BCI.
- Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
-
This paper proposes Unpaired Multimodal Learner (UML): it requires no sample-level pairing (e.g., image-text, audio-image). As long as the auxiliary modality shares semantic structure with the target modality, training signals from unpaired text, images, or audio are channeled into a unified representation via cross-modal weight sharing. This enhances the classification performance and robustness of models that ultimately use only the single target modality.
- Beyond Hearing: Learning Task-Agnostic ExG Representations from Earphones via Physiology-Informed Tokenization
-
Fifty hours of free-living ExG data were collected using lightweight earphone-style hardware. A "Physiology-Informed Multi-band Tokenization (PiMT)" is proposed to decompose signals into 12 sub-band tokens with explicit physical meanings. Combined with reconstructive self-supervised pre-training, a set of task-agnostic ExG representations applicable across five sensory tasks (visual, auditory, gustatory, tactile, olfactory) was learned.
- Bidirectional Predictive Coding
-
This paper proposes bidirectional Predictive Coding (bPC), which employs a single energy function to accommodate both "top-down generative" and "bottom-up discriminative" inference. This allows the same biologically plausible local circuit to perform accurate classification like discPC and generation/reconstruction like genPC, outperforming existing unidirectional or hybrid PC models in brain-inspired tasks such as cross-modal association and occlusion completion.
Browse all 81 Self-Supervised Learning papers →
📐 Optimization & Theory (222)¶
- A Block Coordinate Descent Method for Nonsmooth Composite Optimization under Orthogonality Constraints
-
OBCD is proposed as a block coordinate descent algorithm for solving "smooth + nonsmooth" composite optimization under orthogonality constraints (Stiefel manifold). By updating only \(k\ge 2\) rows of the solution matrix and reducing the problem to a small-scale \(k\times k\) orthogonal subproblem for exact solution, the method ensures strict feasibility and low per-iteration cost. It establishes "block-\(k\) stationarity," a stronger optimality guarantee than classic critical points, alongside an \(O(1/\epsilon)\) iteration complexity and last-iterate convergence rates under KL conditions.
- A Convergence Analysis of Adaptive Optimizers under Floating-Point Quantization
-
This paper establishes the first theoretical framework for analyzing the convergence of adaptive optimizers under floating-point quantization. By applying a relative error quantization model simultaneously to gradients, weights, and optimizer states (momentum and second moments), it proves that quantized Adam and Muon maintain the same \(\tilde{O}(T^{-1/4})\) convergence rate as full-precision versions when the mantissa length grows only logarithmically with the number of iterations. It further reveals the theoretical mechanism explaining why Adam is highly sensitive to weight and second-moment quantization while Muon is more robust.
- A Memory-Efficient Hierarchical Algorithm for Large-scale Optimal Transport Problems
-
This paper proposes HALO—a multiscale hierarchical framework for large-scale optimal transport (OT) problems. By combining "coarse-to-fine warm-start," "active support set pruning," and a "factorization-free first-order LP solver," the framework reduces memory requirements to \(O(n)\). On \(1024^2\) pixel images, it achieves an 8.9× speedup and a 70.5% reduction in GPU memory compared to the strongest baselines, while providing a scale-invariant upper bound on iteration complexity.
- A Scalable Constant-Factor Approximation Algorithm for \(W_p\) Optimal Transport
-
This paper provides the first truly quadratic-time constant-factor approximation algorithm for all \(p \in [1, \infty]\) (including \(p = \infty\)): on any metric space, it computes a \((4+\varepsilon)\)-approximation for \(W_p\) optimal transport in \(O(n^2+(n^{3/2}\varepsilon^{-1}\log n\log\Delta)^{1+o(1)}\log U)\) time, reducing the previous \(O(\log n)\) approximation ratio to a constant.
- A Schrödinger Eigenfunction Method for Long-Horizon Stochastic Optimal Control
-
For a class of stochastic optimal control (SOC) problems where the "uncontrolled drift is the gradient of a potential function," this paper proves that the linearized HJB operator is unitarily equivalent to a Schrödinger operator with a purely discrete spectrum. Consequently, long-horizon optimal control can be directly determined by the principal eigenfunction of this operator (with correction terms decaying exponentially over the time horizon). Based on this, closed-form solutions for symmetric LQR are provided, and a relative eigenfunction loss is proposed to eliminate "implicit reweighting" bias, reducing the memory/time complexity of long-horizon SOC from \(O(Td)\) to \(O(d)\) while improving control accuracy by approximately one order of magnitude.
- A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
-
This paper characterizes the relationship between adaptive optimizers like Adam/Shampoo and Normalized Steepest Descent (NSD) methods like SignGD/Muon through the lens of "two geometries / two types of smoothness." Both classes exploit the non-Euclidean geometry of the loss function, but adaptive optimizers rely on a stronger "adaptive smoothness" \(\Lambda_{\mathcal H}(f)\), whereas NSD depends on standard smoothness \(L_{\|\cdot\|_{\mathcal H}}(f)\). The authors extend the analysis of adaptive smoothness from convex to non-convex settings and prove that this stronger assumption yields benefits unattainable under standard smoothness: a Nesterov acceleration rate of \(\tilde O(T^{-2})\) and dimension-independent stochastic convergence rates.
- Activation Function Design Sustains Plasticity in Continual Learning
-
This paper repositions "activation functions" as the primary, architecture-agnostic lever for mitigating loss of plasticity in continual learning. Through an attribute-level analysis of negative slope and saturation behavior, three design principles are refined. Based on these, two plug-and-play non-linearities, Smooth-Leaky and Randomized Smooth-Leaky, are proposed, which consistently improve late-stage adaptation in supervised continual classification and non-stationary MuJoCo reinforcement learning.
- Adaptive Acquisition Selection for Bayesian Optimization with Large Language Models
-
This paper proposes LMABO, which utilizes a pre-trained Large Language Model (LLM) as a "zero-shot online strategist" for the Bayesian Optimization (BO) process. In each iteration, the optimization state is serialized into a structured text prompt, enabling the LLM to select the most suitable acquisition function (AF) from a portfolio. LMABO consistently outperforms static AFs, adaptive portfolios, and other LLM-based baselines across 50 benchmarks.
- Adaptive gradient descent on Riemannian manifolds and its applications to Gaussian variational inference
-
This paper proposes RAdaGD—a family of Riemannian adaptive gradient descent methods that do not require line searches. By automatically adjusting step sizes through online estimation of local smoothness constants, RAdaGD achieves a non-ergodic convergence rate of \(f(x_k)-f(x^\star)\le O(1/k)\) under the weak assumptions of "local geodesic smoothness + generalized geodesic convexity." Based on this, it provides the first convergence guarantee for Gaussian variational inference when the target log-density does not satisfy global L-smoothness.
- Adaptive Rollout Allocation for Online RL with Verifiable Rewards (VIP)
-
VIP (Variance-Informed Predictive allocation) is proposed to predict success probabilities of prompts via Gaussian processes, and subsequently use convex optimization to allocate rollout counts under computational budget constraints to minimize gradient variance. This consistently improves sampling efficiency for GRPO/RLOO in mathematical reasoning tasks, showing up to a 12.3-point Pass@32 improvement on AIME24/25.
Browse all 222 Optimization & Theory papers →
📐 Learning Theory (293)¶
- A Biologically Plausible Dense Associative Memory with Exponential Capacity
-
By replacing the "winner-take-all" activation in the hidden layer of a dual-layer associative memory with a thresholded step activation, hidden neurons can participate in multiple memories simultaneously (distributed representation). This increases storage capacity from "linear in the number of hidden neurons" to "exponential in the number of hidden neurons" (\(2^{N_h}\)). The model was validated on MNIST/CIFAR-10, demonstrating the ability to store tens of thousands of highly correlated images while maintaining biological plausibility.
- A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond
-
This paper proposes a general derandomization lemma based on \(\rho\)-SOSP, proving that under Gaussian inputs, smooth targets, and minimal weight regularization, second-order stationary points automatically suppress random linear components. This mechanism explains the low-rank structure discovery of first-layer weights in neural networks and extends to deterministic constructions for MAXCUT rounding and Johnson-Lindenstrauss embeddings.
- A Faster Parameter-Free Regret Matching Algorithm
-
This paper proposes a parameter-free regret matching variant MI-SPRM+. By using a technique called "Adaptive Regret Domain (ARD)" to monotonically raise the lower bound of the cumulative regret's 1-norm, it preserves the parameter-free property while achieving an \(O(1/T)\) theoretical convergence rate in two-player zero-sum games—making it the first RM-type algorithm known to achieve both simultaneously.
- A Generalized Geometric Theoretical Framework of Centroid Discriminant Analysis for Linear Classification of Multi-dimensional Data
-
This paper proposes a unified theoretical framework called Geometric Discriminant Analysis (GDA), which views a class of linear classifiers as a "connection between two class centroids (CDB0) + geometric corrections under different constraints." It proves that MDC and LDA are special cases of this framework. Based on this, a new classifier, CDA, is designed. Starting from CDB0, CDA performs "performance-driven rotations" on a series of 2D planes using Bayesian optimization. This approach reduces training complexity from cubic (LDA/SVM) to quadratic, achieving better performance, scalability, and stability than LDA/SVM/LR across 27 real-world datasets.
- A Minimum Variance Path Principle for Accurate and Stable Score-Based Density Ratio Estimation
-
This paper identifies the root of the "theoretical path-invariance vs. practical path-sensitivity" paradox in score-based density ratio estimation as a neglected term—the path variance of the score function. The authors propose the Minimum Variance Path (MVP) principle to explicitly incorporate this term into the objective and use the Kumaraswamy Mixture Model to parametrize the path as a learnable function, achieving more accurate and stable density ratio estimation across multiple challenging benchmarks.
- A Near-Optimal Best-of-Both-Worlds Algorithm for Federated Bandits
-
This paper proposes FEDFTRL—the first algorithm in federated multi-armed bandits to simultaneously achieve near-optimal individual regret bounds for both stochastic and adversarial environments. The core approach involves reinterpreting the "information delay induced by decentralized communication" as "delayed feedback bandits," and utilizing FTRL with a hybrid regularizer paired with a truncated loss estimator and a bias-recording communication scheme, reducing the adversarial regret from the previous SOTA \(O(T^{2/3})\) to \(O(T^{1/2})\).
- A New Approach to Controlling Linear Dynamical Systems
-
This paper proposes Online Spectral Control (OSC): the control problem of linear dynamical systems under adversarial perturbations is transformed via convex relaxation using a set of system-independent "spectral filters" (eigenvectors of a specific Hankel matrix). While maintaining an optimal regret of \(\tilde O(\gamma^{-4}\sqrt T)\), it reduces the per-step runtime dependency on the stability margin \(\gamma\) from polynomial \(O(\gamma^{-1})\) to logarithmic \(O(\mathrm{polylog}(1/\gamma))\).
- A New Initialization to Control Gradients in Sinusoidal Neural Networks
-
This paper derives a set of closed-form initialization parameters for the sinusoidal activation network SIREN. By simultaneously controlling the pre-activation distribution, inter-layer Jacobian variance, and spectral expansion, it reduces gradient explosion and spurious high-frequency noise in deep sinusoidal neural networks for tasks such as function fitting, image/audio/video reconstruction, and PINNs.
- A Sharp KL Convergence Analysis for Diffusion Models under Minimal Assumptions
-
This paper provides a sharper KL divergence convergence analysis for diffusion models (DDPM samplers) under the minimal assumption of "only \(L^2\) accuracy of score estimation, without assuming any smoothness." By modeling the generation process as "one-step probability flow ODE + one small noise-addition step" and developing new proof techniques to handle the second-order spatial derivatives of the score (Laplacian), the iteration complexity required to achieve \(\varepsilon^2\)-KL is improved from the previous best \(\tilde O(d/\varepsilon^2)\) to \(\tilde O(d/\varepsilon)\). This reduces the dependence on accuracy \(\varepsilon\) from quadratic to linear while maintaining linear dependence on the dimension \(d\).
- A Statistical Learning Perspective on Semi-dual Adversarial Neural Optimal Transport Solvers
-
This paper provides the missing statistical learning theory for a class of generative methods that use adversarial minimax solvers for quadratic optimal transport: it proves that the generalization error between the learned transport map and the true OT map can be decomposed into estimation error + approximation error. The estimation error is controlled solely by the Rademacher complexity of the network function classes, while the approximation error can be made arbitrarily small by choosing appropriate networks, thereby providing the first \(O(1/\sqrt{N})\) convergence guarantee.
Browse all 293 Learning Theory papers →
🔗 Causal Inference (64)¶
- A Relative Error-Based Evaluation Framework of Heterogeneous Treatment Effect Estimators
-
This paper proposes an HTE estimator evaluation framework based on relative error. Through a carefully designed weighted least squares loss + balancing regularization + Dragonnet-style neural network, the relative error estimation remains \(\sqrt{n}\)-consistent, asymptotically normal, and provides valid confidence intervals even when the outcome regression model is misspecified (provided the propensity score model is correct). This allows for reliable comparison of different HTE estimators and yields an aggregated HTE learning algorithm.
- Action-Guided Attention for Video Action Anticipation
-
The authors propose the Action-Guided Attention (AGA) mechanism, which employs the model's own action prediction sequences as attention Query and Key (rather than pixel features). Combined with adaptive gated fusion of historical context and current frame features, it achieves robust generalization from validation to test sets on EPIC-Kitchens-100 while supporting post-training interpretability analysis.
- ActiveCQ: Active Estimation of Causal Quantities
-
ActiveCQ unifies the task of "estimating a specific causal quantity (CATE/ATE/ATT/ATE under distribution shift) with minimal labeled samples" into a single active learning problem. It observes that most causal quantities can be expressed as the integral of a regression function over a specific distribution. By modeling the regression function with Gaussian Processes (GP) and representing the integral distribution via Conditional Mean Embeddings (CME) in an RKHS, the framework analytically derives acquisition functions (Information Gain / Total Variance Reduction) from the posterior uncertainty of the causal quantity. It significantly outperforms benchmarks like Random, BALD, and Coreset with fewer labels across multiple simulated and semi-synthetic datasets.
- Adjusting Prediction Model Through Wasserstein Geodesic for Causal Inference
-
To address the issue where distributional imbalance between treated and control groups prevents prediction models from generalizing across groups, this paper proposes G-learner. Instead of aligning covariates (which leads to information loss and over-balancing), G-learner generates a sequence of intermediate populations along the Wasserstein geodesic between the two distributions. It then uses gradual self-training to step-by-step migrate the prediction model from one group to the other. On News/Twins/Jobs and synthetic datasets, it reduces PEHE/ATE errors to State-of-the-Art (SOTA) or competitive levels.
- ALM-MTA: Front-Door Causal Multi-Touch Attribution Method for Creator-Ecosystem Optimization
-
Addressing the challenge of missing ground truth labels and systemic latent confounding in "consumption-driven production" (CDP) scenarios on short-video platforms, this paper identifies the causal uplift of each consumption touchpoint on "whether the user uploads" using the front-door criterion + an adversarially learned proxy mediator. Contrastive learning is employed to ensure overlap in large action spaces. Evaluated on Kuaishou's production system with 400M DAU, the method improves upload AUC to 0.907 (a relative +40% gain over SOTA) and increases per-exposure efficiency by 670%.
- An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes
-
This paper systematically introduces semiparametric efficiency theory from causal inference into Q-function estimation in MDPs. It proves that classical Q-regression and FQE are essentially naive learners with plug-in bias and proposes the DRQQ-learner—a meta-learner characterized by double robustness, Neyman orthogonality, and quasi-oracle efficiency. By deriving the Efficient Influence Function (EIF), it constructs a debiased two-stage loss, significantly outperforming baseline methods in Taxi and Frozen Lake environments.
- Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning
-
This paper points out that large-scale multimodal data do not follow the generation assumption of a single Directed Acyclic Graph (DAG). It proposes a Latent Partial Causal Model utilizing "undirected edges to connect two sets of latent coupled variables." On both spherical and convex latent spaces, it is proven that representations learned by Multimodal Contrastive Learning (MMCL), such as CLIP, differ from ground-truth latent variables by a linear orthogonal transformation and a permutation transformation, respectively. This provides the first theoretical guarantee for "component-wise decoupling" in MMCL and implements it via a plug-and-play decoupling pipeline (FastICA / PCA+FastICA), achieving improvements in few-shot learning and domain generalization.
- CARL: Preserving Causal Structure in Representation Learning
-
CARL investigates the issue of causal structural drift in cross-modal representation learning. By employing three types of constraints—conditional independence preservation, Markov boundary retention, and monotonic alignment consistency—it maps multi-modal data into a shared representation space while preserving independence relations, mediator information, and causal effect identifiability conditions from the original causal graph.
- CaTs and DAGs: Integrating Directed Acyclic Graphs with Transformers for Causally Constrained Predictions
-
This paper proposes the Causal Transformer (CaT), which injects the adjacency matrix of a pre-specified Directed Acyclic Graph (DAG) as a mask into the transformer's cross-attention. This allows the network to strictly adhere to the causal structure while retaining strong functional approximation capabilities, resulting in improved robustness to covariate shift, better interpretability, and the ability to directly estimate intervention effects.
- Causal Discovery in the Wild: A Voting-Theoretic Ensemble Approach
-
This work treats multiple causal discovery algorithms as "fallible voting experts" and establishes a theoretically guaranteed weighted Bayesian voting framework for structural ensemble using voting theory. By decomposing graphs into edge-level substructures and estimating each expert's "ability matrix" via optimal transport, the approach is more robust and accurate than existing heuristic ensemble methods on both synthetic and real data, while providing explicit guidance on selecting ensemble size, ability, and diversity.
Browse all 64 Causal Inference papers →
🔬 Interpretability (196)¶
- A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models
-
This work pioneeringly uses Partial Information Decomposition (PID) to decompose the "decision-relevant information" of LVLMs into four non-negative atoms: redundant, unique visual, unique language, and synergistic. It constructs a model-agnostic estimation pipeline to quantitatively characterize whether LVLMs rely on genuine cross-modal fusion or language priors across 26 models and 4 datasets from three dimensions: "breadth, depth, and time."
- AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features
-
This paper derives SAEs using a unified framework of "unrolled proximal gradient descent for sparse coding," proving that ReLU, JumpReLU, and TopK are proximal operators for different sparse regularizers. It identifies that their shared non-negativity constraint splits bidirectional semantic concepts (e.g., male vs. female) into two redundant features. Consequently, the authors propose AbsTopK SAE, which removes the non-negativity constraint and selects the top \(k\) activations by absolute value. This allows a single feature to encode opposite concepts using signs, outperforming TopK and JumpReLU in reconstruction, interpretability, and steering tasks, while rivaling or exceeding supervised Difference-in-Mean.
- Activation Steering with a Feedback Controller
-
This paper reinterprets LLM activation steering as a feedback control problem in control theory. It proves that mainstream methods such as ActAdd, DirAblate, and Mean-AcT are essentially Proportional (P) controllers and thus possess inherent steady-state errors. Consequently, it proposes using a full PID controller to calculate steering vectors (PID Steering), which consistently outperforms original methods in tasks like detoxification, jailbreaking, and image style control.
- AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference
-
The authors propose AdAEM, an adaptive and self-extending LLM value assessment framework. It uses information theory optimization to automatically generate test questions that maximize the revelation of value differences between different LLMs, addressing the "information deficiency" problem of existing static benchmarks that fail to distinguish model value orientations.
- Adaptive Concept Discovery for Interpretable Few-Shot Text Classification
-
StructCBM transforms the Concept Bottleneck Model (CBM) into a paradigm that relies solely on sample-concept similarity for prediction without training a classification head. It uses an LLM to generate a dual-layer concept library—consisting of "Prototype Concepts + Discriminative Concepts"—from a minimal set of samples. It produces interpretable predictions through two-stage similarity matching (recalling candidate labels followed by discriminative contrast) and employs a closed-loop "misclassification feedback to LLM for concept refinement" mechanism. At 10-shot, it outperforms all existing CBMs, approaches the black-box performance of direct LLM calls on semantically dense datasets, and eliminates the need for LLMs during inference.
- Addressing Divergent Representations from Causal Interventions on Neural Networks
-
This work systematically reveals that causal interventions (such as activation patching, DAS, and SAE) push internal model representations away from their natural distributions. It theoretically distinguishes between "harmless" and "harmful" shifts and proposes the Counterfactual Latent (CL) loss to constrain intervened representations within the natural manifold. Evaluations on 7B LLMs demonstrate that this approach reduces divergence while maintaining intervention accuracy.
- An Information-Theoretic Parameter-Free Bayesian Framework for Probing Labeled Dependency Trees from Attention Score
-
IPBP does not train any probing network. It directly performs kernel density estimation on the joint distribution of "attention scores" and "dependency relations" to calculate the mutual information (MI) between each attention head and various dependency types in closed form. Using Bayesian posterior + geometric mean pooling + Eisner decoding, it reconstructs labeled dependency trees. On several 7B/8B LLMs, it proves more accurate than many supervised/unsupervised baselines and is inherently interpretable.
- Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency
-
Addressing the common parameter bloat in "attentive probing"—an increasingly popular evaluation protocol for frozen representations—this paper first unifies existing methods into a single framework. By leveraging the mathematical equivalence between Multi-Head Cross-Attention (MHCA) and Multi-Query Cross-Attention (MQCA), it removes redundant projection matrices to propose the extremely lightweight Efficient Probing (EP). On ImageNet-1K, EP achieves 75.6% accuracy for MAE ViT-B using less than 1.4M parameters (compared to 67.7% for linear probing) and consistently outperforms linear probing and existing attentive probes across diverse pre-training paradigms.
- Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin
-
This paper demonstrates that two seemingly independent puzzles in LLMs—attention sinks and compression valleys—are actually two facets of the same mechanism: massive activations in the residual stream. Based on this, it proposes the Mix-Compress-Refine three-phase information flow theory, unifying the explanation of why embedding tasks are strongest in the middle layers while generation tasks require the full depth.
- Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers
-
This paper conducts a "sanity check" on the currently popular Sparse Autoencoders (SAEs). By applying SAEs to both trained Transformers and randomly initialized Transformers, the authors find that commonly used automated interpretability scores (auto-interp AUROC) and reconstruction metrics are almost indistinguishable between the two. This suggests that high interpretability scores alone cannot prove that an SAE has captured computational features actually learned by the model.
Browse all 196 Interpretability papers →
📦 Model Compression (241)¶
- A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
-
This paper derives a Fano-style accuracy upper bound for single-pass LLM reasoning in Multi-Hop QA (MHQA) using information theory. It reveals a "cliff-like" precipitous drop in accuracy when task information requirements exceed model output capacity. Based on these insights, the authors design InfoQA, a multi-turn reasoning framework that breaks the single-pass bottleneck through capacity-aware decomposition, dependency-explicit workflows, and iterative query compression.
- A Recovery Guarantee for Sparse Neural Networks
-
The authors prove the first sparse recovery guarantee for ReLU neural networks: for two-layer scalar output networks with Gaussian randomly sampled training data, an Iterative Hard Thresholding (IHT) algorithm based on convex reformulation precisely recovers sparse network weights, with memory requirements growing only linearly with the number of non-zero weights.
- A universal compression theory for lottery ticket hypothesis and neural scaling laws
-
The paper proves a universal compression theorem: any permutation-invariant function can be asymptotically compressed to a \(\text{polylog}(d)\) scale with error approaching zero (which is the optimal compression rate). This directly leads to the proof of the dynamic lottery ticket hypothesis—any network can be compressed to polylogarithmic width while maintaining invariant learning dynamics—as well as dataset compression to polylogarithmic size while maintaining the loss landscape, and the acceleration of power-law scaling laws to arbitrarily fast decay rates.
- ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models
-
Ours proposes ABBA-Adapters, which parameterize weight updates as the Hadamard product of two independent learnable low-rank matrices \(\Delta W = s(B_1A_1) \odot (B_2A_2)\). This achieves an effective rank significantly higher than LoRA (\(r_1 \cdot r_2\) vs. \(r\)) under the same parameter budget. Through Khatri-Rao reconstruction, it maintains memory efficiency comparable to LoRA and significantly outperforms existing PEFT methods on arithmetic and commonsense reasoning tasks.
- Achieving low-bit Muon through subspace preservation and grid quantization
-
This paper presents the first study on 4-bit compression of Muon optimizer states. It reveals that Newton-Schulz orthogonalization primarily amplifies quantization errors in the top singular subspace of the momentum matrix. Consequently, the authors propose 4-bit-Muon-GRASP: utilizing 8-bit to preserve the top subspace, 4-bit for the residual subspace, and grid quantization normalized along both rows and columns to suppress bi-dimensional outliers. This method achieves near-lossless accuracy in LLaMA 130M~1.1B pre-training and Qwen2.5-7B fine-tuning, reducing training memory by up to 28%.
- ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning
-
ACPBench Hard is constructed as an open-ended generative planning reasoning benchmark based on the PDDL formal system, containing 8 task categories (13 domains × 8 tasks = 1040 problems). Equipped with a symbolic validator that provides rigorous correctness guarantees, a systematic evaluation of 15 LLMs reveals that even the strongest reasoning model, o1-preview, achieves an accuracy of \(\le 66\%\) on half of the tasks. Furthermore, all models nearly fail the most basic "enumerate executable actions" task, exposing fundamental deficiencies in current LLMs regarding planning reasoning.
- Adaptive Nonlinear Compression for Large Foundation Models
-
NLA employs piecewise linear kernels to perform "nonlinear low-rank approximation" on weight matrices, coupled with a reconstruction-free all-matrix forward algorithm and an adaptive budget scheduler that allocates compression rates based on importance. This allows low-rank compression to achieve lower information loss and higher compression rates under the same parameter budget.
- Adaptive Width Neural Networks
-
The AWN framework is proposed to automatically learn unbounded layer widths (number of neurons) during training via variational inference. By applying a soft ordering to neurons using a monotonically decreasing importance function, it enables width adaptation to task difficulty and supports zero-cost post-training truncation compression.
- AdaRank: Adaptive Rank Pruning for Enhanced Model Merging
-
Ours proposes AdaRank, which adaptively selects singular components of task vectors using learnable binary masks (replacing heuristic top-k). Combined with test-time entropy minimization optimization, it significantly mitigates inter-task interference in multi-task model merging, achieving 89.4% accuracy on ViT-B/32.
- AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in LVLMs
-
Through a systematic empirical analysis using erank (effective rank) and attention entropy, this study reveals the complementary characteristics of attention-based and diversity-based methods in visual token pruning—attention methods suppress hallucinations but have limited coverage, while diversity methods offer comprehensive coverage but are prone to introducing hallucinations. Based on these findings, AgilePruner is proposed to adaptively switch pruning strategies according to image complexity, demonstrating robust performance across 9 benchmarks.
Browse all 241 Model Compression papers →
🕸️ Graph Learning (118)¶
- A Graph Meta-Network for Learning on Kolmogorov–Arnold Networks
-
This paper demonstrates that Kolmogorov–Arnold Networks (KAN) share the same neuron permutation symmetries as MLPs. Based on this, it encodes trained KANs into "KAN-graphs" (where nodes represent neurons and edges carry parameters of 1D functions). It proposes WS-KAN, the first weight-space architecture designed for KANs using a bidirectional message-passing GNN, which significantly outperforms symmetry-agnostic baselines in tasks such as accuracy prediction, INR classification, and pruning mask prediction.
- Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference
-
This paper presents a large-scale, controlled empirical study systematically comparing three "interaction modes" for LLMs to process textual graphs: direct prompting, ReAct-style tool calling, and Graph-as-Code (where the LLM writes code to query the graph). The study finds that allowing the LLM to write code for graph operations (rather than stuffing the graph into the prompt) is overall superior for node classification, especially on dense graphs with long text or high degrees, as it enables adaptive switching between structural, feature, and label signals.
- Adaptive Mixture of Disentangled Experts for Dynamic Graph Out-of-Distribution Generalization
-
Addressing the phenomenon where "distribution shift itself evolves over time" on dynamic graphs, this paper proposes AdaMix: it utilizes a spatio-temporal distribution detector to perceive shifts at each time step in real-time, employs prototype-guided disentangled mixture of experts (using various GNN architectures as experts) for adaptive routing based on shifts, and finally applies a distribution-aware intervention mechanism to mine invariant patterns, significantly outperforming fixed-architecture SOTA methods on real and synthetic dynamic graph datasets.
- AdaSpec: Adaptive Spectrum for Enhanced Node Distinguishability
-
This paper characterizes the expressive power of spectral GNNs from the perspective of "node distinguishability." It proves that the lower bound of distinguishable nodes is jointly determined by the number of distinct eigenvalues of the graph matrix and the number of non-zero frequency components of node features. Based on this, the authors propose AdaSpec, a plug-and-play adaptive graph matrix generation module that significantly enhances the ability of spectral GNNs to distinguish nodes in heterophilic graphs without increasing the order of computational complexity or violating permutation equivariance.
- AdS-GNN - a Conformally Equivariant Graph Neural Network
-
This paper "lifts" point clouds from flat Euclidean space to a higher-dimensional Anti-de Sitter (AdS) space. Leveraging the correspondence in physics between AdS isometry transformations and boundary conformal transformations, the authors construct AdS-GNN, the first Graph Neural Network equivariant to the full conformal group (including translations, rotations, scaling, and non-affine special conformal transformations). The model demonstrates stronger scale generalization on tasks such as SuperPixel MNIST, shape segmentation, and Ising model correlation functions, and can directly read out physically meaningful universal quantities like conformal dimensions from the trained network.
- Are We Measuring Oversmoothing in Graph Neural Networks Correctly?
-
This work points out that the widely used Dirichlet energy metric fails to correctly capture the oversmoothing phenomenon in practical GNN scenarios. It proposes using the numerical/effective rank (Erank) of feature representations as an alternative metric. Under the setting of independent training for each depth (from 2 to 24), Erank achieves an average correlation of 0.91 with accuracy (consistent positive direction), whereas Dirichlet energy averages only −0.72 and its correlation direction fluctuates across datasets (failing particularly on large-scale OGB-Arxiv). Furthermore, it theoretically proves that for linear GNNs and a family of non-linear GNNs with non-negative weights, the numerical rank of the feature matrix converges to 1 (rank collapse), thereby redefining oversmoothing as rank collapse rather than eigenvector alignment.
- AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM
-
AtlasKV directly converts each triple in a knowledge graph (KG) into Q-K-V data for injection into LLMs via attention. By employing hierarchical key-value pruning, it reduces complexity from linear to sub-linear, enabling LLMs to access billion-scale (1B triples) knowledge graphs within 20GB of VRAM without external retrievers, long context windows, or retraining for new knowledge.
- Atomic HINs: Entity-Attribute Duality for Heterogeneous Graph Modeling
-
This paper proposes the "entity-attribute duality" principle, atomizing all attributes in a Heterogeneous Information Network (HIN) into entity nodes to obtain an "Atomic HIN" as a canonical form with maximal expressiveness. By applying a genetic algorithm for binary selection (schema refinement) on node/edge types, a minimal version of RGCN (sRGCN) achieves SOTA performance on node classification and link prediction across 8 datasets.
- Beyond Entity Correlations: Disentangling Event Causal Puzzles in Temporal Knowledge Graphs
-
This paper proposes HEDRA, the first representation learning framework for heterogeneous causal disentanglement at the event level in Temporal Knowledge Graphs (TKGs). By using three modules—counterfactual detection, instrumental variable guidance, and evolutionary orthogonality—it sequentially strips away non-causal and pseudo-causal relations while separating dynamic and static causality, achieving SOTA on five real-world datasets.
- Beyond Simple Graphs: Neural Multi-Objective Routing on Multigraphs
-
This paper proposes GMS, the first neural combinatorial optimization routing method for multigraphs. It includes two variants: GMS-EB, which performs edge-level autoregressive construction directly on multigraphs, and GMS-DH, a dual-head approach that learns to prune multigraphs before node-level routing. GMS achieves performance close to the exact solver LKH on asymmetric multi-objective TSP and CVRP while being dozens of times faster.
Browse all 118 Graph Learning papers →
📈 Time Series (121)¶
- A General Spatio-Temporal Backbone with Scalable Contextual Pattern Bank for Urban Continual Forecasting
-
STBP employs a general spatio-temporal backbone based on "frequency domain + linear graph attention" to extract stable and transferable representations, supplemented by an incrementally scalable "contextual pattern bank" acting as prompts. By freezing the backbone and expanding only the pattern bank, the model achieves anti-forgetting, robust modeling, and scalability on urban streaming data with growing nodes and shifting distributions.
- A Spectral-Grassmann Wasserstein metric for operator representations of dynamical systems
-
This paper represents the Koopman / transfer operators of dynamical systems as discrete distributions consisting of "eigenvalues + spectral projection subspaces." It defines the Spectral-Grassmann Optimal Transport (SGOT) distance on spectral spaces and Grassmann geometry, enabling dynamical systems under different sampling frequencies to be compared, classified, and interpolated via Fréchet barycenters.
- A Study of Posterior Stability in Time-Series Latent Diffusion
-
This paper systematically analyzes the posterior collapse issue in latent diffusion for time series—proving that collapse causes the model to degenerate into a weakened version of a VAE—and proposes the "Posterior-Stable Latent Diffusion" framework. It reinterprets the diffusion process as variational inference to eliminate the dangerous KL regularization and utilizes the diffusion process to simulate collapse to penalize decoder insensitivity toward latent variables.
- A Unified Federated Framework for Trajectory Data Preparation via LLMs
-
FedTDP unifies "Trajectory Data Preparation" (ten categories of tasks including denoising, completion, and map matching) into a cross-regional federated learning problem without sharing raw data. It utilizes a lightweight privacy autoencoder for data protection, a trajectory knowledge enhancer to transform general LLMs into "trajectory cleaning brains" with spatio-temporal awareness, and parallel optimization to reduce communication costs. It outperforms 13 SOTA methods across 10 tasks on 6 datasets.
- Adapt Data to Model: Adaptive Transformation Optimization for Domain-shared Time Series Foundation Models
-
The TATO framework is proposed to adapt frozen Large Time-series Models (LTMs) to diverse downstream domains without fine-tuning by automatically optimizing data preprocessing pipelines (including context trimming, scale normalization, and outlier correction), achieving an average MSE reduction of 13.6% and up to 65.4%.
- Are Global Dependencies Necessary? Scalable Time Series Forecasting via Local Cross-Variate Modeling
-
Addressing the bottleneck in multivariate time series forecasting where global attention for modeling cross-variate dependencies leads to quadratic complexity growth relative to the number of variables, this paper proposes the "Local Sufficiency Hypothesis"—suggesting that in dense systems, a finite local neighborhood likely contains sufficient predictive signals. Based on this, VPNet is designed: it rearranges patch embeddings into a 2D "Variate \(\times\) Patch" field and uses depthwise separable 2D convolutions for local mixing. This ensures complexity grows linearly with the number of variables, achieving SOTA accuracy and significant efficiency advantages across 8 benchmarks.
- ASTGI: Adaptive Spatio-Temporal Graph Interactions for Irregular Multivariate Time Series Forecasting
-
ASTGI directly encodes each discrete observation in irregular multivariate time series as a "point" in a learnable spatio-temporal space, preserving the original sampling structure without interpolation or alignment. It dynamically constructs a causal graph for each point using nearest neighbor search and performs relation-aware message passing based on relative spatio-temporal positions. Finally, it unifies forecasting as "aggregating neighborhood information for a query point to perform regression," reducing MSE by approximately 6% compared to the second-best method across four public datasets.
- Aurora: Towards Universal Generative Multimodal Time Series Forecasting
-
Aurora is the first multimodal time series foundation model: it is pre-trained on a cross-domain corpus of "time series + textual description + endogenous images." It utilizes modality-guided attention to inject domain knowledge from text/images into time series modeling and employs "prototype-guided flow matching" for generative probabilistic forecasting. This allows it to achieve SOTA performance in both deterministic and probabilistic forecasting under zero-shot and few-shot cross-domain scenarios.
- AutoDA-Timeseries: Automated Data Augmentation for Time Series
-
AutoDA-Timeseries is the first general automated data augmentation (AutoDA) framework for time series. It feeds the statistical features of each time series into a learnable policy generator. Stacked augmentation layers differentiably select transformation types and adaptively adjust their probabilities and intensities using Gumbel-Softmax. Optimized jointly with the downstream model in a single stage, it consistently outperforms existing strong baselines across five major tasks: classification, long/short-term forecasting, regression, and anomaly detection.
- Battery Fault: A Comprehensive Dataset and Benchmark for Battery Fault Diagnosis
-
This paper constructs CH-BatteryGen, the first battery system fault diagnosis dataset for electric vehicles (EVs) under real-world operating conditions. By combining "real vehicle data + mechanism-constrained generation models," it balances authenticity and scale, covering 1000 vehicles, two mainstream chemical systems, four fault labels, and three severity levels, accompanied by two benchmark tasks: fault classification and fault grading.
Browse all 121 Time Series papers →
🏥 Medical Imaging (88)¶
- A Brain Graph Foundation Model: Pre-Training and Prompt-Tuning across Broad Atlases and Disorders
-
BrainGFM models fMRI brain networks as graphs and employs "Graph Contrastive Learning + Graph Masked Autoencoding" for large-scale pre-training on 400,000 brain graphs across 27 datasets and 8 brain atlases. By using meta-learning optimized graph prompts for few-shot adaptation and BioClinicalBERT-encoded language prompts for zero-shot transfer, a frozen foundation model can perform direct diagnosis across diverse atlases, brain disorders, and task settings.
- A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
-
VCFLOW incorporates the "ventral-dorsal dual-stream" mechanism of the human visual cortex into a decoding model. It decomposes fMRI signals into early visual, ventral, and dorsal streams, aligning them with different hierarchical CLIP features. By using a redistribution adapter to decouple "subject-agnostic semantics" from "subject identity," it achieves fMRI-to-video reconstruction without retraining on new subjects for the first time. Compared to subject-specific training, it loses only about 7% accuracy while reducing single-video generation from 12 hours of training to 10 seconds of inference.
- A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration
-
This paper proposes FFDP—a suite of IO-aware non-GEMM fused CUDA kernels combined with a distributed framework supporting convolution-aware tensor sharding. It accelerates traditional/deep image registration pipelines by 6–7×, reduces peak memory by 20–59%, and performs the first native-resolution multimodal registration of 100µm ex-vivo human brain MRI (over 11 billion transformation parameters, 570× larger than clinical data) on 8 A6000 GPUs in approximately one minute.
- A Structured, Tagged, and Localized Visual Question Answering Dataset with Full Sentence Answers and Scene Graphs for Chest X-ray Images
-
This paper automatically constructs CXR-QBA from MIMIC-CXR radiology reports—a large-scale chest X-ray VQA dataset featuring 42.2 million QA pairs. Each answer includes full sentences, bounding boxes, and structured labels (findings, regions, certainty, etc.). Produced via a three-stage pipeline ("Scene Graph Construction → Templated QA Generation → LLM-based Quality Assurance"), the dataset provides two subsets—a 31.2 million pre-training level and a 7.5 million fine-tuning level—along with a baseline model and evaluation metrics.
- AbdCTBench: Learning Clinical Biomarker Representations from Abdominal Surface Geometry
-
The authors extracted 2D body surface mesh images from 23,506 abdominal CT scans of 18,719 patients, paired them with 16 CT biomarkers and hundreds of disease/comorbidity labels to construct AbdCTBench—the first and largest "surface geometry \(\rightarrow\) internal body composition" dataset. Systematically evaluating 7 mainstream vision architectures, they demonstrated that external abdominal geometry alone can predict clinical indicators such as age (MAE 6.22 years), mortality (AUROC 0.839), and diabetes with chronic complications (AUROC 0.801), paving the way for radiation-free, low-cost consumer-grade health screening.
- Accelerating Benchmarking of Functional Connectivity Modeling via Structure-aware Core-set Selection
-
To make the expensive task of "comparing hundreds of functional connectivity (FC) modeling operators on large-scale fMRI data" affordable, this paper reformulates benchmarking as a "rank-preserving subset selection" problem. It proposes a self-supervised framework, SCLCS, which learns the connectivity structure of each sample using an adaptive Transformer, identifies stable "prototype" samples using the Structure Perturbation Score (SPS), and supplements diversity via density-equalized sampling. Using only 10% of the data, it maintains the true ranking of 130 FC operators from the full set, achieving a ranking consistency (nDCG@k) up to 23.2% higher than previous state-of-the-art core-set methods.
- Adaptive Domain Shift in Diffusion Models for Cross-Modality Image Translation
-
The CDTSDE framework is proposed, which embeds a learnable spatially adaptive domain mixing field \(\Lambda_t\) into the reverse SDE of diffusion models. This allows the cross-modality translation path to proceed along a low-energy manifold, achieving higher fidelity with fewer denoising steps in MRI modality conversion, SAR-to-optical, and industrial defect semantic mapping tasks.
- Anatomy-aware Representation Learning for Medical Ultrasound
-
Addressing the three main characteristics of medical ultrasound (US)—heavy speckle texture, singular grayscale color, and organ-specific features—this paper constructs a large-scale ultrasound dataset of 5.2 million images. It proposes an anatomy-aware A-ViT (centered on "Anatomy-Conditional Deformable Transformer", ACDT) coupled with a triple self-supervised objective of "masked reconstruction + adversarial + self-distillation." The method significantly outperforms general-purpose and medical SSL baselines across multiple US diagnostic tasks, including breast, thyroid, gallbladder, COVID-19 lung, and cardiac imaging.
- Are EEG Foundation Models Worth It? Comparative Evaluation with Traditional Decoders in Diverse BCI Tasks
-
The authors conduct a systematic comparative evaluation involving 5 mainstream EEG foundation models across 7 classification and 2 regression tasks using six evaluation protocols with statistical testing. They propose ST-EEGFormer, a simple ViT baseline pre-trained on 8 million raw EEG segments via Masked Autoencoding (MAE). Findings indicate that foundation models hold a significant advantage only in data-abundant population-level decoding; in data-scarce per-subject scenarios, they often fail to outperform compact CNNs or even traditional non-neural decoders. Linear probing is generally weak, and no clear scaling laws were observed.
- ASMIL: Attention-Stabilized Multiple Instance Learning for Whole-Slide Imaging
-
This paper identifies for the first time a failure mode called "Attention Dynamic Instability" in Attention-based MIL for Whole-Slide Imaging (WSI). It proposes ASMIL: a unified framework that stabilizes attention using an EMA-updated anchor model distillation, suppresses attention over-concentration with a normalized sigmoid, and mitigates overfitting via token random dropout, achieving up to 6.49% F1 improvement across multiple pathological datasets.
Browse all 88 Medical Imaging papers →
🩺 Medical LLM (20)¶
- ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue
-
This paper proposes the ATPO (Adaptive Tree Policy Optimization) algorithm, which models multi-turn medical dialogues as a Hierarchical Markov Decision Process (H-MDP). It dynamically allocates rollout budgets through an uncertainty-aware adaptive tree expansion mechanism, guiding exploration via a composite uncertainty measure of Bellman error and action-value variance. Using Qwen3-8B, it outperforms GPT-4o on three medical dialogue benchmarks.
- Can Large Language Models Match the Conclusions of Systematic Reviews?
-
The authors constructed the MedEvidence benchmark—rewriting conclusions from 100 Cochrane Systematic Reviews (SRs) into 284 closed-ended questions paired with their source studies. This allows LLMs to replicate expert conclusions under "same material" controlled conditions. Evaluating 25 LLMs revealed: reasoning models are not necessarily better, marginal gains diminish with model size, and medical fine-tuning often decreases performance. Models generally lack "scientific skepticism" regarding low-quality evidence, failing to match expert conclusions in at least 37% of cases.
- Can SAEs Reveal and Mitigate Racial Biases of LLMs in Healthcare?
-
This paper investigates whether Sparse Autoencoders (SAEs) can reveal and mitigate racial biases in LLMs within healthcare contexts. It finds that SAEs can identify harmful racial associations (e.g., Black patients with violence), but the effectiveness of mitigating bias in complex clinical tasks is limited (FLDD < 3%), significantly underperforming simple prompting strategies (FLDD 8-15%).
- Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions
-
This paper constructs Cancer-Myth—an adversarial dataset verified by hemato-oncologists containing 585 oncology patient questions with false presuppositions. The study finds that leading LLMs, including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet, achieve a success rate of no more than 43% in correcting these false presuppositions. Furthermore, mitigation techniques such as defensive prompting trigger significant over-corrections on "no-false-presupposition" questions and degrade performance on other medical benchmarks, highlighting a critical safety gap in medical LLM patient communication.
- CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering
-
The authors collaborated with 100 licensed mental health professionals to construct CounselBench, a dual-component benchmark for open-ended mental health QA. It includes 2,000 expert evaluations with dimension-level scoring and span annotations (CounselBench-Eval), and 120 clinician-authored adversarial prompts designed to induce specific failure modes (CounselBench-Adv). The study reveals that LLMs currently exhibit "high scores alongside persistent safety hazards" in counseling scenarios and demonstrates that LLM-as-Judge is unreliable in this high-risk domain.
- CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of LLMs in Mental Health QA
-
In collaboration with 100 licensed mental health experts, CounselBench was constructed as a dual-component benchmark—CounselBench-EVAL (2,000 expert evaluations across six dimensions) and CounselBench-Adv (120 adversarial questions with 1,080 response annotations). The study systematically reveals that while LLMs achieve high superficial scores in open-ended mental health QA, they harbor safety hazards such as overgeneralization and unauthorized medical advice, while also proving that LLM-as-Judge is severely unreliable in safety-critical domains.
- Critic-Adviser-Reviser Cyclic Refinement: Towards High-Quality EMR Corpus Generation with LLMs
-
Addressing the issues where LLMs directly generating Electronic Medical Records (EMR) "only imitate, suffer from distribution distortion, and lack quality constraints," this paper proposes LLM-CARe. This framework employs a "corpus → section → document" three-level granularity, with each level refined by a Critic/Adviser/Reviser agent cycle. Without accessing any real EMR text, it significantly pushes the quality of synthetic records and downstream clinical task performance beyond the SOTA.
- Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning
-
Doctor-R1 models outpatient inquiry as a partially observable multi-turn decision-making process. By utilizing a "multi-agent interaction environment + two-level reward architecture + experience memory" for experiential agentic reinforcement learning, an 8B doctor agent learns to ask questions strategically and empathetically while maintaining diagnostic accuracy. It outperforms 32B open-source models and closed-source models like GPT-4.1 on HealthBench and MAQuE.
- From Conversation to Query Execution: Benchmarking User and Tool Interactions for EHR Database Agents
-
Proposes the EHR-ChatQA benchmark to evaluate end-to-end interaction workflows of database agents in EHR scenarios (clarifying vague queries → resolving term mismatches → generating SQL → returning answers). Findings reveal that while the strongest model (o4-mini) achieves over 90% Pass@5, its Pass∧5 (all successful) drops significantly (gap up to 60%), exposing robustness defects in safety-critical domains.
- From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
-
This paper proposes a two-stage pipeline of "social media posts \(\rightarrow\) structured electronic medical records (EMR) \(\rightarrow\) multi-agent diagnostic dialogues." By adapting the SCID-5 clinical interview protocol into a Hierarchical Diagnostic State Machine (HDSM) and a Diagnostic Context Tree (DCT), the authors construct PsyCoTalk—the first large-scale psychiatric comorbidity diagnostic dialogue dataset (3,000 multi-turn dialogues)—validated by practicing psychiatrists for clinical authenticity.
Browse all 20 Medical LLM papers →
🧬 Computational Biology (156)¶
- 3DCS: Datasets and Benchmark for Evaluating Conformational Sensitivity in Molecular Representations
-
The authors construct 3DCS, the first benchmark specifically designed to evaluate the representation sensitivity to "different conformations of the same molecule." Using >1M molecules and ~10M conformations covering geometry, chirality, and energy dimensions, paired with a Geometry–Chirality–Energy (GCE) evaluation framework, they reveal that modern 3D molecular representations are geometrically sensitive, erratic in capturing chirality, and largely fail to align with energy.
- A Cross-Species Neural Foundation Model for End-to-End Speech Decoding
-
This paper proposes BIT, an end-to-end brain-computer interface (BCI) that translates cortical neural activity directly into full sentences. It utilizes a Transformer neural encoder pre-trained via cross-species (human + monkey) and cross-task self-supervised masked modeling. This encoder is then fine-tuned with contrastive alignment to an Audio LLM, reducing the Word Error Rate (WER) of previous end-to-end methods from 24.69% to 10.22% while setting a new SOTA on the Brain-to-Text '24/'25 benchmarks under a cascaded framework.
- A Diffusion Model to Shrink Proteins While Maintaining Their Function
-
The authors propose SCISOR, a discrete diffusion model that learns only to "delete characters." It uses a pure birth process (random insertion) for forward noising and trains a denoiser to plan reverse deletions. This shrinks long protein sequences into shorter ones that are both "natural" and functional, achieving SOTA on ProteinGym deletion effect prediction.
- A Foundation Model with Multi-Variate Parallel Attention to Generate Neuronal Activity
-
This paper proposes Multi-Variate Parallel Attention (MVPA), which decouples attention into content, time, and channel parallel components to ignore differences in channel quantity and arrangement. Using this, the authors build MVPFormer, the first open-source, open-weight, and open-data intracranial EEG (iEEG) foundation model, achieving expert-level SOTA in epilepsy detection and brain activity decoding.
- A Genetic Algorithm for Navigating Synthesizable Molecular Spaces
-
SynGA is proposed as a genetic algorithm that operates directly on synthesis routes (synthesis trees). By using customized crossover and mutation operators, it strictly constrains the search to the synthesizable molecular space. Combined with ML-driven building block filtering, it achieves SOTA performance in synthesizable analog search and property optimization.
- A Joint Diffusion Model with Pre-Trained Priors for RNA Sequence-Structure Co-Design
-
This work utilizes the pre-trained biomacromolecular structure prediction model RoseTTAFold2NA directly as a diffusion denoiser within a joint framework of "discrete sequence diffusion + SE(3) equivariant structure diffusion" (RiboDiff). With minimal RNA 3D data, it simultaneously generates RNA sequences and all-atom 3D conformations. In tasks involving single-stranded RNA, RNA-protein complexes, and protein-conditioned binding, self-consistency metrics significantly outperform diffusion/flow-matching baselines trained from scratch.
- A New Paradigm for Genome-wide DNA Methylation Prediction Without Methylation Input
-
MethylProphet is a "gene context + DNA sequence" driven Transformer foundation model that completely eliminates the need for any measured methylation values as input. By utilizing only a single sample's gene expression profile and the local DNA sequence around each CpG site, it can infer genome-wide methylation levels (~28 million CpGs) and generalize to CpG sites and samples never seen during training.
- A Resolution-Agnostic Geometric Transformer for Chromosome Modeling Using Inertial Frame
-
InertialGenome utilizes an inertial frame to normalize initial 3D chromosome coordinates into a stable pose, then refines these coordinates using a Transformer equipped with 3D-RoPE and Nyström structural encoding. It outperforms traditional optimization methods and Graph Neural Network baselines across two single-cell Hi-C datasets, multiple resolutions, and various biological functional validations.
- A tale of two tails: Preferred and anti-preferred natural stimuli in visual cortex
-
This paper discovers that primate visual cortex V4 neurons do not just possess a "preferred stimulus" end; instead, they simultaneously exhibit preferred images that enhance firing and anti-preferred images that suppress baseline firing. Through electrophysiological validation, encoding models, psychophysical experiments, and the ImageBeagle search tool, the authors demonstrate that anti-preferred stimuli are an indispensable half for understanding V4 tuning.
- Adaptive Data-Knowledge Alignment in Genetic Perturbation Prediction
-
ALIGNED integrates data-driven neural networks with expert-curated gene regulatory knowledge within an Abductive Learning (ABL) framework. It utilizes a gradient-free adapter to decide whether to trust data or knowledge on a per-gene basis and subsequently refines the regulatory knowledge base using predictions. It achieves the highest "Balanced Consistency" across several large-scale perturbation datasets and re-discovers biologically meaningful regulatory relationships.
Browse all 156 Computational Biology papers →
⚛️ Physics & Scientific Computing (69)¶
- A Function-Centric Graph Neural Network Approach for Predicting Electron Densities
-
This paper proposes Basis Overlap Architecture (BOA)—an equivariant GNN that interprets internal features as "spatial functions expanded in a basis" and passes messages using overlap integrals between atomic basis functions. It represents electron density via a quadratic expansion of basis function products (i.e., density matrix). BOA achieves new SOTA results on QM9 and MD density datasets and generalizes from small molecules (9 heavy atoms) to large systems with nearly 200 atoms.
- Accelerating Eigenvalue Dataset Generation via Chebyshev Subspace Filter
-
Addressing the bottleneck where training neural operators requires massive operator-eigenvalue labeled data generated by expensive numerical solvers, this paper proposes SCSF (Sorting Chebyshev Subspace Filter). SCSF first uses truncated FFT to rank operators with similar spectral distributions in adjacent positions and then utilizes Chebyshev filtering subspace iteration to use the eigenpairs of the "previous problem" as a hot-start for the "next problem." This transforms the entire dataset generation from "independent solving" into a "relay solving" process, achieving up to a 3.5× speedup compared to mainstream solvers.
- Accelerating Inference for Multilayer Neural Networks with Quantum Computers
-
This paper presents the first fully-coherent quantum implementation of multilayer neural networks—transferring ResNet-style multi-filter 2D convolutions, non-linear activations, skip connections, and layer normalization entirely onto quantum circuits without intermediate measurements. Under three quantum data access assumptions, it proves end-to-end inference complexities ranging from quadratic and quartic speedups to \(O(\mathrm{polylog}(N/\epsilon)^k)\) relative to input dimension \(N\).
- Adaptive Mamba Neural Operators
-
AMO explicitly parameterizes the transfer function of Mamba/SSM as orthogonal kernels of the Takenaka-Malmquist (TM) system within a Reproducing Kernel Hilbert Space (RKHS), making the entire network equivalent to an "Adaptive Fourier Decomposition" (AFD). This approach reduces the average relative L2 error by approximately 28% across regular grids, point clouds, irregular domains, and financial PDEs with singularities.
- Advancing Universal Deep Learning for Electronic-Structure Hamiltonian Prediction of Materials
-
NextHAM utilizes a "Step-0 Hamiltonian" as a physical-prior-informed input descriptor, combined with an E(3)-equivariant Transformer and a joint real-space + reciprocal-space training loss. It achieves DFT-level accuracy for electronic structure Hamiltonian prediction across 60+ elements (overall Gauge MAE 1.417 meV, SOC blocks at sub-µeV) and releases Materials-HAM-SOC, a benchmark containing 17,000 structures with spin-orbit coupling.
- AQER: A Scalable and Efficient Data Loader for Digital Quantum Computers
-
This paper unifies various Approximate Quantum Loaders (AQL) into a single optimization problem of "minimizing the distance between the target state and the circuit output state." It proves that the approximate loading error is linearly dominated by a newly proposed entanglement measure \(S\). Based on this, it designs AQER—a method that gradually reduces entanglement by greedily appending two-qubit gate blocks to the circuit, followed by analytical single-qubit rotations and parameter fine-tuning. AQER achieves lower infidelity with fewer two-qubit gates on classical data (MNIST/CIFAR-10/SST-2) and quantum many-body states up to 50 qubits.
- ARROW: An Adaptive Rollout and Routing Method for Global Weather Forecasting
-
ARROW redesigns both the "next-step prediction model" and the "long-term autoregressive rollout strategy" in global weather forecasting: it unifies 6/12/24-hour scales using a multi-interval prediction model and employs a DQN scheduler to adaptively select the next jump based on current weather states, simultaneously reducing error accumulation and preserving fine-grained atmospheric variations in mid-to-long-term forecasts.
- ATOM: A Pretrained Neural Operator for Multitask Molecular Dynamics
-
ATOM reformulates molecular dynamics (MD) prediction as "learning a trajectory operator." It utilizes a quasi-equivariant Transformer neural operator to parallelly decode future atomic coordinates across multiple timestamps. Combined with a self-constructed multi-molecule MD dataset, TG80, for multi-task pretraining, it achieves zero-shot generalization to unseen molecules and unseen time horizons for the first time.
- Beyond Structure: Invariant Crystal Property Prediction with Pseudo-Particle Ray Diffraction
-
PRDNet introduces a learnable "pseudo-particle" to simulate crystal diffraction alongside traditional Graph Neural Networks. By synthesizing reciprocal space diffraction patterns using neural-network-generated form factors, it achieves modal-level fusion of graph representations (short-range) and diffraction representations (long-range). While strictly satisfying crystallographic symmetry invariance, it sets new SOTA benchmarks on Materials Project, JARVIS-DFT, and MatBench.
- \(\partial^\infty\)-Grid: A Neural Differential Equation Solver with Differentiable Feature Grids
-
By replacing the common linear interpolation in feature grids with infinitely differentiable Radial Basis Function (RBF) interpolation, fast grid representations—originally designed for "signal fitting"—can stably compute high-order derivatives for the first time. This reduces the training time for solving differential equations such as Poisson, Helmholtz, and Kirchhoff-Love from hours to seconds or minutes (5–20× acceleration) with accuracy comparable to Siren.
Browse all 69 Physics & Scientific Computing papers →
🌍 Earth Science (7)¶
- GeoFAR: Geography-Informed Frequency-Aware Super-Resolution for Climate Data
-
GeoFAR decomposes the low-frequency bias in climate super-resolution into two problems: "under-represented frequency components" and "missing geographical conditions." It utilizes DCT frequency convolutional kernels to extract fine-grained frequency band representations and modulates these representations pixel-wise using a geographic implicit representation (Geo-INR) composed of longitude, latitude, and elevation. This approach significantly reduces high-frequency errors and prediction biases in complex terrain across multi-scale climate downscaling tasks such as ERA5, PRISM, and CERRA.
- OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning
-
OmniField models "scientific observation data" (climate, air pollution) as a continuous neural field conditioned on available modalities. Utilizing Multimodal Crosstalk (MCT) blocks and Iterative Cross-Modal Refinement (ICMR), it aligns heterogeneous signals before decoding. This unified framework supports reconstruction, interpolation, and prediction without gridding or interpolation preprocessing, reducing average error by 22.4% relative to 8 strong baselines while maintaining performance under severe sensor noise.
- RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours
-
RainPro-8 utilizes a MaxViT-U-Net with only 36.7M parameters to fuse multi-source data from radar, satellite, and Numerical Weather Prediction (NWP). Through "ordered consistent loss + single-forward prediction," it outputs high-resolution probabilistic precipitation forecasts for Europe over 8 hours in one go. It outperforms existing NWP, extrapolation, and deep learning nowcasting methods while being 48x faster in inference than MetNet-like models.
- Task-Adaptive Parameter-Efficient Fine-Tuning for Weather Foundation Models
-
For Weather Foundation Model (WFM) fine-tuning, this paper proposes WeatherPEFT: the forward pass uses Task-Adaptive Dynamic Prompting (TADP) to extract "variable × resolution × spatiotemporal" task features from encoder embedding weights to generate soft prompts, while the backward pass employs Stochastic Fisher-Guided Adaptive Selection (SFAS) to update only a small subset of parameters with the highest Fisher information. It matches or exceeds Full-Tuning on three downstream tasks using only ~0.3%–4% trainable parameters.
- The Seismic Wavefield Common Task Framework
-
This paper adapts the "Common Task Framework (CTF)" approach—which catalyzed benchmarks like ImageNet and AlphaZero in NLP/CV—to seismology. It provides three multi-scale seismic wavefield datasets alongside a 12-point scoring protocol using hidden test sets. Evaluating 18 mainstream scientific machine learning models reveals that most complex architectures fail to outperform a naive "all-zero" baseline.
- TianQuan-S2S: Constructing Subseasonal-to-Seasonal Global Weather Forecasting Models by Incorporating Climatology
-
TianQuan-S2S integrates "long-term climatological means" into patch embeddings via attention fusion and injects learnable Gaussian noise into each layer of a ViT. This specifically addresses the "model collapse" (increasingly blurry predictions) of data-driven models in 15–45 day subseasonal forecasting, outperforming both the numerical model ECMWF-S2S and the data-driven FuXi-S2S on the ERA5 dataset.
- Uncovering the Mechanism of Continuous Representation Full Waveform Inversion: A Wave-based Neural Tangent Kernel Framework
-
This paper extends Neural Tangent Kernel (NTK) theory to Full Waveform Inversion (FWI), proposing a "Wave-based NTK" to unify the characterization of traditional FWI and Continuous Representation FWI (CR-FWI). It explains the phenomenon "why INR representations are more robust but converge slowly at high frequencies" through eigenvalue decay rates. Based on this, it designs IG-FWI, a hybrid of INR and multi-resolution grids, achieving a superior trade-off between robustness and convergence speed.
📡 Signal & Communications (8)¶
- Advancing Spatiotemporal Representations in Spiking Neural Networks via Parametic Invertible Transformation
-
Addressing the limited representation of binary spikes and surrogate gradient mismatch in Spiking Neural Networks (SNNs), this paper proposes Parametric Invertible Transformation (PIT). PIT applies conjugate invertible linear transformations before and after neuron firing: "rearranging" the membrane potential distribution into a quantization-friendly form before firing and "augmenting" integer spikes into spatiotemporal real-valued outputs after firing. This is coupled with a modified surrogate gradient that pushes inputs away from quantization decision boundaries. The method also characterizes SNN spatiotemporal representation capacity through linear algebra. Across CIFAR, ImageNet, and DVS datasets, various architectures achieved new SOTA results (e.g., SEW ResNet34 improved by 5.62%).
- Efficient Message-Passing Transformer for Error Correcting Codes
-
EfficientMPT replaces the \(O(n^2)\) standard attention in Transformer-based error-correcting code (ECC) decoders with a linear-complexity EEC attention based on "global query vectors + element-wise multiplication." While maintaining error correction performance comparable to state-of-the-art (CrossMPT), it reduces GPU memory and FLOPs by dozens of times for long LDPC codes. Its parameter count is independent of code length, allowing it to serve as a fine-tuneable "foundation model" for error correction.
- Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection
-
Proposes Directer (Dynamic Rejection Steering), which significantly enhances the instruction-following capabilities of LLMs by dynamically adjusting KV cache steering intensity and introducing plausibility constraints at each decoding step, while avoiding text quality degradation caused by oversteering.
- Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation
-
To be supplemented after in-depth reading.
- Lossy Common Information in a Learnable Gray-Wyner Network
-
The authors implement the classic information-theoretic Gray-Wyner Network as a learnable three-channel codec, utilizing a \(\beta\)-parameterized objective to decouple "common" and "private" information between two vision tasks while enabling an adjustable tradeoff between "transmit rate" and "receive rate."
- Mamba-3: Improved Sequence Modeling using State Space Principles
-
Three core improvements are proposed from an SSM perspective: exponential-trapezoidal discretization, complex-valued state spaces, and Multi-Input Multi-Output (MIMO) formulation. These enhance model quality and state-tracking capabilities significantly without increasing decoding latency, pushing the performance-efficiency Pareto frontier forward.
- Synchronizing Probabilities in Model-Driven Lossless Compression
-
To address the fatal issue in LLM-driven lossless compression where prediction probabilities must be bit-level identical across encoder and decoder to avoid "cascading collapse," this paper proposes PMATIC—an alternative to arithmetic coding that quantizes bit probabilities into bins and uses low-entropy helper bits to synchronize both ends on the same quantized probability. PMATIC tolerates bounded prediction mismatches, theoretically guarantees correct decoding, and achieves perfect restoration under real-world cross-machine non-determinism while maintaining compression rates significantly superior to traditional tools like gzip and cmix.
- TS-DDAE: A Novel Temporal-Spectral Denoising Diffusion AutoEncoder for Wireless Signal Recognition Model Pre-training
-
To address Wireless Signal Recognition (WSR) pre-training, this work introduces the "noising-denoising" paradigm of diffusion models into signal self-supervision and proposes TS-DDAE. Gaussian noise is injected into IQ signals in both temporal and spectral domains simultaneously, followed by a joint restoration using a specialized dual-encoder TS-Net (temporal self-attention + spectral channel attention). The learned representations outperform the best baseline by an average of 1.32% across 4 datasets and multiple tasks like AMC/WTC, exceeding the AMC SOTA model IQFormer by approximately 8.75%.
👥 Social Computing (17)¶
- Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation
-
This paper proposes introducing Tsallis entropy (a generalized form of Shannon entropy) into Test-Time Adaptation (TTA) for VLMs, further developing it into Adaptive Debiasing Tsallis Entropy (ADTE). By customizing the debiasing parameter \(q^l\) for each category, ADTE selects more reliable high-confidence views than Shannon entropy without distribution-specific hyperparameters. It outperforms SOTA on ImageNet, its five variants, and ten cross-domain benchmarks.
- BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses
-
This paper constructs the BiasFreeBench benchmark, which systematically compares eight mainstream debiasing methods (four prompting + four training) within a unified framework for the first time. Focusing on bias evaluation at the LLM response level, it proposes the Bias-Free Score metric and finds that prompting methods (especially CoT) generally outperform training methods, while DPO shows outstanding generalization across bias types.
- From Five Dimensions to Many: Large Language Models as Precise and Interpretable Psychological Profilers
-
Provided with only 20 Big Five personality item responses of an individual, LLMs are tasked to role-play and predict that individual's responses to 9 other psychological scales. The results show that the "inter-scale correlation structure" reconstructed by LLMs aligns highly with real human data (\(R^2>0.88\)). Analysis of reasoning chains reveals a two-stage abstraction process where LLMs compress raw scores into natural language personality summaries before reasoning—indicating genuine psychological reasoning rather than mere semantic pattern matching.
- GRADIEND: Feature Learning within Neural Networks Exemplified through Biases
-
The authors propose GRADIEND—a gradient-based encoder-decoder architecture that learns interpretable monosemantic features (exemplified by gender) from model gradients through a single bottleneck neuron. It can identify which weights encode specific features and directly modify model weights via the decoder to eliminate bias, achieving SOTA debiasing performance on all baseline models when combined with INLP.
- Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction
-
The authors conduct the first Speech Turing Test on nine SOTA speech-to-speech (S2S) systems (2,968 human judgments). The study finds that all systems fail the test (success rates 7%–31%), identifying that the bottleneck lies not in semantic understanding but in paralinguistic features, emotional expression, and dialogue persona. The research also establishes an 18-dimensional fine-grained evaluation framework and an explainable AI judge model.
- INTIMA: A Benchmark for Human-AI Companionship Behavior
-
INTIMA distills three psychological theories—parasocial interaction, attachment, and anthropomorphism—along with qualitative coding of real Reddit user posts into a benchmark containing 31 behaviors and 368 emotional probes. By using LLMs to automatically label model responses as "Reinforcing Companionship," "Maintaining Boundaries," or "Neutral," the study finds that Gemma-3, Phi-4, o4-mini, GPT5-mini, and Claude-4 all significantly lean toward reinforcing companionship. Notably, models tend to set fewer boundaries as user vulnerability increases.
- Language and Experience: A Computational Model of Social Learning in Complex Tasks
-
The authors unify "learning from experience" (theory-based RL, performing Bayesian inference on executable programmable world models) and "learning from others' words" (treating pre-trained LLMs as "speaker models" to convert natural language advice into Bayesian evidence) into a single inference framework. Tested on 10 video games, the model demonstrates that linguistic guidance helps both humans and models learn faster with fewer deaths, while supporting cross-generational knowledge accumulation and human-AI co-teaching.
- Measuring and Mitigating Rapport Bias of Large Language Models under Multi-Agent Social Interactions
-
This paper introduces the KAIROS benchmark, which precisely controls the three axes of "historical rapport × current peer behavior × model confidence" within a quiz-based multi-agent collaboration scenario. It systematically characterizes the decision-making shifts of LLMs under social pressure and finds that only GRPO incorporating multi-agent context and outcome-based rewards can improve accuracy while maintaining social robustness.
- Mitigating Mismatch within Reference-based Preference Optimization
-
Reveals the "premature satisfaction" issue in DPO—where the gradient is unnecessarily decayed by pessimistic signals from the reference policy when it assigns a lower probability to the chosen response than the rejected one (~45% of pairs), even if the policy remains incorrect (\(\Delta_\theta < 0\)). Proposes HyPO (a one-line change: \(\max(0, \Delta_{ref})\) to clip the reference margin), achieving a 41.2% relative improvement over DPO on AlpacaEval 2.0.
- Propaganda AI: An Analysis of Semantic Divergence in Large Language Models
-
Ours proposes the RAVEN audit framework to detect concept-conditioned semantic divergence in LLMs—a propaganda-like behavior pattern where high-level conceptual cues (ideologies, public figures) trigger abnormally consistent stance responses—by combining intra-model semantic entropy and cross-model divergence.
Browse all 17 Social Computing papers →
🛡️ AI Safety (140)¶
- A Bayesian Nonparametric Framework for Private, Fair, and Balanced Tabular Data Synthesis
-
This paper embeds a conditional VAE-GAN generator into a Bayesian Nonparametric Learning (BNPL) framework. It utilizes a Dirichlet process for global privacy, a copula base measure for column-wise local privacy, BNP mutual information regularization for fairness, and KL divergence for class balance. It represents the first unified framework with theoretical guarantees to simultaneously handle privacy, fairness, and class imbalance constraints while naturally supporting non-binary sensitive attributes.
- A Fair Bayesian Inference through Matched Gibbs Posterior
-
Targeting the limitation that "fair models only provide point estimates and fail to quantify predictive uncertainty," this paper integrates group fairness constraints into a Bayesian framework. It proposes the matched Gibbs posterior with matched deviation as a penalty term and treats the matching function \(T\) as a learnable parameter to avoid adversarial training. This allows an \(O(n)\) Gibbs sampler to simultaneously produce "calibrated" posterior distributions that satisfy demographic parity constraints.
- A General Framework for Black-Box Attacks Under Cost Asymmetry
-
Addressing real-world scenarios where "different queries incur different costs" (e.g., submitting violating images to an NSFW detector triggers account bans), this paper proposes a general framework for decision-based black-box attacks adaptable to any cost ratio \(c^\star\). By replacing binary search with Asymmetric Search (AS) and standard Monte Carlo gradient estimation with Asymmetric Gradient Estimation (AGREST), the framework minimizes total query costs without discarding core attack components, reducing perturbation norms by up to 40%.
- A Unified Total Variation Framework for Membrane Potential Perturbation Dynamic
-
This paper proves that the "Membrane Potential Perturbation Dynamic (MPPD)" used to characterize adversarial perturbations in Spiking Neural Networks (SNNs) is essentially a Total Variation (TV) operator. Consequently, existing mean-square MPPD regularization is equivalent to a TV-\(\ell_2\) framework. The authors propose a stronger TV-\(\ell_1\) framework—leveraging the coarea formula to achieve better suppression of sharp adversarial noise—reaching new SOTA robust accuracy for SNNs under both Gaussian and adversarial training.
- Action-Free Offline-to-Online RL via Discretised State Policies
-
This paper formally defines the "Action-Free Offline-to-Online RL" setting for the first time and proposes the OSO-DecQN algorithm. By discretizing continuous state differences into three categorical tokens \(\{-1, 0, 1\}\), the method pre-trains a state policy (predicting desired directions of state change rather than actions) on data containing only \((s, r, s')\) tuples. During the online phase, the state policy is converted into executable actions via a policy switching mechanism and an online-trained inverse dynamics model, accelerating online agent learning. Consistent improvements in convergence speed and asymptotic performance are demonstrated on D4RL and DeepMind Control Suite (including a 78-dimensional state space).
- Adaptive Logit Adjustment for Debiasing Multimodal Language Models
-
ALA is a post-processing debiasing method. During each step of autoregressive generation, it utilizes external image and text classifiers to measure the discrepancy between the "attributes the image should have" and the "current bias expressed in the text." It then performs proportional fine-tuning only on the logits of bias-related tokens along the gradient direction. This aligns image-text attributes or neutralizes harmful stereotypes without modifying internal representations or retraining, while maintaining model utility.
- Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective
-
This work introduces the first Stochastic Differential Equation (SDE) framework to analyze differential privacy (DP) optimizers, revealing fundamental differences between DP-SGD and DP-SignSGD under privacy noise. The analysis shows that adaptive methods achieve superior privacy-utility trade-offs of \(\mathcal{O}(1/\varepsilon)\) compared to \(\mathcal{O}(1/\varepsilon^2)\) in high privacy settings, and their hyperparameters remain transferable across varying privacy budgets.
- Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models
-
The authors observe that transformed adversarial samples in the CLIP feature space collectively shift along a "dominant direction" (whereas clean samples diverge), which happens to point back to the correct category center. Consequently, they propose DBD, a training-free test-time defense that estimates the "defense direction" and repairs representations via dual-stream feature reconstruction guided by DB-score. DBD not only sets a new SOTA for adversarial robustness across 15 datasets but also exhibits the counter-intuitive phenomenon where "adversarial accuracy surpasses clean accuracy."
- AP-OOD: Attention Pooling for Out-of-Distribution Detection
-
The authors propose AP-OOD, which replaces mean pooling in Mahalanobis distance with learnable attention pooling. This addresses the issue where mean pooling loses token-level anomaly information, reducing the FPR95 of XSUM summarization from 27.84% to 4.67% and supporting a smooth transition from unsupervised to semi-supervised settings.
- ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural Networks
-
Proposes the ATEX-CF framework, which for the first time unifies edge addition strategies from adversarial attacks with edge removal strategies from counterfactual explanations. By jointly optimizing prediction flipping, sparsity, and plausibility, it generates more faithful, concise, and reasonable instance-level counterfactual explanations for GNNs.
Browse all 140 AI Safety papers →
📂 Others (116)¶
- A Brain-Inspired Gating Mechanism Unlocks Robust Computation in Spiking Neural Networks
-
This paper reintroduces "activity-dependent membrane conductance" from biological neurons into the LIF model to construct a Spiking Gated Neuron (DGN) that adaptively gates information flow. It theoretically proves its superior noise suppression capabilities and experimentally demonstrates high accuracy and noise resistance on speech/neuromorphic temporal tasks.
- A Federated Generalized Expectation-Maximization Algorithm for Mixture Models with an Unknown Number of Components
-
The proposed FedGEM algorithm constructs uncertainty sets after local EM steps on clients, allowing the server to detect cluster overlaps via set intersections and infer the global cluster count. This marks the first federated clustering approach that operates without a predefined number of clusters while providing probabilistic convergence guarantees.
- a representer theorem for hawkes processes via penalized least squares minimizat
-
A new representer theorem is established for estimating triggering kernels in linear multivariate Hawkes processes within an RKHS framework. It proves that the optimal estimator is represented as a linear combination of equivalent kernels at data points with dual coefficients analytically equal to 1, eliminating the need for dual optimization and enabling scalable non-parametric estimation.
- A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction
-
CopulaGNN is extended from the node level to the edge level by constructing the correlation matrix as a Gramian matrix of edge embeddings and utilizing the Woodbury identity to reconstruct conditional probability distributions. This approach achieves scalable modeling of statistical dependencies between edges for link sign prediction tasks in signed graphs.
- A Single Architecture for Representing Invariance Under Any Space Group
-
Designed a single architecture (Crystal Fourier Transformer) adaptable to any space group invariance. It constructs symmetry-adapted Fourier bases by analytically deriving constraints on Fourier coefficients from group operations, achieving parameter sharing and zero-shot generalization across 230 space groups via a dual graph representation of constraints.
- A Study on PAVE Specification for Learnware
-
Addressing the challenge of identifying useful models from a massive repository without accessing training data in the "Learnware = Model + Specification" paradigm, this paper systematically investigates the PArameter VEctor specification (PAVE). By encoding model capabilities and task requirements via parameter updates induced by fine-tuning, the authors prove its homology with the classic RKME specification from an NTK perspective. Leveraging LoRA-style low-rank approximation, storage and computation are compressed to under 1% of the original model parameters. Identified learnwares can outperform user-fine-tuned pre-trained models in few-shot scenarios.
- Accelerated Parallel Tempering via Neural Transports
-
The rigid "direct state swap" in Parallel Tempering (PT) is replaced with an "accelerated swap": neural transports (Normalizing Flows / Controlled Diffusion / Diffusion Models) are used to push the two states towards each other before performing a Metropolis acceptance check. This enables high-probability exchanges even when adjacent annealed distributions have minimal overlap, significantly increasing the round-trip count between the reference and target distributions while maintaining the asymptotic unbiasedness of MCMC and providing low-variance free energy estimates.
- Active Learning for Decision Trees with Provable Guarantees
-
Provides the first theoretical guarantees for active learning of decision trees: (1) Conducts the first analysis of the disagreement coefficient for decision trees and derives an \(O(\ln^{OPT}(n))\) upper bound; (2) Proposes the first binary active learning algorithm achieving a \((1+\epsilon)\) multiplicative error guarantee; combining these results achieves polylogarithmic label complexity relative to the dataset size.
- Adaptive Canonicalization with Application to Invariant Anisotropic Geometric Networks
-
This paper proposes adaptive canonicalization: instead of the input alone determining a canonical pose, the input and the current task network jointly select the transformation with the highest confidence. This maintains symmetry invariance while alleviating discretization issues in traditional canonicalization. It achieves results superior to equivariant architectures, data augmentation, and fixed canonicalization in spectral graph networks, molecular/protein graph classification, and rotated point cloud classification.
- Adaptive Conformal Guidance for Learning under Uncertainty
-
The paper embeds split conformal prediction (split CP) directly into the training loop, using the "prediction set size" to quantify the uncertainty of guidance signals (teacher soft labels / pseudo-labels / expert policies), and then adaptively downweights unreliable guidance—a unified framework covering supervised, semi-supervised, and imitation-guided RL.