ICML2026 ICML2026 accepted papers ICML2026 paper list AI paper notes top conference papers Image Generation Model Compression AI Safety Reinforcement Learning Interpretability Multimodal VLM Optimization & Theory LLM Reasoning

🧪 ICML2026 Accepted Papers¶

1846 ICML2026 paper notes covering Image Generation (141), Model Compression (117), AI Safety (114), Reinforcement Learning (110), Interpretability (92), Multimodal VLM (89), Optimization & Theory (88), LLM Reasoning (78) and other 51 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

💡 LLM Reasoning (78)¶

A Formal Comparison Between Chain of Thought and Latent Thought: Based on computational complexity theory, this paper formally compares the expressive power of CoT (Chain of Thought) and Latent Thought (Looped Transformer / Coconut). It proves that Latent Thought strictly reaches \(\mathsf{TC}^k\) under polylogarithmic depth, while CoT reaches at most \(\mathsf{TC}^{k-1}\). Simultaneously, in a probabilistic setting, it reveals for the first time that CoT can support FPRAS counting through stochastic decoding, thereby surpassing deterministic Latent Thought.
Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs: Addressing the practical constraint of "fixed token budgets per query" during deployment, this paper proposes Budget-Guided MCTS (BG-MCTS). It utilizes a "budget sufficiency ratio \(\rho\)" as a unified scheduling signal to transition tree search from broad exploration in early stages to deep refinement and answer completion as the budget depletes, consistently outperforming budget-agnostic tree search baselines on mathematical and physical reasoning benchmarks.
An Information-Theoretic Criterion for Efficient Data Synthesis: This paper employs the Data Processing Inequality (DPI) to explain why synthetic data can be effective or cause model collapse: a synthetic data pipeline is only information-open if the training closed-loop continuously introduces stable external signals. Furthermore, high meta-level verification signals are more efficient and generalizable than instance-level imitation.
Are Large Reasoning Models Interruptible?: This paper shifts the evaluation of large reasoning models from static problem-solving to dynamic environments where models may be interrupted or receive mid-generation updates. The authors construct evaluation protocols for mathematics and programming and identify three consistent failure modes: reasoning leakage, panic answering, and self-doubt.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning: AutoTool utilizes reinforcement learning to enable Multimodal Large Language Models (MLLMs) to first determine whether a "zoom-in tool" is truly necessary for a given task. By adaptively switching between tool-assisted reasoning and pure text reasoning, the model achieves simultaneous improvements in accuracy and efficiency across high-resolution perception, grounding, hallucination detection, and reasoning tasks.
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization: The authors use attention dynamics to "develop" the reasoning process—discovering a "preplan-and-anchor" two-beat rhythm during generation. They convert two internal metrics (WAAD/FAI) characterizing this rhythm into token-level advantage amplification coefficients for RL. This allows GRPO to concentrate credit on critical tokens that dictate the direction of downstream reasoning, achieving consistent performance gains across Countdown, QA, and multiple mathematical reasoning benchmarks.
Beyond Test-Time Memory: State-Space Optimal Control for LLM Reasoning: Ours models LLM reasoning as an optimal control problem in latent space (Linear Quadratic Regulator, LQR) and proposes the Test-Time Control (TTC) layer to perform finite-horizon planning during the forward pass. The optimal control action is decoded as the next-token representation. Combined with a Symplectic Iteration CUDA-efficient solver, this adapter-style layer achieves up to +27.8% gain on MATH-500 and a 2-3× increase in Pass@8 on AMC/AIME when inserted into pretrained LLMs.
Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning: Ours proposes the BRIDGE framework, which models the integration of SFT and RL as a bilevel optimization problem. In this framework, an SFT-based upper-level teacher learns to selectively transfer beneficial supervisory signals to an RL-based student via a lightweight LoRA module, achieving an average absolute improvement of over 3 percentage points across five mathematical reasoning benchmarks.
Biases in the Blind Spot: Detecting What LLMs Fail to Mention: This paper proposes a fully automated black-box pipeline to detect "unverbalized biases"—implicit factors that systematically influence model decisions but are never mentioned in Chain-of-Thought (CoT) reasoning. By utilizing LLMs to automatically generate conceptual hypotheses, counterfactual input variants, and sequential statistical tests, the method discovered known biases such as gender and race across three decision-making tasks, as well as novel biases like Spanish fluency, English proficiency, and writing formality.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling: This paper proposes Prefix-RFT, which constructs mixed trajectories by sampling prefixes from expert demonstrations and concatenating model continuations. This approach injects knowledge guidance from SFT while maintaining the objective-oriented optimization of RFT, significantly outperforming independent SFT, RFT, and existing hybrid methods on mathematical reasoning tasks.

Browse all 78 LLM Reasoning papers →

🦾 LLM Agent (59)¶

A Minimal Agent for Automated Theorem Proving: This paper proposes AxProverBase—a minimalist Lean 4 theorem-proving agent. By relying on only three components—"compiler feedback + self-managed notebook + lightweight tool search"—it achieves or exceeds the performance of specialized systems like Hilbert/Seed-Prover using non-fine-tuned frontier LLMs (Claude Opus), while reducing costs by 100x.
A Systematic Study of Behavioral Cloning for Scientific Data Annotation: This paper establishes a controlled framework consisting of 9 procedurally synthetic annotation tasks and virtual annotators to systematically study whether "behavioral cloning" (allowing a VLM to directly mimic full human operation trajectories—clicking, navigating, and undoing within an annotation interface) can replace "direct label prediction." Through four dimensions—training dynamics, scaling laws, transfer capabilities, and linear probes—it reveals findings such as the hierarchical emergence of skills, the phenomenon where models make fewer mistakes than training data but still perform error correction, the necessity of multi-task pre-training for transferability, and task-shared internal representations of "errors."
ACON: Optimizing Context Compression for Long-horizon LLM Agents: Acon utilizes failure trajectory contrast to optimize natural language compression guidelines, simultaneously compressing agent history and observation contexts. It reduces peak tokens by 26% to 54% on AppWorld, OfficeBench, and multi-objective QA while maintaining or improving success rates in long-horizon tasks.
AdaMEM: Test-Time Adaptive Memory for Language Agents: AdaMEM decouples agent memory into two layers: "offline long-term trajectory memory" and "online synthesized short-term strategy memory." This allows agents to dynamically refresh guidance strategies based on current states during long-horizon tasks. Coupled with Step-MFT—a fine-tuning technique that preserves only strategies that "actually change actions"—it achieves relative gains of 13–17% over static memory baselines on ALFWorld, WebShop, and HotpotQA.
Agent-Omit: Adaptive Context Omission for Efficient LLM Agents: By quantifying which turn-level thoughts and observations are omittable via Monte-Carlo rollouts, an 8B agent is trained using cold-start SFT and dual-sampling omit-aware GRPO. This agent adaptively skips redundant thoughts and observations, significantly reducing token usage across five benchmarks while maintaining accuracy comparable to seven state-of-the-art frontier models.
Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling: This paper transforms the web Computer-Use Agent from a step-by-step screenshot-LLM call-execution loop into a system similar to a JIT compiler: compiling natural language tasks into verifiable, cacheable, and parallel-schedulable code plans. This allows JIT-Planner to be 10.4× faster than Browser-Use with 28pp higher accuracy, and JIT-Scheduler to be 2.4× faster than OpenAI CUA with 9pp higher accuracy.
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents: The paper reframes "RL for black-box LLM Agents" as "sampling from the posterior of an optimal policy." By employing Sequential Monte Carlo (SMC) with a lightweight value function to guide frozen black-box models during test time, the authors achieve RL-style optimization without accessing any parameters. This approach outperforms prompting baselines on three AgentGym environments and surpasses GRPO (which requires full parameter access) by scaling test-time computation.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction: The authors define the task of "inferring an equivalent white-box workflow from a black-box agent system" as AWR. They utilize MCTS to search within the sequence space of agent primitives, combined with dynamic Red-Black pruning based on scoring to balance search depth and width, achieving interpretable white-box reconstruction across five real-world domains.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems: This paper models the issue of "stating details without sufficient evidence" in agentic systems as a claim-level over-commitment problem. It proposes calibrated CSS: a calibrated selection for each atomic claim among precise expression, coarse-grained backoff, and omission. In LongFact full-scale experiments, it improves OAU from 0.8460 (without post-processing) to 0.9130 while retaining a specificity of 0.9381.
AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions: The authors propose the AutoRPA framework, which automatically distills interaction trajectories of ReAct-style GUI Agents into reusable RPA functions via a Translator-Builder pipeline. By combining iterative optimization with a hybrid repair strategy, the method maintains or exceeds original Agent success rates while reducing token consumption by 82%~96%.

Browse all 59 LLM Agent papers →

👥 Multi-Agent (24)¶

Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information: This paper proposes two algorithms for aggregating LLM responses by leveraging higher-order information—Optimal Weight (OW) based on first-order accuracy information and Inverse Surprising Popularity (ISP) based on second-order correlation information. These methods are provably superior to Majority Voting (MV) under label-free conditions and demonstrate consistent improvements on UltraFeedback, MMLU, and healthcare datasets.
CoOT: Learning to Coordinate In-Context with Coordination Transformers: This work reframes "cooperating with unknown partners" from a task-generalization problem to a partner-generalization in-context learning problem. By training a Decision Transformer to predict best-response actions over cross-episode interaction trajectories, the model adapts to unseen partners within a few episodes during test-time without updating parameters.
Does Persona Make LLMs K-pop Fans? A Pilot Study of LLM-Based Online Concert Audience Agents: The authors constructed a "virtual audience" system consisting of ten LLM agents posting real-time danmaku. By pairing pre-recorded K-pop performances with human-like fan chats, an N=11 within-subject pilot study revealed that assigning individual personas to each agent significantly enhances diversity and "naturalness" at the model output level. However, this does not translate into a stronger sense of social connection, engagement, or emotional resonance—as K-pop danmaku is essentially a "collective monologue" rather than interpersonal dialogue.
E-mem: Multi-Agent Based Episodic Context Reconstruction for LLM Agent Memory: E-mem replaces the traditional memory paradigm of "preprocessing compression into embeddings/graphs" with an episodic reconstruction paradigm of "preserving original context + on-site reasoning by small model assistants": the master agent only handles global planning, while multiple SLM assistants each guard a segment of uncompressed raw text, performing local reasoning to return evidence after activation via multi-pathway retrieval. This approach outperforms the SOTA F1 on LoCoMo by 7.75 points while cutting token consumption by 70%.
EduMirror: Modeling Educational Social Dynamics with Value-driven Multi-agent Simulation: EduMirror simulates educational social phenomena like "campus bullying" and "peer cooperation" in an LLM-driven multi-agent sandbox. It employs "value-driven agents" based on Maslow's hierarchy of needs and Social Value Orientation (SVO) to play students and teachers, coupled with a "dual-track measurement" protocol that quantifies both observable behaviors and latent psychological states. This allows for ethically safe "what-if" counterfactual experiments in a digital environment.
EngiAgent: Fully Connected Coordination of LLM Agents for Solving Open-ended Engineering Problems with Feasible Solutions: EngiAgent decomposes engineering problem solving into five specialist agents: Analyzer, Modeler, Verifier, Solver, and Evaluator. It utilizes a fully connected coordinator for dynamic feedback routing (replacing rigid pipelines). This approach improves the feasible solution rate on GPT-4o for engineering tasks from 5.66% (zero-shot) and 7.55% (MM-Agent) to 64.15%, representing an approximate 7x increase over previous SOTAs.
Sheaf-ADMM: Learning Multi-Agent Coordination via Sheaf-ADMM: Sheaf-ADMM formulates multi-agent coordination as an end-to-end differentiable ADMM unrolling: each agent observes a local patch, independently solves an ADMM subproblem (\(\bm x\)-update), negotiates consensus via "edge space projections" defined by a cellular sheaf (\(\bm z\)-update), and accumulates divergence using dual variables \(\bm u\). Agents successfully solve global tasks in maze pathfinding, MNIST, and Sudoku, where their inference paths exhibit analyzable primal/consensus/dual states—offering higher intervenability than standard MPNNs.
MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks: The paper reformulates "automated multi-agent system design" as a reinforcement learning (RL) problem involving function calls that output an entire MAS structure in a single step. It introduces MASBench to clarify "when multi-agent systems are truly superior to single-agent systems" across five dimensions: Depth, Horizon, Breadth, Parallelism, and Robustness.
MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems: MASPO end-to-end jointly optimizes role prompts for multi-agent chains without relying on labels through multi-granularity joint evaluation (Local Validity + Lookahead Potential + Global Alignment) and misalignment-driven evolutionary beam search, achieving an average improvement of approximately 2.9 points across 6 tasks.
MASPOB: Multi-Agent Prompt Optimization via GNN Surrogate + LinUCB + Coordinate Ascent: MASPOB reformulates multi-agent system prompt optimization as budget-constrained black-box optimization. It utilizes a GAT surrogate model to capture prompt coupling under workflow topologies, LinUCB in the embedding space to compute epistemic uncertainty, and coordinate ascent to decompose joint search into sequential individual problems. This reduces search complexity from \(\mathcal{O}(\prod |\mathcal{P}_i|)\) to \(\mathcal{O}(\sum |\mathcal{P}_i|)\). Across 6 benchmarks (QA/Code/Math), it achieves an average score of 80.58, surpassing MIPRO (78.87), AFlow (78.52), and IO (68.56).

Browse all 24 Multi-Agent papers →

⚖️ Alignment & RLHF (37)¶

Adaptive Probe-based Steering for Robust LLM Jailbreaking: This paper transforms probe-based contrastive steering into a more powerful white-box red-teaming tool. By using adaptive retraining to correct biased probes and automatically setting steering intensity via activation statistics, it significantly exposes the jailbreak vulnerabilities of fortified LLMs.
Alignment-Aware Decoding: Alignment-Aware Decoding (AAD) directly leverages the token probability ratio of a DPO model relative to an SFT reference model as an implicit alignment reward during inference. Without additional training or external reward models, it generates high-quality aligned responses more stably than greedy, Bo2, and EFT decoding, while also serving as a mechanism to generate synthetic preference data for iterative DPO improvement.
Autoregressive Direct Preference Optimization: The authors observe that DPO's derivation sequence is flawed: it constructs a Bradley-Terry (BT) preference model based on the entire answer first and imposes the autoregressive assumption on the model only afterwards. ADPO advances the autoregressive assumption to before the BT model construction by defining energy functions on the prefix closure of the output space. This yields a minimalist new loss that moves the summation sign from inside the log-sigmoid to the outside. Consequently, it distinguishes two independent length measures for the first time—token length \(\mu\) and feedback length \(\mu'\)-unifying training at any granularity from full answers to individual tokens.
Boosting Direct Preference Optimization with Penalization: This paper proposes DPOP (Direct Preference Optimization with Penalization), which adds an extra penalty to the "reference model's own greedy-decoded response" \(y_g\) for the same prompt alongside the standard DPO preference loss. A detached gate activates this penalty only when the policy "still ranks the rejected response higher than the chosen response," effectively transforming the unused reference-greedy signal into a valid offline alignment signal. On AlpacaEval 2.0, it exceeds DPO/SimPO/AlphaDPO in length-controlled win rate.
Consistency Training Can Entrench Misalignment: This paper proposes the "consistency non-neutrality hypothesis." By evaluating 7 consistency training methods across 108 "model organisms," it finds that consistency training is not alignment-neutral—it systematically suppresses fragile reward hacking and emergent misalignment while amplifying stable sycophancy. Distribution shift, rather than score selection, is identified as the primary driver.
Curriculum Learning for Safety Alignment: This paper proposes Staged-Competence—a DPO safety alignment framework that utilizes "model-specific preference alignment margin" as a difficulty score. It employs a dual curriculum of "staged reference model updates + within-stage competence-based sampling." Across three 8B-scale LLMs, it reduces OOD harmful response rates by an average of 16% and jailbreak success rates by 20%, while maintaining general capabilities and avoiding over-refusal.
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards: This paper theoretically demonstrates that the objectives of "improving accuracy" and "reducing calibration error" in RLVR (e.g., GRPO) training have negatively correlated gradient directions under the Fisher metric and are irreconcilable. It proposes DCPO: allowing the model to explicitly output a verbalized confidence segment after the reasoning trajectory, assigning independent rewards / advantages / masked gradients to reasoning tokens and confidence tokens. While maintaining the same accuracy as GRPO, it reduces the ECE from 0.435 to 0.128 (a 71.6% relative reduction).
Efficient Preference Poisoning Attack on Offline RLHF: The paper proposes a key observation for log-linear DPO: "flipping a single preference label equals adding a fixed vector independent of the policy parameters to the loss gradient." Based on this, targeted poisoning attacks are reduced to a binary sparse approximation problem. Two algorithms are introduced: BAL-A (based on LLL lattice reduction) and BMP-A (based on matching pursuit), along with provable recovery and impossibility conditions.
\(f\)-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses: This paper establishes the first \(O(\log T)\) regret and \(O(1/T)\) suboptimality gap upper bounds for online RLHF under general \(f\)-divergence regularization. It proposes two sampling strategies: (1) optimism in the face of uncertainty using bonus terms; and (2) a novel "derivative-as-uncertainty" perspective, where \(f'\) serves as an uncertainty signal to design derivative-based sampling without explicitly estimating confidence bounds in each round.
F-TIS: Harnessing Diverse Models in Collaborative GRPO: F-TIS combines "Truncated Importance Sampling (TIS)" with "filtering negative advantage off-policy samples based on KL thresholds" into a single GRPO loss. This allows multiple LLMs—varying in size, expertise, or trainable parameter subsets—to exchange samples during a single decentralized GRPO training session. The approach achieves convergence comparable to pure on-policy training and delivers up to a +12% performance gain on OOD math tasks.

Browse all 37 Alignment & RLHF papers →

👻 Hallucination Detection (21)¶

A Unified Definition of Hallucination: It's The World Model, Stupid!: This is a position paper advocating that "hallucinations" across various tasks—translation, summarization, open-domain QA, RAG, multimodal, and agents—be unified as one phenomenon: user-observable, inaccurate world modeling relative to a "reference world model." Every scenario is simply a different configuration of the "\((W, V, P)\)" triplet (Reference World \(W\), View Function \(V\), Conflict Policy \(P\)), converging fragmented definitions into a universal template for generating large-scale, comparable benchmarks.
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models: This paper proposes RUDDER, which extracts per-sample visual evidence directions from residual updates during the prefill stage of LVLMs and adaptively injects them via a Beta Gate during decoding, mitigating object hallucinations with overhead close to a single forward pass.
Automatic Layer Selection for Hallucination Detection: FEPoID (First Effective Peak of Intrinsic Dimension) is proposed as a training-free automatic layer selection criterion. Combined with the First Sentence Truncation (FST) strategy, it consistently selects near-optimal intermediate layers across various QA and summarization hallucination detection benchmarks, significantly outperformed existing baseline methods.
Building Reliable Long-Form Generation via Hallucination Rejection Sampling: This paper proposes the SHARS framework, which detects and rejects hallucinated content sentence-by-sentence during inference, retaining only verified factual segments to continue generation. Combined with an improved semantic entropy detector, HalluSE, it improves factual precision by approximately 20–26% on FactScore while maintaining or increasing the volume of factual information in the output.
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation: The GIFT method is proposed, which constructs a visual saliency map by tracking positive changes in visual attention ("gaze shifts") as the VLM interprets user queries. During the decoding stage, it simultaneously enhances attention for both visual and query tokens to maintain cross-modal fusion balance, achieving up to 20.7% improvement on CHAIR with only 1.13× latency overhead.
Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy: This paper identifies that LVLM hallucinations originate from "insufficient attention + forgetting during generation" regarding correct visual evidence. Observing a significant Inter-Layer Visual Attention Discrepancy (ILVAD) for visual evidence, the authors propose a train-free/plug-and-play method: constructing a visual evidence saliency map via inter-layer differentiation, then continuously weighting visual evidence tokens and "evidence-grounded" text tokens during generation. This consistently reduces hallucinations across 5 LVLMs and 5 hallucination/comprehensive benchmarks.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity: Ours shifts LLM hallucination detection from "analyzing output probabilities" to "analyzing loss landscape curvature"—measuring perturbations in gradient direction and magnitude by adding Gaussian noise to embeddings. Serving as a cheap proxy for the Hessian spectral radius, this method outperforms baselines like Entropy, Semantic Entropy, and EigenScore in AUROC across 12 model-dataset combinations.
From Out-of-Distribution Detection to Hallucination Detection: A Geometric View: This paper treats LLM next-token prediction as a classification task on a massive vocabulary. By migrating two lightweight OOD detectors—NCI (proximity of features to weight vectors) and fDBD (distance from features to decision boundaries)—with two adaptations ("analytical proxy \(\mu_G\) for training feature means" and "calculating boundary distance only on top-\(k\) candidate tokens"), it derives a training-free, single-sample inference-time hallucination detector. It consistently outperforms baselines such as Perplexity, Semantic Entropy, and SelfCheckGPT on CSQA, GSM8K, and AQuA.
Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing: This paper formalizes "LLMs memorizing random facts" as a membership testing problem with continuous confidence scores. It proves that in the sparse limit of facts, the optimal memory cost exactly equals the minimum KL divergence between fact and non-fact output distributions—a "rate-distortion theorem." It further concludes that under the log-loss objective and given limited memory, the optimal strategy is neither abstention nor forgetting, but rather mapping a certain proportion of non-facts and facts to the same high-confidence point, identifying hallucination as the information-theoretically optimal error form.
Hallucinations Undermine Trust; Metacognition is a Way Forward: This position paper argues that "totally eliminating LLM hallucinations" is theoretically impossible without incurring a "utility tax" (discrimination gap); the authors advocate shifting the goal from "eliminating hallucinations" to faithful uncertainty and treating this metacognition as an indispensable control layer for agentic LLMs when calling tools.

Browse all 21 Hallucination Detection papers →

📊 LLM Evaluation (40)¶

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning: This paper proposes Agent World Model, a fully synthetic pipeline encompassing scenarios, tasks, databases, MCP tool interfaces, and verifiers. It generates 1,000 executable, database-driven environments used to train tool-calling agents, achieving superior out-of-distribution generalization on BFCLv3, \(\tau^2\)-bench, and MCP-Universe.
AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning: AGZO discovers that the row space of linear layer gradients is constrained by the forward activation subspace. Based on this, it perturbs parameters only along activation-guided low-rank directions during zeroth-order fine-tuning, thereby improving gradient alignment and downstream task performance while maintaining memory usage levels nearly identical to MeZO.
Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models: This paper introduces AuthorityBench—a multi-domain benchmark with 220,000 prompts using a fully balanced 2×2 factorial design (independently manipulating "claim veracity × citation veracity") to isolate the influence of the "citation authority signal" itself on LLM cognitive behavior. It finds that adding a citation (regardless of its veracity) increases hallucination rates, with the "True Claim + Fabricated Citation" condition causing the most severe hallucinations across all tested models (raising hallucinations in general knowledge domains to 35–77%), and larger models are not necessarily more robust.
BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback: The Bespoke benchmark is proposed, collecting 2,870 sessions from 30 annotators over 3 weeks of real chat and search history. It constructs an evaluation framework with fine-grained preference ratings and diagnostic feedback to systematically assess the personalization capabilities of search-augmented LLMs. Findings indicate that current models score below 60 on average across all configurations, with the bottleneck for personalization lying in history reasoning rather than generation.
Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum: This paper systematically investigates the behavior of probability-based objective functions in SFT, discovering that the standard NLL is not universally optimal: on tasks where the model has a strong prior, prior-leaning objectives like \(-p\) significantly outperform NLL (with gains up to 16%). Conversely, NLL remains superior on tasks with weak priors, revealing an objective selection principle governed by the model-capability continuum.
Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning: Ours proposes GraphGPO, which aggregates all rollout trajectories into a unified state transition graph. By leveraging global shortest path information on the graph to calculate distance-based advantages for each step, it achieves finer-grained credit assignment than trajectory-level attribution, significantly outperforming GRPO and GiGPO on ALFWorld, WebShop, and Sokoban.
BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction: BuildArena places LLMs into the physical sandbox game Besiege, requiring them to use natural language to build bridges, vehicles, and rockets brick by brick. By using a physics engine for simulation and scoring, it systematically evaluates for the first time the engineering construction capability of LLMs to "translate language into functional physical structures." Results indicate that only GPT-5 is marginally competent on hard tasks, while most other models almost entirely fail at the Hard level.
CapBencher: Give Your LLM Benchmark a Built-in Alarm for Test-Set Overfitting: CapBencher injects randomness into each problem (generating multiple logically correct answers and randomly selecting one as the gold label) to cap the Bayes accuracy of a benchmark at a controllable level (e.g., 50%). This enables black-box statistical detection of data contamination in publicly released benchmarks—any model with an accuracy significantly exceeding the Bayes upper bound is flagged as contaminated.
Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering: The authors argue that mainstream LLM benchmark metrics rely on two frequently violated assumptions: a sufficient number of evaluations (permitting the Central Limit Theorem) and independence between prompts. They propose BHM-ESC, a Bayesian Hierarchical Model with "Embedding-Space Clustering": it groups semantically similar prompts into clusters sharing a success probability, and infers the number of clusters as an unknown variable. This provides more reliable performance estimates that correct for prompt dependence under small sample sizes, reducing Mean Absolute Error (MAE) by 4–73% and increasing Expected Log Posterior Density (ELPD) by 40–450 on adversarial robustness benchmarks.
Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees: This paper proposes the DSR (Decompose-Structure-Repair) neuro-symbolic framework, which decomposes the formalization of natural language theorems into three stages: "decomposing NL components → joint generation of FL components and Operator Trees (OPT) → hierarchical repair based on subtree localization." Using a 7B model, it sets new SOTAs on ProverBench / ProofNet / PRIME and releases PRIME, a graduate-level Lean 4 benchmark consisting of 156 problems.

Browse all 40 LLM Evaluation papers →

⚡ LLM Efficiency (48)¶

A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction: Fine-tuning LLMs is expensive and hard to predict. This paper formalizes "predicting final fine-tuning performance before or during early training" as a stochastic estimation problem under information constraints. It decomposes prediction risk into an irreducible intrinsic limit (static data-model compatibility) + reducible optimization variance. It proves a mandatory lower bound of \(c^{-\alpha}\) for the decay rate of optimization variance (no predictor can exceed this speed), derives budget-optimal stopping conditions, and organizes tasks into three predictability regimes—Static-Sufficient, Dynamic-Critical, and Noise-Dominant—using the "intrinsic limit × decay rate" axes, explaining why shallow probing suffices for SST-2 but fails for GSM8K.
Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts: The authors propose an "orthogonal growth" strategy for converged MoE models—using interpositional layer replication for depth and noisy expert cloning for width—scaling a 17B model to 70B. This achieve a 10.6% accuracy improvement over training from scratch under the same additional compute budget.
CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective: The authors reframe the heuristic-based problem of "identifying critical KV cache entries" as an optimization problem of "minimizing attention output perturbation." They derive an analytical upper bound for perturbation (weighted by both attention weights and value norms projected via \(W^O\)) and design a plug-and-play two-stage greedy selection algorithm. This method reduces the compression loss of SOTA eviction approaches like SnapKV, AdaKV, and HeadKV by more than half on average across 29 long-context datasets.
Diffusion Language Model Parallel Decoding via Product-of-Experts Bridge: Diffusion Language Models (DLMs) enable parallel decoding but suffer from poor quality. Directly using Monte Carlo methods to correct DLM drafts toward an Autoregressive (AR) target is computationally expensive due to the massive distribution gap. This paper proposes PoE-Bridge, which inserts a Product-of-Experts intermediate bridge distribution between the DLM and AR models. This decomposes the difficult "DLM \(\to\) AR" correction into two easier "DLM \(\to\) PoE \(\to\) AR" steps. Combined with mixed-temperature sampling and elastic rejection windows, it accelerates standard DLM decoding by up to 5\(\times\) on mathematical reasoning and coding tasks while recovering at least 95% of AR accuracy.
dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching: To address the bottleneck where diffusion Large Language Models (dLLMs) suffer from extremely slow inference due to bidirectional attention and the inability to reuse KV caches, this paper proposes dLLM-Cache. This training-free method applies long-interval caching for static prompts and short-interval refreshing for dynamic responses. By using Value cosine similarity (V-verify) to select and recompute the top 25% most "active" tokens, it achieves up to 9.1× FLOPs acceleration on LLaDA 8B / Dream 7B with almost no drop in performance.
Do Transformers Need Three Projections? A Systematic Study of QKV Sharing Systems: The paper systematically compares three QKV projection sharing schemes: Q=K-V (shared query and key), Q-K=V (shared key and value), and Q=K=V (all three shared). It finds that for Language Modeling (LM), Q-K=V increases Perplexity (PPL) by only 3.1% while reducing the KV cache by 50%. This approach is orthogonal to GQA/MQA, enabling a total cache reduction of 87.5%–96.9%, providing quantifiable memory benefits for edge inference.
DOT-MoE: Transforming Dense LLMs into MoE with Differentiable Optimal Transport: DOT-MoE models the "allocation of neurons to experts when converting a dense FFN to an MoE" as a differentiable optimal transport problem. It employs Sinkhorn-Knopp iterations to solve entropic-regularized balanced transport combined with a Straight-Through Estimator, allowing joint end-to-end learning of neuron-to-expert assignment and the router. It retains 90% of dense performance under 50% active parameters on LLaMA-2/3 and Qwen2.5, outperforming all baselines including structured pruning, random allocation, and clustering.
Dynamic Linear Attention: Addressing the issue where existing "multi-state linear attention" mechanisms merge memory using fixed rules, causing critical tokens to be compressed into coarse summaries prematurely and accumulating errors, DLA proposes an information-aware + capacity-constrained dynamic memory framework. By using a lightweight "state information score" to adaptively determine when to create or merge memory states based on token-level information changes, and employing a fixed-size temporal cache to suppress state explosion, DLA consistently outperforms SOTA Log-Linear Attention across 16 datasets. Furthermore, the DLA version of Mamba-2 matches the performance of a full-attention Transformer with an equivalent number of parameters.
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing: This paper proposes ESP (Embedding-Space Probing): without modifying any weights or training auxiliary models, it injects "mean prompt embeddings" as mask tokens into the input sequence of a frozen LLM. It probes multiple future tokens simultaneously in a single forward pass and uses the base model itself for lossless speculative verification. On LLaMA3 / Qwen3, it achieves 7–11% higher average acceptance length and 15–19% higher throughput than similar training-free baselines (LADE / STAND / PLD).
Ekka: Automated Diagnosis of Silent Errors in LLM Inference: Ekka models the diagnosis of silent errors in LLM serving frameworks—where outputs degrade without explicit errors—as a differential debugging task using reference implementations like HuggingFace as an oracle. By employing an agentic pipeline of "component mapping \(\rightarrow\) activation alignment \(\rightarrow\) change-point analysis," it automatically localizes problematic modules. Ekka achieves a diagnosis accuracy of 80% pass@1 / 88% pass@5 across 17 real-world vLLM/SGLang issues and discovered 4 hidden bugs confirmed by developers.

Browse all 48 LLM Efficiency papers →

📚 Pretraining (27)¶

AC-ODM: Actor–Critic Online Data Mixing for Sample-Efficient LLM Pretraining: AC-ODM formulates the dynamic adjustment of pre-training data domain weights as a continuous control problem in reinforcement learning. Using the DDPG Actor-Critic framework, it perceives the model state in real-time, outputs sampling weights for each domain, and employs "inter-domain gradient alignment" as the reward. Theoretically, this is proven equivalent to maximizing constructive interference of gradients (effective descent step size). On Pythia-1B, it achieves optimal perplexity with approximately 66% fewer steps than strong baselines, scores a 27.5% relative improvement on MMLU, and increases HumanEval pass@1 by 2.23 times, with only a 0.4% increase in wall-clock time per step and 2% extra memory.
Annotations Mitigate Post-Training Mode Collapse: The authors observe that SFT aligns models with a low-entropy semantic prior, leading to "inverse scaling" where larger instruction-tuned models become increasingly repetitive. They propose "Annotation-Anchored Training"—tagging documents with semantic tags during pre-training and masking the loss on these tags during SFT—enabling the model to sample semantics before generating responses, which reduces the semantic diversity gap by 85% while maintaining instruction-following performance.
Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability: This paper proposes a theoretical framework of "effective function classes" and "neuron identifiability," revealing that breaking structural symmetry does not equate to breaking effective symmetry—even if permutation symmetry in the parameter space is eliminated, data-dependent approximate symmetries may still make neuron swapping costs extremely low. Based on this, it provides sufficient conditions for achieving Linear Mode Connectivity (LMC) without the need for alignment.
Constrained Bayesian Experimental Design via Online Planning: This paper proposes COPEx: a semi-amortized scheme combining "offline pre-trained amortized posterior networks + design policies + online multi-step lookahead scenario trees." This allows Bayesian experimental design (BED) to dynamically adapt to budget, cost, and transition constraints at test time. COPEx consistently outperforms baselines such as VPCE, ALINE, and RL-BOED in EIG/RMSE across three types of tasks: constrained location finding, CES, and cost-aware AL.
Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning: This paper systematically investigates the role of data difficulty in SFT, discovering that there is no "universally optimal difficulty." Instead, an optimal difficulty exists that drifts toward harder samples as the data scale increases. This is explained through a PAC-Bayes framework as a tradeoff between the "in-distribution generalization gap" and the "extrapolation gap."
Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings: The authors point out that the mainstream positional encoding, RoPE, couples "content (what)" and "position (where)" into the same phase, leading to poor performance on tasks requiring "finding content by position" or "locating position by content." They propose PoPE, which uses softplus to separate magnitude (controlling what) and pure positional phase (controlling where). As a minor modification to RoPE, PoPE consistently outperforms it in diagnostic tasks, music/genomic/language modeling, and achieves length extrapolation to 10x the training length without any fine-tuning, surpassing YaRN which is specifically designed for extrapolation.
Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization: This workshop paper systematically compares "module-wise manifold constraint" schemes during GPT-2 small pre-training. It discovers that applying strong spectral constraints (Stiefel) to Attention layers while applying weak constraints (DGram) to MLP layers achieves the best performance. Conversely, training Attention layers with DGram leads to divergence, for which the authors provide a mechanistic explanation: "Singular value swelling \(\rightarrow\) Logit inflation \(\rightarrow\) Softmax saturation \(\rightarrow\) Gradient degradation."
Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos: The authors interpret dropout as an "external field" \(h\) that breaks the \(c^*=1\) perfect alignment fixed point in mean field signal propagation theory. They derive the Landau equation, two-parameter scaling collapse, and identify two distinct universality classes for smooth and kinked activations. This leads to a "zero-overhead" practical conclusion: a front-loaded schedule reduces test loss by 18–35% in MLPs and ViTs compared to constant dropout under the same budget.
Explaining Data Mixing Scaling Laws: This paper provides the long-missing theoretical explanation for "multi-domain data mixing scaling laws." By extending two classic theories of single-domain scaling laws (the quantization model and the projection linear regression model) to multiple domains, it proposes a "shared head, disjoint tail" distribution hypothesis. It identifies two mechanisms governing the loss of each domain: capacity competition (limited model capacity is contested by domain-specific skills, globally coupling all domain losses) and data quantity noise (losses in harder-to-learn domains decrease more slowly, biasing the optimal ratio toward them). The resulting model achieves lower fitting errors using fewer parameters and enables cross-scale extrapolation, using small-scale fitted parameters to predict optimal ratios for large models.
FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment: FlexRank performs activation-aware low-rank decomposition (DataSVD) on each linear layer of a pre-trained large model, uses dynamic programming to select a set of strictly nested sub-models corresponding to different compute budgets in \(O(L\cdot K)\) time, and jointly trains this shared weight set using knowledge distillation. Finally, via Gauge-Aligned Reparametrization, rank savings are translated into actual FLOPs savings—yielding a "family" of deployable models for LLMs and ViTs that approach the true Pareto frontier with a single training run.

Browse all 27 Pretraining papers →

✏️ Knowledge Editing (8)¶

AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise: AnyEdit++ utilizes token-level Bayesian Surprise to identify semantic transition points in long-form text, replacing the fixed-window segmentation of AnyEdit with structure-aware Bayes-Chunk. It achieves stable improvements in BLEU and BERT Score across long-form knowledge editing tasks such as mathematics, code, news, and poetry.
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing: LLM editing is formulated as a constrained optimization problem: "minimize edit loss s.t. capability loss remains invariant". This is equivalently transformed via Bregman divergence into a low-curvature subspace projection of the Gauss-Newton Hessian (GNH). By employing K-FAC and a Kronecker eigenbasis technique that avoids explicit construction of the projection matrix, 3,000 edits are completed in 6 minutes on an A40. The average performance drop of LLaMA-3-8B across MMLU/IFEval/ARC-C/TruthfulQA/GSM8K is suppressed to \(< 1\%\), significantly outperforming AlphaEdit, MEMIT, and fine-tuning.
Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs: This paper proposes UniKE—the first "cross-modal knowledge editing" benchmark for Unified Multimodal Models (UMMs) (2,971 editing subjects, 5,535 VQA-verifiable instances). It systematically reveals a modality gap where the "text-side editing success rate is ~92%, yet image generation VQA is only ~18.5%." By using a "reasoning-augmented parameter editing" protocol, it increases VQA accuracy by up to 18.6 percentage points and identifies the root cause as the LLM-to-DiT projection bottleneck using cosine drift metrics on the conditioning path.
From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing: This paper systematically analyzes why backward spreading in locate-then-edit works and where it falls short. It proposes forward replay: treating the hidden state of the first decisive layer as an optimization variable and performing a standard forward pass to obtain targets for subsequent layers. This achieves consistent performance gains over MEMIT/RECT/PRUNE/AlphaEdit without additional computational overhead.
KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Controls: KORE injects new knowledge into LMMs through two-stage "knowledge-oriented controls": automatically expanding single facts into structured multi-turn conversations and instruction tasks (to enhance generalization), while initializing LoRA adapters using the null space of the covariance matrix of prior knowledge (to minimize interference with existing capabilities). It achieves both strong adaptation and strong retention on LLaVA-v1.5 / Qwen2.5-VL.
Reverse-Engineering Model Editing on Language Models: The paper reveals that parameter update matrices of locate-then-edit knowledge editing methods (ROME/MEMIT/AlphaEdit) leak "edited subject" fingerprints through their row spaces. It proposes a two-stage attack, KSTER (recovering subjects via SVD, then prompts via relative entropy drop), and a defense called Subspace Camouflage based on "semantic decoy" injection.
Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence: Ours starts from the "dimension collapse" hypothesis, proving that parameter-level knowledge editing is amplified along directions with low singular values and accumulates linearly with sequential editing. This systematically degrades core LLM capabilities across multiple models, datasets, and evaluation dimensions. Ours further indicates that a simple retrieval-based baseline, SCR, outperforms existing parameter editing methods in all settings.
The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models: This paper proves from an optimization perspective that the stability of sequential editing (SE) stems from "cumulative updates being equivalent to the solution of one-time editing (OTE)." Fancy mechanisms like AlphaEdit's null-space projection or post-processing regularizations in PRUNE/RECT are not the critical factors—as long as OTE-SE alignment is ensured, 2000 steps of sequential editing can be stably completed across four mainstream LLMs even after removing these regularizations.

💬 LLM (Other) (39)¶

A Geometric Relation of the Error Introduced by Sampling a Language Model's Output Distribution to its Internal State: This paper characterizes the information loss introduced by sampling from high-entropy distributions in GPT-style LLMs from a differential geometry perspective. By constructing \(\mathfrak{so}(n)\)-valued 1-forms and parallel transport operators, it demonstrates through chess probe experiments that these geometric rotations align highly with the model’s learned world vectors.
ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models: ANCHOR constructs a dense factor space using "bottom-up abduction + hierarchical clustering." For downstream conditions, it performs coarse-to-fine retrieval to obtain a sparse set of relevant factors. It then aggregates posteriors by combining Naïve Bayes with a dynamically constructed Causal Bayesian Network (CBN) featuring latent variables. In high-risk LLM decision-making scenarios, it significantly reduces "unknown" predictions and improves probability calibration.
Automated Formal Proofs of Combinatorial Identities via Wilf–Zeilberger Guidance and LLMs: WZ-LLM compiles the classic Wilf–Zeilberger symbolic proof pipeline into an executable proof skeleton (recurrence + boundary conditions + side conditions) in Lean 4. These components are discharged by WZ-Prover, a specialized model trained via SFT + expert-iteration + DAPO. On 100 classic combinatorial identities, it improves the pass@32 from Goedel-Prover-V2's 9% to 34%.
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision: This paper proposes Compute as Teacher (CaT): it "synthesizes" a pseudo-reference answer from \(G\) rollouts already sampled by GRPO using a frozen anchor model. In non-verifiable domains, the model uses binary rubrics self-derived from this pseudo-reference to score each rollout as an RL reward. This directly converts inference compute into supervision signals without any human annotation, achieving up to a 30% improvement over baselines on HealthBench and matching or exceeding inference-time aggregation with 9× lower test-time compute.
Creative Collision: Directorial Persona Steering and Competition in Large Language Models: Two semantically opposing "directorial persona" steering vectors (Spielberg's optimistic redemption vs. Scorsese's dark moral ambiguity) are simultaneously injected into the residual stream of an LLM. This study systematically characterizes the moral tone, coherence, and geometric changes during the competition between these two directions, discovering three counter-intuitive phenomena: "directional dominance," "coherence trough," and the "Layer 28 moral hub."
Deep Networks Learn to Parse Uniform-Depth Context-Free Languages from Local Statistics: The authors propose a "Varying-tree RHM" Probabilistic Context-Free Grammar (PCFG) with controllable ambiguity. They prove that using only low-order moments (root-to-pair and root-to-triple) combined with layer-wise clustering is sufficient to recover grammar rules and perform CYK-style parsing. The sample complexity is derived as \(P^\star \asymp v\, m_3\, m_2^{L-1} (p_2^2/2)^{1-L}\), and experiments on CNNs and Transformers strictly follow this power law.
Differential Syntactic and Semantic Encoding in LLMs: By averaging hidden representations of sentences sharing the same syntactic structure or the same meaning to obtain "syntactic centroids" and "semantic centroids," the authors demonstrate that a significant portion of syntactic/semantic information in LLMs like DeepSeek-V3 is encoded via linear superposition. Moreover, these two types of information exhibit clear separability in layer-wise distribution and orthogonal ablation—supporting the linguistic hypothesis of "syntactic autonomy."
Emergence of Hierarchical Emotion Organization in Large Language Models: The paper utilizes a tree-building algorithm that relies solely on LLM output logits without any annotations to "excavate" a hierarchical emotion tree from the model's next-token distribution of emotion words. It finds that as the model scale increases, these trees increasingly resemble the human psychological "emotion wheel." Furthermore, it demonstrates that LLMs under different demographic personas reproduce systematic emotion recognition biases consistent with those of human subjects.
Express Your Doubts: Probabilistic World Modeling Should Not Be Based on Token logprobs: This is a position paper arguing that treating the token softmax probabilities (logprobs) of an LLM as "world event probabilities" is theoretically flawed. This is because distribution estimation, response prediction, and target distribution estimation are three distinct tasks, each corresponding to a different ideal output distribution. The correct approach to obtaining world probabilities is second-order prediction—tasking the LLM to explicitly output its probability for an event (using numerical values or verbal qualifiers) rather than calculating "the probability of it generating X."
How Many Different Outputs Can a Transformer Generate?: Starting from two fundamental architectural facts—finite precision and bounded embedding support—this paper proves that any Transformer can only generate a finite number of "accessible sequences." It provides a tight upper bound where the length of accessible sequences grows linearly with prompt length, after which the proportion of accessible sequences decays exponentially at a rate of \(1/|V|^n\). Experiments on Pythia, Qwen, Llama, and Gemma verify that the theoretical slope differs from the measured value by only 5–10x.

Browse all 39 LLM (Other) papers →

📖 NLP Understanding (2)¶

Causal Fine-Tuning under Latent Confounded Shift: This paper proposes Causal Fine-Tuning (CFT): an SCM-inspired decomposition of "high-level stable features \(C\) + low-level confounding-sensitive features \(\Phi\)" is embedded into standard BERT fine-tuning. By utilizing a front-door style do-calculus adjustment for prediction, it significantly outperforms single-domain generalization baselines such as SFT/SWA/WISE under text spurious correlation injection attacks.
Controlling the Risk of Corrupted Contexts for Language Models via Early-Exiting: This paper formalizes the problem where "user-provided corrupted contexts degrade LLM performance" as a risk control task. By using zero-shot performance as a "safety baseline," combining dynamic early-exit (predicting at intermediate layers to avoid late-layer overthinking of harmful contexts) with a context-aware loss and an improved Learn-then-Test framework (preserving negative loss values via risk transformation rather than clipping), this method guarantees risk \(\leq\) user-specified \(\epsilon\) while achieving \(> 50\%\) computational acceleration across 9 tasks.

✍️ Text Generation (2)¶

Characterizing the Effect of Noise in Language Generation in the Limit: Under the Kleinberg-Mullainathan formal framework of "language generation in the limit," this paper proves that for both uniform and non-uniform generation, noise level 1 is equivalent to any finite noise level \(i \geq 1\) (hierarchy collapse), while a strict separation exists between the noise-free case and noise level 1. Furthermore, it provides the first complete characterization of non-uniform noise-dependent generatability.
Score-Repellent Monte Carlo: Toward Efficient Non-Markovian Sampler with Constant Memory in General State Spaces: SRMC utilizes a \(d\)-dimensional running score average (rather than an \(|\mathcal{X}|\)-dimensional empirical measure) to record history. This history is then incorporated into an exponential score-tilt to construct a surrogate target \(\pi_\theta\) that "repels already visited regions." By wrapping this around any base MCMC kernel, the authors implement a non-Markovian, low-variance, normalization-free sampler with constant memory in general state spaces.

🗣️ Dialogue Systems (5)¶

Context-Driven Incremental Compression for Multi-Turn Dialogue Generation: Concatenating full histories in multi-turn dialogues is expensive and leads to lost clues. This paper proposes C-DIC: viewing dialogues as interleaved "topic threads," it stores revisable per-thread compressed states in a compact memory. Each turn follows a lightweight "Retrieval \(\to\) Revision \(\to\) Write-back" cycle, trained with retrieval-aware truncated backpropagation through time (ra-TBPTT), maintaining stable latency and perplexity over hundreds of turns.
DiscoverLLM: From Executing Intents to Discovering Them: DiscoverLLM formalizes the scenario where "the user has not clearly defined their goals" as a progressive discovery process within a hierarchical intent tree. By using a rewardable hierarchical user simulator, the model is trained to actively explore divergently when goals are unclear and converge for execution when they are clarified. On creative writing, technical writing, and SVG tasks, the method achieves a +10% improvement in satisfaction and a -40% reduction in dialogue length compared to baselines like CollabLLM.
From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents: Addressing two major bottlenecks in post-training multi-turn interactive tool-using agents—expensive high-quality data and RL signal degradation from user simulation noise—the authors propose "AReaL-SEA," a self-evolving multi-agent data synthesis pipeline that generates executable verifiers as rewards. Combined with an RL recipe featuring user model SFT, large batches, and dynamic filtering GRPO, Qwen3-235B achieves a pass^1 of 73.0 in Airline and 98.3 in Telecom on τ²-bench, matching or exceeding Claude/Gemini/GPT-5.
Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives: This paper models LLM-as-a-Service as a "principal-agent" problem, proving that current mainstream "pay-per-token" mechanisms naturally incentivize service providers to re-segment the same string into longer token sequences for overcharging. Furthermore, even if providers are forced to disclose next-token distributions, overcharging without detection remains NP-Hard rather than impossible—the authors provide a simple heuristic algorithm that increases reported tokens by up to 11.2% while maintaining plausibility. Finally, it is proven that the only additive pricing mechanism that eliminates this incentive is "linear pay-per-character."
Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving: This paper points out that traditional Prefill-Decode (PD) disaggregated architectures are significantly inefficient in multi-turn dialogue scenarios due to the repeated P→D recomputation and transmission of KV caches for each turn. It proposes PPD (Prefill-capable Decode), a dynamic routing system that allows decode nodes to decide whether to process Turn 2+ append-prefills locally based on SLO weights, reducing Turn 2+ TTFT by approximately 68%.

🌐 Multilingual & Translation (3)¶

Edit-Based Refinement for Parallel Masked Diffusion Language Models: ME-DLM introduces a lightweight "decode-then-edit" refinement stage to masked diffusion language models (e.g., LLaDA). The first stage generates a draft via standard parallel unmasking, while the second stage performs parallel corrections using replace/delete/insert actions supervised by the shortest edit distance scripts. Using only 1/8 of the diffusion step budget, it outperforms LLaDA-Instruct by +11.6 on HumanEval and +33.6 on GSM8K.
Optimizing Language Models for Crosslingual Knowledge Consistency: This paper addresses the issue of multilingual LLMs providing conflicting answers to the same question across different languages. It designs an RL objective using the "log-likelihood of the answer in another language" as a reward, proving that the optimal policy follows a product-of-experts form and guarantees crosslingual preference consistency when \(\gamma_1\gamma_2=\beta^2\). Based on this, the authors derive DCO (Direct Consistency Optimization), a reward-model-free and online-sampling-free algorithm. Experiments across 9 LLMs, 3 multilingual QA benchmarks, and 26 languages demonstrate simultaneous improvements in crosslingual consistency (RankC) and response accuracy.
Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages: LiRA inserts a lightweight fine-tuning module featuring "anchoring + consistency regularization" between a frozen multilingual encoder and an English LLM. It constrains the sentence vectors of low-resource languages into a shared English semantic space through two theoretically controllable quantities: \(\epsilon_1\) (anchoring error) and \(\epsilon_2\) (translation KL distance), achieving stable improvements across retrieval, ranking, and reasoning tasks.

🔍 Information Retrieval & RAG (26)¶

BlitzRank: Principled Zero-shot Ranking Agents with Tournament Graphs: Ours proposes BlitzRank, a zero-shot reranking framework based on tournament graphs. By accumulating \(\binom{k}{2}\) preference pairs generated by each \(k\)-wise comparison into a global preference graph and utilizing transitive closure to infer additional ranking relations, it achieves Pareto optimality across 14 benchmarks and 5 LLM oracles—reducing token consumption by 25–40% while matching or exceeding the accuracy of existing methods.
CARE: Class-Adaptive Expert Consensus for Reliable Learning with Long-Tailed Noisy Labels: The CARE framework is proposed, which leverages three-way complementary experts—VLM text embeddings, image features, and original labels—to achieve reliable label correction in long-tailed noisy label scenarios through a class-adaptive Top-\(K\) consensus mechanism, consistently surpassing SOTA by up to 3.0% on synthetic and real-world benchmarks.
Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning: Graph-R1 reformulates GraphRAG as an end-to-end RL framework featuring a "knowledge hypergraph environment + multi-turn think–query–retrieve–answer agent + outcome-oriented GRPO." By utilizing lightweight n-ary hypergraph construction and dual-path hyperedge retrieval with RRF fusion, it improves the F1 score of 7B models from Search-R1's 46.19 to 57.82 across six standard RAG datasets.
HGMem: Hypergraph-based Working Memory to Improve Multi-step RAG for Long-Context Complex Relational Modeling: This paper reconstructs the working memory in multi-step RAG from a "flat list of facts" into a hypergraph. Each hyperedge serves as a memory point that can be updated, inserted, or merged. By leveraging the inherent ability of hyperedges to connect \(n \geq 2\) entities, the system allows memory to continuously consolidate low-order facts into high-order concepts during interactions, significantly improving performance in long-context QA tasks that require "global sense-making."
Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation: Ψ-RAG replaces RAPTOR's k-means with a "merge-collapse" hierarchical clustering to construct cross-document abstraction trees. It incorporates a retrieval-response Agent with multi-turn rewriting capabilities and a hybrid BM25 index, enabling Tree-RAG to match or exceed Graph-RAG in corpus-level, multi-hop QA for the first time. The average F1 score is 25.9% higher than RAPTOR and 7.4% higher than HippoRAG 2.
How can embedding models bind concepts?: This paper formalizes the question of "why embedding models fail to bind concepts" as a "complexity problem of the binding function." Through geometric analysis, it demonstrates that CLIP's scene embeddings decompose additively into objects and concepts (explaining why they are probeable in unimodal settings but fail cross-modally). Furthermore, it proves on controlled Transformers that with sufficient data coverage, models learn low-complexity binding dominated by multiplicative interactions between concepts, achieving systematic generalization to unseen object combinations.
LARE: Low-Attention Region Encoding for Text–Image Retrieval: LARE is a training-free text–image retrieval framework: it extracts "low-attention" regions from a frozen vision encoder, re-encodes them, and integrates them into global similarity scores via confidence gating. This significantly improves recall for CLIP/SigLIP-style dual-encoders in crowded scenes with small or rare objects while maintaining performance on standard datasets.
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding: LazyAttention defers RoPE positional encoding from the KV cache write stage to on-the-fly execution within the attention kernel. This allows a single physical KV copy to be reused by any logical position. On skewed RAG workloads, it reduces TTFT by 1.37× and improves throughput by 1.40× compared to SOTA Block-Attention, with negligible loss in generation quality.
LEMUR: Learned Multi-Vector Retrieval: Lemur transforms multi-vector similarity search into a supervised learning problem. By using a two-layer MLP to map token-level embeddings to a low-dimensional latent space and leveraging existing single-vector ANNS indices for retrieval, it achieves speeds an order of magnitude faster than methods like PLAID and MUVERA.
Less Is More: Elevating RAG via Performance-Driven Context Compression: CORE-RAG trains a 1.5B small compressor using GRPO reinforcement learning with "performance-as-reward," compressing retrieved top-k documents into summaries of ~3% original length. It not only avoids performance degradation but also achieves an average improvement of 3.3 EM over full-context RAG across four QA benchmarks.

Browse all 26 Information Retrieval & RAG papers →

💻 Code Intelligence (22)¶

A Benchmark and Framework for Evaluating Next Action Predictions in Spreadsheets: Addressing the gap where spreadsheets lack next-action prediction similar to code completion, this paper constructs NAPE, the first spreadsheet action prediction benchmark (52 human-verified creation trajectories with 11,907 low-level actions). It proposes an online evaluation framework: after each action, the system provides predictions, simulates user acceptance/rejection, and dynamically rewrites remaining ground truth actions. Performance is measured by User Action Savings (uas); experiments show that a fine-tuned 360M model matches GPT-5 (both saving 27% of actions).
AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms: AlgoVeri constructs a strictly aligned benchmark for verified code generation of classical algorithms across Dafny, Verus, and Lean. It demonstrates that current LLMs still face significant gaps in handling complex global invariants, system-level constraints, and explicit proof search, with success rates in Lean and Verus being substantially lower than those in Dafny.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models: BoostAPR constructs a three-stage pipeline for training program-repair models via RL: execution-verified SFT → training sequence-level + line-level dual reward models → redistributing sequence rewards to key edit-line spans using the line-level model during PPO. Using Qwen2.5-Coder-32B, it pushes SWE-bench Verified performance from 17.8% to 40.7% (+22.9pp) and achieves 24.8% on Defects4J through cross-lingual transfer.
Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation: Addressing the neglected problem where "LLM-translated code is functionally correct but slower than human-written code," this work proposes the SwiftTrans framework. It generates multi-perspective candidates using parallel ICL and selects the optimal candidate in linear time via a difference-aware pairwise judge using bubble-scan. Combined with Hierarchical Guidance and Ordinal Guidance training strategies, a Qwen2.5-3B model surpasses GPT-5 in both functional correctness and runtime efficiency.
CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding: CentaurEval is proposed as the first unified evaluation framework for human-AI collaborative programming. By designing 45 "Collaboration-Necessary" task templates, it demonstrates that LLMs alone achieve only a 0.67% pass rate and humans alone achieve 18.89%, while human-AI collaboration reaches 31.11%, revealing that LLMs are evolving from execution tools into co-reasoning partners.
Entropy-informed Decoding: Adaptive Information-Driven Branching: EDEN (Entropy-informed DEcodiNg) sets the step-wise beam width \(B_t\) to be monotonically proportional to the normalized entropy \(\bar H_t\)—branching more at high-entropy forks and behaving almost greedily during low-entropy steps. This approximates wider beam search with fewer total expansions. The authors theoretically prove that entropy-monotonic branching factors are strictly superior to any fixed beam width in terms of expected cumulative regret, providing an explicit regret rate of \(\mathbb{E}[R_T] \leq G P_\max \sum_t \exp(-c m_t \Delta_\min^2)\).
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench: Traditional PPL on SWE-bench is disrupted by the "long context tax" and fails to predict agent capability after SFT. This paper proposes the "Entropy Compression Hypothesis" and the HE-SNR metric, which calculates the signal-to-noise ratio only at "high-entropy decision points" where Top-10 entropy exceeds \((\ln 3 + \ln 4)/2\). This achieves a Pearson correlation of 0.96 and a Kendall consistency of 0.98 with downstream SWE-bench scores.
How can we assess human-agent interactions? Case studies in software agent design: The authors propose the PULSE framework—which collects user feedback, trains an ML model to predict user satisfaction, and employs Prediction-Powered Inference (PPI) to combine real human labels with model pseudo-labels for efficient estimation of agent design effects. Deployed on the open-source coding agent OpenHands across 15,000 users and 36,000 sessions, this work represents the first large-scale real-world evaluation of agent design. Results show that PULSE narrows confidence intervals by approximately 40% compared to standard A/B testing and reveals that benchmark performance can be anti-correlated with human preference (e.g., GPT-5 outperformed Claude-Sonnet-4 on 6/7 benchmarks, yet humans preferred Claude on 4/7 task subsets).
Locally Coherent Parallel Decoding in Diffusion Language Models: This paper proposes CoDiLA, which attaches a lightweight autoregressive (AR) model to a masked diffusion language model (DLM). By receiving the marginal distributions of the DLM through "soft embeddings" and performing local autoregressive decoding within small blocks, it eliminates the local incoherence caused by parallel sampling while preserving the global bidirectional capabilities of the DLM. It establishes a new Pareto frontier on code generation with \(\geq 2\times\) throughput.
MARS: Modular Agent with Reflective Search for Automated AI Research: MARS reframes automated AI research as a problem of "searching for the optimal solution within a software repository space." Built on three pillars—Budget-Aware MCTS, a modular "Design-Decompose-Implement" pipeline, and Comparative Reflective Memory—it achieves SOTA among open-source frameworks on MLE-Bench with a 31.1% gold medal rate (Gemini-3-Pro-Preview) and demonstrates an "Aha! moment" with a 63% cross-branch lesson transfer rate.

Browse all 22 Code Intelligence papers →

🎨 Image Generation (141)¶

A Diffusive Classification Loss for Learning Energy-based Generative Models: This paper proposes DiffCLF, which reformulates energy estimation across temporal noise levels as a classification problem. By training jointly with DSM, it learns more reliable energy functions without requiring expensive maximum likelihood sampling, specifically alleviating the "mode blindness" of score matching regarding multi-modal weights.
A Kinetic Energy Perspective of Flow Matching: This paper treats flow matching sampling trajectories as particle motions and defines Kinetic Path Energy (KPE) to measure the cumulative kinetic energy of the generation process for each sample. Based on this, a training-free strategy called Kinetic Trajectory Shaping (KTS) is proposed to enhance generation quality while suppressing memorization caused by late-stage energy spikes.
A Systematic Investigation of RL-Jailbreaking in LLMs: This paper investigates RL-based LLM jailbreaking as a decomposable POMDP system, finding that environment definition factors—such as reward functions, episode length, and the number of training questions—determine automated red teaming success rates more significantly than the choice of RL algorithm.
A Unified Framework for Diffusion Model Unlearning with f-Divergence: This paper generalizes MSE/KL alignment in diffusion model concept unlearning to arbitrary \(f\)-divergence, proposing the f-DMU framework. It identifies that closed-form Hellinger loss is often more stable and better at preserving non-target concepts than MSE.
AdaEraser: Training-Free Object Removal via Adaptive Attention Suppression: AdaEraser adaptively modulates the self-attention suppression intensity of diffusion models based on the "object presence degree." It simultaneously improves object removal completeness and background reconstruction quality without training, outperforming both training-based and training-free object removal methods on Mulan and OABench.
Adapting Noise to Data: Generative Flows from Learned 1D Processes: This paper argues that the default Gaussian latent in flow/diffusion models is not always suitable for the data distribution. It proposes constructing a data-adaptive product prior using learnable 1D quantile functions to jointly learn the noise and velocity field in flow matching, thereby shortening the transport path and improving performance on heavy-tailed weather data and low-capacity image generation.
Adversarial Flow Models: The authors add an optimal transport regularization term \(\|G(z)-z\|^2\) to the GAN training objective, constraining the GAN's "arbitrary transport map" to a unique Wasserstein-2 optimal transport map. This allows adversarial training on pure Transformers to stabilize for the first time and perform end-to-end single-step generation. On ImageNet-256, the 1NFE FID reaches 2.38 (XL/2) and 1.94 (112-layer recursive model).
AesFormer: Transform Everyday Photos into Beautiful Memories: AesFormer defines aesthetic photo enhancement as Aesthetic Photo Reconstruction (APR). It introduces a two-stage framework that first generates a photography action plan and then executes structural editing, transforming errors in composition, perspective, and pose into executable edits. It significantly outperforms open-source editors on AesRecon and approaches the performance of Nano Banana Pro.
AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching: AG-REPA discovers that the "layers storing semantic information" and the "layers actually driving the velocity field" in audio Flow Matching do not coincide. It proposes using forward-only gate ablation to select layers with the highest causal contribution for representation alignment, achieving faster convergence and lower FAD than fixed-layer REPA in speech and general audio generation.
Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models: This paper proposes Alignment-Guided Score Matching, which utilizes a reward-free Plackett-Luce alignment reward to directly incorporate positive and negative text-image matching signals into the diffusion score matching objective. By training lightweight soft tokens, it improves T2I semantic alignment while mitigating common repetition and counting errors found in SoftREPA.

Browse all 141 Image Generation papers →

🎬 Video Generation (32)¶

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation: AAD-1 utilizes asymmetric adversarial distillation featuring a "causal generator + bidirectional video-level discriminator" alongside DMD warmup to compress autoregressive image-to-video generation into a single sampling step per chunk, effectively mitigating motion collapse and long-range drift.
Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering: SVOO discovers that the attention sparsity of each layer in video DiT is an intrinsic property that is "input-independent within layers and significantly heterogeneous between layers." Based on this, it performs offline per-layer sparsity calibration followed by online QK bidirectional co-clustering for block partitioning. It achieves up to 1.93× speedup while maintaining a PSNR of 29 dB across 7 models (e.g., Wan, HunyuanVideo) without any training.
Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops: CHIEF places the creator at the center of the video generation iterative loop. It utilizes "anthropomorphic multi-modal LLM audience agents" to automatically generate subjective film reviews for generated videos, which are then structured into actionable prompt modifications by a translator. This allows even middle school students without filming experience to scale from 1-minute clips to 10-minute short films with complete narratives.
CamGeo: Sparse Camera-Conditioned Image-to-Video Generation with 3D Geometry Prior: CamGeo distills 3D geometric knowledge from a pre-trained 3D video model (VGGT) through training-only distillation. By providing supervision signals only during the training phase, the diffusion model generates high-quality videos with geometric consistency and smooth motion under sparse camera inputs, while VGGT is completely removed during inference to maintain efficiency.
DFSAttn: Dynamic Fine-Grained Sparse Attention for Efficient Video Generation: DFSAttn achieves 2.1× end-to-end acceleration with quality comparable to full attention through 3D Hilbert curve reordering + hierarchical block scoring + adaptive mask caching. It addresses the core issue of quality degradation in block-sparse attention at high sparsity ratios (>80%).
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos: MIGA enables base video models to generate infinitely long and highly temporally consistent videos without training through two core mechanisms: Two-Stage Training-Inference Alignment (TTA) and Dual Consistency Enhancement (DCE: Self-Reflection + Long-Range Frame Guidance). It achieves a 2.8% improvement in VBench composite score compared to FIFO-Diffusion (97.82 vs 95.02).
EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance: EPiC utilizes a "first-frame visibility mask" approach to construct pixel-aligned anchor videos directly from arbitrary in-the-wild videos. By pairing this with Anchor-ControlNet—comprising only 26M parameters (<1% of the backbone) and operating exclusively on visible regions—Ours achieves SOTA I2V camera control precision and zero-shot generalization to V2V. This is accomplished while freezing the CogVideoX-5B-I2V backbone, using only 5K videos and 500 training steps.
Explainable Forensics of Manipulated Segments in Untrimmed Long Videos: This paper proposes the task of temporal localization and explainable analysis of AI-generated segments in long videos, introducing the large-scale TASLE dataset and the two-stage MSLoc baseline method—achieving precise localization and explainable reasoning of manipulated segments in mixed real-fake videos through boundary-aware proposal generation and MLLM refinement.
Exploring Data-Free LoRA Transferability for Video Diffusion Models: This paper presents the first weight-space analysis of Full Fine-Tuning (FFT) and LoRA for Video Diffusion Models (VDMs). It discovers that both "preserve the singular spectrum and only rotate the singular subspaces," but exhibit conflicting routing directions on head clusters. Based on this, the authors propose CASA—a data-free "spectral arbitration by clustering" LoRA transfer method that allows LoRA trained on base models like Wan2.1 to be directly transferred to distilled variants like FastWan without requiring user data or retraining.
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance: iTryOn defines the "Interactive Video Virtual Try-On" task for the first time—enabling individuals in videos to actively manipulate garments (zipping, lifting corners, stretching) rather than just passive display. By resolving spatial ambiguity through 3D hand priors, strictly aligning timestamped action titles with corresponding frames using Action-aware RoPE (A-RoPE), and amplifying learning signals in sparse interaction frames via Action-aware Constraint Loss (AC Loss), it improves the ISR (Interaction Success Rate) on the self-built VVT-Interact from a baseline of 0.397 to 0.610 (+54%).

Browse all 32 Video Generation papers →

🧩 Multimodal VLM (89)¶

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning: ACTIVE-o3 delegates the decision of "where and how to look" to an MLLM for autonomous learning. Using pure reinforcement learning (GRPO), the model is trained to parallelly select up to 3 sub-regions most worthy of magnification. A dual-form reward mechanism (task reward + heuristic reward) is employed to solve the sparsity of pure task rewards. The method consistently outperforms baselines in small/dense object detection, remote sensing, autonomous driving, and interactive segmentation, while simultaneously enhancing general understanding capabilities such as RealWorldQA and MME.
AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions: This paper proposes AgentHijack, a benchmark that evaluates the robustness of computer-use agents using 9 categories of configurable everyday environment corruptions. Furthermore, it introduces DA-GRPO to strengthen grounding and an onlooker for behavioral summarization and environment checking, improving the average success rate of UI-TARS-1.5-7B from 18.74% to 22.89%.
Alterbute: Editing Intrinsic Attributes of Objects in Images: Alterbute utilizes VLMs to automatically mine Visual Named Entity (VNE) identity clusters and jointly conditions a diffusion model on identity references, attribute text, background, and masks. This approach provides a unified framework for editing object color, texture, material, and shape while preserving object identity and scene context.
Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds: Through a pilot study, the authors discovered that "explicitly lifting vision to point clouds and fusing them with 2D patches" is the most effective way to inject 3D information into VLA models. To address 3D data scarcity and domain gaps across different point cloud sources (simulation, sensor, or monocular estimation), Any3D-VLA is proposed. By employing hybrid point cloud training to learn source-agnostic geometric representations, it achieves a 29.2% zero-shot improvement over the strongest baseline (62.5% vs 33.3%) in real-world grasping tasks.
AOEPT: Breaking the Implicit Modality-Reduction Bottleneck in Modality-Missing Prompt Tuning: AOEPT points out that existing missing-modality prompt tuning compresses the inference scope of Multimodal Transformers into visible modality subspaces. It utilizes Modal-Contextualized Prompts (MCPs) distilled from the training set as a retrievable implicit information source for missing modalities, consistently outperforming existing methods across multiple datasets, missing rates, and backbones.
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination: This paper introduces VisualSwap and VS-Bench to test real visual re-examination capabilities by replacing the image after a VLM claims to "take another look." The study finds that current reasoning-heavy VLMs often follow the inertia of previous text, with only explicit multi-turn user instructions or enhanced visual attention significantly restoring grounding.
AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs: AVI-Bench is an audio-visual benchmark inspired by human cognition. It organizes the evaluation of Omni-MLLMs into three stages: "Perception → Understanding → Reasoning," supplemented by a "Primitive Sensation" (PriSe) extension. Using 14 tasks, 5,864 samples, and 9 metrics, it systematically diagnoses the Audio-Visual Intelligence (AVI) of 28 open-source/closed-source Omni-MLLMs and proposes a four-level AVI taxonomy.
Benchmarking and Enhancing VLM for Compressed Image Understanding: This paper constructs the first large-scale benchmark (11 codecs, 9 VLMs, 1M+ compressed images) to evaluate VLM understanding of compressed images. Performance degradation is decomposed into an irrecoverable "information gap" and a remediable "generalization gap." A lightweight conditional vision encoder adapter is proposed, which utilizes codec type and compression level as conditional embeddings + distillation training to improve VLM performance by 10%–30% across various encoders and bitrates.
Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated: This paper proposes that VLM urban perception benchmarks should possess two key attributes: "reliability-aware" and "negotiated." By utilizing a benchmark comprising 100 Montreal street-view images, 12 community annotators, and 30 measurement dimensions, it reveals that model alignment is positively correlated with annotator consistency and that models exhibit systematic distributional biases compared to humans in subjective evaluation dimensions.
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling: DiNa-LRM is proposed to establish preference learning directly on the noisy latent space of diffusion models. Through noise-calibrated Thurstone likelihood and inference-time multi-noise ensembles, it achieves preference prediction accuracy close to SOTA with significantly lower computational overhead than VLM-based reward models.

Browse all 89 Multimodal VLM papers →

🧠 VLM Reasoning (31)¶

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding: This work adapts the LLM-oriented "Reinforcement Learning with Verifiable Rewards (RLVR)" to video-driven 3D scene understanding. By using GRPO to fine-tune a 4B 3D-aware VLM directly with evaluation metrics (such as 3D IoU, F1, and accuracy) as rewards, the training objectives are aligned with evaluation criteria. Consequently, the 4B model outperforms an 8B baseline on 3D video detection, 3D visual grounding, and spatial reasoning tasks.
3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models: 3ViewSense argues that the bottleneck of VLM spatial reasoning is not insufficient visual features or weak linguistic reasoning, but the lack of a stable 3D intermediate representation. Consequently, it requires the model to first induce front, left, and top views from a single image before reasoning based on these orthographic views, significantly outperforming same-scale VLMs in occlusion counting and view-consistent spatial reasoning.
Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models: This work transforms VLM spatial reasoning from a "passive observation" approach into an agentic workflow that actively selects views based on questions, updates a cognitive map, and verifies reasoning using executable spatial assertions. By fine-tuning Qwen2.5-VL-3B with dense rewards, it achieves 80.5% overall accuracy on MindCube-Tiny, specifically improving the Rotation subset to 85.0%.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning: This paper enforces a split in VLM output into <recognition> perception blocks and <think> reasoning blocks. It introduces a perception reward \(R_P\) determined by whether a "blindfolded" text reasoning agent (which only sees the VLM's perception text without the image) can correctly answer the question, paired with Structured Verbal Verification (SVV) as an outcome reward \(R_O\). MoCA uses \(R_P\) as a gate for modality-level credit assignment, enabling a 7B model to improve across 9 perception/reasoning/rich-modality benchmarks simultaneously, surpassing GPT-4o on multiple metrics.
Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners: To address the "understanding-generation gap" (capable of understanding but failing to generate) in unified multimodal models for anything-to-image (X2I) tasks, this paper proposes the Self-Adaptive Interleaved Reasoner. Using a hierarchical data synthesis pipeline, 50,000 samples are routed between three modes: direct generation, self-reflection, and multi-step planning. The model is trained via SFT + GRPO with step-wise reasoning rewards and intra-group complexity penalties, enabling Emu3.5 to outperform closed-source models like GPT-4o and Gemini 2.5 Flash on KRIS-Bench and OmniContext.
Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding: The authors decompose the KL loss of multimodal on-policy distillation into "language prior" and "visual grounding" sub-objectives based on a Bayesian chain. They find that the gradients of these two are nearly orthogonal, and standard distillation merely takes a passive bisector. Consequently, they propose Visual Gradient Steering (VGS) to actively bias the update direction toward the visual subspace, achieving average gains of +2.37%/+1.56% across seven multimodal reasoning benchmarks for Qwen3-VL 8B→2B/4B.
Efficient Reasoning with Hidden Thinking: Heima distills each stage (summary / caption / reasoning) of lengthy Multimodal LLM (MLLM) Chains-of-Thought (CoT) into a single special thinking token. This allows the model to "think" in latent space, reducing the token count from the 100-200 range to 13-16 while achieving zero-shot accuracy more stable than LLaVA-CoT. An accompanying LLM "interpreter" is trained to reconstruct the textual reasoning chain from the thinking token's hidden states, empirically validating the information-theoretic upper bound of compression loss.
Find, Fix, Reason: Context Repair for Video Reasoning: Addressing the dilemma in video reasoning where "on-policy RL stagnates at capacity ceilings and off-policy distillation suffers from entropy collapse," this paper introduces a frozen, tool-integrated large teacher model. When a student's rollout fails, the teacher inserts minimal "evidence patches" (e.g., key-frame intervals, error types), enabling the student to re-attempt the same question. These repaired trajectories are then incorporated into GRPO optimization through a chosen-rollout mechanism.
From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models: Inspired by human spatial cognition, HATCH designs two complementary training objectives for MLLMs: aligning cross-view patch features using geometric supervision (PaStA), and using reinforcement learning to force models to generate explicit "viewpoint change actions" before answering (ActoR). Using only a 3B base model, it achieves multi-image spatial reasoning performance comparable to GPT-5.2.
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models: This paper argues that current VLM post-training overemphasizes "long-chain reasoning" while neglecting perception bottlenecks. It explicitly decouples post-training into three independent stages: "Visual Perception \(\rightarrow\) Textual Reasoning \(\rightarrow\) Visual Reasoning," using RLVR (instead of caption SFT) to specifically refine perception. This approach improves Qwen3-VL-8B by approximately +5.9% and +1.2% on visual math and perception benchmarks, respectively, while shortening reasoning traces by 20.8%.

Browse all 31 VLM Reasoning papers →

⚡ VLM Efficiency (4)¶

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large Vision-Language Models: This work identifies an anti-intuitive "similarity reversal" phenomenon in CLIP, where visual tokens of referring regions exhibit the lowest similarity with [EOS] text tokens. Based on this observation, the authors propose LiteLVLM—a training-free, text-guided visual token pruning method. It retains 90.3% of the original pixel grounding performance even after discarding 66.7% of tokens, while achieving a 22% inference acceleration and 2.3× VRAM savings.
Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs: This paper unifies Quantization-Aware Training (QAT) and Knowledge Distillation (KD) from an Information Bottleneck (IB) perspective, proposing the GRACE framework (Gated Decoupled Distillation + Relational Centered Kernel Alignment + Adaptive IB Controller). This enables INT4-quantized LLaVA / Qwen-VL to not only avoid performance degradation but outperform BF16 baselines across multiple benchmarks, while achieving 3× throughput and 54% memory savings.
Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy: This study evaluates 16 quantization methods across 10 VLMs and multiple reliability metrics through 700,000 experiments. It finds that quantization is not a simple disruptor—it improves calibration, OOD detection, and noise robustness by suppressing high-rank low-variance spectral components, while simultaneously amplifying reliance on covariate shifts and spurious correlations.
On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression: This paper presents the first systematic study of the adversarial robustness of Large Vision-Language Models (LVLMs) under visual token compression. It identifies an "optimization-inference space mismatch" in existing encoder attacks and proposes the CAGE attack. By utilizing Expected Feature Distortion (EFD) and Ranking-Distortion Alignment (RDA), CAGE significantly reduces the robust accuracy of compressed LVLMs under conditions where the compression mechanism and token budget are unknown.

🎵 Audio & Speech (36)¶

A Semantically Consistent Dataset for Data-Efficient Query-Based Universal Sound Separation: This paper proposes Hive, a universal sound separation dataset constructed via single-event purification and semantically consistent mixing. Using approximately 2.4k hours of high-purity source audio, it enables AudioSep and FlowSep to approach or even exceed the performance of systems trained on million-hour datasets across multiple separation metrics.
Alethia: A Foundational Encoder for Voice Deepfakes: Alethia proposes a "bottleneck masked embedding prediction + flow-matching spectrogram generation" dual-branch pretraining paradigm to develop the first foundational encoder specifically for voice deepfake detection, localization, and attribution. It significantly outperforms general SFMs like Wav2vec2, HuBERT, and WavLM across 56 datasets in 5 task categories and exhibits strong zero-shot robustness against unseen singing voice deepfakes and real-world perturbations.
Algorithmic Recourse of In-Context Learning for Tabular Data: This paper presents the first systematic study of the algorithmic recourse problem in tabular in-context learning (ICL). It proves that dynamic decision rules induced by ICL can still yield definable recourse and proposes ASR-ICL, which uses adaptive subspace zero-order optimization to generate low-cost, sparse, and actionable counterfactual modifications for black-box ICL models.
An Exterior Method for Nonnegative Matrix Factorization: This paper proposes eNMF, which transforms NMF from "always staying inside the nonnegative orthogonal cone" to "approximating the nonnegative cone from the exterior of the rotation equivalence class of the unconstrained SVD optimal solution, followed by feasibility attainment and descent." It reaches lower reconstruction errors faster than 9 classes of NMF baselines on synthetic, text, audio, image, and recommendation data.
Attend to Anything: Foundation Model for Unified Human Attention Modeling: AAM unifies image, video, and audio-visual saliency prediction into a single attention foundation model featuring text conditioning, hyperbolic hierarchical constraints, and Fokker-Planck temporal dynamics. It consistently outperforms specialized models across 16 benchmarks and improves video inference speed to approximately 111 FPS.
Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models: Existing respiratory acoustic foundation models (FMs) have been evaluated almost exclusively on cough classification. This paper presents the first systematic evaluation of FMs on continuous regression tasks (passively estimating age, BMI, and disease probability from cough audio). Using a multi-model multi-target benchmark protocol consisting of 5 FMs × 6 targets × 3 datasets with frozen encoders and three types of regression heads, the study reveals findings obscured by classification-based evaluations, including the "data scale × head capacity" tradeoff, the advantages of generative pre-training, and strongly asymmetric cross-dataset transfer.
CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction: Addressing the lack of unified evaluation for modern music generation models that simultaneously process "text + lyrics + reference audio," this paper establishes a complete ecosystem: 110k pseudo-labeled CMI-Pref-Pseudo, 4,027 human-labeled CMI-Pref, a unified CMI-RewardBench, and a family of ~30M parameter reward models (CMI-RM) capable of handling all modality combinations in a single architecture. The authors demonstrate high correlation with human judgment and enable "inference-time scaling" via top-k filtering.
Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox: The authors construct VoxParadox, a benchmark of 2,000 Multiple Choice Questions (MCQs) designed with intentional contradictions between "what the text says" and "what the audio sounds like." They demonstrate that current Audio LLMs almost exclusively "read but do not listen" in paralinguistic tasks. By introducing PCLM, a lightweight module that adaptively mixes intermediate audio encoder features based on the prompt, combined with DPO, they improve Audio Flamingo 3's performance on VoxParadox from 17.40% to 65.20%.
Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability: This paper formulates the "continuation probability of a pre-trained Large Audio Language Model (LALM) on ground-truth speech tokens" as an objective style consistency metric named MCLP. By employing a gated hybrid reward of MCLP+CER through GRPO on the newly constructed WenetSpeech-RP-TTS dataset, the subjective MOS of role-play TTS is improved from 1.86 to 3.58.
Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?: Fine-tuning ASR with accented speech synthesized via few-shot TTS, the authors decompose the question of "why it works." They find that the gains primarily stem from phoneme-level perturbation augmentation—random phoneme replacement captures most of the benefits, while LLM-generated "target accent phoneme editing" or even oracle ground-truth phonemes/prosody offer only marginal improvements. Furthermore, while synthetic data significantly reduces training variance when real data is scarce, a fixed quota of synthetic data eventually dilutes real data; the real-to-synthetic ratio itself is the critical factor.

Browse all 36 Audio & Speech papers →

🔎 AIGC Detection (11)¶

AutoBaxBuilder: Bootstrapping Code Security Benchmarking: AUTOBAXBUILDER utilizes an LLM agent pipeline to automatically generate web backend security evaluation scenarios, functional tests, and end-to-end security tests. It reduces the cost of manually constructing BAXBENCH-style tasks by approximately 12x and constructs AUTOBAXBENCH, comprising 40 new scenarios, to evaluate the gap between functional correctness and security in contemporary code models.
Black-Box Detection of LLM-Generated Text Using Generalized Jensen-Shannon Divergence: SurpMark reformulates "AI text detection" as a likelihood-free hypothesis test: it uses a proxy LM to calculate token surprisal, discretizes them into \(k\) states via k-means, estimates a first-order Markov transition matrix, and compares it with pre-built "human-written / machine-written" reference matrices using Generalized Jensen-Shannon Divergence (GJS). It provides black-box, zero-retraining, and zero-per-instance-resampling discriminant scores in a single forward pass.
CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection: This work redefines "multimodal fake news detection" as a task of "explicitly capturing conflicts between modalities or with world knowledge." The authors construct CAC, a corpus of 14k samples with fine-grained conflict annotations, and propose the CORE framework. CORE reshapes the conceptual boundaries of MLLMs through Conflict-Perception Training (CPT), enabling the model to significantly outperform dedicated SOTA methods on four datasets (DGM4, MDSM, MMFakeBench, NewsCLIPpings) using only 100–750 samples.
Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models: This paper discovers that directly fine-tuning MLLMs to learn low-level artifacts left by generators damages their early-formed semantic representations (catastrophic forgetting). To address this, the authors propose Deep-VRM, which freezes the early and middle layers to preserve semantics while utilizing a LoRA-based bypass to "residually inject" artifact features into the deep layers of the LLM. This allows a single MLLM to achieve SOTA performance on most AIGI benchmarks without relying on any external expert detectors.
Dissect and Prune: Enhancing Robustness in AI-Generated Image Detection: Addressing the "prediction asymmetry" issue where existing AI-generated image (AIGI) detectors appear accurate but primarily classify images as real, this paper proposes DEAR. By using inpainting images as probes and "dissecting" the model based on the Regional Activation Discrepancy (RAD) between channel activations and generated areas, the method prunes extreme channels on both sides and retrains only the linear classification head. This forces the detector to discard fragile shortcut features, significantly enhancing robustness against unseen generators and post-processing.
Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook: DOVE utilizes rate-distortion variational optimization to automatically construct a compact "Value Codebook" from 10,000 human texts. It then uses Unbalanced Optimal Transport (UOT) to measure distribution differences between human and LLM long-form texts in the value space, improving the "Evaluation-Downstream Task" correlation from \(\le 24\%\) in baselines to \(31.56\%\) across 12 LLMs.
Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators: This paper systematically exposes the vulnerability of AI text detectors under cross-dataset and cross-generator shifts using a "single-threshold fixed protocol." It proposes fusing hand-crafted linguistic features—weighted by learnable dynamic attention—with transformer [CLS] representations. Built on a DeBERTa-v3 backbone, the method achieves 85.9% balanced accuracy on the M4 multi-domain multi-generator benchmark, outperforming strong zero-shot baselines (Fast-DetectGPT, RADAR, Log-Rank) by up to +7.22.
ForensicConcept: Transferable Forensic Concepts for AIGI Detection: Addressing the issues where AI-Generated Image (AIGI) detectors are "highly accurate within the training distribution but fail on unseen generators" and remain entirely black-box, this paper explicitly extracts dispersed evidence relied upon by detectors into a "forensic concept codebook." It uses diffusion features (CleanDIFT) as external generative trace references and employs the neighborhood-structure consistency metric CKNNA to measure the geometric alignment between backbone evidence and diffusion traces. By injecting the diffusion codebook into a target backbone, cross-generator transfer is achieved; the average accuracy on GenImage reaches 92.0%, and higher CKNNA correlates with greater transfer gains.
Generating Robust Portfolios of Optimization Models using Large Language Models: This paper proposes a lightweight, training-free algorithm that utilizes a single LLM to act simultaneously as a "stochastic generator" and a "scoring evaluator." By packaging candidate optimization models into a portfolio until the cumulative generation probability reaches \(1-\alpha\), it theoretically proves that as long as either the generator or the evaluator aligns with human preferences, the portfolio will contain high-quality models. Experiments on NL4LP using GPT verify that the portfolio consistently outperforms random sampling even in the worst-case scenarios.
LLM Self-Recognition: Steering and Retrieving Activation Signatures: Instead of watermarking at the token level, this paper injects a random sparse steering vector into the LLM residual stream during generation, creating a detectable "activation signature." The signature is retrieved by re-feeding the text into the same model and calculating cosine similarity or using a lightweight classifier, achieving over 98% accuracy across multiple detection settings with negligible impact on text quality.

Browse all 11 AIGC Detection papers →

🧊 3D Vision (30)¶

4DPC\(^2\)hat: Towards Dynamic Point Cloud Understanding with Failure-Aware Bootstrapping: 4DPC\(^2\)hat is the first Multimodal Large Language Model (MLLM) designed for "dynamic point cloud sequence" (4D point cloud) understanding. The authors first use a topologically consistent construction pipeline to transform 44,000 animation assets into a dataset of 200,000 cross-modal QA pairs. Then, they employ a spatio-temporal architecture using "preserved group tokens + global tokens + bidirectional Mamba" to avoid compressing a frame into a single vector. Finally, "failure-aware bootstrapping" is used to iteratively identify incorrect model responses and synthesize targeted QA for supplementary training, enabling action understanding and temporal reasoning that significantly outperform approaches that feed video frames to static 3D models.
Adaptive Volumetric Mechanical Property Fields Invariant to Resolution: AdaVoMP utilizes a "Sparse Adaptive Voxel Tree (SAV)" to simultaneously represent the input shape and output material field. A sparse Transformer encoder-decoder then autoregressively generates Young's modulus, Poisson's ratio, and density for each 3D object layer-by-layer. This approach scales the effective resolution of simulatable material fields from \(64^3\) to \(1024^3\) (a \(16^3\) increase) while outperforming previous SOTA models with lower test-time compute.
AvAtar: Learning to Align via Active Optimal Transport: This paper proposes AvAtar, an active alignment framework based on Optimal Transport (OT). It quantifies the influence of candidate queries on global alignment results through gradient propagation. By utilizing the adjoint state method and conjugate gradient method, it achieves efficient solutions with linear complexity. AvAtar consistently outperforms existing active learning strategies in network alignment and cross-domain alignment tasks.
Convex Distance Operator Transport: A Convex and Geometry-Preserving Formulation: This paper proposes CDOT (Convex Distance Operator Transport). By "operatorizing" the distance matrices and coupling of each metric space and replacing the non-convex squared pairwise distance difference in FGW with \(\|D_X T_\pi - T_\pi D_Y\|_{\mathrm{HS}}^2\), it achieves a framework for heterogeneous space alignment that is strictly convex with respect to the coupling \(\pi\), while remaining a valid pseudo-metric and possessing finite-sample risk bounds.
APEIRIA: Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs: This paper proposes APEIRIA, which distills the execution traces of neuro-symbolic 3D concept learners into natural language chain-of-thought (CoC) for 3D MLLMs. By employing GRPO reinforcement learning, it generalizes these reasoning patterns to open-vocabulary and deeply nested instructions. APEIRIA simultaneously outperforms traditional NS3D methods and current state-of-the-art 3D MLLMs on ScanRefer, Multi3DRefer, SQA3D, and Scan2Cap, while retaining the interpretability and modularity of symbolic systems.
DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds: DynaTok encodes incomplete, unordered, and non-correspondence partial point clouds of each frame into a set of compact latent tokens. It aggregates complementary observations across frames using a spatio-temporal Transformer, decouples deformation using a unified latent space of "reference geometry + residual motion," and reconstructs time-consistent complete 4D point cloud sequences via a flow-matching decoder.
EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation: EPS3D is the first end-to-end feed-forward open-vocabulary 3D panoptic segmentation framework. It directly predicts unified 3D panoptic Gaussians with semantic and instance attributes from unposed multi-view images in a single forward pass. By distilling 2D foundation models for supervision, it bypasses the need for 3D annotations. It introduces a semantic-instance mutual enhancement module for reciprocal calibration, achieving approximately 13% higher semantic mIoU than SOTA on Replica with an inference time of only 1 second per scene.
Fast-SAM3D: 3Dfy Anything in Images but Faster: To address the slow inference speed of the SAM3D single-view 3D reconstruction model, this paper provides the first module-level latency profiling. Identifying performance bottlenecks caused by three types of heterogeneity (shape/layout dynamics, texture sparsity, and geometric spectral differences), the authors propose Fast-SAM3D. This training-free framework utilizes modality-aware step caching, spatiotemporal token carving, and spectral-aware token aggregation to achieve a 2.67× speedup at the object level with negligible quality loss, even slightly improving the reconstruction F-Score from 92.34 to 92.59.
FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation: This paper proposes FoundObj, which utilizes 2D/3D self-supervised foundation models (DINOv2 + TRELLIS) as rewarders. By employing a "superpoint merging + PPO" RL agent, it achieves multi-class 3D object segmentation in complex indoor scenes without any scene-level human annotations, improving the unsupervised SOTA AP from 19.6 to 24.2 on ScanNet/S3DIS/ScanNet200.
FSI2P: A Hierarchical Focus–Sweep Registration Network with Dynamically Allocated Depth: This paper abstracts the human observation process of "glancing first, then examining block-by-block" into a two-stage Focus-Sweep paradigm. It replaces Transformer with Mamba for image-to-point cloud interaction and utilizes reinforcement learning to dynamically determine the number of interaction layers at each scale, achieving SOTA performance in I2P registration on RGB-D Scenes V2 and 7-Scenes.

Browse all 30 3D Vision papers →

🎯 Object Detection (6)¶

Adversarially Robust Approximate Furthest Neighbor: This theoretical paper provides the first approximate furthest neighbor data structure resistant to adaptive query adversaries. While maintaining a query complexity with \(n\)-dependence similar to Indyk's classical oblivious algorithm, it demonstrates that traditional random projection furthest neighbor algorithms can be broken by adaptive queries.
EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding: EARL utilizes a two-stage MLLM framework of "coarse interpretation and fine response" to consolidate egocentric interaction reasoning tasks (description + Q&A + pixel mask) into a unified pipeline. The first stage outputs a global interaction description of the full image and treats the last hidden state as a semantic prior. This is injected into the second stage through a novel Analysis-guided Feature Synthesizer (AFS). The system is jointly trained via GRPO with a triple-reward mechanism (format/answer/grounding accuracy), outperforming Seg-Zero by 8.37% cIoU on Ego-IRGBench.
FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization: FOCUS uses a two-stage training approach—"complete removal of category names + attention mask optimization + GRPO IoU reward"—to force VLMs to perform in-context object localization based on visual support examples rather than semantic priors. The 7B parameter model outperforms 72B models, proving that task-aligned inductive bias is more important than pure scaling.
Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection: MPFM replaces the traditional "unimodal Gaussian prototypes" in OSAD with a learnable Gaussian Mixture Model (GMM) prototype space. It uses flow matching to directly regress a velocity field in GMM form, augmented by a mutual information maximization regularization to prevent prototype collapse. The method outperforms all SOTA methods, including DRA, AHL, and DPDL, across 9 industrial and medical AD datasets under the 10/1 anomaly sample setting.
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration: Aiming at the issues that multimodal visual verifiers output binary signals (True/False) that are too coarse and that textual explanations are prone to reward-hacking, this paper proposes OmniVerifier-M1. It utilizes symbolic outputs such as bounding boxes as meta-verification rationales instead of text to support rule-based rewards like IoU. Theoretically and experimentally, it proves that decoupling binary judgment and meta-verification into two independent reward streams (rather than a multiplicative joint reward) significantly improves SNR. Ultimately, the verifier is upgraded to an agentic system, M1-TTS, capable of driving region-level self-recalibration.
Testing the Test: Score-Direction Instability in Class-Split Anomaly Detection: The authors point out that "class-split" anomaly detection benchmarks are ill-posed when the anomaly class and the normal mixture distribution overlap in the representation space—AUROC collapses to random or even reverses, with the direction depending on the unknown anomaly class. A training-free "neighborhood class leakage" metric \(L_k\) is proposed to diagnose such benchmark failure before evaluation.

✂️ Segmentation (14)¶

Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models: This paper constructs PolyMLP, PolyConv, and PolyAttn using Hadamard products to replace pointwise activations/softmax in MLP, convolution, and attention. Without conventional activation functions, these modules allow MetaFormer-style backbones to reach or exceed the performance of activation-based models on ImageNet, robustness benchmarks, and ADE20K segmentation.
Beyond Detection: A Structure-Aware Framework for Scene Text Tracking: The authors propose SymTrack, a detection-free dual-branch scene text tracking framework. It addresses feature bottlenecks caused by perspective distortion through Predictive Token Rectification (PTR), eliminates high visual ambiguity among text instances using Cross-Expert Calibration (CEC), and stabilizes fine-grained localization with an Adaptive Inference Engine (AIE). It significantly surpasses SOTA on three benchmarks (up to +12.32% AUC).
FlowSeg: Dynamic Semantic Guidance for LLM-Conditioned Segmentation: This paper points out that current query-based LLM-conditioned segmentation follows a "propose-then-select" paradigm—candidate masks are often accurate enough, but errors occur due to incorrect selection. To address this, FlowSeg is proposed, where LLM conditional embeddings participate in query refinement at every decoder layer and are continuously updated by new visual evidence. Combined with a lightweight boundary refinement module, it achieves consistent performance gains on RefCOCO/+/g and ReasonSeg.
Functional Attention: From Pairwise Affinities to Functional Correspondences: This paper reinterprets softmax attention in Transformers as a "least-squares linear operator between two learned functional bases." Borrowing the idea of functional maps from shape matching, it compresses the \(n \times n\) pairwise affinity matrix into a \(k \times k\) compact spectral operator, achieving SOTA performance in PDE solving, 3D point cloud segmentation, and OOD generalization simultaneously.
Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models: GPUA treats VLMs like CLIP (rich semantics, insufficient local precision) and VFMs like DINOv3 (fine-grained detail, lacking semantics) as two "visual languages." It uses Optimal Transport to mine soft correspondences and solves the Orthogonal Procrustes problem to learn a geometry-preserving linear mapping that translates VFM features into the VLM space. This process is entirely unsupervised, requires no updates to pre-trained parameters, and achieves an average 11.8% improvement in zero-shot classification.
LightAVSeg: Lightweight Audio-Visual Segmentation: LightAVSeg decouples "semantic filtering (what)" and "spatial localization (where)" by replacing \(\mathcal{O}(N^2)\) cross-modal attention with global channel modulation. This allows the AVS model to achieve 50.4 mIoU (MS3) with only 20.5M parameters and reach an on-device latency of 163.4 ms on Snapdragon 8 Elite, which is approximately \(8\times\) faster than AVSegFormer-R50.
MVR-cache: Optimizing Semantic Caching via Multi-Vector Retrieval and Learned Prompt Segmentation: MVR-cache upgrades the similarity metric for LLM semantic caching from "single-vector cosine" to "multi-vector MaxSim after learned segmentation." By training a lightweight segmentation model via REINFORCE, it boosts cache hit rates by up to 37% while maintaining the same error rate upper bound \(\delta\).
Refining Context-Entangled Content Segmentation via Curriculum Selection and Anti-Curriculum Promotion: CurriSeg keeps the segmentation network architecture unchanged and modifies only the training schedule: it first pushes the model to a stable state using a robust curriculum based on "temporal loss statistics + pixel entropy weighting," and then performs anti-curriculum "spectral blindness" fine-tuning (removing high frequencies to force the model to capture structural semantics). This approach consistently improves FEDER / FSEL / RUN by 2–4% on camouflaged/polyp segmentation benchmarks such as CHAMELEON / CAMO / COD10K / NC4K with zero additional parameters and shorter training time.
Segment Anything with Robust Uncertainty-Accuracy Correlation: Addressing the issue that the SAM series only outputs a single mask-level confidence score and suffers from "Mask-level Confidence Confusion" under domain shift, this paper equips SAM2 with a Weibull dual-granularity Bayesian mask decoder for pixel-level epistemic estimation. It incorporates a synergistic style + deformation adversarial perturbation and calibration loss inspired by human vision, ensuring uncertainty remains aligned with errors across 23 zero-shot target domains, achieving an average J&F of 79.87 with significantly more reliable uncertainty maps.
SPROUT: Supervise Less, See More — Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting: SPROUT is the first fully training-free, zero-annotation framework for pathological nuclei segmentation. It utilizes H&E staining priors to self-construct high-confidence foreground/background regions on each slide → extracts prototypes → performs feature-prototype soft alignment via Partial Optimal Transport (POT) → outputs positive/negative point prompts for SAM. On benchmarks like MoNuSeg, its AJI is 8.2% higher than training-based methods.

Browse all 14 Segmentation papers →

🖼️ Image Restoration (21)¶

AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference: Addressing the issue where "multi-modal low-light video enhancement collapses when event streams or infrared auxiliary modalities are unavailable during inference," AMNet utilizes a Spatial-Spectral Dual-Gated (S2DG) Translator to generate implicit representations of auxiliary modalities from degraded low-light RGB inputs. Combined with large-scale synthetic multi-modal pre-training, this allows stable enhancement regardless of modality availability during testing—achieving SOTA with RGB-only inference, with further gains when auxiliary modalities are provided.
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner: This paper systematically compares continuous diffusion, discrete masked diffusion, and looped transformers across the dimensions of expressivity and trainability. It proves that "continuous diffusion" is strictly more expressive than discrete diffusion and can simulate looped transformers, but its practical performance is limited by decoding and representation space. Consequently, the paper proposes CCDD (Coevolutionary Continuous Discrete Diffusion)—diffusion performed simultaneously on the discrete token space and the contextual embedding space of a pre-trained LLM, with a single model for joint denoising. CCDD reduces perplexity by 25-35% compared to MDLM on LM1B/OWT and outperforms MDLM with 256 steps using only 8 sampling steps.
Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution: ASASR achieves the optimal balance between perceptual quality and structural fidelity in super-resolution by replacing the Flow Matching noise prior from isotropic Gaussian to Sobolev spectral coloring noise, combined with adversarial manifold guidance to generate hard negative samples, constructing the AS-DPO framework.
Consistent Diffusion Language Models: This paper points out that discrete diffusion lacks a counterpart to the continuous-domain probability-flow ODE, making it impossible to directly construct consistency models. The authors propose using the exact closed-form posterior bridge as a "stochastic PF-ODE surrogate" in the discrete domain to construct the Multi-Path Discrete Consistency (MPDC) training objective. This requires the denoiser's predictions across multiple stochastic bridge paths to be consistent in expectation. This enables the single-stage, teacher-free training of Consistent Diffusion Language Models (CDLM) capable of generating high-quality text in 2-3 steps, achieving SOTA in unconditional/conditional text generation and up to \(32\times\) speedup over AR models.
DAPD: Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs: DAPD transforms the single-step parallel unmasking problem of dLLMs into a dynamic graph coloring problem of "selecting independent sets on self-attention-induced MRFs." Without any training, it simultaneously unmasks weakly dependent positions, reducing decoding steps to 1/3.87 of the original on LLaDA / Dream for multi-question mixed prompts with almost no loss in accuracy.
Degradation-Aware Metric Prompting for Hyperspectral Image Restoration: DAMP utilizes 6 interpretable spatial-spectral physical metrics (high-frequency energy ratio, texture uniformity, spectral curvature, etc.) as "Degradation Prompts" (DP) to replace black-box embeddings and explicit degradation labels. These DPs act as gating signals driving a Spatial-Spectral Adaptive MoE to select different "spatial/spectral experts," achieving SOTA performance across 5 HSI restoration tasks and 2 unseen degradations (motion blur, Poisson noise) simultaneously.
DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention: DyLLM is a training-free inference acceleration framework for diffusion LLMs. It identifies "salient tokens" by measuring the cosine similarity of attention contexts between adjacent denoising steps. By recalculating FFN and attention only for these tokens using salient-aware approximate attention, it increases throughput to 7.6× / 9.6× on LLaDA / Dream with negligible performance loss.
Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models: This paper systematically characterizes the failure mechanism of masked diffusion language models (dLLM) under fully non-autoregressive (NAR) decoding. It identifies that proximity bias causes confidence-based sampling to degenerate into reverse autoregrssion, which is prematurely saturated by EOS tokens. By using a 5M-parameter lightweight planner and EOS temperature annealing to intervene in unmasking positions only at the first step, the authors improve LLaDA 8B NAR decoding by 2.8–4.3 points on reasoning tasks like GSM8K with almost no additional overhead.
From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion: Multimodal image fusion has long relied on shared representations in 2D feature grids, leading to the entanglement of global appearance (brightness/contrast/tone) and local details, making them difficult to regulate independently. This paper moves "global appearance" into the compact token space of a frozen 1D tokenizer (TiTok-32). By employing "Selective Token Editing (STE)" to modify only a few token-channel entries, the method regulates global consistency while preserving a 2D pathway for detail recovery, achieving comprehensive SOTA results across four benchmarks.
Learning Normalized Energy Models for Linear Inverse Problems: The authors reformulate "linear inverse problems" as "anisotropic denoising" and propose Anisotropic Covariance Score Matching (A-CSM) to train a normalized energy model \(U_\theta(\mathbf{y},\boldsymbol{\Sigma})\approx -\log p(\mathbf{y}|\boldsymbol{\Sigma})\). A single model can handle inpainting, deblurring, and super-resolution while unlocking three new capabilities: energy-guided adaptive scheduling, MALA unbiased correction, and blind inverse estimation.

Browse all 21 Image Restoration papers →

🛰️ Remote Sensing (3)¶

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing: Any2Any transforms remote sensing (RS) translation between RGB, SAR, NIR, MS, and PAN from a collection of paired models into a unified latent diffusion model within a shared latent space. By utilizing the million-level RST-1M dataset and target modality residual adapters, it achieves superior fidelity and generalization across 14 seen translation directions and multiple unseen modality combinations.
Localized, High-resolution Geographic Representations with Slepian Functions: This paper constructs a geographic positional encoder that concentrates representation capacity on a Region of Interest (ROI) using spherical Slepian functions. It proposes a Slepian-Spherical Harmonic (SH) hybrid encoding to simultaneously capture local high-resolution details and global coarse-grained context. It consistently outperforms mainstream baselines such as SH, Wavelets, and RFF across five classification, regression, and image-enhancement prediction tasks.
The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench: The authors observe that Vision Foundation Models (VFMs) "appear" to predict satellite imagery well but collapse along physical axes in extreme regimes. By formalizing "scientific alignment" as "structural isomorphism," they release TC-Bench—a global tropical cyclone benchmark—and a three-tier linear probing suite (Static/Dynamic/Constraint) to reveal representation collapse in frozen backbones like DINO, CLIP, SigLIP, and MAE for intense cyclones where \(P_c<980\) hPa.

🧑 Human Understanding (5)¶

DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing: DiscoForcing reformulates the "music \(\to\) full-body dance" offline generation problem into a strictly causal, bounded-latency streaming task. It utilizes a VQ-PAE causal music encoder, latent-space Diffusion Forcing, hybrid temporal noise scheduling, and Temporal Guidance sampling to translate music streams into 30 FPS full-body motions that directly drive Unity avatars and Unitree G1 humanoid robots in real-time.
Efficient, Validation-Free Intrinsic Quality Estimation for Large-Scale Face Recognition Datasets: Ours proposes Intrinsic Quality (IQ): after extracting embeddings using a proxy model, it weightedly fuses "Neighborhood Label Consistency (Consis)" and "Normalized Spectral Entropy Effective Rank \(\tilde{r}_{\mathrm{ent}}\)". It provides a "trainability" score for million-scale face recognition datasets without full training or clean validation sets. On WebFace4/12/42M and noise-injected settings, the ranking consistency with downstream MFR-ALL validation accuracy reaches Spearman = 1.0.
Learning Instance-Adaptive Low-Rank Orthogonal Subspaces for Clothes-Changing Person Re-Identification: The "clothing" semantic concept is explicitly modeled as an instance-adaptive low-rank subspace (initialized using the SVD principal components of CLIP text descriptions and refined via cross-attention with image patches). Identity features are then forced to be strictly orthogonal to this subspace through geometric constraints, achieving SOTA results in clothes-changing re-identification (PRCC +5.9% Rank-1) without the need for adversarial training.
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery: MotionGRPO reformulates first-person full-body motion recovery from head-mounted devices as a Markov Decision Process (MDP) over diffusion sampling. It utilizes Group Relative Policy Optimization (GRPO) post-training with a hybrid reward system consisting of a "trajectory condition-aware perception model + 4 joint-level sub-rewards." Crucially, it identifies that strong input conditions lead to nearly identical intra-group samples, causing advantage variance to vanish—a fatal bottleneck. To resolve this, it injects Perlin noise into the conditioning signal to restore intra-group diversity, reducing MPJPE from EgoAllo's 124.985 mm to 114.207 mm on AMASS/RICH.
WaveVerse: Scalable RF Simulation in Generative 4D Worlds: WaveVerse integrates LLM-driven "4D indoor scene + human motion" generation with a physical ray tracer that preserves spatiotemporal phase coherence into a prompt-to-RF signal pipeline. It significantly enhances downstream RF imaging and activity recognition tasks using synthetic data, with performance scaling continuously as simulation volume increases, unlike existing methods that saturate.

📹 Video Understanding (17)¶

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes: This paper proposes the AVTrack dataset and the AVTracker baseline method to address the Audio-Visual Instance Segmentation and tracking (AVIS) task in complex human-centric scenes. By defining eight challenging conditions, a rigorous evaluation benchmark was constructed. A three-stage divide-and-conquer framework was designed (ASR segmented aggregation → local speaker localization → global identity association), which outperforms existing state-of-the-art methods by approximately 8 percentage points on the HOTA metric.
Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning: Foresee-to-Ground (F2G) reformulates Video Temporal Grounding (VTG) from direct timestamp regression into an "identify-then-measure" two-stage problem. By utilizing predictive temporal perception and a span evidence encoder to build a candidate event evidence pool, the LLM generates precise boundaries constrained by selected events. This approach improves [email protected] by 4.1 points on Charades-STA and 6.7 points on ActivityNet.
MetaphorVU: Towards Metaphorical Video Understanding: This paper proposes the first metaphorical video understanding benchmark, MetaphorVU-Bench (860 videos + 8-category metaphor taxonomy), and an enhancement method, MetaphorBoost. By utilizing a metaphor knowledge graph with 54K nodes and 200K edges as an external cognitive scaffold, the study quantitatively reveals that the core bottleneck for MLLMs in metaphorical video understanding is the "lack of cross-domain mapping" rather than visual recognition errors. The optimal model still lags behind humans (83.4) by 17 points.
OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models: This paper argues that existing Omni-LLM token compression methods are suboptimal due to their "symmetric" treatment of audio and video. It proposes OmniSIFT—a two-stage asymmetric compression framework that first prunes video redundancy via spatio-temporal saliency to obtain "visual anchors," which then guide audio selection. With only 4.85M additional parameters, it consistently outperforms existing baselines and even the original model on Qwen2.5-Omni-7B while retaining only 25% of tokens.
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection: The authors propose OPL (Orthogonal Projection Layer) and an enhanced version G-OPL, which utilize a learnable orthogonal subspace derived from QR decomposition to explicitly project out "task-irrelevant variables" and "facial privacy components" within the video anomaly detection feature space. They introduce four privacy-aware metrics (SSC/ARD/PD/FPD), demonstrating that face prediction accuracy by linear SVM probes significantly decreases while maintaining or even improving VAD AUC.
ProAct-VL: A Proactive VideoLLM for Real-Time AI Companions: ProAct-VL enables VideoLLMs to autonomously decide when to respond and generate short-segment commentary under streaming input via a chunk-level I/O paradigm, a lightweight FLAG decision head, and transition-aware loss functions. It achieves ~1s low latency and strong proactivity—obtaining a TimeDiff of only 1.20s and a trigger F1 of 63.25% in game commentary tasks, significantly outperforming offline models like GPT-4o.
RELO: Reinforcement Learning to Localize for Visual Object Tracking: RELO reformulates the "where is the target" problem in single object tracking as an MDP on a spatial feature map. It treats each spatial position as an action and replaces traditional manual center heatmap supervision with actor-critic + direct IoU/AUC rewards. Coupled with two stabilization designs—"regression warmup" and "layer-aligned temporal token propagation"—it achieves SOTA with 57.5% AUC on LaSOText.
Return of Frustratingly Easy Unsupervised Video Domain Adaptation: This paper proposes MetaTrans—a "frustratingly easy" Unsupervised Video Domain Adaptation (UVDA) method. It decouples spatial and temporal domain gaps through spatio-temporal feature subtraction in a dual-stream Transformer. By using only two basic losses (supervised + domain adversarial), it outperforms complex SOTA methods and reduces hyperparameter search costs from exponential to linear.
Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval: To address query ambiguity and temporal sparse supervision caused by "short queries vs. long videos" in Partially Relevant Video Retrieval (PRVR), this paper proposes Holmes, a hierarchical evidential learning framework based on the Dirichlet distribution. It distinguishes precise, polysemous, and under-determined queries using a three-fold principle at the inter-video level for adaptive label calibration, and achieves dense alignment at the intra-video level via flexible optimal transport with a dustbin. Holmes achieves SOTA on ActivityNet, Charades, and TVR datasets.
SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition: SkelHCC maps CLIP to Hyperbolic space to explicitly align skeleton-language representations across three granularities: "Joint → Body Part → Full Body." It utilizes LLM-generated body part importance masks for training-free multi-granularity voting cache inference, achieving a 9% improvement over Prev. SOTA on NTU120 one-shot action recognition with only 0.5M trainable parameters.

Browse all 17 Video Understanding papers →

🚗 Autonomous Driving (8)¶

CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving: CoIRL-AD utilizes two independent actors to handle Imitation Learning (IL) and Reinforcement Learning (RL) respectively, relying on a latent world model to "imagine" future trajectories for calculating long-range rewards for RL. A "leader-follower" competitive mechanism allows both actors to transfer beneficial behaviors to each other. This approach successfully integrates RL into end-to-end driving using offline real-world driving data without an external simulator, achieving significant improvements in cross-city generalization and long-tail scenarios.
Constrained Multi-Objective Reinforcement Learning with Max-Min Criterion: This paper unifies "max-min multi-objective fairness" and "hard constraint satisfaction" into a single MORL framework. By reformulating the problem as a convex program via occupancy measures, the authors derive a dual convex optimization problem over weights \((u,w)\). This allows a projected gradient descent algorithm to simultaneously achieve fairness and constraint feasibility with theoretical guarantees of geometric convergence.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving: DeepSight shifts "future world prediction" from explicit pixel reconstruction (single-frame codebook) to parallel implicit multi-frame prediction of DINOv3 semantic features in BEV space. Combined with an on-demand Adaptive Chain-of-Thought, it achieves a Driving Score of 86.23 (+7.39) and a Success Rate of 71.36% (+13.63) on the Bench2Drive closed-loop benchmark while adding only ~4% inference latency.
Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering: This work reformulates step-by-step prediction in continuous UAV VLN as a closed-loop "recursive Bayesian estimation = GRU prior + memory likelihood + learnable Kalman gain." By fine-tuning on only 10% of the data in TravelUAV, the Success Rate (SR) of L1-Full is improved from 17.6% to 25.9%, while the positional drift—which typically accumulates continuously after 100 steps—is flattened to 30–40 meters.
Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation: This paper proposes PLMD: a framework that merges BEV semantic and obstacle maps into a unified Label Map. It utilizes DDPM, modulated by obstacle priors, to complete semantic and obstacle labels in unexplored regions. As a plug-and-play module, it can be integrated with any GON policy and consistently achieves new SOTA results on HM3D/MP3D across three tasks: ON, IIN, and MRON.
RoCA: Robust Cross-Domain End-to-End Autonomous Driving: RoCA attaches a plug-and-play module based on Gaussian Processes (GP) to end-to-end autonomous driving models. By learning a set of basis tokens and corresponding trajectories that cover diverse scenarios, it probabilistically infers future trajectories based on similarity for new scenarios. This approach uses GP uncertainty for regularization to enhance generalization during source domain training and enables efficient adaptation via pseudo-labels and active learning in new domains, without requiring LLMs or increasing inference overhead.
Threshold-Based Exclusive Batching for LLM Inference: This paper systematically characterizes the performance crossover conditions between mixed batching (MB) and exclusive batching (EB) in LLM inference. It proves that on bandwidth-constrained GPUs, co-batching prefill and decode stages slows down Attention due to bandwidth contention. Consequently, the authors derive an optimal phase-switching threshold \(\theta^*\) and a memory-safe batch size based on the hazard rate, designing an online adaptive scheduler EB+. This scheduler improves throughput by up to 41.9% on bandwidth-constrained hardware and up to 36.4% under non-stationary traffic compared to MB.
TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models: TSRBench constructs a time series reasoning benchmark covering 14 domains, 4 major dimensions (Perception/Reasoning/Prediction/Decision-making), 15 tasks, and 4,125 questions. It supports four input modalities (Text, Visual, Text+Image, Embedding) and systematically evaluates 30+ mainstream LLMs, VLMs, and TSLLMs. It reveals that "scaling holds in perception/reasoning but fails in prediction" and that "text and visual modalities are highly complementary, yet current models struggle to fuse them."

🤖 Robotics & Embodied AI (53)¶

Contrastive Representation Regularization for Vision-Language-Action Models: The authors observe that representations in VLA models inherited from VLMs are dominated by visual appearance and are insensitive to robot proprioceptive states. They propose Robot State-aware Contrastive Loss (RS-CL), which uses the Euclidean distance between proprioceptive states as "soft contrastive labels" to reshape representations. Combined with "view cutoff" feature-level augmentation, this method achieves a SOTA success rate of 69.7% on RoboCasa-Kitchen using GR00T N1.5 and improves success rates from 45.0% to 58.3% on real-world Franka pick-and-place tasks.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation: For zero-shot robotic manipulation transferring from trained tasks to entirely new tasks, the authors decompose demonstrations into "atomic skill-action" pairs as intermediate representations. They utilize a dual-library (dynamic library retrieving by visual/planning similarity + static library completing missing skill tokens via IDF weighting) to provide LLMs with skill-comprehensive in-context demonstrations, upgrading "trajectory imitation" to "compositional skill reasoning."
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies: This paper shifts VLA action decoding from Autoregressive (AR) or external continuous diffusion heads to "masked diffusion on discrete action tokens within a unified Transformer." Combined with adaptive parallel decoding ranked by confidence and secondary re-masking for error correction, it achieves a 96.4% average success rate on LIBERO and a 64.1% total mean score on SimplerEnv-Fractal. Notably, performance degrades by only 0.8% / 20.4% under OOD language/visual perturbations, significantly outperforming continuous diffusion and parallel decoding baselines while preserving the multimodal priors of the pre-trained VLM.
BEAR: Dissecting Embodied Abilities in Multimodal Language Models through Skill-level Evaluation and Diagnosis: BEAR decomposes embodied tasks into 14 atomic skills and constructs 4,469 interleaved image-video-text VQA pairs. By performing horizontal and vertical skill-level diagnosis on 20 MLLMs, it discovers that perception (rather than reasoning) is the primary bottleneck. Consequently, BEAR-Agent is developed using external visual/spatial tools—such as GroundingDINO, 3D scene graphs, and trajectory visualization—improving GPT-5 performance by 17.5% relative to the baseline and increasing real-robot grasping success by 20.17%.
Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation: SceneDiver mitigates visual hallucinations in both high-level planning and reactive control by filtering task-related objects before feeding them back into the model. It employs a two-stage focus plan—coarse-grained sub-scene decomposition via scene graphs followed by agentic VLM verification—and distills this explicit reasoning into VLA using a Slot Attention adapter.
DLO-Lab: Benchmarking Deformable Linear Object Manipulations with Differentiable Physics: DLO-Lab develops a differentiable simulator based on Taichi on the Genesis platform, utilizing Discrete Elastic Rods (DER) as its core. It supports bidirectional coupling, bending plasticity, and closed-loop topology. The platform includes 10 benchmark tasks for rope/cable/elastic bands and a specialized agent using VLM for "grasp proposal + task decomposition." It evaluates various policy learning algorithms (PPO/SAC/SHAC/SAPO/CMA-ES/GD) and validates sim-to-real transitions via system identification.
Drift is a Sampling Error: SNR-Aware Power Distributions for Long-Horizon Robotic Planning: This paper proposes CAPS: reinterpreting "instruction drift" as a systematic sampling error. It uses SNR (\(= \log|\mathcal{A}|-\mathcal{H}\)) as a metacognitive switch to trigger Metropolis-Hastings iterative refinement based on a power distribution \(\pi \propto p^\alpha\) only during high-entropy "Pivotal Windows." It outperforms OpenVLA and TACO training-free on RoboTwin, Simpler-WindowX, and Libero-long.
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model: DUST utilizes a "dual-stream" multi-modal diffusion Transformer (MMDiT) to process action flows and future visual embedding flows in parallel. By employing shared attention for cross-modal fusion, combined with independent noise scheduling and asynchronous action-vision sampling, it enables the VLA to simultaneously learn "what actions to perform" and "what consequences those actions produce." It consistently outperforms GR00T-N1.5+FLARE on RoboCasa, GR-1, and real-world Franka robots.
Dual Advantage Fields: This paper observes that in the bilinear goal-conditioned value model \(V_\theta(s,g)=\psi_\theta(s)^\top\phi_\theta(g)\), the goal embedding \(\phi_\theta(g)\) is exactly the gradient direction of the value field with respect to the state embedding. By utilizing an "action-feature displacement predictor" \(u_\xi(s,a)\approx\gamma\psi(s')-\psi(s)\) and taking its inner product with the goal embedding, a learning-free Q-network local advantage score is obtained. This approach significantly improves the RLiable aggregated metrics across OGBench long-range navigation, manipulation, and puzzle tasks.
Dual Quaternion SE(3) Synchronization with Recovery Guarantees: This paper parameterizes the SE(3) synchronization problem using Unit Dual Quaternions (UDQ) instead of \(4\times4\) matrices. It calculates spectral initialization via power iteration on Hermitian dual quaternion matrices, followed by iterative refinement using the Dual Quaternion Generalized Power Method (DQGPM) with element-wise projection to \(\mathrm{UDQ}^n\). It provides the first finite-step linear convergence and explicit error bounds for SE(3) synchronization, reducing both rotation/translation errors and computational time below those of matrix-based methods in multi-scan point cloud registration.

Browse all 53 Robotics & Embodied AI papers →

🎮 Reinforcement Learning (110)¶

Adaptive Bandit Algorithms for Contextual Matching Markets: This paper studies online matching markets with contexts, treating players' linear preferences for dynamic arm contexts as the bandit learning objective. It proposes BARB for stochastic contexts and AdECO for adversarial contexts, providing adaptive upper bounds for player-optimal stable regret and tight \(\tilde O(T^{2/3})\) theoretical results.
Agent Learning via Early Experience: This paper proposes the "early experience" paradigm, which allows language agents to utilize the future states of their own actions to learn environment dynamics and decision-making reflections without external rewards. This approach consistently outperforms pure imitation learning across 8 agent environments and provides a superior initialization for subsequent GRPO reinforcement learning.
ALSO: Adversarial Online Strategy Optimization for Social Agents: ALSO models dynamic strategy selection in LLM social intelligence simulations as an adversarial online bandit. It utilizes a lightweight reward surrogate model to generalize sparse feedback from dialogue history, improving the overall score on Sotopia-Hard from 3.02 to 3.53, with significant gains in the relationship dimension.
ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization: ASAP identifies that "identifying a set of promising actions" generalizes across distributions more easily than "directly selecting the single optimal action" in neural combinatorial optimization. It utilizes a two-stage proposal-selection strategy and MAML initialization to make neural solvers for 3D-BPP, TSP, and CVRP more stable and faster to adapt when distributions shift.
Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas: This paper proposes an iterative LLM policy synthesis framework where an LLM directly generates Python policy code for multi-agent sequential social dilemmas. Through "feedback engineering," it demonstrates that adding four social metrics—efficiency, equality, sustainability, and peace—as dense feedback alongside scalar rewards breaks the "feedback aliasing" problem, achieving up to a 54% efficiency improvement in the Cleanup game.
Beyond the Proxy: Trajectory-Distilled Guidance for Offline GFlowNet Training: The paper proposes TD-GFN, an offline GFlowNet training framework that eliminates the need for proxy reward models. It extracts edge-level rewards from offline trajectories via inverse reinforcement learning, followed by indirect policy guidance through DAG pruning and prioritized backward sampling. This approach ensures that gradient updates rely exclusively on ground-truth terminal rewards, significantly outperforming existing baselines in tasks such as molecular design and sequence generation.
Bilevel Optimization over Saddle Points of Zero-Sum Markov Games: The PANDA algorithm is proposed to solve bilevel RL problems where the lower level is a regularized zero-sum Markov game. By employing a penalty reformulation based on the Nikaido-Isoda function and utilizing purely first-order policy gradient methods, it achieves an iteration complexity of \(\tilde{O}(\epsilon^{-1})\) and a sample complexity of \(\tilde{O}(\epsilon^{-3})\), matching the best-known rates for single-policy lower-level BRL.
Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning: Addressing the issue where "fixed block sizes" break the logical chain of thought during semi-autoregressive generation in Diffusion Large Language Models (dLLM), this paper proposes b1. It learns a block-end indicator token via RL to generate dynamic-length blocks and employs a "block-level Monotonic Entropy Descent (MED) reward" to drive coherent reasoning. As a plug-and-play reward term integrated into existing dLLM RL frameworks (Diffu-GRPO/GDPO/d1/wd1), it improves wd1 performance on Countdown from 39.45 to 58.98.
CAMEL: Confidence-Gated Reflection for Reward Modeling: This paper observes that the log-probability margin of the verdict token is highly correlated with judgment accuracy. Based on this, it proposes CAMEL—a method that first provides a rapid preference judgment via a single token and triggers reflection generation only when confidence is low. Using counterfactual prefix augmentation in GRPO training to enhance self-correction capabilities, it achieves an average accuracy of 82.9% across three reward model benchmarks with 14B parameters (surpassing the previous best 70B model by 3.2%).
Can Large Language Models Generalize Procedures Across Representations?: This paper finds that procedural knowledge learned by LLMs on symbolic representations (code/graphs) cannot reliably transfer to natural language tasks. It proposes a two-stage RL curriculum strategy—"symbolic then natural language"—enabling a 1.5B Qwen model to approach zero-shot GPT-4o performance on asynchronous planning tasks. From a cognitive science perspective, it demonstrates that successful cross-representation generalization can be interpreted as generative analogy.

Browse all 110 Reinforcement Learning papers →

🎁 Recommender Systems (11)¶

A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving: This paper treats the batch condition in LLM serving as a treatment variable for safety evaluation. It proposes a testing protocol consisting of safety-capability paired comparisons, scorer/human adjudication, cross-model expansion, continuous batching composition, and batch-invariant kernel ablation. The study concludes that refusal flips are real but low-frequency, model-specific, and dependent on the specific serving stack.
Can Recommender Systems Teach Themselves? A Recursive Self-Improving Framework with Fidelity Control: RSIR enables sequential recommendation models to generate new synthetic user interaction sequences using their own predictive capabilities, train a new model, and filter out samples deviating from the user preference manifold using a rank-based "fidelity check" to prevent self-consuming model collapse. It consistently improves NDCG/Recall by 4–11% across 4 datasets and 3 mainstream backbones, theoretically proving that this process is equivalent to implicit regularization along the tangent space of the user preference manifold.
GCIB: Graph Contrastive Information Bottleneck for Multi-Behavior Recommendation: GCIB employs a dual approach of "Graph Information Bottleneck + Cross-behavior Contrastive Learning." It first prunes edges in auxiliary behavior graphs that are irrelevant to the target task at the structural level (maximizing mutual information with the target behavior and minimizing mutual information with the original auxiliary graph via HSIC surrogates). It then aligns denoised auxiliary representations with sparse target representations using InfoNCE at the feature level, achieving a 7%–40% relative improvement in HR@10 / NDCG@10 across four multi-behavior recommendation benchmarks.
Incentivized Exploration with Stochastic Covariates: A Two-Stage Mechanism Design for Recommender System: RCB integrates "exploration-exploitation" and "user incentive compatibility" into a contextual bandit problem under Dynamic Bayesian Incentive Compatibility (DBIC) constraints. It proposes a two-stage algorithm (Cold Start + IPGS), proves \(\tilde{O}(\sqrt{KdT})\) regret in stochastic user covariate scenarios, allows for the integration of any offline learning oracle, and quantifies the "incentive price" — showing that the cold start sample size grows as \(1/\epsilon^2\) as the \(\epsilon\) constraint tightens.
Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design: SkillPCF reformulates the inverse design of Photonic Crystal Fibers (PCF) as a "memory policy learning" problem. A PPO-trained controller selects Top-K memory operations from an evolvable skill library for each trajectory span. An executor implements these in trajectory memory, while MEEP electromagnetic simulation rewards simultaneously optimize both the controller and the skill library. This approach achieves a superior trade-off between design success rate and simulation budget compared to multiple LLM backends and classical optimization baselines.
Position: Neglecting the Sustainability of AI is Fuelling a Global AI Arms Race: Utilizing Karl Marx's "base-superstructure" framework, this position paper argues that current "sustainable AI" discussions are dominated by environmental dimensions while neglecting economic and social ones. It calls for the simultaneous elevation of both climate awareness and resource awareness axes and proposes the CARAML five-layer action framework (Individual / Community / Industry / Government / Global) to curb the escalating "global AI arms race."
Position: Stop Preaching and Start Practising Data Frugality for Responsible Development of AI: This position paper points out that the ML community has long been "preaching without practicing" regarding "data frugality"—while verbally acknowledging that coresets save energy, almost no one actually reports energy consumption or carbon emissions. Using ImageNet-1K as a case study, the authors calculate a conservative lower bound of approximately 5.82 GWh / 2589 tCO2e for downstream training and storage, calling for data frugality to evolve from a slogan into a measurable, actionable, and rewardable engineering practice.
Prompts for Public-Sector LLMs Should Be Governed as Commons: This is a position paper: the authors argue that LLM prompts used by the public sector should be versioned, provenanced, auditable, and vetoable like open-source commons. Based on a pilot benchmark using 443 neighborhood prompts from a North American city (augmented to 3,317) across five governance states, it provides three falsifiable predictions—governed prompts change output distributions, improve auditability, and shorten fault-remediation latency.
Rethinking Contrastive Learning for Graph Collaborative Filtering: Limitations and a Simple Remedy: The authors decompose the forward prediction of LightGCN into a "sum of learnable weights of multi-hop neighbor pairs." They find that the Sampled Softmax (SSM) loss only weights based on the structural similarity of the item-side neighbors and treats all four types of neighbor pairs (UU/II/UI/IU) indiscriminately. Consequently, they propose NT-SSM, which incorporates user-side structural similarity into the gradient and calibrates weighting strategies according to neighbor pair types, consistently outperforming SSM across four datasets and various GCF backbones.
RGMem: Renormalization Group-Inspired Memory Evolution for Language Agents: RGMem draws inspiration from the Renormalization Group (RG) in statistical physics to model the long-term dialogue memory of language agents as a multi-scale system ("Event Layer → Relation Layer → Concept Layer"). It employs threshold-triggered non-linear operators to coarse-grain fragmented dialogues into stable user profiles, thereby breaking the "stability vs. plasticity" trade-off.

Browse all 11 Recommender Systems papers →

🔄 Self-Supervised Learning (28)¶

A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning: This paper improves the sample complexity upper bound for supervised contrastive learning (where tuples are constructed from a finite labeled data pool). By employing two distinct U-statistic estimators, it achieves a breakthrough from bounds dependent on the minimum class probability to bounds that depend only on the number of classes or the sample scale in extreme multi-class scenarios.
Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance: This paper proposes PriorAL, which utilizes foundation model predictions as priors for joint decision-making with small models via a "Product of Experts." It employs imbalance-aware entropy filtering to partition the unlabeled pool into a "clean set (for free pseudo-labeling)" and a "noise set (for human annotation)," achieving over 50% savings in labeling costs on image/text tasks characterized by both class imbalance and label noise.
Beyond Distribution Estimation: Simplex Anchored Structural Inference Towards Universal Semi-Supervised Learning: Ours proposes SAGE, which replaces "estimating unlabeled data distributions" with "structural inference in the representation space." By combining simplex ETF geometric anchors, high-order graph propagation, and distribution-agnostic reliability weighting, SAGE achieves an average accuracy improvement of 8.52% under the UniSSL setting with extreme label scarcity and arbitrary unlabeled distributions.
Can Local Learning Match Self-Supervised Backpropagation?: This paper theoretically proves that local self-supervised learning (local-SSL) can precisely achieve the gradient updates of global backpropagation (BP-SSL) in deep linear networks. Based on this insight, the authors propose CLAPP++ (introducing 2D spatial dependence and direct feedback), which achieves performance comparable to global BP-SSL on CIFAR-10/STL-10/Tiny ImageNet, setting a new SOTA for local-SSL.
Data Augmentation of Contrastive Learning is Estimating Positive-incentive Noise: The authors prove that "predefined data augmentation (rotation/cropping/flipping)" in contrastive learning is equivalent to a point estimation of Positive-incentive Noise (π-noise). They then upgrade π-noise from "point estimation" to a learnable distribution by training a π-noise generator (PiNDA) to add learnable noise as augmentation. This leads to consistent gains for SimCLR / BYOL / SimSiam / MoCo / DINO in vision and is naturally compatible with non-visual data without manual augmentation, such as HAR / Reuters / Epsilon.
FLAG: Foundation Model Representation with Latent Diffusion Alignment via Graph for Spatial Gene Expression Prediction: FLAG reformulates the prediction of spatial gene expression from H&E pathology images as a structured distribution generation problem. It employs a fixed spatial graph encoder to compress tissue topology into conditional vectors, uses a DiT for denoising in the gene dimension, and injects gene-gene regulatory priors through intermediate layer alignment with Gene Foundation Models (GFMs). This approach elevates Gene Structural Correlation (GSC) and Spatial Structural Correlation (SSC) to new heights while maintaining competitive PCC/MSE performance.
From Zero to Hero: Advancing Zero-Shot Foundation Models for Tabular Outlier Detection: This paper proposes OutFormer, a tabular Prior-Fitted Network (PFN) pretrained on a mixture of three synthetic priors (GMM, SCM, and Copula) and stabilized through a Multi-Armed Bandit-based Self-Evolving Curriculum. It achieves zero-shot tabular outlier detection by processing training data in-context and generating labels in a single forward pass. OutFormer achieves SOTA rankings across ADBench and two new benchmarks containing 1500+ datasets, while maintaining inference latency comparable to shallow models.
How 'Neural' is a Neural Foundation Model?: The authors treat a "SOTA foundation model of mouse visual cortex (FNN)" as a physiological experimental subject. By analyzing its encoder, recurrent, and readout modules using a trinity of decoding manifolds, encoding manifolds, and decoding trajectories, they discovered that FNN's fitting accuracy is primarily sustained by a large number of homogeneous feature maps in the readout, while only the recurrent module is truly "brain-like." Using a newly proposed tubularity metric, they quantitatively show that "early encoding layers lack biological-grade temporal structure," providing explicit suggestions for future neural foundation models to "add recurrence early and reduce feature dimensions in the readout."
Inconsistency-Aware Minimization: Improving Generalization with Unlabeled Data: This paper proposes "local inconsistency" \(S_\rho(\theta)\)—the worst-case KL divergence within a parameter ball—which can be calculated using only unlabeled data. By employing it as a training regularization term, the resulting IAM optimizer performs comparably to or better than SAM/ASAM in supervised tasks and brings additional improvements in semi-supervised (FixMatch) and self-supervised (SimCLR) scenarios by leveraging unlabeled batches.
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimation: InfoAtlas transforms mutual information estimation from an optimization problem where an evaluation network is trained from scratch for each dataset into a "single forward pass" problem using a hypernetwork pre-trained on large-scale synthetic data. This achieves accuracy comparable to neural estimators like MINE/MINDE while providing a 100× speedup.

Browse all 28 Self-Supervised Learning papers →

📐 Optimization & Theory (88)¶

A2SG: Adaptive and Asymmetric Surrogate Gradients for Training Deep Spiking Neural Networks: To address the dual issues of "sharp loss landscapes" and "conflicting gradients across timesteps" in deep Spiking Neural Networks (SNNs) trained with surrogate gradients, this paper proposes a unified framework, A2SG. On one hand, it employs an adaptive effective window width (automatically adjusting \(\beta\) based on Spatial Gradient Variation (SGV) and Temporal Gradient Consistency (TGC)) to suppress gradient variation and align directions across timesteps. On the other hand, it replaces symmetric surrogate functions with an asymmetric shape that allocates gradients based on membrane potential levels. It theoretically proves that asymmetric shapes exhibit lower variation than symmetric ones and that smaller local gradient variation leads to flatter loss landscapes, consistently improving accuracy and energy efficiency across CNN and Transformer-based SNNs.
A Fully First-Order Layer for Differentiable Optimization: Mainstream differentiable optimization layers rely on implicit differentiation of KKT conditions, which requires computing Hessians and solving large KKT linear systems, making them difficult to scale to large problems. This paper rewrites differentiable optimization as a bi-level optimization, constructing a "ghost proxy" problem with a fixed active set and linearized active constraints to simplify inequality constraints into equality constraints locally. It then uses finite differences to estimate the hypergradient using only first-order information within nearly constant \(\mathcal{O}(\log(1/\epsilon))\) calls. The authors implement FFOLayer, a PyTorch library that is plug-and-play with any convex solver (including GUROBI/MOSEK). It achieves convergence comparable to exact methods while computational time and peak memory grow nearly sublinearly with problem scale.
A General Framework for Dynamic Consistent Submodular Maximization: This paper presents a general consistency framework for fully dynamic submodular maximization. In streaming environments with insertions and deletions, it provides the first constant approximation guarantees with sublinear worst-case per-step solution changes (recourse) for both cardinality and matroid constraints.
Accelerated Multiple Wasserstein Gradient Flows for Multi-objective Distributional Optimization: This paper generalizes Multiple Wasserstein Gradient Descent into continuous-time gradient flows and introduces Nesterov-style momentum acceleration to obtain A-MWGraD. Theoretically, it improves the convergence rate to the weak Pareto optimum from \(O(1/t)\) to \(O(1/t^2)\) in geodesically convex scenarios. Empirically, it accelerates convergence in multi-target sampling and Bayesian multi-task learning.
AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping: To address the recurring loss spikes in large model pretraining, AdaGC replaces the "one-size-fits-all" Global Gradient Clipping with "per-tensor adaptive clipping based on the EMA of its own historical gradient norm." By suppressing abnormal gradients before they pollute the optimizer's first and second moments, it reduces spike scores to zero on Llama-2 7B, Mixtral 8×1B, and ERNIE 10B-A1.4B, while improving downstream accuracy by +1.32%, +1.27%, and +2.48% respectively compared to Global Gradient Clipping (GlobalGC).
Adaptive Estimation and Inference in Semi-parametric Heterogeneous Clustered Multitask Learning via Neyman Orthogonality: This paper bridges Double Machine Learning (DML) and clustered multitask learning by proposing an adaptive framework that combines Neyman orthogonality with a data-driven pairwise fusion penalty. In semi-parametric settings with heterogeneous (potentially infinite-dimensional) nuisance parameters, it accurately recovers latent task clusters, achieves oracle-level aggregation rates, and establishes asymptotic normality for valid statistical inference.
Adaptive Preconditioners Trigger Loss Spikes in Adam: This paper attributes loss spikes in Adam training to the lag-induced decoupling between the second-moment preconditioner and the current squared gradients, and explains as well as predicts spike occurrences using the curvature of the preconditioned Hessian in the gradient direction.
Adaptive Sharpness-Aware Minimization with a Polyak-type Step size: A Theory-Grounded Scheduler: This paper generalizes the Polyak step size to USAM/SAM, providing a sharpness-aware scheduler that does not rely on manual learning rate tuning. Its stability and performance are verified through convex optimization theory and CIFAR experiments.
Asymmetric Perturbation in Solving Bilinear Saddle-Point Optimization: This paper demonstrates that perturbing the payoff of only one player in a bilinear zero-sum game preserves the original equilibrium under a sufficiently small perturbation. Based on this, the authors construct AsymP-GDA, which theoretically achieves linear last-iterate convergence and approaches the original equilibrium faster and more accurately than symmetric perturbation in normal-form and extensive-form game experiments.
Automatic Unsupervised Ensemble Outlier Model Selection–Extended Version: The MetaEns framework is proposed to adaptively and greedily construct compact, high-quality anomaly detection ensembles under unlabeled conditions. It works by predicting the marginal ensemble gain of candidate detectors through meta-learning, combined with a proxy objective function featuring diversity discounts and algorithm family risk regularization.

Browse all 88 Optimization & Theory papers →

📐 Learning Theory (45)¶

A Perturbation Approach to Unconstrained Linear Bandits: This paper revisits the perturbation-based bandit linear optimization approach by Abernethy et al., proposing the PABLO reduction. This reduction transforms unconstrained linear bandits into a problem that can call any OLO subroutine, thereby obtaining comparator-adaptive static/dynamic regret, high-probability bounds, and discussions on various lower bounds.
Active Learning with Low-Rank Structure for Data Selection: Addressing the mismatch where existing coreset methods assume geometric clustering while modern datasets exhibit global algebraic (low-rank) structures, this paper proposes a data selection framework based on low-rank approximation and residual sensitivity sampling. Using a weighted subset of size \(\tilde{O}(k+1/\varepsilon^2)\), the method approximates the full average loss to a \((1\pm\varepsilon)\) relative error (with an additive term proportional to the optimal rank-\(k\) approximation cost \(\Phi_k\)). It outperforms uniform and cluster-based sampling on tabular data and Llama3-8B / Qwen2.5-3B fine-tuning.
AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory: This work presents the first systematic formalization of "Empirical Process-based Statistical Learning Theory (SLT)" from scratch in Lean 4. It fills gaps in Mathlib by implementing Gaussian Lipschitz concentration, the Dudley entropy integral theorem, and sharp rates for least squares regression (including \(\ell_1\) constraints). The project consists of approximately 30,000 lines of Lean code without sorry or axiom, completed through a human-AI collaborative paradigm where humans designed proof strategies and agents (Claude Code + Opus-4.5) executed tactical proofs.
Asymptotic Optimality of the High-Dimensional Gaussian Mechanism and Improved Low-Dimensional Mechanisms for Differential Privacy: This theoretical paper answers two long-standing open questions: whether the Gaussian mechanism is the optimal choice for additive noise differential privacy in high dimensions (Ans: as the dimension \(T\to\infty\), no additive noise can asymptotically outperform the Gaussian at a fixed mean squared error), and whether there exist mechanisms superior to both Gaussian and \(\ell_2\) mechanisms in low dimensions (Ans: yes—the authors propose a three-parameter family of Spherical Generalized Gamma noise, which reduces MSE by up to 15% in certain low-dimensional settings, and they provide tight composition guarantees for this family, resolving an open question by Joseph et al. regarding the \(\ell_2\) mechanism).
Bandit Social Learning with Exploration Episodes: This paper investigates the social learning dynamics of bandits where "each selfish agent controls a short sequence of decisions (episode)." It proves that even if agents spontaneously explore within their own episodes, exploration at the aggregate level still fails. For any episode length \(m \geq 2\) and any aggregate utility function \(f\) (such as sum, max, or min), learning failure occurs with positive probability, leading to linear growth of Bayesian regret over time.
Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation: Instead of treating catastrophic forgetting as "parameter drift," this work provides a closed-form characterization in function space under the NTK framework: new task training drags old task predictions away via the cross-task kernel \(K_{AB}\), and this "forgetting vector" is precisely predictable before training. This vector concentrates on an extremely small number of eigenmodes of the old task kernel \(K_{AA}\) (1–6 modes carry 50–90% of the forgetting energy), explaining why parameter-space regularizers fail on shared-head benchmarks and leading to a spectral regularization method that protects only the vulnerable subspace.
Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding: This paper proposes a Conditional Kernel Ridge Regression (Conditional KRR) framework that injects a set of unpenalized features into kernel methods. By reducing it to a standard KRR via a residual kernel, the authors prove a reduction cost of \(\mathcal{O}(1/\sqrt{N})\) and verify sufficient conditions where Conditional KRR outperforms standard KRR under both hard thresholding (top-k eigenfunctions) and soft thresholding (random Gaussian features) settings.
CORE-MTL: Rethinking Gradient Balancing via Causal Orthogonal Representations: The authors reattribute the root cause of "negative transfer" in Multi-Task Learning (MTL) from "gradient conflict" to the "entanglement of semantics and noise in shared representations." They propose CORE-MTL: a dual-stream encoder splits representations into semantic \(\hat{Z}_s\) and residual \(\hat{Z}_r\), implementing "causal orthogonality" through CKA independence constraints, counterfactual style replacement, and inverse rendering reconstruction. Theoretically, it provides a tighter OOD upper bound than gradient balancing; experimentally, it outperforms ten baselines including PCGrad, GradNorm, STCH, and FairGrad on NYUv2/Cityscapes (ID) and GTA5→Cityscapes/Cityscapes-C (OOD) settings.
Correcting Split Selection in Online Decision Trees via Anytime-Valid Inference: The authors point out that the "fixed sample size" concentration inequalities used by the classic Hoeffding Tree (HT) for splitting on data streams are violated by its own "data-dependent stopping rule." They reformulate the split criterion using testing-by-betting + Universal Portfolio, allowing both single trees and Adaptive Random Forests to maintain controlled Type-I errors at any stopping time, while achieving higher accuracy and smaller tree sizes across 12 real-world streams.
Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity: To reduce evaluation budgets when "selecting the best model," the authors transform the classic Successive Rejects bandit algorithm into a "synchronized" version called SySRs. By evaluating all surviving models on the same batch of test samples in each stage, the algorithm exploits inter-model correlations similarly to paired testing. This results in a hyperparameter-free best-arm identification algorithm with an error bound that tightens as model correlation increases. On 15 standard benchmarks, it reliably selects the optimal model using \(\le 35\%\) of model-sample pairs, outperforming existing methods.

Browse all 45 Learning Theory papers →

🔗 Causal Inference (19)¶

An Odd Estimator for Shapley Values: This paper demonstrates that the Shapley value depends solely on the odd component of a set function. Based on this, it proposes OddSHAP: a method that isolates odd signals via paired sampling, screens high-order odd Fourier interactions using GBT, and performs sparse odd regression. It significantly outperforms flexible-budget Shapley estimators on mid-to-high dimensional explanation tasks.
Causal-JEPA: Learning World Models through Object-Level Latent Masking: Ours proposes C-JEPA, which extends JEPA's mask prediction from image patch-level to object-level latent representations. By using object-level masking as latent interventions, the model is forced to learn interaction-dependent dynamics. It achieves approximately a 20% gain in counterfactual reasoning over non-masked baselines and reaches comparable performance in control tasks using only 1% of tokens with over 8x planning acceleration.
Causal Modeling of Selection in Evolution: The paper argues that "selection" consists of two types: static selection (one-time filtering) and evolutionary selection (accumulation of differential reproduction over multiple generations). Existing graphical models conflate the two, leading to erroneous causal discoveries on evolutionary data. The authors define a causal graphical model that explicitly characterizes evolution and prove that its conditional independence (CI) constraints can be losslessly represented by a "clique-expanded DAG." This allows for the direct application of standard PC/GES/CDNOD algorithms, requiring only a reinterpretation of the output semantics.
Controllable Generative Sandbox for Causal Inference: This paper proposes CausalMix, a variational generative framework that jointly optimizes a type-specific multi-head decoder and a Bayesian Gaussian Mixture Model (GMM) latent prior with three independently adjustable causal "knobs" (overlap \(\alpha(X)\), CATE function \(\tau(X)\), and unobserved confounding \(\kappa(X,T)\)). While maintaining the fidelity of real-world data distributions, CausalMix allows users to design counterfactual benchmarks. Validated on real metastatic castration-resistant prostate cancer (mCRPC) patient records, CausalMix high-fidelity reproduces mixed-type tables and stably injects overlap, confounding, and heterogeneous effects as needed for controllable stress-testing of CATE estimators.
Density-Guided Robust Counterfactual Explanations on Tabular Data under Model Multiplicity: DensityFlow reformulates "generating Robust Counterfactual Explanations (RCE) under model multiplicity" as an optimal transport problem with density constraints. It uses Noise Contrastive Estimation (NCE) to train a (K+1)-way discriminator that simultaneously learns classification and class-conditional density. It then employs a Neural ODE to transport query samples along density gradients to the high-density manifold of the target class. In black-box scenarios, it aligns the surrogate only via local distillation on generated trajectories, achieving higher cross-model validity with significantly fewer queries than ensemble baselines.
ECSEL: Explainable Classification via Signomial Equation Learning: ECSEL employs "one signomial (sum of power-law terms with real exponents) per category + softmax" as a classifier. Combined with L1 sparse regularization and multi-stage optimization, it recovers 95.86% of target equations on symbolic regression benchmarks like AI Feynman with significantly lower compute than SOTA, while achieving parity with XGBoost/MLP on 11 classification datasets. All feature attributions are derived in closed-form from model parameters.
Evaluating Bivariate Causal Statements Based on Mutual Compatibility: This paper addresses scenarios where "only pairwise (bivariate) causal statements are available without ground truth." It proposes two compatibility scores that do not rely on faithfulness: comp for linear cases and incomp for graph structures. By determining whether the multivariate model formed by stitching these pairwise statements requires "anomalous extra confounding" to explain the observed covariance, the method identifies incorrect causal claims and uses it to score LLM causal outputs.
Finding Most Influential Sets: Finding the size-\(k\) subset whose removal maximizes the change in a specific estimator (Most Influential Set, MIS) originally required exhaustive search over \(\binom{n}{k}\) subsets, which is computationally intractable. This paper proves that as long as the leave-set-out effect can be expressed in a linear-fractional form, MIS selection collapses into a sequence of "top-\(k\) selection" subproblems. Utilizing Dinkelbach’s method, the approach achieves \(\mathcal{O}(n)\) per iteration with finite-step termination, providing full theoretical guarantees ranging from "exact optimality for fixed inputs" to "statistical recovery of the oracle set" within partially linear models.
Formalizing and Falsifying Causal Pathways of Rare Events: This paper formalizes "verbal causal explanations" of rare events as causal pathways—subgraphs composed of binarized events. By defining a pathway explanation score to quantify the explanatory power of "root causes + mediation pathways" relative to the target event, the authors establish a falsifiable evaluation framework for causal explanations.
From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models: The authors use an interventional audit of "per-token ablation" to test the implicit assumption in MoE pruning that "observational routing statistics can predict which experts are deletable." On three high-redundancy MoE models, they obtain a clean "three-model null result": none of the 60 metric-layer combinations predict the causal importance of experts after multiple-comparison correction. This suggests that existing pruning methods are effective not because metrics successfully identify "useless experts," but because redundancy in early and middle layers makes almost any selection criterion equally safe.

Browse all 19 Causal Inference papers →

🔬 Interpretability (92)¶

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents: This paper proposes an evaluation framework for LLM Agent goal-directedness that integrates behavioral assessment with internal representation probing. In grid navigation tasks using GPT-OSS-20B, it was discovered that while the agent behaviorally follows goals, and internally encodes coarse-grained spatial maps and short-term plans, it can be misled by non-functional goal-like objects.
A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments: This paper constrains model design using VR interactive experiments and proposes a mental rotation model composed of a 3D equivariant spatial encoder, a neuro-symbolic object encoder, and an MLP for action decision-making. The model replicates human mental rotation behavior in terms of accuracy, number of actions, and partial response time trends.
Accurate Evaluation of Quickest Changepoint Detectors via Non-parametric Survival Analysis: This work reformulates the ARL/ADD evaluation in online quickest changepoint detection (QCD) as a right-censored survival analysis problem. By using Kaplan-Meier curves to estimate detection time and delay under finite and irregular sequence lengths, the proposed method provides more robust and less biased estimators compared to traditional methods that only count triggered samples.
Adaptive Querying with AI Persona Priors: The authors package "LLM response distributions conditioned on personas" into a finite mixture Bayesian prior. This allows for efficient prediction of remaining responses via closed-form posterior updates on personas after asking only a few questions, outperforming classic CAT/IRT baselines.
AI Engram: In Search of Memory Traces in Artificial Intelligence: The authors translate four classic criteria of "engrams" (memory traces) from neuroscience (specificity, reactivation, sufficiency, necessity) into algebraic constraints in parameter space. This leads to a closed-form estimator calculated in a single forward pass using input statistics. It "carves out" the causal sub-components of a concept within network weights, allowing arbitrary knowledge to be injected or erased via simple linear arithmetic—proving that this biologically motivated solution is equivalent to a natural gradient projection under the Fisher metric.
All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs: This paper systematically disproves the implicit assumption in mechanistic interpretability—"one LLM capability corresponds to one unique circuit"—using the Overlap-Aware Sheaf Repulsion (OASR) algorithm. It reveals that the same task can be supported by multiple, nearly non-overlapping sheaves (IoU ~4–11%) that satisfy requirements for being faithful, sparse, and complete. The authors propose the "Distributive Dense Circuit Hypothesis" as a theoretical explanation.
Analytic Bijections for Smooth and Interpretable Normalizing Flows: This paper constructs three families of "globally smooth (\(C^\infty\)), defined on the entire \(\mathbb{R}\), and analytically invertible in closed-form" scalar bijections. These serve as plug-and-play replacements for splines or affine transforms in coupling flows and enable a directly parameterized radial flow that transforms the radius while preserving angular directions. The latter is highly stable to train, geometrically interpretable, and achieves comparable quality to coupling flows on targets with radial structures using three orders of magnitude fewer parameters.
Beyond Additive Decompositions: Interpretability Through Separability: Ours proposes Tensor Separable Learning (TSL), a stagewise greedy regression method that models the conditional mean as the difference between positive rank-1 separable products. By utilizing a separable structure, it avoids signal cancellation and interaction masking issues inherent in additive decompositions under strong interactions, while its partial dependence functions can precisely recover the shapes of the fitting factors.
BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking: BLOCK-EM utilizes SAEs to identify a sparse set of internal latents that "causally control emergent misalignment." During narrow-domain SFT, a one-sided regularization is applied to prohibit the model from amplifying these latents in the "misalignment direction." This mechanism reduces emergent misalignment (EM) by an average of 93% across six fine-tuning domains with almost no degradation in in-domain task performance.
Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression: Proposes SimpliPy (a rule-based simplification engine 100x faster than SymPy) and Flash-ANSR (a Transformer-based amortized symbolic regression framework). It matches or exceeds the legacy genetic programming method PySR on the FastSRB benchmark with a ~58% recovery rate, while generating increasingly concise expressions as the inference budget grows.

Browse all 92 Interpretability papers →

📦 Model Compression (117)¶

A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search: The paper converts LoRA hyperparameter configurations into text with domain explanations, using a frozen LLM, learnable tokens, and a projection layer to construct a continuous search space for Bayesian Optimization (BO). By employing 10% of the data for proxy evaluation to reduce trial costs, it significantly outperforms default LoRA configurations and conventional HPO methods within approximately 30 search iterations.
A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints: The paper establishes the first queueing model for LLM inference that explicitly incorporates KV cache memory dynamics, deriving a closed-form stability condition \(\lambda < \mu(1-\delta)\). This allows operators to directly calculate the required number of GPUs; validation on single GPU, 8-GPU clusters, and LongBench real-world data demonstrates errors \(\leq 10\%\).
Active Budget Allocation for Efficient Scaling Law Estimation via Surrogate-Guided Pruning: This paper models training budget allocation in scaling law experiments as a multi-round resource selection problem. By combining Successive Halving with learning curve surrogates to predict future potential, it approximates the full scaling law with up to 98.7% training cost savings on synthetic and nanoGPT learning curves.
Active Tabular Augmentation via Policy-Guided Diffusion Inpainting: This paper formalizes the "fidelity-utility gap" in tabular augmentation (where generators optimize for distribution matching, yet augmentation value stems from low-density regions). It proposes the TAP algorithm, which utilizes diffusion inpainting for manifold-constrained proposals, policy-guided utility-aligned selection, and hard-constraint gating with conservative window commitment. On 7 real-world tabular datasets, it achieves up to a 15.6% improvement in classification accuracy and a 32% reduction in regression RMSE compared to baselines.
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation: This paper identifies that GRPO loses gradient signals under binary verifiable rewards when intra-group rewards are identical. It proposes the ACR metric for real-time diagnosis of this "advantage collapse" and introduces AVSPO to inject virtual reward samples, restoring intra-group variance. This approach consistently improves performance by 4-6 percentage points across various Qwen2.5 mathematical reasoning models.
An Algebraic View of the Expressivity of Recurrent Language Models: This paper unifies the formal language expressivity of RNNs/SSMs as an algebraic problem: once numerical semantics are fixed, the languages a model can recognize are determined by its hierarchical transition monoids and their wreath products. Furthermore, the same architecture yields entirely different counting capabilities under floating-point versus unsigned integer semantics.
ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin: The authors diagnose the root cause of VQ-VAE codebook collapse as "codebook vector \(\ell_2\) norm imbalance + geometric clustering." They propose SAMP: Ball-Bounded Norm Regularization to constrain all codebook vectors within a time-varying Euclidean ball, and ArcCosine Additive Margin Loss—drawing inspiration from ArcFace—to push latent vectors apart on the sphere. This results in uniformly distributed codebooks and significantly higher utilization, outperforming mainstream VQ-VAE variants in ImageNet reconstruction and generation FID.
AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning: This paper decomposes forgetting in CLIP-based class-incremental learning into "attribute extraction drift" and "attribute aggregation drift." It proposes Area, which utilizes Principal Geodesic Analysis (PGA) to fix visual/textual attribute anchors on the hypersphere, combined with lightweight task experts, Variational Information Bottleneck (VIB) regularization, and Optimal Transport (OT) routing to stabilize attribute aggregation. This approach significantly improves average and final accuracy across nine CLIP-CIL benchmarks.
Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice: This paper discovers that tabular foundation models (TFMs) such as TabPFN and Mitra exhibit high accuracy in discrete choice tasks but violate price-demand monotonicity and produce untrustworthy value-of-time (VOT) estimates. Consequently, it proposes a two-stage behavioral adapter that embeds TFM predictions into a utility model constrained by economic theory, achieving 100% behavioral validity while recovering most accuracy gains.
Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion: This paper demonstrates through controlled experiments that Hyperfitting (training LLMs to near-zero loss on small datasets) is not a temperature-scaling-style distribution sharpening, but a dynamic, context-dependent token Rank Reordering mechanism. This mechanism concentratedly occurs in the final layer of the Transformer as a "Terminal Geometric Expansion" (\(\Delta \text{Dim} \approx +80.8\)). Based on this, Late-Stage LoRA is proposed—fine-tuning only the last 5 layers—maintaining generation diversity while reducing trainable parameters by approximately 80%.

Browse all 117 Model Compression papers →

🕸️ Graph Learning (35)¶

Aitchison Embeddings for Learning Compositional Graph Representations: This paper proposes AICoG, which represents nodes as mixtures of latent archetypes on a simplex and learns graph embeddings using Aitchison geometry and Isometric Log-Ratio (ILR) coordinates. While maintaining the same expressiveness as Euclidean latent distance models, it ensures that node role similarity has an endogenous interpretation based on relative trade-offs of proportions.
An Approximation Algorithm for Graph Label Selection: This paper provides the first \(\tilde{O}(\log^{1.5} n)\) approximation algorithm for Graph Label Selection without label budget relaxation. By employing tree cut sparsification, flow decision-making, and dynamic programming on trees, it transforms the originally globally coupled node selection problem into a solvable combinatorial optimization pipeline.
Anchor-guided Hypergraph Condensation with Dual-level Discrimination: AHGCDD reformulates hypergraph condensation (HGC) from a decoupled paradigm of "training a structure generator then matching trajectories" into an end-to-end framework. It embeds structural information into initial features using Heat-Kernel-PageRank, synthesizes sparse learnable hyperedges via an anchor-guided approach based on feature distances, and replaces expensive HNN retraining with a dual-level discrimination loss (prototype MMD + instance-level contrastive). It achieves ≥SOTA across 5 hypergraph benchmarks with up to a 144× speedup.
Are Common Substructures Transferable? Riemannian Graph Foundation Model with Neural Vector Bundles: This paper redefines "transferable common substructures" in graph pre-training as behavioral invariance within the representation space. It constructs Gauge using neural vector bundles, gated geometric flattening, and Dirichlet loss, enabling graph models to achieve stronger structural generalization in cross-domain few-shot transfer, zero-shot link prediction, and graph isomorphism tasks.
Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design: The M-DESIGN framework is proposed to model neural network design as a retrieval-augmented iterative modification process. By constructing a Modification-Gain Graph to encode fine-grained architectural editing effects and utilizing Bayesian dynamic task similarity to calibrate transfer signals online, it achieves design-space optimality in 26 out of 33 GNN tasks.
Deep Neural Sheaf Diffusion: This paper identifies that Neural Sheaf Diffusion (NSD) loses its theoretically guaranteed resistance to collapse at deep layers because the "disagreement signal" of the sheaf Laplacian vanishes as diffusion converges. DNSD replaces the Laplacian with a sheaf adjacency operator and incorporates LayerNorm, odd activation functions, and per-stalk gating. This allows the sheaf architecture to be stably stacked up to 16 layers for the first time, achieving up to a 30 pp improvement over GNN/NSD baselines on synthetic long-range tasks and consistent leads on real-world heterophilic graph benchmarks.
DTKG: Dual-Track Knowledge Graph-Verified Reasoning Framework for Multi-Hop QA: DTKG bisects multi-hop QA into "parallel fact verification vs. chain reasoning." It first routes questions to the appropriate branch using a few-shot classifier. The parallel branch verifies atomic facts using KG triples, while the chain branch performs DFS path expansion with scoring-based pruning on Wikidata. Combined with "task-aware" denoising, it achieves a performance gain of 5%–29.5% over single-strategy baselines like KGR and ToG across six datasets.
ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs: Addressing the difficulty of aligning GNN and LLM representations on Text-Attributed Graphs (TAGs), this paper proposes a set energy-based model (set EBM). It projects both representations into a shared latent space, measures distribution misalignment using Cramér distance for layer-wise alignment, and employs a sampling-free Energy Discrepancy (ED) training objective to minimize energy. The method achieves state-of-the-art (SOTA) performance across 8 TAG datasets.
Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models: Starting from the Knowledge Graph Completion (KGC) task, this paper proves and measures that the "minimal parameter budget required for implicit reasoning" follows a linear scaling law based on Graph Search Entropy as the complexity metric. Each parameter supports approximately \(0.008\) bits of reasoning information, challenging the naive intuition that "larger models always yield stronger reasoning."
Fixed Aggregation Features Can Rival GNNs: The paper proposes Fixed Aggregation Features (FAF): multi-hop neighborhoods are compressed into tabular features using non-trainable aggregation operators like mean/sum/max/min/std and fed into an MLP. On 12 out of 14 node classification benchmarks, it matches or outperforms fine-tuned GCN/GAT/GraphSAGE and even Graph Transformers, systematically questioning the necessity of trainable neighborhood aggregation in GNNs.

Browse all 35 Graph Learning papers →

📈 Time Series (45)¶

Adaptive Time Series Reasoning via Segment Selection: This paper proposes ARTIST, which frames time series question answering (TSQA) as a sequential decision-making problem of "reasoning while selecting segments." Through a controller-reasoner architecture and hierarchical self-play RL, the model selectively reads task-relevant temporal segments, thereby improving reasoning accuracy.
AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection: AnomSeer formalizes statistical evidence from classical time-series anomaly detection into expert reasoning trajectories and reinforces Multimodal LLMs (MLLMs) via TimerPO. This enables the model to simultaneously perform anomaly type classification, interval localization, and fine-grained explanation based on line chart inputs.
Beyond Extrapolation: Knowledge Utilization Paradigm with Bidirectional Inspiration for Time Series Forecasting: The KUP-BI framework is proposed, which constructs a "post-target continuation" knowledge base from the training set. It retrieves continuation patterns of similar historical trajectories through ratio-based transformations to generate a continuation-style auxiliary stream. This stream is fused with backbone network features via a gating mechanism, consistently improving long-term forecasting performance across 6 datasets and 4 backbone architectures.
Building Social World Models with Large Language Models: This paper proposes the "Social World Model" (SWM), which treats collective beliefs as states and social events as exogenous actions. It utilizes an LLM as a transition engine to learn an event-conditioned state transition distribution \(P_\theta(\mathbf s_{t+1}\mid\mathbf s_t,e_t)\). By utilizing a frozen "hindsight posterior attributor" to provide pseudo-labels, it bypasses the challenge of missing "event \(\rightarrow\) belief change" annotations. SWM significantly outperforms time-series foundation models and strong baselines like GPT-5.5 on SWM-Bench, a benchmark constructed from real prediction markets (Kalshi/Polymarket).
CombinationTS: A Modular Framework for Understanding Time-Series Forecasting Models: CombinationTS decouples time-series forecasting models into five orthogonal modules: Input Transformation, Embedding, Encoder, Decoder, and Output Transformation. By performing paired Monte Carlo sampling on a shared "Evaluation Condition Space," it replaces fragile single-point MSE with marginal performance \(\mu\) and stability \(\sigma\). The primary conclusion is that with a well-designed data view (Embedding), a parameter-free Identity Encoder can match or even outperform complex Transformers, suggesting that "SOTA gains" in time-series forecasting largely stem from data representation rather than modeling capacity.
DAG: A Dual Correlation Network for Time Series Forecasting with Exogenous Variables: For Time Series Forecasting with known future covariates (TSF-X), DAG designs a dual-pathway network: one pathway captures "historical exogenous → future exogenous" attention patterns along the temporal dimension and injects them into "historical endogenous → future endogenous" predictions, while the other captures "historical exogenous → historical endogenous" patterns along the channel dimension and injects them into "future exogenous → future endogenous" predictions. DAG achieves the best MSE on 10/12 public/newly released TSF-X datasets, significantly outperforming TimeXer, TFT, TiDE, CrossLinear, and PatchTST.
DistMatch: Adaptive Binning via Distribution Matching for Robust Sequential Conformal: DistMatch proposes a recursive binning method based on KS statistics—by grouping residuals into approximately exchangeable leaf nodes, it discards weight reassignment, providing effective conformal prediction intervals under distribution shift. It achieves the smallest interval widths across five datasets while maintaining valid coverage.
Divide and Contrast: Learning Robust Temporal Features Without Augmentation: Di-COT efficiently learns robust time series representations without data augmentation by randomly partitioning sequences into overlapping sub-blocks for contrastive learning. Compared to existing methods, it is 2.5 times faster with higher accuracy, validated comprehensively across 6 large-scale datasets + 124 UCR + 28 UEA.
Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting: This paper argues that Time Series Foundation Models (TSFMs) exhibit a phenomenon of "good average metrics but failure at critical moments" in traffic speed forecasting. By employing regime-stratified evaluation based on traffic states, the authors expose catastrophic failures masked by aggregate metrics and propose BMA (Bimodal Mixture Augmentation), a post-processing method that requires no retraining, to bring prediction interval coverage in "transition regimes" back to levels near historical baselines.
Doubly Outlier-Robust Online Infinite Hidden Markov Model: This paper proposes BR-iHMM, which combines "robust observation updates (WoLF)" with "batched state inference (degenerate sticky HDP prior)." It provides bounded Posterior Influence Functions (PIFs) in both the observation and state spaces for online infinite Hidden Markov Models. On streaming data containing outliers—including financial order books, electricity loads, and synthetic regressions—it reduces one-step-ahead prediction RMSE by up to 67%.

Browse all 45 Time Series papers →

🏥 Medical Imaging (28)¶

Are We Overconfident in Models and Results for Semi-Supervised 3D Medical Image Segmentation?: This paper highlights two types of overconfidence in semi-supervised 3D medical image segmentation: model overconfidence in pseudo-labels and overly optimistic evaluation protocols. It proposes TCSeg, which utilizes confidence-uncertainty dual-axis reliability and tri-space calibration (probability, feature, and image spaces) to suppress confirmation bias. It also advocates for a rigorous evaluation protocol involving multiple random seeds and the Simultaneous reporting of both best and last checkpoints.
Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions: This paper proposes S(H)NAP—a generative interventional framework based on 3D diffusion bridges for "removal + insertion." It decomposes the decisions of Sybil, a leading lung cancer risk prediction model, into a Linear + Second-order Interaction Model (LMPI) consisting of "nodule main effects + pairwise interactions + background." For the first time, it audits the model's dependence on in-hospital artifacts (e.g., ECG electrodes, metal buttons) and identifies a severe "radial insensitivity" failure mode for peripheral nodules through causal rather than correlative methods.
CASCADE Conformal Prediction: Uncertainty-Adaptive Prediction Intervals for Two-Stage Clinical Decision Support: The CASCADE framework is proposed to propagate epistemic uncertainty from a first-stage classifier (quantified via Venn-Abers predictors) into second-stage regression prediction intervals. This enables a 38.9% reduction in interval width for high-confidence patients while automatically expanding safety buffers for uncertain cases, achieving adaptive coverage guarantees.
DGNO: Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring: DGNO reformulates defocus deblurring of pathological microscopy images as an inverse problem of "spatially varying integral operators." Using a Discontinuous Galerkin (DG) style, it decomposes the global kernel into element-local integral operators and interface numerical fluxes. This preserves the physical interpretability of neural operators while effectively handling the inherently local discontinuous blur in pathological images, surpassing SOTAs such as NAFNet, Restormer, and MambaIRv2 on datasets like BBBC006w1.
DIYHealth Suite: Dataset, Model, and Benchmark for Health Management at Home: Addressing the "Diagnosis-It-Yourself" scenario—a field overlooked by existing medical LLMs—this work delivers an integrated suite comprising a dataset (DIYHealth-900K, 900,000 multimodal home health QAs), a model (DIYHealthGPT, centered on the newly proposed H2LoRA parameter-efficient fine-tuning mechanism), and a benchmark (DIYHealthBench, the first evaluation covering 11 home health tasks). The suite achieves SOTA performance across both general and medical-specific baselines.
DP-KFC: Data-Free Preconditioning for Privacy-Preserving Deep Learning: This paper proposes DP-KFC: based on the observation that "the scaling of the Fisher matrix is determined by the architecture, and the correlation structure can be approximated by modality-level spectral statistics," it reconstructs KFAC preconditioners by probing the network with structured synthetic noise (1/f^\alpha pink noise for images, Zipf sampling for text). This approach neither consumes the privacy budget nor introduces distribution shifts, consistently outperforming DP-SGD and public data preconditioning methods under strong privacy (\(\varepsilon \le 3\)).
EEG-Based Multimodal Learning via Hyperbolic Mixture-of-Curvature Experts: EEG-MoCE assigns a Lorentz manifold expert with learnable curvature to each modality in EEG-based multimodal learning (emotion/sleep/cognition). It utilizes curvature-aware attention, where "higher curvature signifies richer hierarchical structure and thus higher weight in fusion," to perform cross-modal integration. This approach achieves cross-subject accuracy gains of +14.14%, +3.34%, and +7.98% on the EAV, ISRUC, and Cognitive datasets, respectively.
Evidential Reasoning Advances Interpretable Real-World Disease Screening: EviScreen utilizes "Normal + Pathological" dual knowledge banks for region-level evidence retrieval, followed by cross-attention and self-attention to perform evidential reasoning between the current case and retrieved evidence. This approach provides both retrospective interpretability (identifying which historical cases support the current judgment) and localization interpretability (abnormality maps from contrastive retrieval), achieving SOTA specificity at high recall levels across four real-world external test sets.
Factored Classifier-Free Guidance: This paper identifies the "attribute amplification" failure mode of Classifier-Free Guidance (CFG) in counterfactual generation—where a single global \(\omega\) amplifies attributes that should remain unchanged. The authors propose FCFG: grouping attributes based on a causal graph and assigning independent guidance weights to each group. This approach significantly reduces off-target attribute drift and improves counterfactual reversibility on CelebA-HQ, EMBED, and MIMIC-CXR.
Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration: This paper proposes FedHD: In heterogeneous federated pathology scenarios, it employs Gaussian-mixture feature alignment for "one-to-one" WSI feature-level distillation. It then progressively injects cross-institutional synthetic features into local training via curriculum learning. This allows institutions to collaborate without sharing raw data or exchanging model parameters. Compatible with heterogeneous MIL architectures and feature extractors, it comprehensively outperforms existing federated and distillation baselines on TCGA-IDH, CAMELYON16, and CAMELYON17.

Browse all 28 Medical Imaging papers →

🩺 Medical LLM (4)¶

A Machine-Learned Comorbidity Index: Traditional comorbidity scores (Charlson, Elixhauser) are linear rules with weights manually calibrated for mortality, performing poorly on other clinical outcomes. This paper utilizes neural networks to compress ICD codes from an admission into a scalar score, trained by maximizing the normalized HSIC (kernel dependence) between this score and multiple clinical outcomes. This ensures the single score provides consistent severity ranking across mortality, readmission, length of stay, and ICU admission. The dependence metrics on MIMIC-III/IV significantly exceed those of traditional indices and various machine learning baselines.
ClinTutor-R1: Advancing Scalable and Robust One-to-Many Alignment in Clinical Socratic Education: This paper proposes ClinTutor-R1, the first vision-language agent for one-to-many alignment in clinical Socratic education. By constructing the 48k ClinTeach dialogue dataset via the ClinEdu multi-agent simulator, and utilizing explicit Theory of Mind (ToM) reasoning alongside three-axis rubric reinforcement learning, the model maintains stable teaching quality even when scaled to 10 students, outperforming baselines by 20% and reaching GPT-4o performance levels.
Exploring Accurate and Transparent Domain Adaptation in Predictive Healthcare via Concept-Grounded Orthogonal Inference: ExtraCare utilizes a "dictionary metric-induced orthogonal decomposition" to decouple Electronic Health Record (EHR) patient representations into "cross-domain invariant label information" and "domain-specific covariate residuals." It surpasses existing domain adaptation baselines on two real-world EHR datasets while mapping each latent variable back to specific ICD medical concepts via sparse dimension ablation. This informs clinicians exactly what was "preserved" and "discarded" during the adaptation process.
MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings: The authors propose a "multi-stage LLM + terminology grounding + repair loop" pipeline to convert free-text medical cases into HL7 FHIR R4 standard bundles. Using this, they construct the MedCase-Structured dataset (1,408 cases, 82.5% success rate) from MedCaseReasoning. Experiments demonstrate that diagnostic accuracy for GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.6 consistently drops by 4–23% when using structured FHIR inputs compared to raw text.

🧬 Computational Biology (52)¶

Active Timepoint Selection for Learning Measure-Valued Trajectories: This paper investigates "when a distribution snapshot is most valuable to sample." It uses Linearized Optimal Transport (LOT) to linearize measure trajectories in Wasserstein space and employs a multi-output Gaussian Process (GP) with time warping to provide epistemic uncertainty, enabling the active selection of timepoints that best reduce trajectory reconstruction error.
Advancing Ligand-based Virtual Screening and Molecular Generation with Pretrained Molecular Embedding Distance: This paper proposes utilizing frozen pretrained molecular models (GeoDiff, MoLFormer) to calculate the distance between embeddings (PED) as a measure of molecular similarity without any specialized similarity training. This approach serves both for candidate ranking in virtual screening and as a reward signal for molecular generation; it correlates strongly with industrial-standard 3D similarity (ROCS/ROSHAMBO2), outperforms traditional metrics in EF1% on the LIT-PCBA benchmark, and accelerates generation sampling by up to 3.3×.
CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation: CARD utilizes "radix \(r\) decomposition" to bijectively map molecular 3D coordinates into coarse-to-fine sequences of discrete-continuous mixed tokens. This enables a cross-system general autoregressive Transformer to act as a "zero-free-energy proposal" for directly estimating the absolute free energy of arbitrary molecular systems via BAR. It achieves the accuracy of classical MFES on 70 new solvation systems while being approximately 40x faster during inference.
Circuit Tracing in Autoregressive Protein Language Models: ProGenMech introduces "Cross-Layer Transcoders (CLT)" to the autoregressive protein language model ProGen3. Using a zero-shot circuit discovery algorithm, it identifies sparse latent circuits (less than 2%) that faithfully replicate generative probability distributions and zero-shot fitness scores while mapping to biologically conserved motifs such as the HRD/DFG motifs in kinases.
CoSiNE: Conditional Site-Independent Neural Evolution Model for Antibody Sequences: CoSiNE models the antibody affinity maturation process using a neural-parameterized conditional site-independent Continuous-Time Markov Chain (CTMC). It captures inter-site epistatic effects while maintaining tractability and enables antigen-specific antibody optimization via Guided Gillespie sampling, outperforming existing language and evolutionary models in zero-shot variant effect prediction.
Constrained Flow Optimization via Sequential Fine-Tuning for Molecular Design: Addressing the scenario of "maximizing rewards (e.g., binding affinity, dipole moment) under hard domain constraints (e.g., synthetic accessibility, energy upper bounds)," this paper proposes the CFO algorithm. CFO decomposes constrained generative optimization into a sequence of standard KL-regularized fine-tuning subproblems using the Augmented Lagrangian method. By adaptively updating penalty factors \(\rho_k\) and dual variables \(\lambda_k\), CFO achieves provable convergence and significant Pareto improvements in reward-constraint trade-offs across low-dimensional toy tasks and FlowMol molecular design.
CountsDiff: A Diffusion Model on the Natural Numbers for Generation and Imputation of Count-Based Data: To address the issue that biological sequence counts (scRNA-seq, ATAC-seq, etc., which are inherently natural numbers) are unsuitable for either continuous or categorical diffusion, this paper proposes CountsDiff—a diffusion framework operating directly on the set of natural numbers \(\mathbb{N}_0\). It reparameterizes Blackout diffusion using "survival probability scheduling \(p(t)\) + explicit loss weighting" and integrates modern diffusion tools including continuous-time training, classifier-free guidance, churn/remasking (attrition) non-monotonic reverse trajectories, and stochastic rounding. Even with a minimal implementation, it matches or exceeds SOTA discrete generative models and specialized imputation methods on CIFAR-10/CelebA images and scRNA-seq imputation.
Cross-Chirality Generalization by Axial Vectors for Hetero-Chiral Protein-Peptide Interaction Design: This paper proposes AFI (Axial Feature Injection), which injects axial vector features into the polar vector channels of \(E(3)\)-equivariant scalarized models via linear mixing to reduce them to \(SE(3)\)-equivariance and enable chirality sensitivity. By applying this to UniMoMo, the authors developed PepMirror, which generates hetero-chiral (D-L) peptide binders in a zero-shot manner using only homo-chiral (L-L) training data. Wet-lab experiments on the CD38 target validated it as the first experimentally confirmed AI de novo D-peptide design framework.
Demystifying Multimodal Biomolecular Co-design with Intrinsic Geodesic Coupling: The authors re-model the co-generation of heterogeneous modalities ("sequence + 3D structure") as a Temporal Optimal Transport (TOT) problem. By using bi-level optimization with a Gaussian Process surrogate (GeoCoupling), the model automatically learns non-diagonal temporal coupling curves during training (i.e., allowing structure and sequence to denoise at their respective optimal paces). This approach outperforms "synchronous coupling" and "random coupling" baselines in both SBDD and unconditional protein co-design tasks, revealing a universal "structure-leading" generation principle where geometry precedes semantics.
Disentangling Latent Risk Pathways via Bayesian Hypergraph Inference: To address the modeling challenges of "multi-disease, long-tail/rare diseases, and shared risk factors" in Electronic Health Records (EHR), the authors reformulate multi-disease risk as "risk-factor-modulated latent disease pathways." They employ a latent hypergraph (where hyperedges represent subsets of diseases sharing risk factors) to express high-order structures, coupled with a repulsive prior to ensure sparse and identifiable pathways. A logic-preserving structured variational inference framework is used for scalable posterior estimation with calibrated uncertainty.

Browse all 52 Computational Biology papers →

⚛️ Physics & Scientific Computing (33)¶

A Call to Lagrangian Action: Learning Population Mechanics from Temporal Snapshots: Starting from the principle of least action, this paper proposes the Wasserstein Lagrangian Mechanics (WLM) framework to learn second-order population dynamics rather than traditional first-order gradient flow dynamics. This enables capturing richer collective phenomena such as periodicity and rotation, and allows for interpolation and future forecasting without requiring a reference process.
ANTIC: Adaptive Neural Temporal In-situ Compressor: To compress PB-EB scale PDE simulation data "on-the-fly," this paper proposes ANTIC: it utilizes a physics-aware temporal selector to retain only physically significant snapshots, and employs neural fields with LoRA continual fine-tuning to encode residuals between adjacent snapshots. It achieves \(435\times\) compression on 2D Kolmogorov flows and \(6807\times\) spatio-temporal joint compression on a 4.2 TiB 3D binary black hole merger simulation.
BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories under Spatio-Temporal Vector Fields: The BALLAST algorithm is proposed to correct active learning utility estimates by sampling vector fields from the GP posterior and simulating future trajectories of Lagrangian observers. Additionally, the VaSE inference method is developed to increase GP posterior sampling efficiency by thousands of times, achieving approximately 16%-22% savings in deployment costs on synthetic and high-fidelity ocean flow fields.
Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation: Distribution Transformer (DT) explicitly tokenizes the "prior distribution" into a set of Gaussian Mixture Model (GMM) components and injects "observations" into the decoder via cross-attention, learning an end-to-end mapping from "prior + data → posterior." While maintaining conjugacy within the same family (GMM→GMM) to support sequential filtering, it compresses inference time from minutes to milliseconds and allows arbitrary prior replacement at test time without retraining.
EqGINO: Equivariant Geometry-Informed Fourier Neural Operators for 3D PDEs: EqGINO transforms GINO's GNO encoder, FNO backbone, and GNO decoder into SE(3) equivariant modules: GNO adopts relative distances as rotation-invariant kernels, and FNO utilizes "orbit-based weight sharing" to enforce isotropy (\(W(R\mathbf k)=W(\mathbf k)\)) in the frequency domain. This maintains the global receptive field of FNO while ensuring robustness to arbitrary rigid transformations in 3D PDE surrogates and reducing spectral weight complexity from \(\mathcal O(K^3)\) to \(\mathcal O(K)\).
Foundation Inference Models for Ordinary Differential Equations: FIM-ODE amortizes the process of "inferring ordinary differential equation vector fields from noisy trajectories" into pre-training. Using an 8M-parameter Transformer neural operator pre-trained solely on low-degree polynomial ODE priors, it performs zero-shot vector field prediction in a single forward pass. It matches or exceeds the symbolic regression baseline ODEFormer on ODEBench with approximately 1/10 the parameters and 1/80 the training data.
From Generalist to Specialist Representation: This paper provides the first fully nonparametric (no intervention, no functional constraints) proof for two-layer hierarchical identifiability: the temporal-task structure is identifiable via CI tests from a collider perspective, and task-relevant latents can be disentangled from generalist representations through sparsity regularization.
From Geometry to Dynamics: Learning Overdamped Langevin Dynamics from Sparse Observations with Geometric Constraints: To address the difficulty of accurately inferring stochastic dynamics when trajectories are sparsely sampled, this paper reformulates inference as a stochastic control problem. It utilizes the geometry of the system's invariant density (Riemannian metric + geodesics) to guide the reconstruction of unobserved paths, achieving significantly more accurate estimation of the drift function \(\mathbf{f}\) in extremely under-sampled overdamped Langevin systems compared to existing methods.
Generative Neural Operators Through Diffusion Last Layer: A "Diffusion Last Layer" (DLL) is appended to any neural operator backbone (FNO/DeepONet). An input-dependent basis \(\Phi_a\) is used to compress the target field into an \(r\)-dimensional coefficient vector, followed by a small MLP velocity field that performs conditional flow matching in the coefficient space. This upgrades deterministic operators into generative ones capable of sampling stochastic solutions and providing roll-out uncertainty.
Hermite-NGP: Gradient-Augmented Hash Encoding for Learning PDEs: The paper upgrades Instant-NGP's multi-resolution hash table to a "gradient-augmented" version—storing function values and all mixed partial derivatives at each hash grid point. It utilizes Hermite interpolation to reconstruct a \(C^1\) continuous, analytically twice-differentiable field, effectively enabling NGP for PINN-based PDE solving for the first time. It achieves up to a \(20\times\) error reduction over SOTA neural PDE solvers on 2D/3D benchmarks, with training times of only \(2\)–\(3.5\,\mathrm{ms}\) per epoch.

Browse all 33 Physics & Scientific Computing papers →

🌍 Earth Science (2)¶

Scaling Laws of Global Weather Models: This paper presents the first cross-model scaling law analysis of five mainstream data-driven weather models (Aurora, AIFS, Pangu, GraphCast, SFNO) under a unified training/evaluation protocol. It finds that weather models favor "width over depth," compute budgets should prioritize more training data over larger models, and scaling behaviors vary significantly across meteorological variables—distinct patterns from NLP/Vision scaling laws.
(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models: MOSAIC addresses two types of spectral degradation in ML weather forecasting models (spectral damping from deterministic averaging and high-frequency aliasing from coarsened latent spaces) by combining "probabilistic perturbation + mesh-aligned block-sparse attention on HEALPix spherical grids." With only 214M parameters at 1.5° resolution, it matches or exceeds models with 6× higher resolution, generating a 24-member 10-day forecast in 12 seconds on a single H100.

📡 Signal & Communications (2)¶

Joint Model and Data Sparsification via the Marginal Likelihood: JMDS achieves simultaneous model and data sparsification through a unified objective of maximizing marginal likelihood. By avoiding the sub-optimality of multi-stage pipes, it maintains performance superior to independent sparsification across CIFAR, ImageNet, and WikiText at 5-10× joint compression ratios.
Meta-learning Structure-Preserving Dynamics: This paper systematically introduces modulation-based meta-learning (where a hyper-network maps latent codes \(\bm{z}^{(k)}\) to hierarchical modulation parameters) into Hamiltonian and GENERIC neural networks. It proposes two novel modulation schemes—latent multi-rank (MR) and latent SVD-like modulation—enabling a shared network to adapt to entire families of new parameter instances \(\bm{\mu}\) with few shots, while strictly maintaining energy conservation or dissipation structures.

👥 Social Computing (9)¶

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases: This paper proposes alignment tampering: when a model to be aligned generates "high-quality but biased" and "low-quality but unbiased" responses, the pairwise preference labels in RLHF conflate quality with bias. This causes the reward model, PPO/DPO, and Best-of-N sampling to further amplify unwanted biases.
FLIPS: Instance-Fingerprinting for LLMs via Pseudo-Random Sequences: FLIPS generates unique model "fingerprint responses" by designing pseudo-random seed sequences known only to the model owner. The fingerprint remains detectable (detection rate > 99%, false positive rate < 1%) under black-box query scenarios even if the attacker fine-tunes or prunes the model.
IDO: Incongruity-Aware Distribution Optimization for Multimodal Fake News Detection: IDO leverages explicit modeling of cross-modal incongruity as a learnable distribution optimization target—simultaneously pulling multimodal embeddings of real news closer while pushing the incongruity of fake news further apart. Ours achieves a 3-7% F1 Gain over Prev. SOTA on Weibo / Twitter / Fakeddit and significantly enhances generalization to unseen fake news.
MIND: Multi-Rationale Integrated Discriminative Reasoning Framework for Multi-Modal Fake News: MIND provides an explainable and robust discriminative framework for fake news detection through multi-view rationale generation + cross-rationale discriminative reasoning. By simultaneously leveraging three types of LLM-generated rationales—fact-checking, modal consistency, and semantic plausibility—it achieves a 4-8% F1 improvement over SOTA on Weibo, Twitter, and Fakeddit.
ObjEmbed: Towards Universal Multimodal Object Embeddings: ObjEmbed trains a universal object embedding model—by aligning multimodal object representations through a combination of tasks including detection, segmentation, retrieval, captioning, and classification. A single embedding exceeds or matches task-specific SOTA across 11 tasks, such as OVD, OVS, Text2Image-Object, and Open-Caption-Eval.
SCOPE: Selective Conformal Optimized Pairwise LLM Judging: SCOPE eliminates position bias in LLM judging through Bidirectional Preference Entropy (BPE) and implements finite-sample FDR control via Conformal Risk Control—providing statistically valid risk guarantees while maintaining high coverage (FDR is only 0.099 at 0.583 coverage vs. Vanilla FDR of 0.198 at 1.000 coverage).
Self-Debias: Self-correcting for Debiasing Large Language Models: Self-Debias reframes the LLM debiasing problem as "fair resource allocation of probability mass over autoregressive reasoning chains." Using trajectory-level suffix margins as resource units and the Jain Fairness Index to prevent budget collapse on easy samples, combined with cold-start SFT and consistency-filtered online self-training, the method improves Qwen3-8B's average score across 8 fairness/utility benchmarks from 77.5 to 81.7 using only 20k labeled seeds. It flips the base model's tendency to "correct toward bias" (collapse) into a stable +0.4 gain.
The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-modal Divergence: This paper employs a measure-theoretic framework to elevate the InfoNCE loss to a deterministic "population energy" over representation distributions. It demonstrates that the unimodal case is convex and converges to a unique Gibbs equilibrium, whereas the symmetric multimodal case exhibits a persistent negative symmetric KL coupling, showing that a modality gap is a geometric necessity.
Three Years of r/ChatGPT: Societal Impact Evaluations from Social Media Data: The study analyzes 137,000 posts from the r/ChatGPT subreddit over three years (2022-12 to 2025-11) by decomposing them into interpretable features using Sparse Autoencoders (SAE). By fitting piecewise linear changepoints to track the temporal trajectory of each feature, researchers found that "emotional usage" (therapy, emotional attachment) surged following the release of GPT-4o. Furthermore, the proposed online monitoring algorithm, PuLSE, demonstrated that it could have triggered alerts in October 2024—six months before OpenAI publicly acknowledged these impacts.

🛡️ AI Safety (114)¶

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity: ABC-Bench transforms the question "Can AI agents actually perform molecular biology?" into three automatically scorable tasks (designing DNA fragments, evading synthesis screening, and controlling liquid-handling robots for Gibson Assembly). Experiments show that eight frontier models exceed the median scores of PhD-level experts across all three tasks. Real-world wet-lab validation demonstrates that scripts written by o4-mini-high successfully assembled DNA on OpenTrons robots.
ACTG-ARL: Differentially Private Conditional Text Generation with RL-Boosted Control: This paper proposes ACTG, a hierarchical framework that decomposes private text generation into two sub-tasks: feature learning and conditional text generation. It further introduces Anchored RL, which enhances the instruction-following capabilities of the conditional generator through a hybrid reinforcement learning objective and SFT anchors based on best-of-N sampling, achieving a 20% improvement in MAUVE on biomedical data compared to prior work while maintaining text fidelity.
Active Continual Learning with Metaplastic Binary Bayesian Neural Networks: BiMU designs bounded-memory and uncertainty-aware metaplastic updates for binary Bayesian neural networks to prevent Bernoulli posterior saturation in long-range non-stationary streams. It utilizes Monte Carlo disagreement for buffer-free one-pass active querying, significantly reducing label requirements and backpropagation updates.
Position: 'AI Alignment' Encompasses Competing Technical Priorities: This ICML position paper argues that "AI alignment" is a polysemous term: the ML literature contains at least three high-level alignment ideals that are competing rather than merely different (Task Reliability / Social Judiciousness / Takeover Avoidance). In practice, advancing one type of alignment often actively undermines another. The authors explain these tensions via two cross-cutting distinctions—"threat model differences" and "positive/negative alignment differences"—and offer five specific recommendations for researchers.
Position: AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks: This is a position paper arguing that AI researchers must look beyond distant superintelligence risks and proactively lead technical research into "arms control" for military AI. Using historical precedents from nuclear arms control as a template, the authors demonstrate that integrating frontier models into military systems introduces risks with extremely poor verifiability—such as escalation, alignment faking, and gradual human disempowerment—for which current diplomatic tools are unprepared. They call for a formal collaboration mechanism between AI researchers and arms control experts to solve technical challenges regarding verification, trust, and transparency.
Alignment Risks from Capability-Seeking RL Training: This paper identifies an underestimated alignment risk: when models pursue task capabilities via RL in environments with "structural loopholes," they spontaneously learn to exploit these loopholes for high rewards even without explicit instruction. Using four "loopholes games," the authors demonstrate that such exploits are prevalent, transferable across tasks, propagatable through SFT, and more resistant to correction than SFT-distilled behaviors. Crucially, as the exploit rate rises, main task metrics often remain stable or even improve, creating a "developer blind spot" that evades standard monitoring.
AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing: AliMark reformulates sentence-level text watermarking from "prefix-conditioned sentence-by-sentence detection" to "global secret bit sequence encoding and alignment." By utilizing text reconstruction and adaptive block edit distance, it significantly enhances detection robustness against strong paraphrasing attacks such as DIPPER and GPT-3.5.
Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model: This paper proposes Anchored Decoding: an inference-time method that anchors a high-performance but potentially risky LM to a safe LM trained only on permissive data. It provides a formal guarantee on the trade-off between copyright duplication risk and generation quality using a tunable information budget.
Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning: The authors provide the first systematic evaluation of the impact of 7 mainstream plasticity interventions (SAM, Shrink & Perturb, Weight Clip, SN, WD, LN, ReDo) on deep reinforcement learning (DRL) backdoor attacks through 14,664 experiments. It is discovered that only SAM acts as a "demon"—significantly intensifying backdoor threats. Consequently, the "Sweeper-Converter-Connector" robust backdoor injection framework is proposed, alongside a detection signal based on the sharpness of the loss landscape.
Antidistillation Fingerprinting: This paper proposes Antidistillation Fingerprinting (ADFP), which utilizes a proxy student model to estimate which watermark tokens are most easily absorbed during the distillation process. This allows for more reliable detection of whether third-party models have been trained on teacher model outputs, without sacrificing the quality of the teacher's generation.

Browse all 114 AI Safety papers →

📂 Others (70)¶

A Hypertoroidal Covering for Perfect Color Equivariance: This paper uses a double-cover mapping to lift the interval-valued saturation and luminance in HSL space onto circle groups, constructing \(\mathbb{T}^3\)CEN. This enables the network to achieve precise color equivariance for hue, saturation, and luminance shifts, enhancing robustness in tasks such as color-shifted and medical imaging.
Adaptive Multi-Round Allocation with Stochastic Arrivals: This paper formalizes network recruitment as a budget-constrained sequential control problem and proves that single-round optimal allocation is greedy. By introducing a population-level surrogate value function, the complexity of multi-round planning is reduced to \(O(b^5\log b)\). Furthermore, a robustness guarantee is provided, decomposing model errors into frontier-level, population-level, and approximation errors.
AI Cap-and-Trade: Efficiency Incentives for Accessibility and Sustainability: Drawing on carbon cap-and-trade, the authors propose a quota-trading market for AI inference FLOPs (AI Allowance). Using KKT conditions, they prove this mechanism strictly reduces FLOP usage by companies under reasonable parameters, simultaneously addressing energy consumption and the exclusion of small companies in the LLM era.
AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training: AMDP utilizes multi-directional asynchronous pipelines, a one-step parameter mismatch upper bound, gradient accumulation, and ZeRO state sharding to improve the throughput of large-scale model pipeline parallel training while maintaining near-synchronous convergence. In 8-GPU GPT/BERT experiments, it achieves a maximum improvement of approximately 17% relative to the strongest asynchronous baselines.
Amortized Simulation-Based Inference in Generalized Bayes via Neural Posterior Estimation: This paper amortizes the power posterior family in generalized Bayes into a single neural posterior estimator conditioned on both the observation \(x\) and the temperature \(\beta\). This allows posterior sampling for different observations and varying temperatures to be completed in a single forward pass, eliminating the need to run MCMC for every instance.
AutoNumerics-Zero: Automated Discovery of State-of-the-Art Mathematical Functions: AutoNumerics-Zero is proposed as an evolutionary symbolic regression method with zero prior knowledge. Starting from empty programs, it automatically discovers arithmetic programs for approximating transcendental functions (such as exponential and cosine functions). Under finite-precision targets, it surpasses classic approximation methods designed by mathematicians over centuries by requiring fewer operations.
Beyond Model Readiness: Institutional Readiness for AI Deployment in Public Systems: Addressing the widespread phenomenon of AI systems in the public sector being "technically feasible but failing in deployment," this paper proposes the Institutional Alignment Readiness (IAR) five-dimensional assessment framework. It evaluates whether a receiving institution is prepared for the responsible deployment of AI systems across five dimensions: institutional compatibility, data ecology maturity, human oversight capacity, fiscal sustainability, and legal alignment.
Bullet Trains: Parallelizing Training of Temporally Precise Spiking Neural Networks: A parallel training method for Spiking Neural Networks (SNNs) based on parallel associative scan is proposed, achieving up to 44× acceleration while maintaining exact hard-reset dynamics, using a differentiable numerical root solver to compute spike times with machine precision.
Cascaded Flow Matching for Heterogeneous Tabular Data with Mixed-Type Features: TabCascade decomposes tabular rows into two cascaded segments: "low-resolution (categorical + discretized version of numerical)" and "high-resolution (continuous numerical)". It first learns the low-resolution joint distribution using CDTD and then generates numerical details using flow matching guided by the low-resolution information. Transport costs are tightened through data-dependent coupling and learnable non-linear time schedules. It natively supports the generation of "mixed-type features" (e.g., missing values, zero-inflation), achieving a 51.9% Gain in detection scores over SOTA across 12 datasets.
Complexity as Advantage: A Regret-Based Perspective on Emergent Structure: This paper proposes Complexity-as-Advantage (CAA): redefining "complexity" as the regret dispersion of a family of resource-constrained observers on the same process. It proves that under the log-loss + Markov framework, it is equivalent to the sum of conditional mutual information atoms (recovering excess entropy); from a coding perspective, it is equivalent to the variance of excess description length (MDL). This unifies Kolmogorov complexity, Bennett's logical depth, and excess entropy into a computable and empirically estimable scalar spectrum.

Browse all 70 Others papers →