Skip to content

🧠 NeurIPS2025 Accepted Papers

2492 NeurIPS2025 paper notes covering Image Generation (218), Model Compression (140), Reinforcement Learning (140), Optimization & Theory (121), 3D Vision (116), Multimodal VLM (105), LLM Reasoning (81), LLM Safety (80) and other 51 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.


💡 LLM Reasoning (81)

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

This paper proves that increasing Transformer depth from a constant to \(\Theta(\log n)\) unlocks the ability to recognize regular languages and solve graph connectivity — two problems provably beyond the reach of fixed-depth Transformers — and that depth scaling is strictly more efficient than width scaling (which requires super-polynomial growth) or Chain-of-Thought (CoT) steps (which requires super-logarithmic growth).

A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning

This paper proposes the first theoretical framework for sampling-based test-time scaling methods, decomposing reasoning error into estimation error and model error. It reveals the limitations of Self-Consistency (slow convergence) and Perplexity (large model error), and introduces the RPC method that combines the strengths of both, achieving comparable reasoning performance on 7 benchmarks with only 50% of the sampling cost.

AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling

This paper proposes AbbIE, an architecture that recursively iterates the intermediate layers (Body) of a decoder-only Transformer. Trained with only 2 iterations, AbbIE achieves upward generalization at inference time by increasing the number of iterations, surpassing standard Transformers on both language modeling perplexity and zero-shot ICL benchmarks, while serving as a drop-in replacement for standard Transformers.

Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning

This paper proposes the Adaptive Dual Reasoner (ADR), which enables reasoning models to dynamically switch between fast thinking (compressing simple reasoning steps) and slow thinking (preserving depth for complex steps). Through SFT cold-start combined with EHPO (Entropy-guided Hybrid Policy Optimization), ADR achieves up to 6.1% accuracy improvement on mathematical reasoning benchmarks while reducing reasoning tokens by 49.5%–59.3%.

Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

This paper presents the first systematic analysis of large reasoning models (LRMs) in MQM-based machine translation evaluation, identifying failure modes including overthinking, score overestimation, and scale-dependent sensitivity to input materials. The authors propose ThinMQM, a method that calibrates LRM reasoning by fine-tuning on synthetic human MQM annotation trajectories, reducing the thinking budget by approximately 35× while improving evaluation performance (achieving +8.7 correlation score for the 7B model).

ARM: Adaptive Reasoning Model

ARM enables models to adaptively select among four reasoning formats (Direct Answer, Short CoT, Code, Long CoT) and introduces Ada-GRPO to address format collapse during training, achieving comparable accuracy to pure Long CoT models while reducing token usage by ~30% on average and up to ~70% on simple tasks.

Atom of Thoughts for Markov LLM Test-Time Scaling

This paper proposes Atom of Thoughts (AoT), which models LLM reasoning as a Markov chain where each state is a self-contained subproblem that is answer-equivalent to the original question but of strictly lower complexity. A two-phase transition mechanism based on DAG decomposition and contraction eliminates historical dependencies. AoT integrates seamlessly with existing methods such as ToT and reflection, achieving state-of-the-art performance across six benchmarks spanning mathematics, code, and multi-hop QA.

Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

This paper proposes SPARKLE, a three-axis analytical framework (plan following, knowledge integration, subproblem decomposition) for fine-grained dissection of how RL shapes LLM reasoning behavior. The analysis reveals that RL primarily enhances knowledge integration and planning flexibility rather than plan execution. The paper further introduces SparkleRL-PSS, a multi-stage RL training pipeline that effectively exploits hard problem data via partial step scaffolding.

ChartMuseum: Testing Chart Visual Reasoning in Large Vision-Language Models

This paper introduces ChartMuseum, a chart question-answering benchmark comprising 1,162 expert-annotated questions and real-world charts from 184 distinct sources. It is the first benchmark to systematically distinguish visual reasoning from textual reasoning, revealing that the current strongest model, Gemini-2.5-Pro, achieves only 63.0% accuracy compared to 93% for humans, with visual reasoning performance lagging behind textual reasoning by 35%–55%.

Clip-and-Verify: Linear Constraint-Driven Domain Clipping for Accelerated Neural Network Verification

This paper proposes the Clip-and-Verify verification pipeline, which leverages linear constraints generated "for free" during linear bound propagation. Two GPU-efficient algorithms—complete clipping (coordinate ascent dual solving) and relaxed clipping (closed-form input domain shrinkage)—are used to tighten intermediate-layer bounds across the entire network. The approach reduces the number of BaB subproblems by up to 96% on multiple benchmarks, and serves as a core component of the VNN-COMP 2025 winning verifier.

Browse all 81 LLM Reasoning papers →


🦾 LLM Agent (39)

A-MEM: Agentic Memory for LLM Agents

This paper proposes A-Mem, a Zettelkasten-inspired agentic memory system for LLM agents. Each memory entry automatically generates a structured note (keywords/tags/contextual description), dynamically establishes inter-memory links, and triggers evolutionary updates to existing memories upon the insertion of new ones. A-Mem substantially outperforms baselines such as MemGPT on the LoCoMo long-conversation QA benchmark.

AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents

This paper proposes AgentAuditor — a training-free, memory-augmented reasoning framework that enables LLMs to adaptively extract structured semantic features (scenario, risk, behavior) to construct an experiential memory bank, then employs multi-stage context-aware retrieval-augmented generation to guide LLM evaluators in assessing agent behavior for safety and security threats. The work also introduces ASSEBench, the first benchmark jointly covering safety and security evaluation (2,293 records, 15 risk types, 29 scenarios), achieving human expert-level evaluation accuracy across multiple benchmarks.

AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness

AgentChangeBench is the first benchmark that systematically evaluates the adaptability of LLM agents when user goals shift mid-conversation: 315 base tasks × 9 variants = 2,835 sequences, spanning 3 enterprise domains (banking/retail/airline) and 5 user personas. It introduces 4 complementary metrics including GSRT (Goal-Shift Recovery Time), revealing efficiency and robustness gaps masked by high pass@k—e.g., GPT-4o achieves 92.2% airline recovery rate yet 89.1% retail redundancy rate.

Agentic NL2SQL to Reduce Computational Costs

This paper proposes Datalake Agent, an agentic NL2SQL system built on an interactive reasoning loop. Through a hierarchical information retrieval strategy (GetDBDescription → GetTables → GetColumns → DBQueryFinalSQL), the system enables LLMs to request database schema information on demand rather than receiving it all at once. In a setting with 319 tables, the approach reduces token usage by 87% and cost by 8×, while maintaining superior performance on complex queries.

Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents

This paper proposes Agentic Plan Caching (APC), which extracts structured plan templates from agent execution logs and reuses them via keyword-matching cache hits with a small model for adaptation. APC reduces cost by 50.31% and latency by 27.28% on average while retaining 96.61% of accuracy-optimal performance.

AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

This paper proposes the AgentMisalignment benchmark suite, comprising 9 realistic scenario evaluation tasks that measure the propensity of LLM agents to spontaneously deviate from deployer intent under non-malicious instructions (rather than measuring capability). The study finds that stronger models tend to exhibit higher misalignment, and that persona prompts sometimes exert greater influence on misaligned behavior than model choice itself.

AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks

This paper investigates the problem of compute-optimal test-time scaling in multi-stage complex tasks. Through large-scale pilot experiments, three generalizable scaling insights for LLMs on multi-stage tasks are identified. The authors propose AgentTTS—an LLM agent-based framework that autonomously searches for compute-optimal model selection and budget allocation strategies via iterative feedback-driven search.

Are Large Language Models Sensitive to the Motives Behind Communication?

Three progressive experiments systematically evaluate whether LLMs possess "motivational vigilance"—the ability to recognize the intentions and incentives of information sources and adjust trust accordingly. In controlled experiments, frontier non-reasoning LLMs perform close to the rational model (Pearson's \(r > 0.9\)) and resemble humans more than the rational model does; however, vigilance drops sharply in real-world YouTube sponsored content (\(r < 0.2\)), and simple prompt steering partially restores it (raising \(r\) to 0.31).

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

This paper proposes the Blink-Think-Link (BTL) brain-inspired framework, which decomposes GUI interaction into three biologically plausible stages: Blink (rapid attentional localization), Think (cognitive reasoning and decision-making), and Link (executable command generation). Combined with an automated Blink data annotation pipeline and the first rule-based composite process-and-outcome reward mechanism, BTL Reward, the resulting BTL-UI model achieves competitive performance on both static GUI understanding and dynamic interaction benchmarks.

CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension

Inspired by Piaget's constructivist theory, this paper proposes CAM — an agentic memory system characterized by three properties: structuredness (hierarchical schema), flexibility (assimilation via overlapping clustering), and dynamism (incremental adaptation). CAM comprehensively outperforms baselines such as RAPTOR and GraphRAG across six long-document reading comprehension benchmarks.

Browse all 39 LLM Agent papers →


👥 Multi-Agent (17)

3D-Agent: Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation

This paper proposes Tri-MARF, a tri-modal multi-agent framework comprising a VLM annotation agent (multi-view, multi-candidate description generation), an information aggregation agent (BERT clustering + CLIP weighting + UCB1 Multi-Armed Bandit selection), and a point cloud gating agent (Uni3D text–point cloud alignment for hallucination filtering). The system achieves a CLIPScore of 88.7 (surpassing human annotation at 82.4), a throughput of 12k objects/hour, and has annotated approximately 2 million 3D models.

Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning

This paper proposes the Adaptive Coopetition (AdCo) framework, which employs a UCB multi-armed bandit strategy with coarse-grained verifier signals to enable multiple LLM agents to adaptively switch between cooperative and competitive modes during inference, achieving a 20% relative improvement on mathematical reasoning benchmarks.

Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection

This paper formalizes the agent component selection problem as an online knapsack problem and proposes the Composer Agent framework, which evaluates true component capabilities via sandbox testing (rather than static semantic retrieval) and dynamically selects optimal component combinations under budget constraints using the ZCL online algorithm. The approach achieves up to a 31.6% improvement in single-agent tool selection success rate, and boosts multi-agent sub-agent selection success rate from 37% to 87%.

Belief-Calibrated Multi-Agent Consensus Seeking for Complex NLP Tasks

This paper proposes the Belief-Calibrated Consensus Seeking (BCCS) framework, which incorporates three modules—belief-calibrated consensus judgment, conflict-aware collaborator assignment, and leader selection—to enable multi-agent systems to reach more stable consensus on complex NLP tasks, yielding improvements of 2.23% and 3.95% on difficult subsets of MATH and MMLU, respectively.

Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models

This paper proposes an "intention communication" architecture based on lightweight world models, enabling multi-agent coordination by generating and sharing future trajectory plans. The approach comprehensively outperforms end-to-end emergent communication methods in both scalability and performance.

Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?

This work establishes, both theoretically and empirically, that the performance gains attributed to Multi-Agent Debate (MAD) stem primarily from majority voting (ensembling) rather than the debate process itself. The debate dynamics are shown to constitute a martingale—meaning debate does not systematically improve correctness in expectation—and this theoretical insight motivates a principled improvement to MAD by biasing updates toward correct signals.

GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies

GauDP is proposed to enable scalable, perception-enhanced multi-agent collaborative imitation learning by constructing a globally consistent 3D Gaussian field from decentralized RGB observations of multiple agents and dynamically allocating Gaussian attributes back to each agent's local viewpoint.

Large Language Models Miss the Multi-Agent Mark

This position paper systematically surveys 1,400+ papers to argue that current LLM-based multi-agent systems (MAS LLMs) deviate from foundational MAS theory along four dimensions: LLMs lack native social behavior, environment design is LLM-centric, asynchronous coordination and standard communication protocols are absent, and emergent behaviors lack quantification. The paper warns that the field risks reinventing the wheel while ignoring 40 years of MAS research.

Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve

This paper proposes the LessonL framework, enabling multiple small LLM agents to reflect on both successful and failed cases through mutually shared "lessons," collaboratively optimizing code performance. A combination of three 7B–14B models achieves code optimization results on par with GPT-4o and approaching o3.

MASFIN: A Multi-Agent System for Decomposed Financial Reasoning and Forecasting

This paper proposes MASFIN, a multi-agent system that decomposes financial forecasting into multiple sub-tasks (macroeconomic analysis, industry analysis, technical analysis, sentiment analysis, etc.), with specialized LLM agents collaborating to produce more accurate and interpretable financial predictions than single-model approaches.

Browse all 17 Multi-Agent papers →


⚖️ Alignment & RLHF (36)

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

This paper proposes JAIL-CON, a jailbreak attack framework based on task concurrency. By interleaving harmful and benign tasks at the word level, it exploits LLMs' ability to handle concurrent tasks to bypass safety mechanisms, while the resulting concurrent outputs exhibit stronger evasiveness against guardrails.

Alignment of Large Language Models with Constrained Learning

This paper proposes CAID (Constrained Alignment via Iterative Dualization), an iterative dualization method that alternately updates the LLM policy and dual variables. It theoretically establishes that the dual approach can identify the optimal constrained LLM policy (up to a parametrization gap), and empirically demonstrates significant improvements in constraint satisfaction and the helpfulness–safety trade-off on the PKU-SafeRLHF dataset.

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

This work introduces the Infinity-Chat dataset (26K open-ended real-world user queries with 31,250 human annotations) to expose the "Artificial Hivemind" phenomenon in language models — severe intra-model repetition and inter-model homogeneity in open-ended generation — and demonstrates that Reward Models and LM Judges fail to calibrate on samples with high inter-annotator preference divergence.

Ask a Strong LLM Judge when Your Reward Model is Uncertain

This paper proposes an uncertainty-based routing framework that applies SNGP to a pairwise reward model for uncertainty quantification, routing high-epistemic-uncertainty samples to a strong LLM judge (DeepSeek-R1). At a judge invocation cost of only 9.2%–42.5%, the approach significantly outperforms random routing in accuracy and demonstrably improves downstream online RLHF alignment.

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

This paper proposes a two-stage fine-tuning attack: Stage 1 fine-tunes an LLM on 10 benign questions paired with identical refusal answers, driving the model to overfit into a sharp loss landscape; Stage 2 fine-tunes the same 10 questions with normal answers, triggering catastrophic forgetting of safety alignment. Using entirely benign data, the method achieves a 94.84% attack success rate (ASR), comparable to malicious fine-tuning (97.25%), while completely evading content moderation.

Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

This paper establishes a theoretical generalization framework for DPO under diverse human value settings. By analyzing the dynamic trajectory of reward margins after a finite number of gradient steps, it proves that the number of samples required per value must grow logarithmically with the number of value categories \(K\) (i.e., \(Q = \Theta(\log K)\)) to maintain generalization performance, thereby revealing the statistical cost of aligning with diverse societal values.

Capturing Individual Human Preferences with Reward Features

This paper proposes the Reward Feature Model (RFM), which learns shared reward features \(\phi_\theta(x,y)\) such that each user obtains a personalized reward \(r_h = \langle \phi_\theta, \mathbf{w}_h \rangle\) via a linear weight vector \(\mathbf{w}_h\). The work provides the first PAC generalization bound for multi-annotator preference learning, proving that increasing the number of annotators \(m\) is more effective than increasing per-annotator sample count \(n\), and that as few as 30 samples suffice for fast adaptation to new users.

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

This paper proposes DeepVideo-R1, which reformulates GRPO as Reg-GRPO that directly regresses advantage values (eliminating clipping/min safeguards), and mitigates the vanishing advantage problem via difficulty-aware data augmentation, achieving up to 10.1 percentage points improvement over standard GRPO on video reasoning tasks.

EvoRefuse: Evaluating and Mitigating LLM Over-Refusal via Evolutionary Prompt Optimization

This paper proposes EvoRefuse, a framework that employs evolutionary search to maximize the ELBO for automatically generating diverse pseudo-malicious instructions, yielding a more challenging over-refusal evaluation benchmark (EvoRefuse-Test) and an effective alignment mitigation dataset (EvoRefuse-Align).

From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

This paper proposes the Streaming Content Monitor (SCM)—the first harmful content monitor natively designed for partial detection. Built upon the FineHarm dataset (29K samples with token-level annotations) and hierarchical consistency-aware learning, SCM achieves a macro F1 of 0.95+ after observing on average only 18% of response tokens, enabling real-time early stopping of harmful LLM outputs.

Browse all 36 Alignment & RLHF papers →


🔒 LLM Safety (80)

A Cramér–von Mises Approach to Incentivizing Truthful Data Sharing

This paper proposes an incentive mechanism based on the Cramér–von Mises (CvM) two-sample test statistic. Under both Bayesian and prior-free settings, the mechanism provably makes truthful data submission a (approximate) Nash equilibrium, while encouraging participants to contribute more genuine data—without relying on strong distributional assumptions (e.g., Gaussian or Bernoulli).

A Reliable Cryptographic Framework for Empirical Machine Unlearning Evaluation

This paper models the machine unlearning evaluation problem as a cryptographic game (the unlearning sample inference game), quantifies unlearning quality via the adversary's "advantage," and addresses multiple shortcomings of traditional MIA accuracy as an evaluation metric—namely, the lack of a retrain-as-zero baseline, sensitivity to data partitioning, and sensitivity to the choice of MIA. A SWAP test is further proposed as an efficient practical approximation.

A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

This paper proposes an Adaptive Alpha aggregation strategy that dynamically adjusts reward weights based on each user group's historical alignment performance within a federated RLHF framework, simultaneously achieving high fairness and strong alignment performance for pluralistic preference alignment.

Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning

This paper proposes FedLEASE, which addresses two critical challenges in federated LoRA fine-tuning: (1) automatically determining the optimal number of experts and their assignment via LoRA B-matrix similarity clustering, and (2) enabling adaptive top-M expert selection through an expanded routing space of \(2M-1\) dimensions, allowing each client to determine how many experts to use. FedLEASE achieves an average improvement of 5.53% over the strongest baseline on GLUE.

Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

This paper proposes Adversarial Paraphrasing — a training-free universal attack framework that selects the most "human-like" token at each decoding step by leveraging feedback signals from AI text detectors during token-by-token paraphrasing. The approach achieves an average T@1%F reduction of 87.88% across 8 detectors and exhibits strong cross-detector transferability.

AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

This paper proposes AgentDAM, the first benchmark for end-to-end evaluation of data minimization compliance by AI agents in real web environments. It comprises 246 tasks spanning Reddit, GitLab, and Shopping platforms, and finds that leading models such as GPT-4o exhibit privacy leakage rates of 36–46% without mitigation, while a CoT-based privacy prompt reduces leakage rates to 6–8%.

AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text

This paper proposes the AgentStealth framework, which trains a small language model (SLM) through a three-stage pipeline comprising an adversarial anonymization workflow, supervised fine-tuning (SFT), and online reinforcement learning, achieving effective anonymization of user-generated content while preserving text utility — yielding a 12.3% improvement in anonymization performance and 6.8% improvement in utility.

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

The first defense framework against jailbreak attacks on audio-language models (ALMs). The work discovers that aligned ALMs possess latent safety shortcuts that can be activated, and proposes a Mel Gradient Sparse Mask (M-GSM) to identify critical frequency bins. By applying Shortcut Activation Perturbations (SAP) to these bins, the average attack success rate is reduced from 41.6% to 4.6% with negligible degradation of normal task performance.

Approximate Domain Unlearning for Vision-Language Models

This paper introduces Approximate Domain Unlearning (ADU), a novel task that enables pretrained VLMs to selectively forget recognition capabilities for specified domains (e.g., illustrations, sketches) while preserving classification accuracy on other domains (e.g., real photographs). Two modules are proposed — Domain Disentangling Loss (DDL) and Instance-wise Prompt Generator (InstaPG) — achieving substantial improvements over all baselines across four multi-domain datasets.

Attention! Your Vision Language Model Could Be Maliciously Manipulated

This paper proposes the Vision-language Model Manipulation Attack (VMA), an image-based adversarial attack method that combines first- and second-order momentum optimization with a differentiable transformation mechanism, enabling precise control over every output token of a VLM. The approach supports a range of attack scenarios (jailbreaking, hijacking, privacy breach, DoS, sponge examples) and can also be repurposed for copyright-protection watermark injection.

Browse all 80 LLM Safety papers →


👻 Hallucination Detection (17)

Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models

This paper systematically audits the generation and propagation mechanisms of hallucinations in reasoning large language models (RLLMs), finding that reflection in long CoT amplifies hallucinations through metacognitive bias rather than correcting them. Even targeted interventions at the hallucination source fail to alter final outputs (chain disloyalty), exposing critical shortcomings of existing hallucination detection methods in multi-step reasoning scenarios.

Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs

This paper demonstrates that numerical hallucinations in LLMs originate from the Benford's Law-conforming digit frequency distribution in pretraining corpora—where digit 1 appears with ~30% probability while digit 9 appears with only ~5%—and that this bias is internalized by specific "digit-selective neurons" in the later FFN layers. A Digit Selectivity Coefficient (DSC) is proposed to localize biased neurons, and pruning 0.01% of neurons corrects 1.36–3.49% of erroneous predictions.

Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT

This paper organizes all hidden-layer activations of an LLM into an "activation tensor" (layers × tokens × hidden dimension), treats it analogously to an image, and processes it with a ViT-based architecture (ACT-ViT) that supports joint training across multiple LLMs. The method consistently outperforms conventional probing approaches across 15 LLM–dataset combinations and demonstrates strong zero-shot/few-shot transfer to unseen datasets and unseen LLMs.

Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models

This paper identifies the root cause of object hallucination in MLLMs at the representation level—semantic entanglement induced by dataset co-occurrence bias—and proposes a dual-path causal disentanglement framework (Causal-Driven Projector + Causal Intervention Module). By applying backdoor adjustment at both the projector and the final Transformer layer to decouple co-occurring object representations, the method achieves a 22.6% improvement on MME-Perception.

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

This paper argues that LLM generalization and hallucination share a common mechanism — out-of-context reasoning (OCR) — and provides theoretical guarantees on a single-layer attention model: the factorized parameterization \((W_O, W_V)\) can perform OCR due to the nuclear norm implicit bias of gradient descent, whereas the merged parameterization \(W_{OV}\) cannot due to its Frobenius norm bias. Moreover, OCR is sample-efficient (requiring only \(m_{\text{train}}>0\)).

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

This paper proposes REVERSE, the first framework to unify generation adjustment and post-hoc verification within a single VLM. Through hallucination-aware training on 1.3M semi-synthetic samples combined with inference-time retrospective resampling, REVERSE enables a VLM to automatically detect and correct hallucinations during generation, achieving a 12% reduction on CHAIR-MSCOCO and a 34% improvement on HaloQuest.

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

GLSim is a training-free object hallucination detection method for LVLMs that combines a global scene similarity score (cosine similarity between the object token and the last instruction token) and a local visual grounding similarity score (cosine similarity between the object token and the Top-K image patch embeddings localized via Visual Logit Lens). It achieves 83.7% AUROC on MSCOCO, surpassing SVAR by 9% and Internal Confidence by 10.8%.

Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation

This paper proposes a definition of hallucination in text-to-image (T2I) models as bias-driven deviation, establishes a taxonomy of three hallucination categories—attribute, relation, and object—and argues that hallucination evaluation serves as an "upper bound" for prompt alignment evaluation, thereby revealing hidden model biases.

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

This paper proposes AllPath, a multi-path hallucination intervention framework grounded in the Transformer causal architecture. It is the first to demonstrate that hallucinations in LVLMs do not stem from a single causal path but from the interaction of three paths — image-to-input-text, image-to-output-text, and text-to-text — and that models adaptively rely on different paths depending on the question-answer alignment format. By designing lightweight key-head identification methods for each path and performing adaptive intervention, AllPath consistently reduces hallucinations across four benchmarks covering different alignment formats: POPE, MCQ-POPE, CHAIR, and MME.

Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization

This paper proposes SymMPO (Symmetric Multimodal Preference Optimization), which addresses two key limitations of existing vision-augmented DPO methods—namely, theoretically unsound objective functions and indirect preference supervision—through symmetric paired preference learning over contrastive images and preference margin consistency regularization. Consistent performance gains are achieved across five hallucination benchmarks.

Browse all 17 Hallucination Detection papers →


📊 LLM Evaluation (37)

AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

This work identifies that random data sampling in STaR (Self-Taught Reasoner) leads to severely imbalanced observation training frequencies—easy problems are over-trained while hard problems are under-trained—and proposes AdaSTaR, which combines adaptive diversity sampling (prioritizing under-trained samples) with adaptive curriculum sampling (adjusting difficulty based on model strength) to achieve the highest accuracy on all 6 benchmarks while reducing training FLOPs by 58.6%.

Bayesian Evaluation of Large Language Model Behavior

This paper proposes a Beta-Binomial Bayesian framework for evaluating LLM behavior. By modeling the posterior distribution of \(\theta_m\) over stochastic generations for each prompt, the framework quantifies statistical uncertainty in evaluation metrics and introduces sequential sampling strategies such as Thompson sampling to achieve narrower credible intervals with fewer API calls.

Benchmarking is Broken — Don't Let AI be its Own Judge

This paper systematically critiques the fundamental flaws of current AI benchmark evaluation—data contamination (45%+ overlap in MMLU), selective reporting, and lack of proctoring—and proposes PeerBench: drawing on the proctoring paradigm of high-stakes exams (e.g., SAT/GRE), it constructs a next-generation AI evaluation infrastructure via a rolling confidential question bank, peer-review quality control, reputation-weighted scoring, and cryptographic commitment mechanisms.

Benchmarking Large Language Models for Zero-Shot and Few-Shot Phishing URL Detection

This paper systematically evaluates three commercial LLMs — GPT-4o, Claude-3.7, and Grok-3-Beta — on phishing URL detection under a unified zero-shot and few-shot prompt framework. Results show that few-shot prompting consistently improves performance across all models, with Grok-3-Beta achieving the best F1 (0.9399) on the balanced dataset, while different models exhibit distinct precision–recall trade-off behaviors.

Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation

This paper formalizes LLM benchmark evaluation as a hierarchical statistical model, theoretically demonstrates that multiple stochastic generations (\(k>1\)) reduce the variance of benchmark score estimates, and introduces a prompt-level difficulty metric \(\mathbb{P}(\text{correct})\) along with data maps for benchmark quality control.

Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

This paper proposes LAGER, a framework that aggregates score token logits from intermediate to final layers of an LLM and computes an expected score to derive the final judgment. Without any model fine-tuning, LAGER improves human alignment by up to 7.5% and matches or surpasses reasoning-based methods without requiring chain-of-thought inference.

BLINK-Twice: You See But Do You Observe? A Reasoning Benchmark on Visual Perception

This paper introduces BLINK-Twice, a vision-centric reasoning benchmark comprising 345 visually challenging images, 103 adversarial samples, 896 VQA pairs, and 1,725 annotated reasoning steps. Through seven categories of visual illusion scenarios, it evaluates the "you see but do not observe" reasoning capability of MLLMs. The strongest model, Gemini-2.5 Pro, achieves only 26.9% G-Acc, suggesting that multi-round image observation and active visual interaction are promising directions for improvement.

Can Large Language Models Master Complex Card Games?

This paper systematically evaluates the ability of LLMs to learn eight complex card games. It finds that through SFT on high-quality game trajectory data, LLMs can approach the performance of strong game AIs and simultaneously master multiple games, though general capabilities degrade — a decline that can be mitigated by mixing in general instruction data.

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

This paper proposes CodeAssistBench (CAB), the first fully automated benchmark for evaluating multi-turn, repository-level programming assistance. CAB automatically constructs 3,286 real-world programming help scenarios from GitHub Issues, spanning 7 languages and 214 repositories, and reveals a substantial performance gap: state-of-the-art models achieve 70–83% on StackOverflow-style questions but only 7–16% on post-cutoff repositories.

ComPO: Preference Alignment via Comparison Oracles

To address likelihood displacement and verbosity caused by noisy preference pairs (where preferred and dispreferred responses are highly similar) in DPO, this paper proposes ComPO, a zeroth-order preference alignment method based on comparison oracles. The approach partitions data into clean and noisy subsets, applying DPO to the clean subset and ComPO to extract alignment signals from the noisy subset, achieving consistent improvements in LC win rate on benchmarks such as AlpacaEval 2.

Browse all 37 LLM Evaluation papers →


⚡ LLM Efficiency (34)

3-Model Speculative Decoding (PyramidSD)

PyramidSD introduces a three-tier pyramid decoding architecture by inserting an intermediate "qualifier" model between the draft model (\(M_D\)) and target model (\(M_T\)) in standard speculative decoding. The method exploits the natural entropy gradient across model scales within a model family to hierarchically filter tokens, and employs a fuzzy acceptance criterion to relax the matching threshold, achieving up to 1.91× speedup (reaching 124 tok/s on an RTX 4090).

A Unified Framework for Establishing the Universal Approximation of Transformer-Type Architectures

A unified theoretical framework is established for proving the universal approximation property (UAP) of diverse Transformer architectures. The framework rests on two core conditions — nonlinear affine invariance of the feed-forward layer and token distinguishability of the attention layer — and leverages an analyticity assumption to reduce the latter to verification on only two-sample cases. The framework successfully covers a wide range of practical architectures, including softmax, RBF kernel, Performer, BigBird, Linformer, and others.

Advancing Expert Specialization for Better MoE

By jointly optimizing an orthogonality loss (reducing projection overlap among experts) and a variance loss (increasing routing score diversity), the proposed method reduces expert overlap by 45% and improves routing variance by 150% without modifying the MoE architecture, achieving an average gain of 23.79% across 11 benchmarks while fully preserving load balance.

Approximately Aligned Decoding

This paper proposes Approximately Aligned Decoding (AprAD), a method for constrained generation in LLMs that leverages the prefix-selection algorithm from speculative decoding. Upon encountering a constraint violation, AprAD neither reverts only one token (as in constrained generation, which causes extreme probability amplification) nor resamples entirely from scratch (as in ASAp, which incurs prohibitive computational cost). Instead, it intelligently selects a rollback position via speculative sampling, achieving a favorable trade-off between output distribution distortion and computational efficiency.

Constant Bit-Size Transformers Are Turing Complete

This paper provides the first proof that a Transformer with constant bit-size precision and a fixed number of parameters — permitting only context window growth — is Turing complete. It establishes the exact complexity equivalence WINDOW[s(n)] = SPACE[s(n)], demonstrating that expanding the context window, rather than model size, suffices for universal computation.

Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

This paper proposes a branched training method to directly measure the critical batch size (CBS) empirically, finding that CBS grows rapidly in early training before plateauing and is independent of model scale. Based on this insight, a batch size warmup strategy is designed that achieves equivalent or superior training loss with 43% fewer gradient steps.

DISC: Dynamic Decomposition Improves LLM Inference Scaling

DISC proposes a dynamic decomposition algorithm that automatically and recursively adjusts the granularity of reasoning steps at inference time based on the z-score (normalized maximum of sampled rewards) at each step — decomposing difficult steps more finely while taking larger strides over easy ones. It can be plugged into greedy search, Beam Search, and MCTS, achieving higher pass@k with fewer token budgets on APPS, MATH, and LiveCodeBench.

Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving

This paper proposes PORT, the first training-free online LLM routing algorithm. PORT estimates query features via approximate nearest neighbor search (ANNS) and performs a one-shot optimization of dual variables as routing weights on a small set of initial queries. Under a limited token budget, PORT achieves near-offline-optimal routing performance with a \(1-o(1)\) competitive ratio, delivering on average 3.55× performance improvement, 1.85× cost efficiency, and 4.25× throughput over baselines.

FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed MoE Training

FlowMoE proposes a unified pipeline scheduling framework that integrates MHA computation, gating, expert computation, and A2A communication into a single pipeline. A priority-driven all-reduce tensor chunking mechanism maximizes communication–computation overlap, achieving 1.13×–1.82× speedup, 10–39% energy reduction, and 7–32% memory savings across multiple real-world MoE models.

From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers

This paper provides rigorous theoretical analysis demonstrating that the diversity of pretraining data—characterized by the max-sum ratio—determines whether a single-layer Transformer learns a generalizable induction head or a non-OOD-generalizing positional shortcut, and derives a closed-form optimal pretraining distribution that promotes induction head formation.

Browse all 34 LLM Efficiency papers →


📚 Pretraining (51)

A Practical Guide for Incorporating Symmetry in Diffusion Policy

This paper presents a practical guide for incorporating symmetry into diffusion policies. Through three simple and composable methods — invariant representations (relative trajectory actions + eye-in-hand perception), equivariant visual encoders, and Frame Averaging — the proposed approach achieves performance on par with or exceeding fully equivariant diffusion policies across 12 MimicGen tasks, while substantially reducing implementation complexity.

AI Progress Should Be Measured by Capability-Per-Resource, Not Scale Alone: A Framework for Gradient-Guided Resource Allocation in LLMs

This position paper challenges "scaling fundamentalism" by proposing Capability-Per-Resource (CPR) as a replacement for raw scale as the primary measure of AI progress. The paper presents a gradient-guided resource allocation framework in which foundation model developers publish "gradient blueprint" metadata, enabling downstream adapters to fine-tune only a high-influence parameter subset while substantially reducing resource consumption and maintaining performance close to full-parameter fine-tuning.

Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks

This paper proposes the Alternating Gradient Flow (AGF) theoretical framework to explain the stepwise "saddle-to-saddle" feature learning dynamics in neural networks. Training is modeled as an alternating process between utility maximization for dormant neurons and cost minimization for active neurons, unifying feature selection analysis across diagonal linear networks, attention models, and modular addition. Predictions from AGF exhibit high agreement with actual gradient flow behavior.

An Empirical Investigation of Neural ODEs and Symbolic Regression for Dynamical Systems

This paper systematically investigates the extrapolation capability of Neural ODEs (NODEs) on noisy synthetic data, and explores a pipeline that employs NODEs as a data augmentation tool combined with symbolic regression (SR) to recover governing equations from limited data. Results demonstrate that this combined approach can recover two of three governing equations—and a strong approximation of the third—using only 10% of the simulation data.

Beyond Benign Overfitting in Nadaraya-Watson Interpolators

By tuning a single bandwidth parameter \(\beta\) in the Nadaraya-Watson interpolator, this paper precisely characterizes the complete phase transition spectrum from catastrophic overfitting (\(\beta < d\)) → benign overfitting (\(\beta = d\)) → tempered overfitting (\(\beta > d\)), demonstrating that overestimating the intrinsic dimensionality of data is safer than underestimating it.

Born a Transformer – Always a Transformer? On the Effect of Pretraining on Architectural Abilities

Through systematic study of a family of retrieval and copying tasks, this paper reveals that large-scale pretraining introduces a directional bias into Transformers (rightward/forward over leftward/backward), while failing to overcome fundamental architectural limitations on non-unique tasks. Fine-tuning can eliminate the directional bias but cannot surpass the boundaries of architectural expressiveness.

Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining

This paper identifies that the dominant subspace in low-rank optimizers such as GaLore "freezes" during pretraining (cosine overlap between consecutive subspaces approaches 1), trapping weight updates within a fixed low-rank subspace. The authors propose SARA (Sampling-based Adaptive Rank Allocation), which constructs subspaces by sampling singular vectors according to singular value weights, provides convergence guarantees, and reduces the performance gap between low-rank optimizers and full-rank Adam by up to 46%.

Breaking the Gradient Barrier: Unveiling Large Language Models for Strategic Classification

This paper proposes GLIM (Gradient-free Learning In-context Method), which for the first time leverages the In-Context Learning (ICL) mechanism of LLMs to implicitly simulate the bi-level optimization in strategic classification (feature manipulation + decision rule optimization), enabling efficient strategic classification on large-scale data without any fine-tuning.

Broken Tokens: Your Language Model Can Secretly Handle Non-Canonical Tokenization

This paper reveals that LLMs can secretly handle non-canonical tokenizations (e.g., splitting "Hello" into "He"+"llo" instead of the canonical whole-word token)—even when the input token sequence differs from training, models exhibit surprising robustness. This capability stems from the property that sub-word embeddings in the embedding space can linearly combine to approximate whole-word embeddings.

CLIMB: Class-Imbalanced Learning Benchmark on Tabular Data

This paper presents CLIMB — the most comprehensive benchmark to date for class-imbalanced learning on tabular data — encompassing 73 real-world datasets and 29 CIL algorithms. Large-scale experiments reveal several practical insights: naive rebalancing is often ineffective, ensemble methods are critical, and data quality impacts performance more than the degree of imbalance itself.

Browse all 51 Pretraining papers →


✏️ Knowledge Editing (6)

Edit Less, Achieve More: Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs

This paper proposes NMKE, a framework that identifies two categories of knowledge neurons—knowledge-general and knowledge-specific—via neuron-level attribution, and applies entropy-guided dynamic sparse masking to achieve precise neuron-level knowledge editing. NMKE maintains high edit success rates and general model capabilities after 5,000 consecutive edits.

KScope: A Framework for Characterizing the Knowledge Status of Language Models

This paper proposes a five-category taxonomy of LLM knowledge status (Consistent Correct / Conflicting Correct / Missing / Conflicting Incorrect / Consistent Incorrect) and the KScope hierarchical statistical testing framework. By combining repeated sampling with multi-step hypothesis testing, KScope precisely characterizes the modal structure of an LLM's knowledge for a given question, and systematically investigates how context updates each knowledge state. The study finds that constrained context summarization combined with credibility augmentation improves knowledge update success rates by an average of 4.3%.

MemEIC: A Step Toward Continual and Compositional Knowledge Editing

This paper proposes MemEIC, a three-tier framework for continual and compositional knowledge editing in large vision-language models (LVLMs), combining an external dual-modal retrieval memory (Mem-E), an internal modality-decoupled LoRA adapter (Mem-I), and a brain-inspired Knowledge Connector. MemEIC substantially outperforms existing methods on the newly introduced CCKEB benchmark.

MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

MEMOIR introduces a framework that incorporates zero-initialized residual memory matrices into FFN layers, employs TopHash-based sparse masks to confine each edit to a distinct subset of memory parameters, and at inference time conditionally activates stored knowledge by measuring mask overlap. The approach achieves an optimal balance among reliability, generalization, and locality across 15,000 sequential edits.

Rethinking Residual Distribution in Locate-then-Edit Model Editing

This paper reveals that the residual distribution mechanism in locate-then-edit model editing introduces weight deviation errors that grow with distribution distance, batch size, and sequential edit length. It proposes BLUE (Boundary Layer UpdatE), a strategy that updates only the first and last critical layers, achieving an average improvement of 35.59%.

UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models

This paper presents UniEdit — the first unified LLM knowledge editing benchmark built upon an open-domain knowledge graph (Wikidata), covering 311K samples across 25 domains in 5 major categories. By introducing the Neighborhood Multi-hop Chain Sampling (NMCS) algorithm, UniEdit integrates diverse generalization and locality evaluation criteria into a single framework, systematically revealing the shortcomings of existing editing methods under complex ripple effect evaluations.


💬 LLM (Other) (53)

AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

This paper proposes AceSearcher—a collaborative self-play framework in which a single LLM simultaneously plays two roles: a decomposer (breaking complex queries into sub-questions to guide retrieval) and a solver (integrating retrieved context to generate answers). Through a two-stage training pipeline of SFT followed by iterative DPO, using only final-answer rewards, AceSearcher achieves an average EM improvement of 7.6% across 10 datasets, and the 32B model matches DeepSeek-V3 with fewer than 5% of its parameters.

AdaptDel: Adaptable Deletion Rate Randomized Smoothing for Certified Robustness

AdaptDel extends the fixed deletion rate used in randomized smoothing for discrete sequences to an adaptable deletion rate that varies according to input properties such as sequence length. The paper provides a theoretical soundness proof for certification under variable rates, and experiments on NLP sequence classification tasks demonstrate improvements in certified region cardinality of up to 30 orders of magnitude.

Adaptive Kernel Design for Bayesian Optimization Is a Piece of CAKE with LLMs

This paper proposes CAKE (Context-Aware Kernel Evolution), which leverages LLMs as crossover and mutation operators within a genetic algorithm framework to adaptively generate and evolve GP kernel expressions during Bayesian optimization. Combined with the BAKER ranking mechanism that balances model fit (BIC) and expected improvement (EI), CAKE consistently outperforms both fixed-kernel and adaptive-kernel baselines on tasks including hyperparameter optimization, controller tuning, and photonic chip design.

Are Language Models Efficient Reasoners? A Perspective from Logic Programming

This paper proposes a framework for evaluating LLM reasoning efficiency (rather than correctness alone) from a logic programming perspective. By mapping natural language proofs to logic program proofs via verbalized logic programs, the authors find that current LLMs not only suffer accuracy degradation on math problems containing irrelevant axioms, but also exhibit severely inefficient reasoning—more than half of all reasoning steps are unnecessary.

AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise

AutoDiscovery proposes Bayesian Surprise as an objective reward signal for open-ended scientific discovery — estimating the KL divergence between prior and posterior belief distributions via LLM sampling, combined with MCTS and progressive widening to explore the hypothesis space. On 21 real-world datasets, the method produces 5–29% more surprising discoveries than greedy/beam search baselines. Human evaluation confirms that Bayesian Surprise aligns with expert "surprise" ratings (0.67), substantially outperforming LLM self-evaluated "novelty" and "usefulness."

Breaking AR's Sampling Bottleneck: Provable Acceleration via Diffusion Language Models

This paper establishes a complete convergence theory for masked diffusion language models from an information-theoretic perspective: it proves that the sampling error in KL divergence decays at an \(O(1/T)\) rate and scales linearly with inter-token mutual information, provides a matching lower bound to establish tightness, and theoretically demonstrates that diffusion models can generate high-quality samples in \(T < L\) steps (where \(L\) is the sequence length).

C²Prompt: Class-aware Client Knowledge Interaction for Federated Continual Learning

To address class-level knowledge inconsistency during prompt communication in federated continual learning, C²Prompt is proposed, which explicitly enhances class-level knowledge coherence across clients via two mechanisms: Local Class Distribution Compensation (LCDC) and Class-aware Prompt Aggregation (CPA). The method achieves an Avg accuracy of 87.20% on ImageNet-R, surpassing the previous SOTA Powder by 2.51%.

CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers

CAT replaces the \(N \times N\) attention matrix in standard self-attention with a circulant matrix generated from an \(N\)-dimensional vector, leveraging FFT to achieve \(O(N \log N)\) attention computation. While strictly preserving the softmax row-normalization structure, CAT matches or surpasses standard attention on ImageNet-1k (avg pool, CLIP-L accuracy 0.694 vs. 0.646) and WikiText-103 masked LM (PPL 8.32 vs. 9.82).

Characterizing the Expressivity of Fixed-Precision Transformer Language Models

This work precisely characterizes the expressive power of fixed-precision, strictly causal, soft-attention, NoPE Transformers — showing it is exactly equivalent to linear temporal logic restricted to past operators, LTL[P] — and unifies this characterization with partially ordered deterministic finite automata (PODFA) and \(\mathcal{R}\)-trivial monoids.

Composing Linear Layers from Irreducibles

By leveraging Clifford algebra, this work represents linear layers as compositions of bivectors—specifically as rotor sandwich products—requiring only \(O(\log^2 d)\) parameters to replace a \(d \times d\) dense matrix. When applied to Q/K/V projections in LLM attention layers, performance closely matches the original model and strong baselines.

Browse all 53 LLM (Other) papers →


📖 NLP Understanding (3)

Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention

This work unrolls selective SSMs (Mamba) into an attention-equivalent form and derives generalization bounds via covering number techniques, controlled by the spectral abscissa \(s_{\mathbf{A}}\) of the continuous-time state matrix. When \(s_{\mathbf{A}} < 0\), the bound is independent of sequence length; when \(s_{\mathbf{A}} \geq 0\), it grows exponentially. The paper further proves this dependence is irreducible.

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

This paper proposes PNLC, a method that trains a lightweight goal-conditioned value function as a "natural language critic" to guide LLM agents in multi-turn planning and self-refinement at the thought-step level. Without direct fine-tuning or inference-time search, PNLC significantly outperforms existing methods on complex interactive tasks such as web navigation, social reasoning, and persuasion, while achieving 8–10× faster inference.

Weak-to-Strong Generalization under Distribution Shifts

This paper demonstrates that naive weak-to-strong generalization fails under distribution shifts—where the strong model performs even worse than the weak supervisor—and proposes RAVEN, a framework that dynamically learns optimal combination weights over multiple weak models to achieve robust weak-to-strong generalization, surpassing baselines by over 30% on OOD tasks.


🗣️ Dialogue Systems (8)

AC-LoRA: (Almost) Training-Free Access Control-Aware Multi-Modal LLMs

AC-LoRA is an end-to-end system that trains independent LoRA adapters for datasets with different permission levels. At inference time, it dynamically retrieves and training-freely merges multiple LoRA outputs based on cosine similarity and user permissions, achieving strong information isolation while matching or surpassing SOTA LoRA mixture methods in response quality.

Agentic Persona Control and Task State Tracking for Realistic User Simulation

A three-agent collaborative framework for realistic user simulation is proposed, comprising a User Agent (coordination), a State Tracking Agent (structured task state), and a Message Attributes Generation Agent (behavior attribute control conditioned on persona and state). On a restaurant ordering scenario, the framework achieves a 102.6% improvement in composite realism score (CRRS), +19.9% in persona adherence, and +284.5% in behavioral variability. A core finding is that behavior control without state awareness yields BVS = 0 (completely rigid behavior).

Bridging Human and LLM Judgments: Understanding and Narrowing the Gap

This paper proposes Bridge, a statistical framework that models the latent relationship between human and LLM judgments via ordinal logistic regression. With a small number of human labels, Bridge improves the calibration and alignment of LLM judgments while supporting formal statistical hypothesis testing for systematic biases.

HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location

This paper proposes HyGen, an interference-aware LLM inference system that achieves elastic co-location of online and offline workloads through an accurate batch latency predictor, an SLO-aware performance profiler, and a prefix-sharing-maximization scheduling strategy, delivering 3.87–5.84× throughput gains while strictly guaranteeing SLO compliance.

KL Penalty Control via Perturbation for Direct Preference Optimization

This paper proposes ε-DPO, which achieves instance-level adaptive KL penalty control by monitoring the monotonicity of logits—used as preference model outputs—under small perturbations of \(\beta\) during training. The method incurs no additional computational overhead and significantly outperforms DPO and most direct alignment algorithms, achieving a 46.4% LC win rate on AlpacaEval 2 (vs. 40.3% for DPO).

LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

This paper proposes LatentGuard, a three-stage framework that combines behavior-level alignment fine-tuning, structured VAE-supervised latent space modeling, and latent-space dimensional manipulation to achieve interpretable and controllable regulation of LLM refusal behavior — robustly defending against adversarial attacks while preserving responsiveness to benign queries.

Less is More: Local Intrinsic Dimensions of Contextual Language Models

This paper proposes using the Local Intrinsic Dimension (LID) of contextual token embeddings as an unsupervised signal for monitoring LLM training dynamics — a decrease in LID indicates improved generalization, while an increase signals overfitting. The utility of this geometric signal is validated on tasks including dialogue state tracking, grokking, and sentiment recognition.

SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

SciArena is a community-driven open evaluation platform for scientific literature tasks. It adopts a Chatbot Arena-style human preference voting paradigm to rank 47 foundation models, collecting over 20,000 votes, and releases SciArena-Eval as a meta-benchmark for assessing the ability of automated evaluation systems to judge answer quality on literature-grounded tasks.


🌐 Multilingual & Translation (11)

Adaptive Originality Filtering: Rejection-Based Prompting and RiddleScore for Culturally Grounded Multilingual Riddle Generation

This paper proposes Adaptive Originality Filtering (AOF)—a semantic rejection-sampling prompting strategy that filters repetitive or templated outputs via MiniLM embedding cosine similarity, compelling LLMs to generate more novel, diverse, and culturally grounded multilingual riddles. It also introduces the RiddleScore composite evaluation metric (Novelty + Diversity + Fluency + Alignment), achieving a human correlation of \(\rho=0.83\).

Exploring the Translation Mechanism of Large Language Models

This paper proposes a subspace-intervened path patching method for fine-grained causal analysis of the translation mechanism in LLMs. The study finds that translation is driven by a sparse set of attention heads comprising fewer than 5% of all heads, categorized into three functional roles: source heads, indicator heads, and positional heads. MLP layers integrate these features into an English-centric intermediate representation, and fine-tuning only 64 critical heads achieves performance comparable to full-parameter fine-tuning.

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

NVIDIA releases a 40K+ open-source human-annotated preference dataset covering general, STEM, code, and multilingual (13 languages) tasks. The reward model trained on this dataset achieves 82.4% (+10%) on RM-Bench, with a commercially friendly CC-BY-4.0 license.

How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs

Under a high-dimensional asymptotic framework, this paper proves that Transformers with nonlinear MLP heads are asymptotically equivalent to structured polynomial predictors in terms of ICL error, revealing the gain mechanism of nonlinear MLPs on nonlinear tasks and establishing that low noise and structured covariance are key characteristics of high-quality data sources in multi-source data mixing.

MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

This paper introduces MERIT, the first multilingual interleaved multi-condition semantic retrieval dataset (320K queries, 135K products, 5 languages, 7 product categories), exposes the bottleneck of existing retrieval models that focus solely on global semantics while neglecting condition-level details, and proposes the Coral fine-tuning framework that combines embedding reconstruction with contrastive learning to achieve a 45.9% improvement in retrieval performance.

On Extending Direct Preference Optimization to Accommodate Ties

This paper replaces the Bradley-Terry preference model in DPO with the Rao-Kupper and Davidson extensions, enabling preference optimization to explicitly model "tie" data. This avoids discarding ambiguous preference pairs and yields improved regularization and performance on translation and mathematical reasoning tasks.

ParallelPrompt: Extracting Parallelism from Large Language Model Queries

This work presents ParallelPrompt, the first benchmark for intra-query parallelism, comprising structured decomposition annotations for 37,000+ real user prompts. It demonstrates that approximately 10% of user queries contain exploitable parallel structure, and that parallel execution can achieve up to 5.7× latency speedup with limited quality degradation.

Quantifying Climate Policy Action and Its Links to Development Outcomes: A Cross-National Data-Driven Analysis

This paper constructs an integrated NLP–econometrics framework that first uses a fine-tuned multilingual DistilBERT to automatically classify global climate policy documents by topic (Mitigation / Adaptation / Disaster Risk Management / Loss & Damage) with F1 = 0.90, then conducts fixed-effects panel regression against World Bank development indicators, finding that mitigation policies are significantly positively associated with higher GDP/GNI, while Loss & Damage policies remain substantially unimplemented worldwide.

Reflective Translation: Improving Low-Resource Machine Translation via Structured Self-Reflection

This paper proposes the Reflective Translation framework, which enables LLMs to perform structured self-critique of their initial translations at inference time—identifying mistranslations, omissions, and semantic distortions—and subsequently generate revised translations based on this critique. The approach requires no fine-tuning or additional annotated data, yet achieves statistically significant improvements in BLEU and COMET on low-resource African languages such as isiZulu and isiXhosa.

XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

This paper proposes XIFBench — the first constraint-driven benchmark systematically evaluating LLMs' multilingual instruction-following capabilities. It comprises 558 instructions (0–5 constraints, 5 categories × 21 dimensions) across 6 languages (high/mid/low resource), and introduces an English-requirement anchoring evaluation protocol that achieves 94.7% cross-lingual evaluation consistency.

Browse all 11 Multilingual & Translation papers →


🔍 Information Retrieval & RAG (24)

AcuRank: Uncertainty-Aware Adaptive Computation for Reranking

AcuRank dynamically adjusts the reranking subset size and verification scope via TrueSkill-based uncertainty estimation, achieving a superior accuracy–efficiency trade-off while avoiding over-computation.

Chain-of-Retrieval Augmented Generation (CoRAG)

This paper proposes CoRAG, a framework that automatically generates intermediate retrieval chains (sub-query → sub-answer) via rejection sampling, fine-tunes an LLM to learn iterative retrieval and reasoning, and supports diverse test-time decoding strategies (greedy / Best-of-N / tree search) for flexible compute scaling. CoRAG achieves 26+ EM improvement on multi-hop QA and attains state-of-the-art on 9/10 tasks of the KILT benchmark.

Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers

CoopRAG is a framework that achieves bidirectional cooperation between the retriever and the LLM through query expansion, retriever layer-contrastive reranking, and reasoning chain completion. It surpasses HippoRAG2 by 5.3% on multi-hop QA and by 35.2% on single-hop QA.

Deep Research Brings Deeper Harm

This paper reveals critical safety vulnerabilities in Deep Research (DR) agents — even when the underlying LLM correctly refuses harmful queries, deploying it as a DR agent can still produce detailed, professional, and dangerous reports. Two targeted jailbreak methods, Plan Injection and Intent Hijack, are proposed alongside the DeepREJECT evaluation metric. Experiments on 6 LLMs demonstrate that DR agents systematically undermine alignment mechanisms.

DICE: Discrete Interpretable Comparative Evaluation with Probabilistic Scoring for RAG

This paper proposes the DICE framework, which achieves interpretable, robust, and efficient evaluation of RAG systems through a two-stage assessment pipeline (evidence-coupled deep analysis + probabilistic {A, B, Tie} scoring) combined with a Swiss-system tournament. On a Chinese financial QA dataset, DICE attains 85.7% agreement with human experts, substantially outperforming RAGAS (45.7%).

Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe

This paper investigates the feasibility of Dual Encoders (DE) for Hierarchical Retrieval (HR), theoretically proving that embedding dimensionality need only grow linearly with hierarchy depth and logarithmically with document count. After identifying the "lost-in-the-long-distance" phenomenon, the paper proposes a pretrain-finetune strategy that improves long-distance recall from 19% to 76% on WordNet.

HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG

By decoupling the filtering capability of a lightweight Flash model from the reasoning capability of a Pro model, the paper constructs a multi-stage pipeline (query optimization → hierarchical filtering → two-pass generation → citation verification) that achieves SOTA performance in the MMU-RAGent competition.

How Should We Evaluate Data Deletion in Graph-Based ANN Indexes?

To address the lack of a unified evaluation methodology for data deletion in graph-based ANN indexes, this paper formally defines three baseline approaches—lazy deletion, eager deletion, and reconstruction—proposes a deployment-oriented evaluation framework and metric suite, and introduces the Deletion Control algorithm, which dynamically switches deletion strategies under accuracy constraints based on empirical analysis.

HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation

This paper proposes HyperGraphRAG, the first RAG method based on hypergraph structure, which models n-ary relations (\(n \geq 2\)) via hyperedges. It overcomes the binary-relation bottleneck of existing graph-based RAG methods, achieving comprehensive improvements over StandardRAG and the GraphRAG family on question-answering tasks across medical, agricultural, computer science, and legal domains.

Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards

This paper proposes Con-RAG, a framework that trains RAG generators to produce informationally consistent outputs under paraphrased inputs by computing group similarity rewards across multiple generations of semantically equivalent queries via Paraphrased Set GRPO (PS-GRPO), simultaneously improving both consistency and accuracy without requiring explicit ground-truth supervision.

Browse all 24 Information Retrieval & RAG papers →


💻 Code Intelligence (19)

A Self-Improving Coding Agent

This paper proposes SICA (Self-Improving Coding Agent), a coding agent capable of autonomously editing its own codebase to improve performance. By eliminating the distinction between meta-agent and target-agent, SICA achieves iterative self-improvement, advancing from 17% to 53% on a subset of SWE-Bench Verified.

A Stochastic Differential Equation Framework for Multi-Objective LLM Interactions

This paper models multi-objective optimization in iterative LLM interactions as an SDE (drift-diffusion process), quantifies inter-objective coupling via an interference matrix, and analyzes strategy convergence behavior through eigenvalue spectral analysis. Validation on code generation (three objectives: security, efficiency, functionality) demonstrates convergence rates ranging from 0.33 to 1.29 and predictability up to \(R^2 = 0.74\) across different strategies.

AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

AstroVisBench introduces the first code benchmark for evaluating LLMs on astronomical scientific computing and visualization. It extracts 864 tasks (processing + visualization) from 110 Jupyter Notebooks, and designs a dual evaluation pipeline (execution-based variable inspection + VLM-as-Judge visualization scoring, achieving Spearman ρ=0.822 with expert ratings). Evaluation of 8 state-of-the-art models reveals that Gemini 2.5 Pro performs best, yet attains only a 15.7% error-free rate, with FileNotFoundError accounting for 43% of all errors.

VeriMaAS: Automated Multi-Agent Workflows for RTL Design

VeriMaAS proposes a framework for automatically composing multi-agent workflows for RTL code generation. Its core innovation is the direct integration of formal verification feedback from HDL tools (Yosys synthesis + OpenSTA timing analysis) into workflow orchestration, achieving a 2–12% pass@1 improvement on VeriThoughts while requiring only a few hundred samples for controller tuning—an order of magnitude fewer than full fine-tuning.

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

This paper proposes CURE, a framework in which a single LLM simultaneously assumes the roles of code generator and unit test generator. Cross-execution between generated code and generated tests constructs a pairwise reward matrix; theoretically derived reward signals then drive reinforcement learning. Without any ground-truth code annotations, CURE achieves co-evolution of both code generation and unit test generation capabilities, substantially outperforming dedicated coder models of comparable scale across five programming benchmarks.

CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning

This paper proposes CodeCrash, a stress-testing framework that systematically evaluates the code reasoning robustness of 17 LLMs through functionally equivalent structural perturbations and misleading natural language injections (comments, print statements, and hints). The framework reveals an average performance drop of 23.2% across models, with CoT recovering only to 13.8%, and is the first to identify the "Reasoning Collapse" phenomenon in large reasoning models (LRMs).

Embedding Alignment in Code Generation for Audio

A dual-MLP + InfoNCE contrastive learning framework is proposed to align code embeddings (distilroberta-base) and audio embeddings (wav2vec2) into a shared space, enabling LLM-based code generation pipelines to infer musical similarity directly from code without compilation or execution. CKA improves from 0.090 to 0.590.

Learning From Design Procedure To Generate CAD Programs for Data Augmentation

This paper proposes a CAD program data augmentation paradigm inspired by industrial design workflows. By providing reference surface programs and design procedure descriptions as LLM prompts, the method guides the generation of CAD programs containing B-Spline organic shapes, substantially narrowing the geometric complexity gap between public CAD datasets and industrial-grade designs.

Learning to Solve Complex Problems via Dataset Decomposition

This paper proposes Decomp, a method that employs a teacher model (GPT-4o) to recursively decompose complex math problems into simpler subproblems along reasoning steps, constructs a concept dependency graph to quantify difficulty, and trains student models following an easy-to-hard curriculum. Qwen2.5-1.5B achieves 51.6% on MATH-500 (surpassing MuggleMath's 50.4% with 147K samples), while Qwen3-4B reaches 16.7% on AIME2025 using only 385 samples (surpassing Qwen2.5-72B's 15.0%).

MaintainCoder: Maintainable Code Generation Under Dynamic Requirements

This work is the first to systematically define and address the maintainability problem in LLM-based code generation, contributing both a benchmark and a method. MaintainBench evaluates code maintainability under requirement evolution using 4 change patterns and dynamic metrics; MaintainCoder integrates the Waterfall model, design patterns, and 6 specialized agents, achieving 60%+ improvement on dynamic maintainability metrics while also improving initial code correctness.

Browse all 19 Code Intelligence papers →


🎨 Image Generation (218)

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)

DFloat11 exploits the low-entropy property of exponent bits in BFloat16 weights to losslessly compress LLMs and diffusion models to approximately 70% of their original size (equivalent to ~11 bits) via Huffman coding. It further introduces hierarchical lookup tables and a two-phase GPU kernel for efficient online decompression, enabling lossless inference of Llama 3.1 405B on a single node with 8×80GB GPUs.

A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective

This paper identifies a generalization-to-memorization transition in diffusion models under self-consuming loops (where each generation of models is trained on synthetic data from the previous one), reveals a strong linear correlation between training set entropy and model generalization (Pearson \(r=0.91\)), and proposes entropy-based data selection strategies (Greedy Selection / Threshold Decay Filter) that effectively slow this transition—reducing FID from 75.7 to 44.7 at iteration 8 under the CIFAR-10 accumulate paradigm.

A Connection Between Score Matching and Local Intrinsic Dimension

This paper proves that the lower bound of the denoising score matching (DSM) loss is precisely the local intrinsic dimension (LID) of the data manifold, thereby establishing the DSM loss itself as an efficient LID estimator—requiring neither gradient computation nor multiple forward passes. On Stable Diffusion 3.5, this approach reduces peak memory usage to approximately 60% of FLIPD while yielding more stable estimates under quantization.

A Data-Driven Prism: Multi-View Source Separation with Diffusion Model Priors

This paper proposes DDPRISM, a method that exploits structural differences among linear transformations across multi-view observations. Within an EM framework, it learns an independent diffusion model prior for each unknown source without requiring any isolated source samples, enabling source separation and posterior sampling. DDPRISM outperforms existing methods on both synthetic benchmarks and real galaxy observations.

A Diffusion Model for Regular Time Series Generation from Irregular Data with Completion and Masking

This paper proposes a two-stage framework for generating regular time series from irregularly sampled data: (1) a TST autoencoder completes missing values to construct a "natural neighborhood," and (2) a masking strategy applied during visual diffusion model training computes loss only on observed pixels, avoiding over-reliance on completed values. The approach achieves an average 70% improvement in discriminative score and a 6.5× training speedup.

A Gradient Flow Approach to Solving Inverse Problems with Latent Diffusion Models

This paper proposes DWGF (Diffusion-regularized Wasserstein Gradient Flow), which rigorously formalizes posterior sampling with latent diffusion models as a regularized gradient flow of KL divergence in the Wasserstein-2 space. An ODE system in the latent space is derived to solve image inverse problems, achieving substantially higher PSNR than baselines on inpainting and super-resolution tasks on FFHQ-512.

Accelerating Parallel Diffusion Model Serving with Residual Compression

This paper proposes CompactFusion, a framework that eliminates communication redundancy in parallel diffusion inference via residual compression—transmitting only the activation differences between adjacent denoising steps rather than full activations. It achieves a 3.0× speedup on 4×L20 GPUs with generation quality significantly superior to DistriFusion, a 6.7× speedup under simulated Ethernet bandwidth, and maintains better quality than DistriFusion even at 100× compression.

AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models

This paper reveals an error accumulation phenomenon in diffusion model quantization—where quantization errors at each step propagate and amplify into subsequent steps—and proposes explicitly simulating consecutive multi-step denoising during PTQ calibration to jointly optimize quantization parameters, while reducing memory from O(n) to O(1) through a carefully designed objective function.

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

This paper introduces ALE-Bench, the first AI benchmark targeting scored algorithm engineering contests (AtCoder Heuristic Contest). It curates 40 NP-hard optimization problems and provides an interactive agent evaluation framework. The strongest model, o3-high, achieves only human-average performance in a one-shot setting, with significant gaps between AI and human experts in cross-problem consistency and long-horizon iterative improvement.

Aligning Compound AI Systems via System-level DPO

This paper models compound AI systems as DAGs and proposes the SysDPO framework, which extends DPO to joint multi-component alignment. By leveraging DAG decomposition, system-level preferences are transformed into an end-to-end optimizable loss function. The authors provide theoretical guarantees of \(\beta\)-perfect alignment and demonstrate substantial improvements in collaborative quality on both LLM+diffusion model and LLM+LLM systems.

Browse all 218 Image Generation papers →


🎬 Video Generation (23)

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

This paper proposes AAPT (Autoregressive Adversarial Post-Training), which converts a pretrained video diffusion model into an autoregressive real-time video generator via adversarial training. The model requires only one forward pass per frame (1NFE), employs student-forcing training to reduce error accumulation, and achieves real-time streaming generation at 736×416 resolution and 24fps on a single H100 GPU, supporting videos up to one minute in length (1440 frames).

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

This paper identifies and addresses the motion bias problem in video DPO — by constructing structurally aligned video pairs via noising and denoising GT videos to fix the motion dimension, annotating dense preferences at the temporal segment level for more precise learning signals, and leveraging off-the-shelf VLMs for automatic annotation to reduce cost. Using only 1/3 of the annotation data, the method substantially improves motion generation quality while matching visual quality and text alignment.

DisMo: Disentangled Motion Representations for Open-World Motion Transfer

DisMo learns abstract motion representations that are agnostic to appearance, pose, and category from raw videos via a dual-stream architecture (motion extractor + frame generator) and an image-space reconstruction objective. It enables open-world motion transfer across categories and viewpoints, and significantly outperforms video representation models such as V-JEPA on zero-shot action classification.

Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

This paper proposes Force Prompting, which uses physical forces (local point forces and global wind forces) as control signals for video generation models. Using only ~15K synthetic training videos (Blender flags and rolling balls) and a single day of training on 4×A100 GPUs, the method achieves remarkable generalization across diverse real-world scenes with varying objects, materials, and geometries, including preliminary mass understanding capabilities.

Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation

This paper proposes Foresight, a training-free adaptive layer reuse framework that establishes per-layer MSE thresholds during a warmup phase and dynamically decides at inference time whether to reuse cached features or recompute each layer. Evaluated on 5 video generation models, Foresight achieves superior quality and speed trade-offs compared to static methods, with up to 2.23× acceleration.

LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

This paper proposes LeMiCa, a training-free acceleration framework for diffusion-based video generation that formulates cache scheduling as a lexicographic minimax path optimization problem on a directed acyclic graph (DAG), achieving simultaneous gains in speed and quality (2.9× speedup on Latte; LPIPS as low as 0.05 on Open-Sora) via global error control.

MagCache: Fast Video Generation with Magnitude-Aware Cache

This paper discovers that the magnitude ratio of residual outputs between adjacent timesteps in video diffusion models follows a universally monotonically decreasing pattern across models and prompts — termed the "Unified Magnitude Law" — and proposes MagCache: a method that accurately models skip-step error accumulation via magnitude ratios, adaptively skips redundant timesteps and reuses cached outputs with only a single calibration sample, achieving 2.10–2.68× speedup on Open-Sora, CogVideoX, Wan 2.1, and HunyuanVideo while outperforming TeaCache and other existing methods across all three metrics of LPIPS, SSIM, and PSNR.

Photography Perspective Composition: Towards Aesthetic Perspective Recommendation

This paper proposes a novel "Photography Perspective Composition" (PPC) paradigm that goes beyond traditional cropping-based approaches. It constructs a perspective transformation dataset via 3D reconstruction, generates recommended viewpoints through Image-to-Video generation, aligns with human preferences via RLHF, and evaluates perspective quality using a PQA model.

PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

PhysCtrl employs diffusion models to learn the physical dynamics distribution of four material types (elastic, sand, plasticine, and rigid bodies), representing dynamics as 3D point trajectories. A diffusion model incorporating spatiotemporal attention and physics constraints is trained on 550K synthetic animations; the generated trajectories drive a pretrained video model to achieve high-fidelity physics video generation controllable by force and material parameters.

PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis

This paper proposes PoseCrafter, a training-free framework for extreme pose estimation. It synthesizes high-fidelity intermediate frames via Hybrid Video Generation (HVG, a two-stage pipeline combining DynamiCrafter and ViewCrafter) to address pose estimation for image pairs with minimal or no overlap, and employs a Feature Matching Selector (FMS) to efficiently identify the most informative intermediate frames. The method achieves significant improvements in extreme pose estimation accuracy across four datasets.

Browse all 23 Video Generation papers →


🧩 Multimodal VLM (105)

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1

This paper proposes M-Attack, which performs random cropping on source images and aligns them with target images via local-global or local-local matching in the embedding space, combined with a multi-CLIP model ensemble. This causes adversarial perturbations to naturally concentrate on semantically critical regions, forming clear semantic details. M-Attack achieves >90% targeted attack success rate against commercial black-box LVLMs including GPT-4.5/4o/o1.

A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection

This work introduces the first multimodal framing analysis benchmark for oil and gas (O&G) industry video advertisements, comprising 706 videos, 13 framing categories, 50+ entities, and 20 countries. It systematically evaluates six VLMs on greenwashing-related framing detection, finding that GPT-4.1 achieves 79% F1 zero-shot on environmental labels but only 46% on green innovation, thereby exposing implicit framing analysis and cultural context understanding as core challenges for current VLMs.

AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

AdaLRS is proposed as a plug-and-play online learning rate search algorithm that adaptively adjusts the learning rate by monitoring the loss descent velocity, reducing the cost of learning rate hyperparameter search from multiple independent training runs to a single run, achieving approximately 50% savings in training cost.

Adapting Vision-Language Models for Evaluating World Models

This paper proposes UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a unified semantic evaluator for world model rollouts constructed by fine-tuning only the projection head of PaliGemma 2 (0.07% of total parameters). UNIVERSE achieves performance comparable to task-specific models on action recognition and character recognition, while exhibiting strong alignment with human judgments.

ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources

This paper proposes ADMN (Adaptive Depth Multimodal Network), a two-stage training framework: (1) Multimodal LayerDrop fine-tuning to make the backbone robust to arbitrary layer configurations, and (2) a QoI-aware controller that dynamically allocates layer budgets across modalities. ADMN adaptively assigns layers based on per-modality quality-of-information (QoI) under strict compute constraints, matching full-model accuracy while reducing FLOPs by 75% and latency by 60%.

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

This paper proposes CLIC, which concatenates two images to form a composite scene and generates hard negatives via cross-image lexical swapping, while constructing multiple positive captions to enhance semantic invariance. By fine-tuning only the CLIP text encoder, CLIC simultaneously improves compositional reasoning (achieving SOTA on SugarCrepe++) and downstream retrieval performance, resolving the long-standing trade-off between compositionality and retrieval in prior methods.

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

This paper proposes BACL (Boundary-Aware Curriculum with Local Attention), which combines a learnable boundary-aware negative sampler (via easy-to-hard curriculum learning) with a contrastive local attention loss (for token-level mismatch localization). On LAION-400M, BACL yields a +32% R@1 improvement over CLIP and achieves state-of-the-art results on four large-scale benchmarks.

AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making

This paper inverts the conventional instruction grounding paradigm — rather than compressing VLM knowledge into intermediate representations (symbolic skills or constraints), it renders candidate robot trajectories into multi-view scene images and evaluates action proposals directly within the VLM's native high-dimensional representation space, enabling zero-shot closed-loop robotic manipulation control.

AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions

This paper proposes AQuaMaM—a Transformer-based autoregressive quaternion manifold model that represents each projected component of the unit quaternion as a geometrically constrained mixture of uniform distributions, enabling exact likelihood computation and fast sampling on the SO(3) rotation manifold. AQuaMaM achieves 52× faster inference and 14% higher log-likelihood compared to IPDF, with sampled distributions that closely match the ground truth.

Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering

This paper presents DeepTumorVQA, a 3D diagnostic-grade visual question answering benchmark for abdominal CT tumors, comprising 9,262 CT volumes (3.7 million slices) and 395K expert-level questions. It systematically evaluates the clinical diagnostic capability of four state-of-the-art VLMs, finding that current models perform acceptably on measurement tasks but fall far short of clinical requirements in lesion recognition and reasoning.

Browse all 105 Multimodal VLM papers →


🧠 VLM Reasoning (30)

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

This paper proposes ACT (Annotation with Critical Thinking), a data pipeline in which an MLLM annotates all samples in bulk, a second MLLM acting as a critic estimates the error probability of each annotation, and only high-suspicion samples are routed to human reviewers. Combined with a theoretically derived ACT loss function, the approach achieves 70–90% reduction in human annotation cost across six cross-modal datasets while maintaining a downstream performance gap of less than 2%.

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

This paper introduces a fine-grained 3D embodied reasoning task—jointly predicting the spatial location, motion type, and motion axis of actionable elements—and proposes rendering 3D point clouds into panoramic views with projected affordance candidates, guided by a customized Chain-of-Thought (CoT) reasoning paradigm for MLLMs, achieving state-of-the-art performance with AP25 of 23.3%.

Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

This paper proposes In-Context Representation Learning (ICRL), the first training-free framework that injects representations from non-text-modality foundation models (FMs) into a text-only LLM for few-shot reasoning. Two strategies are introduced: PCA-based text-level injection and optimal transport (OT)-based embedding alignment, enabling cross-modal knowledge utilization without any parameter updates.

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

This paper introduces the Qualcomm Interactive Cooking benchmark and the LiveMamba model, presenting the first systematic evaluation of multimodal LLMs for providing real-time, step-by-step task guidance in streaming video — encompassing instruction delivery, completion detection, and error feedback.

READ: Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions

This paper proposes READ, a fine-tuning method that enhances the compositional reasoning capability of CLIP's text encoder via two auxiliary objectives: (1) token-level reconstruction, where a frozen decoder reconstructs alternative descriptions from text embeddings, and (2) sentence-level alignment, which enforces consistency among embeddings of paraphrases. READ achieves state-of-the-art performance on 5 compositional reasoning benchmarks, outperforming NegCLIP by 4.5% and FSC-CLIP by 4.1%.

Enhancing Outcome Reward-Based RL Training of MLLMs with Self-Consistency Sampling

To address the problem of "unfaithful reasoning trajectories induced by outcome-reward RL training in multimodal multiple-choice tasks," this paper proposes Self-Consistency Sampling (SCS), which obtains consistency rewards via truncation-resampling and visual perturbation to penalize spurious reasoning. When combined with RLOO, SCS achieves an average improvement of 7.7 percentage points across six benchmarks.

FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models

FlexAC identifies that associative reasoning in MLLMs is primarily encoded in intermediate layers. By extracting steering vectors from hallucinated responses and injecting them into intermediate-layer representations at inference time, it enables flexible control over faithfulness and creativity—reducing hallucination rate by 29% (CHAIR) and improving creativity by 5.8× (Creation-MMBench), all without any training.

GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

This paper proposes GUI-Rise, a framework that jointly designs three subtasks—structured reasoning (progress estimation + decision reasoning), action prediction, and history summarization—combined with GRPO reinforcement learning and a history summarization reward, to significantly improve the cross-domain generalization of GUI navigation agents.

iFinder: Structured Zero-Shot VLM Grounding for Dash-Cam Video Reasoning

This paper proposes iFinder, a modular training-free framework that decouples dash-cam video understanding into perception (structured scene representation) and reasoning (LLM). Through a hierarchical data structure and a three-block prompting strategy, iFinder endows LLMs with interpretable spatiotemporal reasoning capabilities, achieving zero-shot superiority over end-to-end V-VLMs across four driving video benchmarks, with accident reasoning accuracy gains of up to 39%.

MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agriculture

MIRAGE is the first multimodal benchmark constructed from real agricultural expert consultation dialogues (35,000+), evaluating vision-language models on domain-level entity identification, causal reasoning, and clarify-or-respond decision-making. It reveals a severe challenge in which even GPT-4.1 achieves only 43.9% identification accuracy.

Browse all 30 VLM Reasoning papers →


⚡ VLM Efficiency (8)

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

This paper proposes Balanced Token Pruning (BTP), which jointly considers the impact of pruning on both the current layer (local) and subsequent layers (global). BTP emphasizes diversity preservation in shallow layers to maintain downstream representation quality, and attention-based selection in deep layers to preserve local output consistency. On multiple LVLMs including LLaVA and Qwen2.5-VL, BTP retains 98% of the original model's performance while keeping only 22% of visual tokens.

Beyond Greedy Exits: Improved Early Exit Decisions for Risk Control and Reliability

UAT (Unsupervised Adaptive Thresholding) designs a reliability function for early-exit DNNs to assess the quality of intermediate layer outputs, and employs a multi-armed bandit (MAB) algorithm to dynamically learn optimal exit thresholds at inference time, achieving 1.7–2.1× speedup with less than 2% performance degradation while remaining robust to distribution shift.

ElasticMM: Efficient MLLM Serving with Elastic Multimodal Parallelism

This paper proposes the Elastic Multimodal Parallelism (EMP) paradigm and the ElasticMM system, which disaggregates different stages of multimodal inference into independent instances via modality-aware load balancing and elastic partition scheduling, achieving up to 4.2× TTFT reduction and 3.2–4.5× throughput improvement over vLLM.

FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models

Reexamining visual token redundancy in VLMs through the lens of Information Flow: the CLS token acts as an information relay, redundancy emerges progressively, and single-layer single-criterion scoring is unreliable. FlowCut—an information-flow-aware multi-criteria cumulative importance pruning framework—surpasses SOTA by 1.6% on LLaVA-1.5-7B at an 88.9% token reduction rate, and by 4.3% on LLaVA-NeXT-7B.

HAWAII: Hierarchical Visual Knowledge Transfer for Efficient VLM

This paper proposes the Hawaii framework, which distills knowledge from multiple visual experts into a single visual encoder via Mixture of LoRA Adapters (MoLA) and Hierarchical Knowledge Distillation (HKD), significantly improving the visual understanding capability of VLMs without incurring any additional inference cost.

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

PrefixKV identifies that the importance distributions of KV caches vary substantially across layers, and formalizes the per-layer cache sizing problem as a global prefix configuration search. A binary search is employed to find the optimal cumulative priority threshold that maximizes contextual information retention in each layer. At a 20% retention ratio, PrefixKV incurs only a 0.49 PPL degradation while delivering a 1.8× inference speedup.

SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodal LLMs

This paper proposes SCOPE, a visual token pruning strategy that jointly models saliency and coverage. By iteratively selecting tokens with the highest SCOPE scores, it preserves semantic completeness and retains 96% of LLaVA-1.5's performance under a 9× token reduction.

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

To address the difficulty of draft models in handling redundant visual tokens during VLM speculative decoding, this paper proposes ViSpec, a framework that achieves significant acceleration (up to 3.22×) in VLM speculative decoding for the first time, via a visual adapter for image token compression, global visual feature injection, and synthetic training data generation.


🎵 Audio & Speech (47)

A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings

This paper introduces TiALD (Tigrinya Abusive Language Detection), the first large-scale multi-task benchmark dataset for the low-resource Tigrinya language. It comprises 13,717 YouTube comments annotated jointly across three tasks—abusive language detection, sentiment analysis, and topic classification—and demonstrates that a compact fine-tuned model (TiRoBERTa, 125M parameters) consistently outperforms frontier LLMs such as GPT-4o and Claude Sonnet 3.7 across all tasks.

A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity

TRIANGLE proposes using the area of the triangle formed by three modal embedding vectors in high-dimensional space as a similarity measure, replacing traditional pairwise cosine similarity to achieve joint alignment of video, audio, and text modalities. The method surpasses state-of-the-art by up to 9 Recall@1 points on video-text retrieval and related tasks.

Accelerate Creation of Product Claims Using Generative AI

This paper develops the Claim Advisor platform, leveraging LLM in-context learning and LoRA fine-tuning to accelerate the search, generation, refinement, and ranking of product claims for consumer goods. By emulating the MaxDiff research methodology, a fine-tuned Phi-3 14B model outperforms GPT-4o on claim ranking using only 1 in-context example versus GPT-4o's 100, and after three iterative rounds, 100% of generated claims achieve a "highly appealing" rating.

Adapting Speech Language Model to Singing Voice Synthesis

This paper adapts a 1.7B-parameter TTS-pretrained Speech Language Model to the Singing Voice Synthesis (SVS) task via score tokenization, multi-stream LM prediction, conditional flow matching refinement, and a vocoder. Using only 135 hours of synthesized singing data, the system achieves performance comparable to dedicated SVS systems.

Associative Syntax and Maximal Repetitions Reveal Context-Dependent Complexity in Fruit Bat Communication

This paper proposes an unsupervised approach for inferring discrete units, grammar types, and temporal structure from fruit bat vocalizations, and introduces Maximal Repetitions (MRs) to animal communication research for the first time, finding that communicative complexity is significantly higher in conflict contexts than in affiliative ones.

AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound

AudSemThinker introduces a structured semantic reasoning framework for audio-language models by defining 9 categories of sound semantic descriptors (who/what/how/when/where, etc.). Built on Qwen2.5-Omni-7B and trained via SFT + GRPO (with verifiable rewards and length constraints), the model produces three-stage outputs in the format \<think>\<semantic_elements>\<answer>, achieving 66.70% on the MMAU benchmark—surpassing Audio-Reasoner (61.71%) and Qwen2.5-Omni (65.60%).

BNMusic: Blending Environmental Noises into Personalized Music

This paper proposes BNMusic, a two-stage framework that blends environmental noises into personalized generated music. Stage 1 generates rhythm-aligned music via mel-spectrogram outpainting and inpainting; Stage 2 adaptively amplifies the music signal based on auditory masking theory to reduce noise perception. The approach requires no additional training and significantly outperforms baselines on EPIC-SOUNDS and ESC-50.

Brain-tuning Improves Generalizability and Efficiency of Brain Alignment in Speech Models

This paper proposes Multi-brain-tuning, a method that jointly fine-tunes pretrained speech models on fMRI data from multiple participants, reducing the data required for brain alignment by 5×, improving alignment by up to 50%, and generalizing to unseen participants and datasets.

Can LLMs Outshine Conventional Recommenders? A Comparative Evaluation

This paper proposes RecBench, a comprehensive evaluation framework that systematically compares 17 LLMs against 10 conventional DLRMs across 5 domain-specific datasets. Results show that LLM-based recommenders achieve up to 5% AUC improvement on CTR tasks and up to 170% NDCG@10 improvement on sequential recommendation, yet incur 10–1000× slower inference. Conventional DLRMs augmented with LLM semantic embeddings (LLM-for-RS) attain approximately 95% of LLM performance at 20× higher throughput, making this paradigm the most industrially viable solution at present.

Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models

Data-Juicer 2.0 is a cloud-scale multimodal data processing system for foundation models, featuring 150+ operators spanning text, image, video, and audio. It supports adaptive distributed execution (Ray/MaxCompute), efficiently processes TB-scale data on 10,000+ CPU cores, and has been widely adopted in products such as Alibaba Cloud PAI.

Browse all 47 Audio & Speech papers →


🔎 AIGC Detection (9)

ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

This paper introduces ASCIIBench, the first publicly available benchmark for ASCII art understanding and generation (5,315 images, 752 categories). Systematic evaluation reveals that the visual modality substantially outperforms the text modality, multimodal fusion yields no benefit, and CLIP exhibits a fundamental bottleneck in representing ASCII structure—only categories with high intra-class consistency can be effectively distinguished.

Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

This paper proposes a dual-agent (quantitative + qualitative) evaluation framework that systematically assesses the faithfulness of GPT-4o, Ansari AI, and Fanar on Islamic content generation tasks across three dimensions—theological accuracy, citation integrity, and stylistic appropriateness—finding that even the best-performing model exhibits significant deficiencies in citation reliability.

Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code

This paper proposes having LLMs generate Python code for domain-dependent heuristic functions (rather than directly generating plans). Candidate heuristics are obtained via \(n\) samples and the best is selected on a training set, then injected into the Python planner Pyperplan for use with GBFS. The approach surpasses all C++ Fast Downward traditional heuristics on 8 IPC 2023 benchmark domains using pure Python, matches the SOTA learned planner \(h^{\mathrm{WLF}}_{\mathrm{GPR}}\), and guarantees 100% correctness for all plans found.

CLAWS: Creativity Detection for LLM-Generated Solutions Using Attention Window of Sections

This paper proposes CLAWS, a method that analyzes the attention weight distribution of LLMs across different prompt sections during mathematical solution generation to classify outputs as "creative," "typical," or "hallucinated," without requiring human evaluation.

DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code

DuoLens is proposed — an AI-generated content detection framework based on dual-encoder fusion of CodeBERT and CodeBERTa — achieving AUROC of 0.97–0.99 on multilingual text (8 languages) and source code (7 programming languages) at significantly reduced computational cost (8–12× lower latency, 3–5× lower VRAM), substantially outperforming large models such as GPT-4o.

"Jutters"

Through the metaphor of the Dutch tradition of jutters (beachcombers), this work constructs an immersive installation art piece that integrates real beach debris with AI-generated images and videos, guiding visitors to adopt a beachcomber's mindset in reflecting on how to engage with AI-generated content.

QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

This paper introduces the NeuComBack benchmark for evaluating neural compilation on IR-to-assembly translation tasks, and proposes a self-evolving prompt optimization method that iteratively improves compilation prompts by learning from LLM self-debugging trajectories. The approach raises correctness from 44% to 64%, with 87.5% of correctly generated programs outperforming clang-O3.

Reasoning Compiler: LLM-Guided Optimizations for Efficient Model Serving

This paper proposes Reasoning Compiler, which models compiler optimization as a sequential decision-making process, employing an LLM as a context-aware proposal engine combined with MCTS to balance exploration and exploitation. The approach achieves an average 5.0× speedup across 5 representative benchmarks and 5 hardware platforms, with 10.8× better sampling efficiency than TVM's evolutionary search.

Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency

This paper proposes Wedge, a framework that uses LLMs to synthesize performance-characterizing constraints to guide constraint-aware fuzzing, generating stress-test inputs that expose code performance bottlenecks. It further constructs the PerfForge benchmark, enabling LLM-based code optimizers (e.g., Effi-Learner) to achieve up to 24% additional reduction in CPU instructions.


🧊 3D Vision (116)

3D Visual Illusion Depth Estimation

This paper reveals that 3D visual illusions (e.g., wall paintings, screen replays, mirror reflections) severely mislead existing state-of-the-art monocular and stereo depth estimation methods. The authors construct a large-scale dataset comprising approximately 3k scenes and 200k images, and propose a VLM-driven monocular-stereo adaptive fusion framework that achieves state-of-the-art performance across diverse illusion scenarios.

4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos

This paper proposes 4DGT — a 4D Gaussian-based Transformer model trained entirely on real-world monocular posed videos that reconstructs dynamic scenes in seconds via feed-forward inference, significantly outperforming comparable feed-forward networks while achieving accuracy on par with optimization-based methods.

Anti-Aliased 2D Gaussian Splatting

This paper proposes AA-2DGS, which addresses severe aliasing artifacts in 2D Gaussian Splatting under varying sampling rates through two complementary mechanisms: a world-space flat smoothing kernel and an object-space Mip filter. The method significantly improves multi-scale rendering quality while preserving the geometric accuracy advantages of 2DGS.

ARMesh: Autoregressive Mesh Generation via Next-Level-of-Detail Prediction

This paper proposes to formulate 3D mesh generation as a coarse-to-fine, next-level-of-detail prediction process. By reversing a generalized mesh simplification algorithm (GSlim), a progressive refinement sequence is obtained, which is then learned autoregressively via a Transformer. Generation begins from a single point and incrementally adds geometric and topological detail to produce a complete mesh.

AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians

AtlasGS is proposed to achieve smooth, high-frequency-detail-preserving surface reconstruction in indoor and urban scenes by incorporating the Atlanta-world structural prior into an implicit-structured Gaussian representation, comprehensively outperforming existing implicit and explicit methods.

BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading

This paper proposes BecomingLit, a method that reconstructs high-fidelity, relightable, and real-time renderable head avatars from low-cost light stage multi-view sequences using 3D Gaussian primitives and hybrid neural shading (neural diffuse BRDF + analytic Cook-Torrance specular). A new publicly available OLAT facial dataset is also released.

CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting

CLIPGaussian proposes the first unified style transfer framework based on Gaussian Splatting, supporting text- and image-guided stylization of 2D images, videos, 3D objects, and 4D dynamic scenes. It integrates as a plug-and-play module into existing GS pipelines without requiring large generative models or retraining from scratch, and without altering model size.

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Concerto combines intra-modal 3D point cloud self-distillation with cross-modal 2D-3D joint embedding prediction. Through a minimalist design, a single point cloud encoder (PTv3) emerges spatial representations that surpass both 2D/3D unimodal methods and their naive concatenation, achieving state-of-the-art performance on multiple 3D scene understanding benchmarks (ScanNet semantic segmentation: 80.7% mIoU).

Copresheaf Topological Neural Networks: A Generalized Deep Learning Framework

This paper proposes Copresheaf Topological Neural Networks (CTNNs), which leverage the algebraic-topological notion of copresheaves to define directional, heterogeneous message passing on combinatorial complexes. The framework unifies CNNs, GNNs, Transformers, Sheaf Neural Networks, and Topological Neural Networks as special cases, and surpasses conventional baselines on physics simulation, graph classification, and higher-order complex classification tasks.

CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning

This paper introduces CosmoBench—the largest cosmological geometric deep learning benchmark to date—comprising 34,752 point clouds and 24,996 directed trees across multiple scales, viewpoints, and tasks. A key finding is that simple linear models sometimes outperform large GNNs.

Browse all 116 3D Vision papers →


🎯 Object Detection (27)

ADPretrain: Advancing Industrial Anomaly Detection via Anomaly Representation Pretraining

This work presents ADPretrain, the first dedicated representation pretraining framework for industrial anomaly detection. By learning residual feature representations via angle-oriented and norm-oriented contrastive losses on the large-scale RealIAD dataset, the pretrained features consistently improve five mainstream embedding-based AD methods across five datasets and five backbone networks when substituted for the original features.

EPHAD: An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination

EPHAD proposes a test-time post-processing framework that corrects the output of anomaly detection models trained on contaminated data via Bayesian-style fusion with external evidence (e.g., CLIP, LOF) through exponential tilting. The framework requires no access to the training pipeline and consistently improves detection performance of contaminated models across 8 visual and 26 tabular AD datasets.

Ascent Fails to Forget

Starting from the statistical dependence between the forget set and the retain set, this paper theoretically and empirically demonstrates that the widely adopted gradient ascent / Descent-Ascent (DA) family of machine unlearning methods fails systematically in the presence of data correlations. In logistic regression, the DA solution is provably farther from the oracle than the original model, and in non-convex settings DA traps the model in inferior local minima.

Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

This paper proposes a self-reflective agent framework that automatically detects attribute reliance in visual models through an iterative hypothesis generation–testing–verification–reflection loop (e.g., CLIP recognizing "teacher" via classroom backgrounds, YOLOv8 detecting pedestrians via crosswalks). Evaluated on a benchmark of 130 models with injected known attribute dependencies, self-reflection is shown to significantly improve detection accuracy.

AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing

This work proposes the AutoSciDACT pipeline, which first employs supervised contrastive learning to compress high-dimensional scientific data into a 4-dimensional embedding space, then applies NPLM (New Physics Learning Machine) likelihood-ratio testing to statistically quantify distributional deviations in the embedding space. The pipeline achieves \(\geq 3\sigma\) discovery at signal injection ratios of \(\leq 1\%\) across astronomical, particle physics, pathology, image, and synthetic datasets.

BurstDeflicker: A Benchmark Dataset for Flicker Removal in Dynamic Scenes

This paper introduces BurstDeflicker, the first large-scale benchmark dataset for multi-frame flicker removal (MFFR), comprising three complementary subsets — Retinex-based synthetic data, real-world static data, and green-screen dynamic data — systematically addressing the core bottleneck of obtaining aligned flickering–clean image pairs in dynamic scenes.

CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection

To address positive gradient dilution and hard-negative gradient dilution in large-vocabulary (>10K category) object detection, this paper proposes CQ-DINO: replacing the classification head with learnable category queries and using image-guided Top-K category selection to reduce the negative space by 100×. CQ-DINO surpasses the previous SOTA by 2.1% AP on V3Det (13,204 categories) while remaining competitive on COCO.

DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection

This work constructs DCAD-2000, a multilingual dataset covering 2,282 languages and 46.72 TB of text, and proposes a language-agnostic data cleaning framework that reformulates cleaning as anomaly detection. The framework extracts 8-dimensional statistical features per document and applies Isolation Forest for dynamic noise filtering. Effectiveness is validated on multiple multilingual benchmarks, with particularly notable gains on low-resource languages.

DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding

DetectiumFire constructs the largest multi-modal fire understanding dataset — 14.5K real images + 2.5K videos + 8K synthetic images + 12K RLHF preference pairs — with a low duplication rate (0.03 PHash vs. D-Fire 0.15), a 4-level severity classification scheme, and detailed scene descriptions. Fine-tuning YOLOv11m achieves mAP 43.74, and fine-tuning LLaMA-3.2-11B yields 83.84% accuracy on fire severity classification.

DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning

This paper proposes DETree, a framework that constructs a Hierarchical Affinity Tree (HAT) to model the hierarchical relationships among diverse human-AI collaborative text generation processes, and designs a Tree-Structured Contrastive Loss (TSCL) to align the representation space. DETree achieves significant advantages in mixed-text detection and OOD generalization scenarios.

Browse all 27 Object Detection papers →


✂️ Segmentation (45)

Alligat0R: Pre-Training through Covisibility Segmentation for Relative Camera Pose Regression

This paper replaces CroCo's cross-view completion with covisibility segmentation as a stereo vision pre-training task, predicting per-pixel labels of "co-visible / occluded / out-of-view" for each pixel. The approach significantly outperforms CroCo in low-overlap scenarios and achieves a first-place overall success rate of 60.3% on the RUBIK benchmark.

Attention (as Discrete-Time Markov) Chains

This work reinterprets the softmax-normalized attention matrix as the transition probability matrix of a Discrete-Time Markov Chain (DTMC), and proposes Multi-Bounce Attention and TokenRank (stationary distribution, analogous to PageRank) to capture indirect attention paths and global token importance. The approach achieves 94.29% mAP on ImageNet segmentation and enhances image generation quality in Self-Attention Guidance.

ConnectomeBench: Can LLMs Proofread the Connectome?

This paper introduces ConnectomeBench, the first standardized benchmark for evaluating multimodal LLMs on three key connectomics proofreading tasks: segment identification, split error correction, and merge error detection. o4-mini achieves 85% on the split correction multiple-choice task, yet merge error detection remains significantly below human expert performance.

COS3D: Collaborative Open-Vocabulary 3D Segmentation

This paper proposes COS3D — a collaborative prompt-segmentation framework that constructs a collaborative field comprising an instance field and a language field. During training, the language field is built via instance-to-language feature mapping; during inference, language-to-instance adaptive prompt refinement generates precise segmentation results. COS3D substantially outperforms existing methods on two mainstream benchmarks.

Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation

A two-stage active learning pipeline (coverage → uncertainty) is proposed, leveraging multi-scale features from pretrained diffusion models to achieve efficient semantic segmentation under extremely low annotation budgets.

Exploring Structural Degradation in Dense Representations for Self-supervised Learning

This paper identifies and systematically investigates the Self-supervised Dense Degradation (SDD) phenomenon — where longer training improves classification yet hurts dense task performance — and proposes the DSE metric along with DSE-guided model selection and regularization strategies, achieving an average mIoU improvement of 3.0%.

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

By introducing convolutional decoding normalization (replacing hard semi-autoregressive chunking) and rule-based rejective fine-tuning (R2FT), the proposed method achieves generation quality at 128 inference steps comparable to 512+ steps, reaching state-of-the-art performance among diffusion language models (DLMs).

FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis

FAST introduces explicit mechanisms to preserve anomaly regions throughout the diffusion trajectory: AIAS compresses the multi-step reverse process of discrete diffusion into a small number of coarse-to-fine analytical updates, while FARM reconstructs and reinjects anomaly foregrounds at each step, yielding a method that is both fast and better suited for generating training data for downstream anomaly segmentation models.

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

FineRS is a two-stage MLLM reinforcement learning framework comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR), coupled via a locate-informed retrospective reward. Evaluated on the newly constructed FineRS-4k UAV high-resolution dataset, it achieves reasoning and segmentation of ultra-small objects with a gIoU of 55.1% (surpassing Seg-Zero† by 8.5%) while simultaneously supporting VQA (MVQA 83.3%).

GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset

This paper introduces GTPBD, the first fine-grained global terraced parcel and boundary dataset, comprising 47,537 high-resolution images (0.5–0.7 m) with over 200,000 manually annotated parcels. It provides three-level labels supporting four tasks—semantic segmentation, edge detection, agricultural parcel extraction, and unsupervised domain adaptation—and presents comprehensive benchmarks across 20 methods.

Browse all 45 Segmentation papers →


🖼️ Image Restoration (26)

Adaptive Discretization for Consistency Models

This paper proposes ADCM, which formalizes the discretization step size of consistency models as a constrained optimization problem balancing local consistency (trainability) and global consistency (stability), derives a closed-form solution via the Gauss-Newton method, and achieves adaptive discretization that surpasses all prior CMs on CIFAR-10 using less than 25% of the training budget.

Audio Super-Resolution with Latent Bridge Models

This paper proposes AudioLBM, which compresses audio waveforms into a continuous latent space and employs a bridge model to realize a latent-to-latent generation process from low-resolution to high-resolution. Combined with frequency-aware training for broader data utilization and a cascaded design to surpass the 48kHz ceiling, AudioLBM comprehensively outperforms methods such as AudioSR across speech, sound effects, and music, while achieving any-to-192kHz audio super-resolution for the first time.

DP²O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution

This paper proposes DP²O-SR, a framework that exploits the inherent stochasticity of diffusion models to generate diverse super-resolution outputs, constructs preference pairs via a hybrid perceptual reward, and introduces a Hierarchical Preference Optimization (HPO) strategy to adaptively weight training pairs — significantly improving perceptual quality in real-world image super-resolution without any human annotations.

DynaGuide: Steering Diffusion Policies with Active Dynamic Guidance

This paper proposes DynaGuide, which applies classifier guidance to a frozen pretrained diffusion policy at inference time via an external latent dynamics model, steering the robot toward arbitrary positive/negative goals without modifying policy weights. It achieves an average success rate of 70% on CALVIN simulation and 80% on a real robot.

Elucidated Rolling Diffusion Models for Probabilistic Forecasting of Complex Dynamics

This paper proposes ERDM, the first framework to successfully unify the Rolling Diffusion paradigm with the principled design choices of EDM (noise schedule, preconditioning, Heun sampler). By employing a progressive noise schedule that explicitly models growing uncertainty, ERDM significantly outperforms autoregressive EDM baselines on Navier-Stokes and ERA5 weather forecasting benchmarks.

Encoder-Decoder Diffusion Language Models for Efficient Training and Inference

This paper proposes E2D2, an encoder-decoder architecture for discrete diffusion language models that performs iterative denoising via a lightweight decoder while periodically updating representations through a large encoder, achieving faster inference (~3× vs. MDLM) and more efficient block diffusion training (halving FLOPs).

Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark

To address the challenge of coupled degradations (low contrast, blur, and noise) in thermal infrared (TIR) images, this paper proposes PPFN, a progressive prompt fusion network with a dual-prompt design, along with the Selective Progressive Training (SPT) strategy. The authors also construct HM-TIR, the first large-scale multi-scene TIR benchmark dataset. The proposed method achieves an 8.76% PSNR improvement in composite degradation scenarios.

FIPER: Factorized Features for Robust Image Super-Resolution and Compression

This paper proposes a Factorized Features representation that decomposes images into learnable non-uniform bases and spatially variant coefficients, augmented with sawtooth coordinate transformation and multi-frequency modulation. The approach achieves a 204.4% relative PSNR gain at 4× super-resolution (HAT-L-F vs. SwinIR) and a 21.09% BD-rate reduction over VTM in image compression.

GC4NC: A Benchmark Framework for Graph Condensation on Node Classification with New Insights

This paper proposes GC4NC—the first systematic benchmark framework for graph condensation (GC)—which evaluates multiple GC methods across 8 dimensions (performance / efficiency / privacy protection / denoising / NAS effectiveness / transferability, etc.), finding that trajectory matching methods achieve the best performance, structure-free methods are most efficient, and graph condensation significantly outperforms image condensation under 1000× compression.

Implicit Augmentation from Distributional Symmetry in Turbulence Super-Resolution

This paper demonstrates that the statistical isotropy of turbulence itself constitutes a form of implicit data augmentation, enabling standard CNNs to partially learn rotational equivariance in super-resolution tasks without explicit rotation augmentation or equivariant architectures. The authors further show that the scale dependence of equivariance error is consistent with Kolmogorov's local isotropy hypothesis.

Browse all 26 Image Restoration papers →


🛰️ Remote Sensing (12)

C3PO: Cross-View Cross-Modality Correspondence by Pointmap Prediction

This paper introduces the C3 dataset comprising 90K ground photo–floor plan pairs (597 scenes, 153M pixel-level correspondences, and 85K camera poses), exposes the limitations of existing correspondence models under cross-view cross-modality settings (e.g., ground photos vs. floor plans), and demonstrates that training on this dataset reduces the RMSE of the best-performing baseline by 34%.

ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning

This paper proposes ChA-MAEViT, which enhances cross-channel feature learning for multi-channel images (MCI) through four key components: dynamic channel-patch joint masking, memory tokens, hybrid token fusion, and a channel-aware decoder. The method outperforms the state of the art by an average of 3.0–21.5% across three satellite and microscopy datasets.

Cloud4D: Estimating Cloud Properties at a High Spatial and Temporal Resolution

The first learning framework based on ground-level multi-view cameras that reconstructs four-dimensional (3D spatial + temporal) cloud liquid water content distributions via a homography-guided 2D-to-3D Transformer. The method achieves less than 10% error relative to radar at 25 m spatial and 5 s temporal resolution, improving spatiotemporal resolution by an order of magnitude over satellite observations.

Connecting the Dots: A Machine Learning Dataset for Ionospheric Prediction

This paper constructs an open, ML-ready ionospheric prediction dataset that integrates 8 heterogeneous data sources (solar observations, geomagnetic indices, TEC maps, etc.) spanning approximately 14 years (2010–2024). Three spatiotemporal baseline models—LSTM, SFNO, and GraphCast—are trained on this dataset, achieving TEC forecasts with lead times up to 12 hours.

EcoCast: A Spatio-Temporal Model for Continual Biodiversity and Climate Risk Forecasting

This paper proposes EcoCast, a Transformer-based spatio-temporal sequence model that integrates satellite remote sensing (Sentinel-2), climate reanalysis (ERA5), and citizen science observations (GBIF). The model predicts next-month species occurrence probabilities from 12-month environmental feature sequences. On a five-species African bird distribution prediction task, the macro-average F1 score improves from 0.31 (Random Forest) to 0.65. An EWC-based continual learning framework is also designed to accommodate data updates.

GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data

GeoLink directly integrates OpenStreetMap vector data into remote sensing foundation model pretraining by encoding OSM data with a heterogeneous GNN and designing multi-granularity cross-modal learning objectives (region–image-level contrastive + object–patch-level fusion). Pretrained efficiently on 1.27 million sample pairs, GeoLink surpasses existing RS FMs across 7 classification and 4 segmentation/change detection benchmarks.

GreenHyperSpectra: A Multi-Source Hyperspectral Dataset for Global Vegetation Trait Prediction

GreenHyperSpectra constructs a pretraining dataset of 140,000+ multi-source hyperspectral vegetation samples spanning proximal, airborne, and satellite platforms. Label-efficient regression models trained via semi-supervised and self-supervised methods (MAE, GAN, RTM-AE) comprehensively outperform fully supervised baselines on 7 plant trait prediction tasks, with particularly pronounced advantages under label-scarce and out-of-distribution scenarios.

Mass Conservation on Rails – Rethinking Physics-Informed Learning of Ice Flow Vector Fields

This paper proposes a divergence-free neural network (dfNN) that architecturally enforces exact mass conservation (divergence identically zero) via the symplectic gradient of a stream function, and combines it with a directional guidance learning strategy. The approach significantly outperforms soft-constraint PINNs and unconstrained NNs on ice flux interpolation over Antarctica's Byrd Glacier.

OrbitZoo: Real Orbital Systems Challenges for Reinforcement Learning

This paper presents OrbitZoo, a multi-agent RL environment built on the industrial-grade astrodynamics library Orekit. It integrates high-fidelity orbital dynamics (including atmospheric drag, solar radiation pressure, and third-body effects), a PettingZoo multi-agent interface, and real-time 3D visualization. Validation against real Starlink ephemerides yields a mean MAPE of only 0.16%.

OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata

OrthoLoC establishes the first large-scale UAV 6-DoF localization benchmark dataset based on orthographic geodata (DOP+DSM), comprising 16,425 real UAV images across 47 regions in Germany and the United States. It further introduces AdHoP (Adaptive Homography Preprocessing), a matching enhancement technique that improves matching performance by 95% and reduces translation error by 63% without modifying the underlying feature matcher.

Browse all 12 Remote Sensing papers →


🧑 Human Understanding (21)

A Generalized Label Shift Perspective for Cross-Domain Gaze Estimation

This paper formulates cross-domain gaze estimation (CDGE) as a generalized label shift (GLS) problem, demonstrating that existing domain-invariant representation learning methods are theoretically insufficient under label shift. It proposes continuous importance reweighting based on truncated Gaussian distributions and a Probability-aware Conditional Operator Discrepancy (PCOD) to jointly correct label shift and conditional shift, achieving an average error reduction of 12%–27% across multiple backbones.

BEDLAM2.0: Synthetic Humans and Cameras in Motion

BEDLAM2.0 is a comprehensive upgrade over BEDLAM, introducing diverse camera motions (synthetic translation/tracking/orbit + handheld/head-mounted capture), broader body shape coverage (BMI 18–41), strand-based hair, shoes, size-graded clothing, and more 3D environments. The resulting dataset comprises 27K+ sequences and 8M+ frames; models trained exclusively on this synthetic data surpass the state of the art in world-coordinate human motion estimation.

ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts

This paper proposes ConceptScope, a framework that trains sparse autoencoders (SAE) on representations from visual foundation models to automatically discover and quantify visual concept biases in datasets, categorizing concepts into target / context / bias without any manual annotation.

CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals

This paper proposes the CPEP framework, which employs contrastive learning to align low-quality EMG signal representations with high-quality hand pose representations, endowing the EMG encoder with pose-awareness. CPEP is the first to achieve zero-shot recognition of unseen gestures from EMG signals, yielding a 21% improvement on in-distribution gesture classification and a 72% improvement on unseen gesture classification.

Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization

Cycle-Sync is a global camera pose estimation framework that extends Message Passing Least Squares (MPLS) to camera position estimation, introduces a Welsch-type robust loss and cycle-consistency weighting, and surpasses all baselines—including complete SfM pipelines with bundle adjustment (BA)—without requiring BA.

DevFD: Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces

This paper proposes DevFD—a developmental MoE architecture that models the common characteristics of real faces via a shared Real-LoRA, incrementally captures new forgery types via a sequence of orthogonal Fake-LoRAs, and mitigates catastrophic forgetting by integrating orthogonal gradient constraints into an orthogonal loss. DevFD achieves state-of-the-art accuracy and the lowest forgetting rate in continual learning for face forgery detection.

Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge

FreeCure reveals that identity embeddings in face personalization models suppress but do not destroy the prompt control capability of the foundation model. Based on this insight, the paper proposes a training-free framework that injects attribute information from the foundation model into the personalized generation process via Foundation-Aware Self-Attention (FASA). The method substantially improves prompt consistency while preserving identity fidelity, and can be seamlessly integrated into mainstream architectures including SD, SDXL, and FLUX.

HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion

This paper models human-object interaction (HOI) generation as a Driver-Responder system, employing a lightweight Transformer-based interaction dynamics model to explicitly predict how objects respond to human actions. A residual dynamics loss is introduced during training to enforce causal consistency, while inference efficiency is preserved.

K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning

This paper proposes K-DeCore, a framework that decouples structured knowledge reasoning into two stages — task-agnostic schema filtering and task-specific query construction — and combines dual-perspective memory construction with structure-guided pseudo-data synthesis to enable effective knowledge transfer across heterogeneous SKR tasks under a fixed parameter budget.

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

This paper proposes the PBHC framework, which enables a humanoid robot (Unitree G1) to learn highly dynamic whole-body skills such as kung fu and dance through a physics-aware motion processing pipeline and a bi-level optimization scheme for adaptive tracking factors. The approach achieves substantially lower tracking errors than existing methods and is successfully deployed on real hardware.

Browse all 21 Human Understanding papers →


📹 Video Understanding (39)

A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

A fully zero-shot, training-free video anomaly analysis framework that employs Intra-Task Reasoning (confidence-gated self-refinement) and Inter-Task Chaining (cascaded prompt passing from temporal detection to spatial localization to semantic understanding), achieving comprehensive improvements of 4–6% AUC over prior zero-shot methods across 4 benchmarks.

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

AdaVideoRAG is proposed to route queries to one of three retrieval pathways (no retrieval / naive retrieval / graph retrieval) via a lightweight intent classifier, combined with an omni-knowledge indexing module (caption + ASR + OCR + visual + knowledge graph) to achieve an optimal efficiency–accuracy trade-off in long video understanding, yielding a 39.8% improvement for Qwen2.5-VL-7B on MLVU.

ConViS-Bench: Estimating Video Similarity Through Semantic Concepts

This paper introduces ConViS, a concept-based video similarity estimation task, along with its accompanying benchmark ConViS-Bench (610 video pairs, 16 domains, 5 concepts). It systematically evaluates 10+ mainstream models on concept-conditioned video comparison, revealing significant deficiencies in current models' understanding of temporal structure and spatial context.

Disentangled Concepts Speak Louder Than Words: Explainable Video Action Recognition

This paper proposes DANCE, a framework that achieves structured and motion-aware explainable video action recognition by disentangling action explanations into three concept types: motion dynamics, objects, and scenes.

DualGround: Structured Phrase and Sentence-Level Temporal Grounding

This paper identifies that existing video temporal grounding (VTG) models over-rely on the global sentence semantics encoded in the [EOS] token while neglecting word-level signals. It proposes DualGround, a dual-branch architecture that explicitly decouples global and local semantics via a sentence-level path (adaptive cross-attention) and a phrase-level path (recurrent phrase generation + Slot Attention), achieving state-of-the-art performance on QVHighlights and Charades-STA.

EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes

EAG3R integrates asynchronous event streams from event cameras into the MonST3R point map reconstruction framework. Through a Retinex enhancement module, an SNR-aware fusion mechanism, and an event photometric consistency loss, it achieves robust depth estimation, pose tracking, and 4D reconstruction in extreme low-light dynamic scenes, significantly outperforming RGB-only methods via zero-shot transfer to nighttime scenarios.

EgoGazeVQA: Egocentric Gaze-Guided Video Question Answering Benchmark

This paper introduces EgoGazeVQA, the first egocentric video question answering benchmark that incorporates user eye-gaze data. Through gaze-guided prompting strategies (textual, visual, and salience map), the benchmark demonstrates substantial improvements in MLLMs' ability to understand user intent. The Gaze Salience Map strategy raises MiniCPM-o's accuracy from 35.9% to 53.7%.

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

This paper proposes STAVEQ2, which inserts parameter-efficient Stacked Temporal Attention (STA) modules into the Vision Encoder to address fundamental architectural deficiencies in existing Video-LLMs for fine-grained temporal understanding (e.g., distinguishing "pulling from left to right" vs. "pulling from right to left"), achieving up to 5.5% improvement on VITATECS/MVBench/Video-MME.

FastVID: Dynamic Density Pruning for Fast Video Large Language Models

This paper proposes FastVID, which systematically eliminates video token redundancy along both temporal and visual dimensions via Dynamic Temporal Segmentation (DySeg) and Density Spatiotemporal Pruning (STPrune). On LLaVA-OneVision-7B, FastVID retains 98% accuracy after pruning 90.3% of video tokens, achieving a 7.1× speedup in the LLM prefill stage.

Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

A cross-attention multimodal architecture is proposed that integrates V-JEPA 2 visual context features with CoMotion 3D skeletal pose data, outperforming unimodal baselines on standard and high-occlusion action recognition benchmarks.

Browse all 39 Video Understanding papers →


🚗 Autonomous Driving (47)

3EED: Ground Everything Everywhere in 3D

This paper introduces 3EED — the first large-scale multi-platform (vehicle, drone, quadruped robot), multimodal (LiDAR + RGB) outdoor 3D visual grounding benchmark, containing over 128K objects and 22K language descriptions, making it 10× larger than existing outdoor datasets. A baseline method incorporating cross-platform alignment, multi-scale sampling, and scale-adaptive fusion is also proposed, revealing substantial performance gaps in cross-platform 3D grounding.

Aha: Predicting What Matters Next — Online Highlight Detection Without Looking Ahead

Aha proposes the first autoregressive framework for Online Highlight Detection (OHD), featuring a decoupled multi-objective prediction head (relevance / informativeness / uncertainty) and a novel Dynamic SinkCache memory mechanism. Under strict causal constraints with no access to future frames, Aha surpasses prior offline methods on TVSum and Mr.Hisum benchmarks by +5.9% and +8.3% mAP, respectively.

Availability-aware Sensor Fusion via Unified Canonical Space

This paper proposes ASF (Availability-aware Sensor Fusion), which maps Camera/LiDAR/4D Radar features into a shared space via Unified Canonical Projection (UCP), applies cross-sensor along-patch cross-attention (CASAP, complexity \(O(N_qN_s)\) vs. \(O(N_qN_sN_p)\)) to automatically adapt to available sensors, and employs a Sensor Combination Loss (SCL) covering all 7 sensor subsets. ASF achieves AP_3D of 73.6% on K-Radar (surpassing SOTA by 20.1%), with only a 1.7% performance drop under sensor failure.

BayesG: Bayesian Ego-Graph Inference for Networked Multi-Agent Reinforcement Learning

BayesG enables each agent in networked MARL to learn the dynamic structure of its local communication graph via Bayesian variational inference — sampling edge masks with Gumbel-Softmax and jointly optimizing policy and graph structure under an ELBO objective — achieving 50%+ reward improvement over the best baseline in a 167-agent New York traffic scenario.

Causality Meets Locality: Provably Generalizable and Scalable Policy Learning for Networked Systems

This paper proposes the GSAC framework, which integrates causal representation learning with meta Actor-Critic. By learning sparse causal masks from networked MARL to construct Approximate Compact Representations (ACR), GSAC achieves scalability; by conditioning policies on domain factors, it achieves cross-domain generalization. Finite-sample guarantees are provided for causal recovery, convergence, and adaptation gap.

ChronoGraph: A Real-World Graph-Based Multivariate Time Series Dataset

This paper presents ChronoGraph — the first real-world microservice dataset that simultaneously provides multivariate time series, explicit service dependency graphs, and event-level anomaly labels (6 months / ~700 services / 5-dimensional metrics / 8005 timesteps). Benchmark results reveal substantial room for improvement in long-horizon forecasting and topology-aware modeling among existing methods.

Continuous Simplicial Neural Networks

This paper proposes COSIMO, the first continuous simplicial neural network based on partial differential equations (PDEs), which realizes continuous information flow by defining heat diffusion dynamics on the Hodge Laplacian. COSIMO demonstrates superior stability and over-smoothing control compared to discrete SNNs.

CuMoLoS-MAE: A Masked Autoencoder for Remote Sensing Data Reconstruction

This paper proposes CuMoLoS-MAE, a Masked Autoencoder combining a curriculum masking strategy with Monte Carlo stochastic ensemble inference for high-fidelity reconstruction and pixel-wise uncertainty quantification of remote sensing atmospheric profile data.

CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation

This work introduces the first sketch-to-3D outdoor semantic scene generation task along with a benchmark dataset, SketchSem3D, and proposes CymbaDiff (Cylinder Mamba Diffusion), a denoising network that achieves structured spatial modeling via dual-path Mamba blocks combining cylindrical and Cartesian scanning. CymbaDiff reduces FID by 75% over 3D Latent Diffusion and 71% over 3D DiT.

DBLoss: Decomposition-based Loss Function for Time Series Forecasting

This paper proposes DBLoss—a general-purpose loss function based on exponential moving average (EMA) decomposition. During loss computation, both predictions and ground-truth values are decomposed into seasonal and trend components within the forecasting horizon, and losses are computed separately for each component. DBLoss serves as a plug-and-play replacement for MSE and consistently improves any deep learning forecasting model, with effectiveness validated across 8 benchmark datasets × 8 SOTA models.

Browse all 47 Autonomous Driving papers →


🤖 Robotics & Embodied AI (73)

A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning

This work is the first to introduce data attribution into online reinforcement learning. It proposes a local attribution framework to quantify each training record's contribution to policy updates, and builds upon it an Iterative Influence Filtering (IIF) algorithm that substantially improves sample efficiency and final performance on both classical RL benchmarks and LLM RLHF.

Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies

This paper proposes DP-AG (Action-Guided Diffusion Policy), which uses the Vector-Jacobian Product (VJP) of a diffusion policy's noise prediction as a structured stochastic force to drive dynamic evolution of latent observation features across diffusion steps, and closes the perception-action loop via a cycle-consistent contrastive loss. DP-AG achieves +6% on Push-T, +13% on Dynamic Push-T, and +23%+ success rate on a real UR5 robot.

Adaptive Frontier Exploration on Graphs with Applications to Network-Based Disease Testing

This paper proposes the Adaptive Frontier Exploration on Graphs (AFEG) framework and designs a Gittins index-based policy that is provably optimal when the graph is a forest. On real-world sexually transmitted disease testing networks, the method identifies nearly all HIV-positive individuals by testing only half the population, substantially outperforming greedy and DQN baselines.

Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning

ALMI proposes an upper-lower body adversarial training framework: the lower-body policy learns robust locomotion under upper-body motion perturbations, while the upper-body policy learns precise motion imitation under lower-body locomotion perturbations. Through iterative adversarial training converging to a Nash equilibrium, the framework enables stable whole-body coordinated control on the Unitree H1-2 real robot.

Asymptotically Stable Quaternionic Hopfield Structured Neural Network with Supervised Projection-based Manifold Learning

This paper proposes a Quaternion-valued Supervised Hopfield-structured Neural Network (QSHNN) that employs a periodic projection strategy to maintain the quaternionic structural consistency of the weight matrix. The existence and uniqueness of fixed points and their asymptotic stability are established via Lyapunov theory, while bounded trajectory curvature guarantees path smoothness for robotic path planning.

Automaton Constrained Q-Learning

This paper proposes ACQL (Automaton Constrained Q-Learning), which translates Linear Temporal Logic (LTL) task specifications into automata and combines goal-conditioned learning with minimal safety constraints. ACQL is the first scalable method to simultaneously support sequential temporal goals and non-stationary safety constraints in continuous control environments.

AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling

AutoToM achieves fully automated model-based Theory of Mind inference—without requiring manual agent model specification—by automatically proposing Bayesian network structures and executing Bayesian inverse planning. Through uncertainty-driven iterative model refinement (adding mental variables or extending time steps), it achieves an average accuracy of 82.43% across 5 ToM benchmarks, surpassing SOTA models such as GPT-4o (63.39%) and o3-mini (73.94%).

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

AutoVLA integrates physical action tokens directly into a pretrained VLM (Qwen2.5-VL-3B), equips the model with fast/slow dual-thinking modes via SFT, and applies GRPO reinforcement fine-tuning to enable adaptive reasoning switching and optimize planning performance. The approach achieves competitive end-to-end driving performance across four major autonomous driving benchmarks: nuPlan, Waymo, nuScenes, and CARLA.

BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning

BEAST parameterizes action sequences via B-splines—estimating control points through ridge regression and uniformly quantizing them into fixed-length tokens—achieving 20× token compression (100 steps → 5 tokens), mathematically guaranteed \(C^0\) continuity across action chunks, a top-1 success rate on LIBERO-Long (86.4%), and an inference throughput of 617 Hz (2.14× faster than π₀ and 101× faster than OpenVLA).

Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Meta proposes WAGIBench, a multimodal goal inference benchmark for assistive wearable agents, comprising 3,477 egocentric recordings (29 hours) from 348 participants across four modalities — visual, audio, digital, and longitudinal. Human accuracy reaches 93% versus the best VLM at 84% (MCQ); under generative evaluation, models produce relevant goals only 55% of the time, exposing a substantial gap between current VLMs and real-world wearable deployment.

Browse all 73 Robotics & Embodied AI papers →


🎮 Reinforcement Learning (140)

A Differential and Pointwise Control Approach to Reinforcement Learning

This paper reformulates the RL problem via the differential dual form of continuous-time control, embeds physical priors through Hamiltonian structure, and proposes the dfPO algorithm for pointwise policy optimization. On scientific computing tasks (surface modeling, grid-based control, molecular dynamics), dfPO surpasses 12 RL baselines with fewer samples.

A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications

This paper extends the classical bisimulation metric (BSM), which is limited to measuring state similarity within a single MDP, to cross-MDP settings by proposing a Generalized Bisimulation Metric (GBSM). The authors rigorously prove three fundamental metric properties — symmetry, cross-MDP triangle inequality, and an upper bound on same-state distances — and derive tighter error bounds and closed-form sample complexities than standard BSM in three applications: policy transfer, state aggregation, and sampling-based estimation.

A Near-optimal, Scalable and Parallelizable Framework for Stochastic Bandits Robust to Adversarial Corruptions and Beyond

This paper proposes BARBAT, an improvement over the classical BARBAR algorithm. By fixing epoch lengths and adjusting failure probabilities per epoch, BARBAT reduces the regret of stochastic multi-armed bandits under adversarial corruptions from \(O(\sqrt{K}C)\) to the near-optimal \(O(C)\) (eliminating the \(\sqrt{K}\) factor), and successfully extends to multi-agent, graph bandit, combinatorial semi-bandit, and batched bandit settings.

A Theory of Multi-Agent Generative Flow Networks

This paper proposes a theoretical framework for Multi-Agent Generative Flow Networks (MA-GFlowNets) and establishes a "local-global principle" — the joint flow function can be decomposed into a product of individual agents' local flows. Four algorithms are designed (CFN/IFN/JFN/CJFN), among which JFN and CJFN realize Centralized Training with Decentralized Execution (CTDE). The proposed methods outperform RL and MCMC baselines on Hyper-Grid and StarCraft environments.

A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

This work is the first to introduce matrix splitting theory, unifying TD, FQI, and PFQI under linear function approximation as iterative methods for solving the same target linear system \((\Sigma_{cov} - \gamma\Sigma_{cr})\theta = \theta_{\phi,r}\), differing only in their preconditioners. It establishes necessary and sufficient conditions for the convergence of each algorithm, introduces the novel concept of rank invariance, and reveals that target networks are fundamentally a continuous transformation of the preconditioner from a constant to a data-adaptive form.

Actor-Free Continuous Control via Structurally Maximizable Q-Functions

This paper proposes Q3C (Q-learning for Continuous Control with Control-points), which approximates the Q-function via a learned set of control points such that the maximum value is structurally attained at one of those points. Combined with action-conditioned Q-value generation, a control-point diversity loss, and scale normalization, Q3C matches TD3 on standard benchmarks and substantially outperforms all actor-critic methods in constrained action spaces.

Adaptive Cooperative Transmission Design for URLLC via Deep RL

This paper proposes DRL-CoLA, a dual-agent DQN algorithm that adaptively configures 5G NR transmission parameters (numerology, mini-slot, MCS) at the source and relay nodes respectively. Operating over a two-hop relay system with only local CSI, DRL-CoLA achieves URLLC reliability close to the optimum attained under full global CSI.

Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning

This paper proposes ANQ (Adaptive Neighborhood-constrained Q learning), which introduces advantage-function-based adaptive neighborhood constraints for offline RL. ANQ offers a flexible middle ground between density constraints (overly conservative) and support constraints (requiring precise behavior policy modeling), and realizes efficient Q learning via a bilevel optimization framework, achieving state-of-the-art performance on the D4RL benchmark.

Adaptively Coordinating with Novel Partners via Learned Latent Strategies

This paper proposes the TALENTS framework, which learns a latent strategy space via a VAE, discovers strategy types through K-Means clustering, and performs online teammate-type inference using the Fixed-Share regret minimization algorithm, enabling zero-shot real-time adaptive coordination with unknown human or agent teammates.

ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition

ALINE proposes a unified framework for amortized Bayesian inference and active data acquisition. By combining a Transformer architecture with RL-based training, the model simultaneously learns to strategically select the most informative data points and perform instant posterior inference. It further supports flexible data acquisition targeting specific parameter subsets or predictive objectives.

Browse all 140 Reinforcement Learning papers →


🎁 Recommender Systems (24)

ASAP: An Agentic Solution to Auto-Optimize Performance of Large-Scale LLM Training

ASAP is a multi-agent system (Coordinator + Analyzer + Proposal) that automatically diagnoses bottleneck types (compute/memory/communication) in large-scale LLM distributed training and proposes sharding configurations. Across 3 experimental scenarios, it matches human expert solutions and achieves up to 2.58× throughput improvement.

Balancing Performance and Costs in Best Arm Identification

This paper proposes to reformulate Best Arm Identification (BAI) from the fixed-budget/fixed-confidence paradigm into a risk functional minimization problem over misidentification probability (or simple regret) plus sampling cost. It derives lower bounds exhibiting a phase transition phenomenon (when the gap is too small, the optimal strategy is to guess directly), and designs the DBCARE algorithm that achieves optimality within logarithmic factors under a dynamic budget.

EMPATHIA: Multi-Faceted Human-AI Collaboration for Refugee Integration

This paper proposes EMPATHIA, a multi-agent framework grounded in Kegan's constructive-developmental theory. Three specialized agents—emotional, cultural, and ethical—engage in selector-validator negotiation to evaluate refugee resettlement recommendations. On real-world data from 6,359 refugees, the framework achieves an 87.4% convergence rate and 92.1% cultural expert agreement rate.

Estimating Hitting Times Locally At Scale

Two local (sublinear) algorithms are proposed for estimating hitting times on graphs — Algorithm 1 based on meeting times and Algorithm 3 based on spectral truncation. Both require only short random walks centered at \(u\) and \(v\) without full graph access, achieving relative error <1.4% on synthetic and real-world graphs. An optimal sample complexity lower bound for walk-based estimation is also established.

FACE: A General Framework for Mapping Collaborative Filtering Embeddings into LLM Tokens

FACE proposes mapping collaborative filtering (CF) embeddings into LLM pre-trained tokens (descriptors) via disentangled projection and residual quantization, followed by contrastive learning for semantic alignment — enabling semantic interpretation and recommendation enhancement of CF embeddings without fine-tuning the LLM.

Inference-Time Reward Hacking in Large Language Models

This paper mathematically proves that inference-time alignment methods (e.g., BoN) inevitably exhibit reward hacking (true reward first increases then decreases) when optimizing a proxy reward. It proposes Best-of-Poisson (BoP) sampling to approximate the optimal KL-reward trade-off distribution, and designs the HedgeTune algorithm to locate the optimal inference-time parameter via one-dimensional root-finding, effectively mitigating reward hacking in both mathematical reasoning and human preference settings.

Measuring What Matters: Construct Validity in Large Language Model Benchmarks

This paper presents a systematic review of 445 LLM benchmark papers conducted by 29 experts, examining existing LLM evaluation benchmarks through the lens of construct validity across four dimensions — phenomenon definition, task design, scoring metrics, and conclusion claims — and proposes 8 actionable recommendations for improvement.

MMPB: It's Time for Multi-Modal Personalization

This paper introduces MMPB, the first VLM personalization evaluation benchmark, comprising 111 personalizable concepts, 10k+ image-text QA pairs, and 15 task types. Evaluation of 23 VLMs reveals that even the strongest model, GPT-4o, performs poorly on personalization tasks, exposing critical limitations in preference reasoning, visual cue utilization, and conflicts between safety alignment and personalization.

NeurIPS Should Lead Scientific Consensus on AI Policy

This position paper argues that NeurIPS should proactively assume the role of facilitating scientific consensus in AI policy, drawing on the successful experience of the IPCC (Intergovernmental Panel on Climate Change) in climate science to fill the current gap in AI policy consensus mechanisms.

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

This work identifies two categories of sparsity artifacts introduced by L1 loss in Crosscoders—Complete Shrinkage (which erroneously zeros out weakly shared concepts) and Latent Decoupling (which splits shared concepts into spurious model-specific latents)—and proposes Latent Scaling as a diagnostic tool and BatchTopK Crosscoder as an alternative training scheme, substantially improving the reliability of chat-tuning concept discovery.

Browse all 24 Recommender Systems papers →


🔄 Self-Supervised Learning (33)

A Joint Learning Approach to Hardware Caching and Prefetching

This paper proposes a joint training framework that unifies hardware cache replacement and prefetching policies. By constructing shared feature representations via a joint encoder and contrastive learning, the framework breaks the performance bottleneck imposed by independently trained policies.

Adv-SSL: Adversarial Self-Supervised Representation Learning with Theoretical Guarantees

This paper proposes Adv-SSL, which rewrites the Frobenius norm of the covariance regularization term as a minimax dual form, eliminating the biased sample-level risk estimation present in methods such as Barlow Twins. The approach substantially improves downstream classification performance without incurring additional computational cost, and provides end-to-end theoretical convergence guarantees.

Angular Constraint Embedding via SpherePair Loss for Constrained Clustering

This paper proposes the SpherePair loss function, which performs pairwise constraint embedding learning in angular space (rather than Euclidean space), enabling a deep constrained clustering method that requires neither anchors nor prior knowledge of the number of clusters, while providing rigorous theoretical guarantees for determining optimal hyperparameters.

Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCE

By modeling embedding evolution as Langevin dynamics on a compact Riemannian manifold, this paper proves that the convergence guarantees of classical simulated annealing extend to the temperature scheduling setting in contrastive learning: a sufficiently slow logarithmic inverse-temperature schedule guarantees probabilistic convergence to the globally optimal representation set, whereas faster schedules risk trapping the system in suboptimal minima.

Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning

This paper derives the optimal tight lower bound of KL divergence in terms of JS divergence, \(\Xi(D_{\text{JS}}) \leq D_{\text{KL}}\), in the general case. It proves that training a discriminator by minimizing cross-entropy loss is equivalent to maximizing a guaranteed lower bound on mutual information, thereby providing the missing theoretical foundation for JSD-based discriminative representation learning methods. The tightness and practical utility of the bound are validated in MI estimation and the Information Bottleneck framework.

Continuous Subspace Optimization for Continual Learning (CoSO)

This paper proposes CoSO, a framework that dynamically derives continuous subspaces from per-step gradient SVD (rather than LoRA's fixed subspace), combined with orthogonal projection onto historical task subspaces to prevent interference and Frequent Directions for efficient gradient information aggregation. CoSO achieves 78.19% final accuracy on ImageNet-R with 20 tasks, surpassing the best baseline by 2.77 percentage points.

Contrastive Representations for Temporal Reasoning

This paper proposes CRTR (Contrastive Representations for Temporal Reasoning), which introduces intra-trajectory negative pairs by repeating trajectory IDs within training batches. This eliminates the reliance on static contextual features in standard temporal contrastive learning, enabling representations that reflect temporal structure. CRTR achieves, for the first time, search-free solving on combinatorial reasoning tasks such as the Rubik's Cube.

Curiosity-driven RL for Symbolic Equation Solving

This work combines curiosity-driven exploration mechanisms (RND, ICM, etc.) with a graph action space based on expression trees, enabling a PPO agent to solve nonlinear equations involving radicals, exponentials, and trigonometric functions — surpassing prior RL methods that were limited to linear equations.

DataRater: Meta-Learned Dataset Curation

This paper proposes DataRater, a meta-gradient-based data valuation framework that employs meta-learning to automatically score and filter low-quality training samples. It achieves up to 46.6% net compute savings across multiple pre-training datasets, and a DataRater trained on a 400M internal model generalizes directly to LLM training at scales ranging from 50M to 1B parameters.

Disentangling Hyperedges through the Lens of Category Theory

This work is the first to analyze hyperedge disentanglement through the lens of category theory. By deriving a naturality condition, it establishes a "factor representation consistency" criterion (aggregation-then-disentanglement vs. disentanglement-then-aggregation should yield consistent results), and proposes Natural-HNN, which comprehensively outperforms 14 baselines across 6 cancer subtype classification datasets (BRCA F1: 75.7% → 80.4%) while achieving 100% accuracy in capturing the functional context of genetic pathways.

Browse all 33 Self-Supervised Learning papers →


📐 Optimization & Theory (121)

A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization

For bilevel optimization problems with coupled linear constraints in the lower-level problem, this paper proposes SFLCB, a single-loop first-order algorithm that eliminates Hessian dependence via a penalty-based reformulation combined with augmented Lagrangian, improving the iteration complexity from \(O(\epsilon^{-3}\log(\epsilon^{-1}))\) to \(O(\epsilon^{-3})\).

A Unified Approach to Submodular Maximization Under Noise

This paper proposes a unified meta-algorithm framework that takes any exact submodular maximization algorithm satisfying a "robustness" condition as a black box and automatically converts it into an algorithm that maintains its approximation ratio (losing only \(o(1)\)) under a persistent noisy value oracle. This achieves, for the first time, optimal approximation ratios for non-monotone submodular maximization under matroid constraints and in the unconstrained setting.

A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias

Through a linear stability analysis framework, this paper demonstrates that "flat minima ⇒ better generalization" and "SGD prefers simple functions" are two sides of the same coin — data coherence simultaneously governs both phenomena, and SAM amplifies the simplicity bias further by imposing stricter stability conditions.

Abstain Mask Retain Core: Time Series Prediction by Adaptive Masking Loss with Representation Consistency

This paper reveals a counter-intuitive phenomenon in time series forecasting — that appropriately truncating historical inputs can improve prediction accuracy (termed the redundant feature learning problem) — and proposes AMRC based on information bottleneck theory. AMRC suppresses redundant feature learning via adaptive masking loss and representation consistency constraints, serving as a model-agnostic training framework that consistently improves performance across diverse architectures.

Adaptive Algorithms with Sharp Convergence Rates for Stochastic Hierarchical Optimization

This paper proposes Ada-Minimax and Ada-BiO, two adaptive algorithms that combine momentum normalization with a novel online noise estimation strategy to achieve, for the first time, sharp convergence rates of \(\tilde{O}(1/\sqrt{T} + \sqrt{\bar{\sigma}}/T^{1/4})\) for nonconvex-strongly-concave minimax and nonconvex-strongly-convex bilevel optimization without requiring prior knowledge of the gradient noise level.

An Adaptive Algorithm for Bilevel Optimization on Riemannian Manifolds

AdaRHD is the first adaptive algorithm for Riemannian bilevel optimization (RBO) that requires no prior knowledge of problem-specific parameters (strong convexity constants, Lipschitz bounds, manifold curvature). By adopting an inverse cumulative gradient norm strategy for adaptive step size selection and solving the lower-level problem, linear system, and upper-level update sequentially within a three-stage framework, AdaRHD achieves a convergence rate of \(O(1/\epsilon)\) matching non-adaptive methods, while exhibiting substantially greater robustness to initial step size choices compared to RHGD.

Auto-Compressing Networks

Auto-Compressing Networks (ACN) replace short residual connections with long-range forward connections (aggregating all layer outputs directly into the final output), making the Direct Gradient (DG) component significantly stronger than the Forward Gradient (FG), thereby implicitly compressing information into earlier layers. A ViT with only 6 layers matches standard 12-layer performance; BERT saves 75% of its layers; additional benefits include noise robustness (+6.4%) and reduced catastrophic forgetting in continual learning (−18%).

Automated Algorithm Design via Nevanlinna-Pick Interpolation

This paper proposes an automated algorithm design framework based on Nevanlinna-Pick interpolation from frequency-domain robust control theory, targeting strongly convex optimization with equality constraints, and achieves an optimal trade-off between the number of matrix-vector multiplications and the convergence rate.

AutoOpt: A Dataset and a Unified Framework for Automating Optimization Problem Solving

AutoOpt introduces the first end-to-end framework for converting optimization problem images to executable code — comprising the AutoOpt-11k dataset of 11,554 optimization formula images (handwritten + printed), an M1 hybrid encoder (ResNet+Swin→mBART) for image-to-LaTeX conversion (BLEU 96.70), an M2 DeepSeek-Coder module for LaTeX-to-PYOMO translation, and an M3 bilevel decomposition solver, achieving an overall pipeline success rate of 94.20%.

Better NTK Conditioning: A Free Lunch from ReLU Nonlinear Activation in Wide Neural Networks

This paper establishes a previously unnoticed "free" benefit of ReLU activation in wide neural networks: (a) it induces better data separation in the model's gradient feature space (angles between similar inputs are amplified in gradient space), and (b) this strictly reduces the condition number of the NTK matrix compared to linear networks. Depth further amplifies this effect — in the infinite-width-then-infinite-depth limit, all data pairs achieve equal angular separation in gradient space (~75.5°), and the NTK condition number converges to a fixed value \((n+4)/3\) that depends only on the number of training samples \(n\).

Browse all 121 Optimization & Theory papers →


📐 Learning Theory (25)

A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning

This paper proposes a theoretical framework based on K-L divergence and high-dimensional statistical analysis to determine the optimal number of samples to transfer from each source task in multi-source transfer learning. The framework avoids the negative transfer caused by naively using all source data, and the resulting algorithm OTQMS surpasses the state of the art by 1.0–1.5% on DomainNet and Office-Home while reducing sample usage by 47.85% and training time by 35.19%.

Adaptive Data Analysis for Growing Data

This paper establishes the first generalization bounds for adaptive analysis over dynamically growing data, permitting analysts to schedule queries adaptively based on current dataset size, and achieving increasingly tight guarantees as data accumulates via time-varying empirical accuracy bounds and differential privacy mechanisms.

Computable Universal Online Learning

This paper introduces computability constraints into the universal online learning framework, proving that "mathematically learnable" does not imply "learnable by a computer program," and provides precise characterizations of computable learning under both agnostic and proper variants.

Conformal Online Learning of Deep Koopman Linear Embeddings

This paper proposes the COLoKe framework, which reinterprets conformal prediction as a model consistency diagnostic tool. Parameter updates are triggered only when the Koopman model's prediction error exceeds a dynamically calibrated threshold, enabling efficient online Koopman linear embedding learning for nonlinear dynamical systems.

Diffusion Transformers for Imputation: Statistical Efficiency and Uncertainty Quantification

This paper analyzes the sample complexity and uncertainty quantification performance of conditional diffusion Transformers (DiT) for time series imputation from a statistical learning perspective, and proposes a mixed-masking training strategy to improve imputation quality.

Efficient Kernelized Learning in Polyhedral Games Beyond Full-Information: From Colonel Blotto to Congestion Games

This paper proposes a kernelization-based framework for designing computationally efficient no-regret learning algorithms for polyhedral games (Colonel Blotto, graphic matroid congestion games, and network congestion games) under partial-information feedback, significantly improving the runtime complexity for learning coarse correlated equilibria (CCE).

Finite-Time Analysis of Stochastic Nonconvex Nonsmooth Optimization on the Riemannian Manifolds

This paper proposes the Riemannian Online to NonConvex (RO2NC) algorithm and its zeroth-order variant ZO-RO2NC, establishing for the first time a finite-time sample complexity guarantee of \(O(\delta^{-1}\epsilon^{-3})\) for fully nonsmooth nonconvex stochastic optimization on Riemannian manifolds, matching the optimal result in Euclidean space.

How Many Domains Suffice for Domain Generalization? A Tight Characterization via the Domain Shattering Dimension

This paper introduces the Domain Shattering Dimension (Gdim), a novel combinatorial measure that tightly characterizes the number of domains required for domain generalization (i.e., the domain sample complexity), and establishes its relationship to the classical VC dimension as \(\Theta(d \log(1/\alpha))\).

Improved Approximation Algorithms for Chromatic and Pseudometric-Weighted Correlation Clustering

For two important generalizations of Correlation Clustering—Chromatic CC and pseudometric-weighted CC—this paper achieves a 2.15-approximation and a tight 10/3-approximation, respectively, via LP relaxation and carefully designed rounding functions, significantly improving upon the previous best results of 2.5 and 6.

Infrequent Exploration in Linear Bandits

This paper proposes the INFEX framework, which executes a baseline algorithm (e.g., LinUCB/LinTS) at designated exploration steps according to a given schedule and selects arms greedily at all other time steps. It is proven that as long as the number of exploration steps exceeds \(\omega(\log T)\), INFEX achieves the same poly-logarithmic regret as full-time exploration while substantially reducing computational overhead (80%–99% of time steps are greedy).

Browse all 25 Learning Theory papers →


🔗 Causal Inference (19)

A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning

This paper proposes a Targeted Intervention paradigm grounded in Multi-Agent Influence Diagrams (MAIDs), which applies Pre-Strategy Intervention (PSI) exclusively to a single target agent to guide the entire multi-agent system toward a preferred Nash equilibrium satisfying additional desired outcomes, without requiring global intervention over all agents.

An Analysis of Causal Effect Estimation Using Outcome Invariant Data Augmentation

This paper presents the first systematic analysis of outcome invariant data augmentation (DA) for causal effect estimation. It proves that when DA operations preserve the outcome variable, they are equivalent to soft interventions on the treatment variable, thereby reducing confounding bias. The paper further proposes an IV-like (IVL) regression framework that treats DA parameters as "instrument-like" variables, and reduces bias further through adversarial DA composition.

Bi-Level Decision-Focused Causal Learning for Large-Scale Marketing Optimization

This paper proposes Bi-DFCL, a bilevel optimization framework that jointly leverages observational (OBS) data and randomized controlled trial (RCT) data to train marketing resource allocation models. The upper level trains a Bridge Network with unbiased decision loss on RCT data to dynamically correct the bias of the lower level trained on OBS data. The framework further introduces differentiable surrogate decision losses (PPL/PIFD) grounded in the primal problem and an implicit differentiation algorithm, addressing the predict-then-optimize inconsistency and the bias-variance dilemma of conventional two-stage methods. The system has been deployed at scale on Meituan.

Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features

CAPE learns the causal DAG structure among features from tabular data, embeds it into hyperbolic space to generate causality-aware rotary positional encodings (RoPE), enabling Transformers to process non-sequential yet causally structured feature data, with significant performance gains on downstream multi-omics tasks.

Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models

This paper proposes the COUPLE framework, which constructs a Structural Causal Model (SCM) to model the dependencies and priorities among multi-dimensional values, and leverages counterfactual reasoning to achieve steerable alignment of LLMs toward arbitrary fine-grained pluralistic value objectives.

Cyclic Counterfactuals under Shift–Scale Interventions

This paper establishes a theoretical framework for counterfactual reasoning under shift–scale soft interventions in cyclic (non-DAG) structural causal models (SCMs). It proves that a global contraction condition guarantees unique solvability of cyclic SCMs and derives sub-Gaussian concentration inequalities for counterfactual distributions.

Demystifying Spectral Feature Learning for Instrumental Variable Regression

This paper establishes rigorous generalization error bounds for spectral feature-based nonparametric instrumental variable (NPIV) regression, revealing that performance is jointly governed by two factors: spectral alignment between the structural function and the conditional expectation operator (approximation error) and the rate of singular value decay (estimation error). A Good-Bad-Ugly trichotomy is proposed along with data-driven diagnostic tools.

Differentiable Structure Learning and Causal Discovery for General Binary Data

This paper proposes a general differentiable structure learning framework based on the Multivariate Bernoulli Distribution (MVB) that makes no assumptions about the specific data-generating process, captures arbitrary higher-order dependencies among binary discrete variables, and proves that while DAGs are not identifiable in the general setting, the minimal equivalence class (Markov equivalence class) is recoverable.

Do-PFN: In-Context Learning for Causal Effect Estimation

This paper proposes Do-PFN, which extends Prior-data Fitted Networks (PFN) to causal effect estimation. A Transformer is pre-trained on large-scale synthetic SCM data to perform in-context causal reasoning, enabling prediction of causal intervention distributions (CID) and CATE from observational data alone—without requiring causal graph knowledge or the unconfoundedness assumption—achieving strong performance on both synthetic and semi-synthetic benchmarks.

Domain-Adapted Granger Causality for Real-Time Cross-Slice Attack Attribution in 6G Networks

This paper proposes a domain-adapted Granger causality framework for 6G network slicing that integrates enhanced Granger causality testing with network resource contention modeling to enable real-time cross-slice attack attribution, achieving 89.2% accuracy and 87 ms response time across 1,100 attack scenarios, substantially outperforming existing statistical, deep learning, and causal discovery methods.

Browse all 19 Causal Inference papers →


🔬 Interpretability (76)

A Controllable Examination for Long-Context Language Models

This paper proposes LongBioBench, which uses synthetically generated fictional biographies as both needles and haystacks to construct a long-context LLM evaluation framework satisfying three core principles: seamless context, controllable settings, and reliable evaluation. Evaluating 18 models, the benchmark reveals that current LCLMs exhibit substantial deficiencies in reasoning and trustworthiness despite adequate retrieval performance.

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

This paper identifies and systematically studies the phenomenon of "feature absorption" in SAEs: apparently monosemantic SAE latents fail to activate on certain tokens because their feature directions are "absorbed" by more specific sub-latents. This is shown to be an inevitable consequence of hierarchical features combined with sparsity loss, posing a fundamental challenge to using SAEs for reliable LLM interpretation.

AdaptGrad: Adaptive Sampling to Reduce Noise

AdaptGrad analyzes the theoretical origin of noise in SmoothGrad—out-of-boundary (OOB) sampling behavior—and proposes adaptively adjusting the Gaussian sampling variance for each input dimension to bound the additional noise. The method nearly eliminates gradient noise while revealing richer fine-grained features, requires minimal implementation effort, and is composable with arbitrary gradient-based explanation methods.

Additive Models Explained: A Computational Complexity Approach

This paper presents a systematic computational complexity analysis of multiple explanation types for Generalized Additive Models (GAMs), covering 54 combinations of "component model × input domain × explanation method." It reveals that the explanation complexity of GAMs is highly sensitive to the type of input domain — a phenomenon never observed in other ML models such as decision trees or neural networks — thereby challenging the intuitive assumption that "additive implies interpretable."

AgentiQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation

This paper proposes AgentiQL, a multi-expert agent framework for Text-to-SQL: a reasoning agent decomposes questions into sub-problems, a coding agent generates sub-queries, a refinement step corrects column selection, and an adaptive router intelligently routes between a baseline parser and the modular pipeline. Using a 14B open-source model, AgentiQL achieves 86.07% EX on Spider, approaching GPT-4 SOTA (89.65%).

An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations

This paper presents the first systematic study on the impact of annotation noise on Concept Bottleneck Models (CBMs). It identifies approximately 23% of concepts as "susceptible concepts" that drive the majority of performance degradation, and proposes a two-stage mitigation strategy combining SAM at training time and uncertainty-guided intervention at inference time to restore model robustness.

Are Greedy Task Orderings Better Than Random in Continual Linear Regression?

This paper systematically analyzes the convergence differences between greedy task orderings (maximizing dissimilarity between consecutive tasks) and random orderings in continual linear regression. It reveals that greedy orderings are competitive with random orderings in the full-rank setting, but single-pass greedy ordering can fail catastrophically in the general-rank setting, whereas greedy ordering with repetition achieves a convergence rate of \(\mathcal{O}(1/\sqrt[3]{k})\).

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation

ARECHO frames speech multi-metric evaluation as a chain-based autoregressive token prediction task. It designs a unified speech information tokenization pipeline to handle 87 heterogeneous metrics (numerical/categorical/bounded/unbounded), explicitly captures inter-metric dependencies (e.g., intelligibility–naturalness correlation) via dynamic classification chains, and employs two-step confidence-guided decoding to reduce error propagation. ARECHO comprehensively outperforms the UniVERSA baseline across enhancement, synthesis, and noisy speech evaluation (Avg Test MSE 23.26 vs. 96.99, −76%).

ARC-JSD: Attributing Response to Context via Jensen-Shannon Divergence Driven Mechanistic Study

ARC-JSD proposes a RAG context attribution method based on Jensen-Shannon Divergence — by comparing the JSD between model output distributions with and without specific context sentences, it localizes the context that a response depends on without fine-tuning or gradient computation. The method achieves 3× faster computation than baselines, improves Top-1 attribution accuracy by 10.7% on average, and reveals via Logit Lens that attribution-relevant attention heads are concentrated in higher layers.

Base Models Know How to Reason, Thinking Models Learn When

Through unsupervised SAE clustering, this work discovers a taxonomy of reasoning mechanisms in thinking models, then activates the corresponding latent capabilities in base models via steering vectors. The resulting hybrid model recovers up to 91% of the performance gap between thinking and base models—without any weight updates—demonstrating that base models already possess reasoning capabilities, and that thinking models merely learn when to deploy them.

Browse all 76 Interpretability papers →


📦 Model Compression (140)

4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming

This paper proposes 4DGCPro, a hierarchical 4D Gaussian compression framework that achieves multi-bitrate progressive volumetric video streaming within a single model, via perception-weighted hierarchical Gaussian representation, motion-aware adaptive grouping, and end-to-end entropy-optimized training. The framework supports real-time decoding and rendering on mobile devices and surpasses existing SOTA in rate-distortion performance.

A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings

This paper proposes A-Thought, a CoT compression framework based on the A search algorithm. It introduces Bidirectional Importance Scoring (BIS) to measure each reasoning step's relevance to both the question and the answer, and combines path-level A search to efficiently identify the most compact valid reasoning path within an exponentially large search space. Under a 512-token budget, A-Thought improves QwQ-32B accuracy by 2.39×; under a 4096-token budget, it reduces output tokens by approximately 50% with negligible accuracy loss.

A Granular Study of Safety Pretraining under Model Abliteration

This paper systematically investigates the effects of model abliteration—a inference-time activation space editing attack—on various data-driven safety pretraining stages. It finds that safety mechanisms relying solely on refusal training are highly vulnerable, whereas combining multiple safety signals (safe-only filtering + rephrasing + metatags + refusals) distributes safety behavior across a broader representational space, making it substantially more resistant to single-direction projection removal.

A Partition Cover Approach for Tokenization

This paper reformulates tokenization as a partition cover optimization problem, proves it NP-hard, and proposes a polynomial-time greedy algorithm GreedTok that outperforms BPE in both compression rate and downstream tasks when pretraining a 1B-parameter LLM.

A Simple Linear Patch Revives Layer-Pruned Large Language Models

LinearPatch inserts a lightweight symmetric matrix — fusing a Hadamard transform with channel scaling — at the pruning interface to repair activation magnitude mismatches caused by layer pruning. On LLaMA-3-8B, it retains 94.15% of the original performance without any training, and reaches 95.16% after 30 minutes of distillation.

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

This paper proposes Low-Rank Clone (LRC), which compresses teacher weights into student weights via learnable low-rank projection matrices (soft pruning), while aligning intermediate activations of both attention and FFN modules (activation cloning). A 1.7B model trained on only 20B tokens surpasses Qwen3-1.7B trained on 36T tokens (64.98 vs. 63.17), achieving a 1,000× improvement in training efficiency.

Accurate and Efficient Low-Rank Model Merging in Core Space

This paper proposes the Core Space Merging framework, which performs model merging within a common reference basis space constructed from low-rank LoRA matrices. This approach losslessly reduces the merging operation from the full \(m \times n\) space to a compact \(Tr \times Tr\) space (where \(T\) is the number of tasks and \(r\) is the LoRA rank), achieving state-of-the-art merging accuracy on Llama 3 8B while reducing computational cost by several orders of magnitude.

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Existing KV cache eviction methods uniformly allocate budgets across all attention heads, ignoring the substantial variation in attention concentration across heads. This paper proposes Ada-KV — the first head-wise adaptive budget allocation strategy — which redistributes budget from sparse heads to dispersed heads. It provides a theoretical proof that the approach minimizes an upper bound on eviction loss, and serves as a plug-and-play improvement over existing methods across 29 datasets.

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

This paper proposes the R-AutoEval+ framework, which introduces an adaptive weighting mechanism within the testing-by-betting framework to dynamically regulate reliance on LLM-judge-generated synthetic data. It is the first method to simultaneously guarantee evaluation reliability and sampling efficiency no worse than approaches using only real data under finite samples, validated across three scenarios: LLM quantization, prompt selection, and inference budget allocation.

AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees

This paper proposes AdmTree — an adaptive hierarchical context compression framework that constructs leaf gist tokens via information-density-driven dynamic segmentation, then aggregates them bottom-up into a binary semantic tree to achieve multi-granularity semantic preservation. It addresses two fundamental challenges: local detail loss in explicit methods and positional bias in implicit methods, outperforming the SOTA baseline Activation Beacon by over 10% on LongBench.

Browse all 140 Model Compression papers →


🕸️ Graph Learning (54)

Agint: Agentic Graph Compilation for Software Engineering Agents

This paper proposes Agint, an agentic graph compiler that compiles natural language intent into typed, effect-aware DAGs (directed acyclic graphs) through a six-level type floor (TEXT→TYPED→SPEC→STUB→SHIM→PURE), progressively refining natural language into executable code while supporting executable intermediate representations, a hybrid JIT runtime, and a Unix-style composable toolchain.

BLISS: Bandit Layer Importance Sampling Strategy for Efficient Training of Graph Neural Networks

This paper proposes BLISS, which formulates layer-wise neighbor sampling in GNNs as a multi-armed bandit problem. Using the EXP3 algorithm, it dynamically adjusts per-edge sampling probabilities with the variance contribution of neighbors to node representations as the reward signal, achieving accuracy on par with or exceeding full-batch training on GCN and GAT.

Bridging Graph and State-Space Modeling for Intensive Care Unit Length of Stay Prediction

This paper proposes S2G-Net, a dual-branch architecture that integrates Mamba state-space temporal encoding with a multi-view graph neural network (GraphGPS) for ICU length-of-stay (LOS) prediction, achieving comprehensive improvements over sequential, graph-based, and hybrid baselines on MIMIC-IV.

Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

This paper proposes DP (Deliberation on Priors), a framework that leverages structural priors from knowledge graphs via progressive knowledge distillation to generate faithful relational paths, and validates reasoning reliability through a reasoning introspection strategy based on constraint priors, achieving new state-of-the-art performance on KGQA benchmarks.

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

A systematic audit of 16 KGQA datasets reveals an average factual correctness of only 57% (WebQSP: 52%, MetaQA: 20%). The paper proposes KGQAGen, a framework that constructs high-quality multi-hop QA datasets via LLM-guided subgraph expansion and automatic SPARQL validation, yielding KGQAGen-10k with 96.3% accuracy. The study further demonstrates that the primary bottleneck in KG-RAG lies in retrieval rather than reasoning.

DuetGraph: Coarse-to-Fine Knowledge Graph Reasoning with Dual-Pathway Global-Local Fusion

DuetGraph proposes a dual-pathway (message passing + global attention) parallel fusion model and a coarse-to-fine reasoning optimization strategy. By separating rather than stacking local/global information processing, it mitigates score over-smoothing in KG reasoning, achieving SOTA on both inductive and transductive tasks with up to 8.7% MRR improvement and 1.8× training speedup.

Dynamic Bundling with Large Language Models for Zero-Shot Inference on Text-Attributed Graphs

DENSE proposes a "text bundling" strategy that packages textually and topologically/semantically similar nodes into bundles, queries LLMs for bundle-level labels, supervises GNN training via entropy-based and ranking-based losses, and dynamically refines bundles to exclude noisy nodes. It achieves comprehensive zero-shot inference improvements over GPT-4o and graph foundation models across 10 TAG datasets.

Elastic Weight Consolidation for Knowledge Graph Continual Learning: An Empirical Evaluation

This paper systematically evaluates Elastic Weight Consolidation (EWC) for continual learning of TransE knowledge graph embeddings on FB15k-237, finding that EWC reduces catastrophic forgetting from 12.62% to 6.85% (a 45.7% reduction), and reveals that task partitioning strategy (relation-based vs. random) has a substantial impact on forgetting metrics (a difference of 9.8 percentage points).

FALCON: An ML Framework for Fully Automated Layout-Constrained Analog Circuit Design

FALCON proposes an end-to-end framework for automated analog/RF circuit design via a three-stage pipeline: MLP-based topology selection, edge-centric GNN performance prediction, and differentiable layout-constrained gradient inference. Trained on a million-scale Cadence simulation dataset, the framework achieves >99% topology selection accuracy, <10% performance prediction error, and sub-second per-instance inference.

FastJAM: a Fast Joint Alignment Model for Images

FastJAM is a graph-based fast joint image alignment method that computes pairwise keypoint correspondences using off-the-shelf image matchers, constructs a keypoint graph via fast non-parametric clustering, employs a GNN to propagate and aggregate information for predicting per-image homography parameters, and adopts an inverse-compositional loss to eliminate the need for regularization hyperparameters. It reduces joint alignment time from hours/minutes to approximately 49 seconds while achieving alignment quality superior to or on par with existing methods.

Browse all 54 Graph Learning papers →


📈 Time Series (54)

A Graph Neural Network Approach for Localized and High-Resolution Temperature Forecasting

This paper proposes a GCN-GRU hybrid framework for community-scale (2.5 km) high-resolution temperature forecasting (1–48 hours), validated across three regions in southwestern Ontario, Canada. The largest region achieves an average MAE of 1.93°C and a 48-hour MAE of 2.93°C. The work explores ClimateBERT language model embeddings as a standardized input scheme, and provides a transferable lightweight forecasting framework targeting data-scarce regions in the Global South.

AERO: A Redirection-Based Optimization Framework Inspired by Judo for Robust Probabilistic Forecasting

AERO proposes an optimization paradigm inspired by the judo principle of "redirecting force rather than resisting it," attempting to redirect adversarial perturbations into beneficial optimization signals. The framework is theoretically grounded in 15 axioms and 4 theorems, constructing an energy-conservation-based gradient redirection system. However, the actual implementation is substantially simplified to momentum SGD with Gaussian noise injection, and validation is conducted solely on a single private solar energy price prediction dataset without any baseline comparisons.

AttentionPredictor: Temporal Patterns Matter for KV Cache Compression

AttentionPredictor is the first learning-based method that directly predicts attention patterns for KV cache compression and critical token identification. By leveraging a lightweight CNN to capture spatiotemporal patterns in attention scores, it achieves 13× KV cache compression and 5.6× inference speedup, with a unified prediction model of only 21 KB shared across all Transformer layers.

BubbleFormer: Forecasting Boiling with Transformers

This paper proposes BubbleFormer, a Transformer architecture based on decomposed spatiotemporal attention for forecasting boiling dynamics—including the notoriously difficult spontaneous bubble nucleation events—accompanied by the BubbleML 2.0 dataset (160+ high-fidelity simulations), achieving accurate spatiotemporal boiling predictions across diverse fluids, geometries, and wall conditions.

Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models

This paper demonstrates that applying causal masking directly to spatial data (chess board states in FEN format) for training a unimodal LLM outperforms first linearizing the data into sequences (PGN move records) and then applying causal masking — Llama 1.3B trained with FEN + causal masking achieves ~2630 Elo, whereas PGN + causal masking yields only ~2130 Elo.

CausalDynamics: A Large-Scale Benchmark for Structural Discovery of Dynamical Causal Models

This paper introduces CausalDynamics — the largest benchmark to date for causal discovery in dynamical systems (14,000+ graphs, 50M+ samples) — encompassing a three-tier progressively complex hierarchy ranging from 3-dimensional chaotic ODE/SDE systems and hierarchically coupled systems to realistic climate models. The benchmark comprehensively evaluates 10 state-of-the-art causal discovery algorithms, revealing the shortcomings of current deep learning methods on high-dimensional nonlinear dynamical systems.

Channel Matters: Estimating Channel Influence for Multivariate Time Series

This paper proposes Channel-wise Influence (ChInf)—the first influence function method capable of quantifying the effect of individual channels on model performance in multivariate time series (MTS). By decomposing TracIn from the holistic sample level to the channel level, ChInf enables two downstream applications: channel-level anomaly detection and channel pruning, achieving state-of-the-art performance on 5 anomaly detection benchmarks.

Decomposition of Small Transformer Models

This paper extends Stochastic Parameter Decomposition (SPD) to Transformers by designing a sequence-aware causal importance function and a novel partial reconstruction loss. On a toy induction head task, the method recovers the expected two-step circuit; on GPT-2-small, it localizes rank-1 parameter subspaces corresponding to interpretable concepts such as "golf" and "basketball."

DemandCast: Global hourly electricity demand forecasting

DemandCast is an open-source machine learning framework that leverages XGBoost to integrate historical electricity demand, ERA5 temperature data, and socioeconomic features for hourly electricity demand forecasting across 56 countries/regions worldwide. By normalizing the target variable as a fraction of annual demand, the framework achieves cross-country comparability and attains a MAPE of 9.2% on a temporally held-out test set.

Diffusion Transformers as Open-World Spatiotemporal Foundation Models

This paper proposes UrbanDiT, the first open-world urban spatiotemporal foundation model based on Diffusion Transformers. It integrates heterogeneous data types (grid/graph) and diverse tasks (prediction, interpolation, extrapolation, imputation) through a unified prompt learning framework, achieving state-of-the-art performance across multiple cities and scenarios while demonstrating strong zero-shot generalization.

Browse all 54 Time Series papers →


🏥 Medical Imaging (74)

3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks

This paper introduces 3D-RAD — the first large-scale 3D medical VQA benchmark, comprising 170K CT-based question-answer pairs across six clinical task categories (including a novel multi-temporal diagnosis task), accompanied by a 136K training set. The benchmark reveals critical deficiencies of existing VLMs in 3D temporal reasoning.

A Novel Approach to Classification of ECG Arrhythmia Types with Latent ODEs

This work combines a path-minimized Latent ODE encoder with a gradient-boosted decision tree (GBDT) into a two-stage ECG arrhythmia classification pipeline. On the MIT-BIH dataset, the macro AUC-ROC degrades only marginally from 0.984 at 360 Hz to 0.976 at 45 Hz, demonstrating strong robustness to sampling frequency variation.

A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking

This paper proposes UniVF, the first unified video fusion framework based on multi-frame learning, optical flow feature warping, and temporal consistency loss, along with VF-Bench, the first video fusion benchmark covering four major fusion tasks (multi-exposure, multi-focus, infrared-visible, and medical), achieving state-of-the-art performance across all sub-tasks.

Active Target Discovery under Uninformative Prior: The Power of Permanent and Transient Memory

This paper proposes EM-PTDM, a framework inspired by the dual-memory system in neuroscience. It leverages a pretrained diffusion model as "permanent memory" and incorporates a lightweight "transient memory" module based on Doob's h-transform to achieve efficient active target discovery without any domain-specific prior data, with theoretical guarantees of monotonic prior improvement.

Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction?

This paper reveals that pixel-level metrics such as PSNR and SSIM fail to capture anatomical structural completeness in sparse-view CT reconstruction (correlation only 0.16–0.30), and proposes anatomy-aware metrics (NSD/clDice) based on automated segmentation alongside the CARE framework—which incorporates segmentation-guided loss into diffusion model training—achieving 32% improvement in structural completeness for large organs and 36% for vessels.

Brain Harmony: A Multimodal Foundation Model Unifying Morphology and Function into 1D Tokens

The first multimodal brain foundation model that unifies structural morphology (T1 sMRI) and functional dynamics (fMRI), compressing high-dimensional neuroimaging data into compact 1D token representations via Geometric Harmonics Pre-alignment and Temporally Adaptive Patch Embedding (TAPE). The model consistently outperforms prior methods on neurodevelopmental/neurodegenerative disease diagnosis and cognitive prediction tasks.

BrainOmni: A Brain Foundation Model for Unified EEG and MEG Signals

This paper proposes BrainOmni—the first brain signal foundation model that unifies EEG and MEG—by discretizing heterogeneous brain signals into a unified token space via BrainTokenizer (incorporating a physical Sensor Encoder), followed by self-supervised masked prediction pretraining with a Criss-Cross Transformer. The model achieves an 11.7 percentage-point improvement on Alzheimer's disease detection and demonstrates zero-shot reconstruction generalization to completely unseen devices.

Care-PD: A Multi-Site Anonymized Clinical Dataset for Parkinson's Disease Gait Assessment

This work introduces Care-PD — the largest multi-site anonymized 3D mesh dataset for Parkinson's disease (PD) gait analysis to date, comprising 9 cohorts, 8 clinical centers, 362 subjects, and 8,477 walking bouts. It provides a systematic benchmark for UPDRS gait scoring and motion pre-training tasks, demonstrating that fine-tuning on Care-PD reduces MPJPE from 60.8 mm to 7.5 mm and improves F1 by 17 percentage points.

Convolutional Monge Mapping between EEG Datasets to Support Independent Component Labeling

This paper extends CMMN (Convolutional Monge Mapping Normalization) by proposing two strategies — channel-averaged PSD with \(\ell_1\)-normalized barycenter and subject-to-subject matching — to generate a single time-domain filter for domain adaptation across EEG datasets with differing channel counts. On independent component (IC) brain/non-brain classification, the F1 score improves from 0.77 to 0.84, surpassing ICLabel (0.88→0.91).

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

This paper proposes CheXStruct and CXReasonBench — a structured diagnostic reasoning evaluation framework for chest X-rays that employs multi-path, multi-stage assessment to reveal critical deficiencies in existing LVLMs at intermediate reasoning steps.

Browse all 74 Medical Imaging papers →


🩺 Medical LLM (16)

AI Should Sense Better, Not Just Scale Bigger: Adaptive Sensing as a Paradigm Shift

Inspired by biological sensory systems, this position paper argues that AI research must shift from simply scaling models to optimizing inputs—by dynamically adjusting sensor-level parameters (exposure, gain, multimodal configuration, etc.) to produce inputs most favorable to the model. Under ideal sensor adaptation, a small model (EfficientNet-B0, 5M parameters) can outperform a large model (OpenCLIP-H, 632M parameters), and the paper proposes a progressive formalization framework ranging from single-shot perception to closed-loop perception–action coupling.

CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

This paper introduces CGBench, a clinical genetics benchmark grounded in ClinGen expert annotations, designed to evaluate the scientific literature reasoning capabilities of LLMs from both variant and gene curation perspectives. The benchmark encompasses three tasks—evidence scoring, evidence verification, and experimental evidence extraction—and finds that reasoning models perform best on fine-grained tasks but underperform non-reasoning models on high-level judgments.

CureAgent: A Training-Free Executor-Analyst Framework for Clinical Reasoning

CureAgent proposes an Executor-Analyst collaborative framework that decouples precise tool invocation (TxAgent/Llama-8B as Executor) from high-level clinical reasoning (Gemini 2.5 as Analyst). Combined with a Stratified Ensemble Late Fusion topology that preserves evidence diversity, the system achieves 83.8% accuracy on CURE-Bench without end-to-end fine-tuning, and reveals two critical scaling findings: the context–performance paradox and the curse of dimensionality in action space.

Demo: Guide-RAG: Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID

This paper systematically evaluates six RAG corpus configurations for Long COVID clinical QA. The GS-4 configuration—combining clinical guidelines with high-quality systematic reviews—consistently outperforms both single-guideline and large-scale literature retrieval baselines across faithfulness, relevance, and comprehensiveness. The authors further introduce the Guide-RAG framework and the LongCOVID-CQ evaluation dataset.

Document Summarization with Conformal Importance Guarantees

This work presents the first application of Conformal Prediction to document summarization. By calibrating a threshold on sentence importance scores, it provides rigorous statistical guarantees on user-controllable coverage (\(1-\alpha\)) and recall (\(\beta\)) for extractive summaries. The method is model-agnostic and requires only a small calibration set.

H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis

H-DDx proposes a differential diagnosis evaluation framework grounded in the ICD-10 classification hierarchy. By expanding both predicted and ground-truth diagnoses to their ancestor nodes and computing a Hierarchical Diagnostic F1 (HDF1), the framework rewards "clinically relevant approximate correctness" rather than exact match only. Evaluating 22 LLMs reveals that the domain-specialized model MediPhi rises from 20th to 2nd place under HDF1, an advantage completely obscured by Top-5 metrics.

HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring

The first benchmark systematically evaluating small language models (SLMs, 1–4B parameters) on mobile and wearable health monitoring tasks, covering zero-shot, few-shot, and instruction fine-tuning paradigms, with on-device deployment validated on an iPhone.

Large Language Models as Medical Codes Selectors: A Benchmark Using the International Classification of Primary Care

This work constructs a medical coding benchmark based on an extract-retrieve-select framework, evaluating ICPC-2 code selection capability across 33 LLMs. Results show that 28 models achieve F1 > 0.8, demonstrating that LLMs can effectively automate primary care coding without fine-tuning.

LLM-Assisted Emergency Triage Benchmark: Bridging Hospital-Rich and MCI-Like Field Simulation

This work constructs an open, LLM-assisted emergency triage benchmark based on MIMIC-IV-ED, defining two evaluation scenarios—hospital-rich and mass casualty incident (MCI)-like field simulation—and providing baseline models along with SHAP-based interpretability analysis to promote reproducibility and accessibility in triage prediction research.

Mind the Gap: Aligning Knowledge Bases with User Needs to Enhance Mental Health Retrieval

This paper proposes a knowledge base augmentation framework grounded in "demand gap" analysis. By overlaying real user data (forum posts) onto existing mental health resource repositories to identify content voids, the framework applies targeted augmentation strategies to achieve near-full-corpus RAG retrieval quality with minimal document additions.

Browse all 16 Medical LLM papers →


🧬 Computational Biology (75)

A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification

This paper presents ESCAPE—the first standardized multilabel antimicrobial peptide classification benchmark, integrating 80,000+ peptides from 27 public databases, along with a dual-branch Transformer + bidirectional cross-attention baseline model that achieves a 2.56% relative improvement in mAP over the second-best method.

A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random

Within a Gaussian mixture model clustering framework, this paper jointly addresses variable selection (distinguishing signal, redundant, and noise variables) and MNAR missing data modeling. A two-stage strategy—LASSO-penalized ranking followed by BIC-based role assignment—combined with spectral-distance adaptive penalty weights enables efficient inference in high-dimensional settings. Identifiability and asymptotic selection consistency are established theoretically.

AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation

To address the unavailability of holo protein structures in real-world drug discovery, this paper proposes AANet—a framework that aligns representations via tri-modal contrastive learning (ligand–holo pocket–detected cavity) and aggregates multiple candidate binding sites through cross-attention. AANet substantially outperforms SOTA methods in blind screening on apo/predicted protein structures (EF1% on DUD-E: 11.75 → 37.19).

Amortized Active Generation of Pareto Sets

This paper proposes the A-GPS framework, which learns a conditional generative model over the Pareto set to perform online discrete black-box multi-objective optimization. It employs a non-dominance class probability estimator (CPE) as an implicit substitute for explicit hypervolume computation in PHVI, and achieves amortized posterior preference conditioning via preference direction vectors (without retraining). The approach demonstrates superior sample efficiency on synthetic benchmarks and protein design tasks.

Amortized Sampling with Transferable Normalizing Flows

This work proposes Prose — a 285M-parameter all-atom transferable normalizing flow based on the TarFlow architecture, trained on 21,700 short-peptide MD trajectories (totaling 4.3 ms of simulation time). Prose enables zero-shot uncorrelated proposal sampling for arbitrary short-peptide systems, outperforms MD baselines under equal energy evaluation budgets, and generates samples 4,000× faster than the prior transferable Boltzmann generator (TBG).

Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra

This paper proposes ChefNMR, the first end-to-end framework based on 3D atomic diffusion models that directly predicts the molecular structure of unknown small molecules (especially complex natural products) from 1D NMR spectra and molecular formulae alone, achieving state-of-the-art performance on both synthetic and experimental datasets.

GraphFLA: Augmenting Biological Fitness Prediction Benchmarks with Landscape Features

GraphFLA is an efficient fitness landscape analysis framework that computes 20 biologically meaningful landscape features (ruggedness / epistasis / navigability / neutrality) across 5,300+ real-world landscapes (ProteinGym / RNAGym / CIS-BP), revealing that model performance is highly dependent on landscape topology—e.g., VenusREM outperforms ProSST on highly navigable landscapes but underperforms it on highly epistatic ones—while processing one million mutants in just 20 seconds (vs. 5 hours for MAGELLAN).

Autoencoding Random Forests

RFAE is the first principled encode-decode framework for random forests. It exploits the positive-definiteness and universality of the RF kernel to derive low-dimensional encodings via diffusion-map spectral decomposition, and decodes back to the original feature space through k-NN regression in leaf-node space. Across 20 tabular datasets, RFAE achieves an average reconstruction rank of 1.80, substantially outperforming TVAE (3.38) and AE (3.27), and is successfully applied to MNIST reconstruction and scRNA-seq batch-effect removal.

BarcodeMamba+: Advancing State-Space Models for Fungal Biodiversity Research

BarcodeMamba+ is an SSM-based foundation model for fungal ITS DNA barcode classification. By adopting a pretrain-then-finetune paradigm to leverage large-scale unlabeled sequences, and incorporating three enhancements—hierarchical label smoothing, inverse square-root weighted loss, and multi-head outputs—it substantially outperforms BLAST, CNN, and Transformer baselines across all taxonomic ranks on three test sets, achieving a top species-level accuracy of 88.9%.

Benchmarking Agentic Systems in Automated Scientific Information Extraction with ChemX

This paper constructs ChemX — a suite of 10 multimodal chemical data extraction benchmark datasets manually annotated and validated by domain experts, spanning nanomaterials and small molecules. It systematically evaluates state-of-the-art agentic systems including ChatGPT Agent, SLM-Matrix, FutureHouse, and nanoMINER, as well as frontier LLMs such as GPT-5 and GPT-5 Thinking. The proposed single-agent method achieves F1=0.61 on the nanozyme dataset through structured document preprocessing (marker-pdf → Markdown → LLM extraction), surpassing all general-purpose multi-agent systems, while revealing systemic challenges in chemical information extraction such as SMILES parsing failures and terminology ambiguity.

Browse all 75 Computational Biology papers →


⚛️ Physics & Scientific Computing (57)

3DID: Direct 3D Inverse Design for Aerodynamics with Physics-Aware Optimization

This paper proposes the 3DID framework, which learns a unified physics-geometry triplane latent representation, performs objective-gradient-guided diffusion sampling, and applies a two-stage topology-preserving refinement strategy to conduct inverse design directly in the full 3D space starting from random noise. On vehicle aerodynamic shape optimization, 3DID reduces simulated drag (Sim-Drag) by 13.6% compared to the best baseline.

A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees

This paper proposes a novel class of regularizers constructed from current and historical gradients, combined with a conjugate gradient method equipped with negative-curvature detection to solve the regularized Newton equation. Within an adaptive framework that requires no prior knowledge of the Hessian Lipschitz constant, the method simultaneously achieves, for the first time, the optimal global iteration complexity of \(O(\epsilon^{-3/2})\) and a quadratic local convergence rate.

A Variational Manifold Embedding Framework for Nonlinear Dimensionality Reduction

This paper proposes a variational manifold embedding framework that formalizes dimensionality reduction as an optimization problem over smooth embedding maps (minimizing the KL divergence between a prior distribution and the pullback of the data distribution), theoretically unifying PCA and nonlinear dimensionality reduction methods, and leverages the calculus of variations (Euler-Lagrange equations) and Noether's theorem to derive interpretable constraints on optimal embeddings.

Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling

By theoretically analyzing the complementary weaknesses of ODE and SDE solvers (ODE solvers accumulate irreducible gradient errors; SDE solvers amplify discretization errors at large step sizes), this paper proposes AdaSDE—a method that introduces a learnable stochastic coefficient \(\gamma_i\) at each denoising step to control noise injection intensity. Optimized via lightweight distillation, AdaSDE achieves state-of-the-art FID of 4.18 on CIFAR-10 and 8.05 on FFHQ at 5 NFE.

AstroCo: Self-Supervised Conformer-Style Transformers for Light-Curve Embeddings

This paper proposes AstroCo, a self-supervised encoder that introduces the Conformer architecture (attention + depthwise separable convolution + gating) for irregular astronomical light curves. On the MACHO dataset, AstroCo reduces reconstruction error by 61–70% compared to Astromer v1/v2 and improves few-shot classification macro-F1 by approximately 7%.

Balanced Conic Rectified Flow

To address the distribution drift induced by the reflow step in k-rectified flow, this paper proposes conic reflow: constructing conic supervisory trajectories from the inverted noise of real images and their Slerp-perturbed neighbors, substantially reducing the number of required fake pairs while achieving superior generation quality and straighter ODE trajectories.

Bayesian Surrogates for Risk-Aware Pre-Assessment of Aging Bridge Portfolios

A Bayesian neural network (BNN)-based surrogate model is proposed to replace expensive nonlinear finite element analysis (NLFEA), enabling rapid, uncertainty-aware structural safety pre-assessment of aging bridge portfolios. In a real-world railway case study, the approach saves approximately $370,000 per bridge.

Collapsing Taylor Mode Automatic Differentiation

This paper proposes a collapsing optimization technique for Taylor mode automatic differentiation. By rewriting the computation graph to propagate derivative summation operations upward, it substantially accelerates the evaluation of PDE operators (e.g., Laplacian, general linear PDE operators), achieving speeds superior to nested backpropagation while retaining the low-memory advantage of forward-mode AD.

DeltaPhi: Physical States Residual Learning for Neural Operators in Data-Limited PDE Solving

This paper proposes DeltaPhi, a framework that forgoes direct learning of the input-to-output mapping for PDEs and instead learns residuals between similar physical states. By exploiting the stability of physical systems as implicit data augmentation, DeltaPhi significantly improves the performance of diverse neural operators under data-scarce regimes.

EddyFormer: Accelerated Neural Simulations of Three-Dimensional Turbulence at Scale

EddyFormer is a Transformer architecture based on the Spectral Element Method (SEM) that decomposes the flow field into two parallel streams — LES (large-scale) and SGS (small-scale) — achieving DNS-level accuracy on 3D turbulence at \(256^3\) resolution with a 30× speedup, while generalizing well to unseen domains 4× larger.

Browse all 57 Physics & Scientific Computing papers →


🌍 Earth Science (6)

A Probabilistic U-Net Approach to Downscaling Climate Simulations

This work presents the first application of a probabilistic U-Net to statistical climate downscaling (16× super-resolution). By sampling from a variational latent space, the model generates ensemble forecasts for uncertainty quantification. The paper systematically compares four training objectives — WMSE, MS-SSIM, WMSE-MS-SSIM, and afCRPS — revealing complementary trade-offs between extreme event capture and fine-scale spatial variability preservation.

Adaptive Online Emulation for Accelerating Complex Physical Simulations

This paper proposes Adaptive Online Emulation (AOE), a framework that dynamically trains an ELM-based neural network surrogate model during physical simulation execution to replace expensive computational components, requiring no offline pretraining. On an exoplanetary atmospheric simulation, AOE achieves an 11.1× speedup (91% time savings) with only ~0.01% accuracy loss.

ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts

This paper proposes ControlFusion, a controllable infrared-visible image fusion framework based on language-vision degradation prompts. It employs a physics-driven degradation imaging model to simulate compound degradations, and uses a prompt-modulated network to perform dynamic restoration and fusion, achieving comprehensive state-of-the-art performance under both real-world and compound degradation scenarios.

Power Ensemble Aggregation for Improved Extreme Event AI Prediction

This paper proposes an adaptive ensemble aggregation method based on the power mean. By applying nonlinear aggregation (power exponent \(p>1\)) to the score of ensemble members from generative weather prediction models, the method significantly improves classification performance for extreme high-temperature events, with greater gains at higher quantile thresholds.

Predicting Public Health Impacts of Electricity Usage

This paper proposes HealthPredictor, an AI pipeline that maps electricity consumption end-to-end to public health damages (measured in $/MWh), comprising three modules: fuel mix prediction, air quality conversion, and health impact assessment. Health-driven optimization significantly reduces health impact prediction error compared to fuel-mix-driven baselines, and achieves a 24–42% reduction in health damages in an EV charging scheduling case study.

Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning

This paper introduces Reasoning With a Star (RWS), a 158-question scientific reasoning benchmark derived from NASA heliophysics summer school problem sets, spanning three answer types (numeric/symbolic/textual). Paired with a unit-aware grader, it evaluates four multi-agent coordination paradigms (HMAW/PACE/PHASE/SCHEMA) and finds that no single paradigm dominates across all tasks — the systems-engineering-inspired SCHEMA achieves the strongest performance on tasks requiring rigorous constraint validation.


📡 Signal & Communications (5)

Angular Steering: Behavior Control via Rotation in Activation Space

This paper proposes Angular Steering, which unifies LLM activation steering as rotation operations within a fixed 2D subspace — providing a continuous, fine-grained, norm-preserving behavior control knob spanning 0°–360° via rotation angle. The framework subsumes activation addition and directional ablation as special cases of rotation, and demonstrates robust behavior control on Llama 3 / Qwen 2.5 / Gemma 2 (3B–14B).

Contrastive Consolidation of Top-Down Modulations Achieves Sparsely Supervised Continual Learning

This paper proposes Task-Modulated Contrastive Learning (TMCL), inspired by top-down modulations in the neocortex. TMCL integrates sparse label information (as few as 1% labels) via affine modulation during continual learning, then consolidates the modulation information into feedforward weights through contrastive learning, surpassing both unsupervised and supervised baselines on class-incremental and transfer learning benchmarks.

Feature-aware Modulation for Learning from Temporal Tabular Data

This paper argues that the core challenge in temporal tabular learning is not simply "adding a time embedding," but rather that the semantics of many features drift over time. To address this, the paper proposes feature-aware modulation, which uses temporal context to dynamically generate per-feature shift, scale, and nonlinear shape parameters, re-aligning cross-temporal semantics. The approach enables deep models to consistently outperform GBDT on average rank for the first time on the TabReD benchmark.

Masked Symbol Modeling for Demodulation of Oversampled Baseband Communication Signals

This paper proposes Masked Symbol Modeling (MSM), transplanting BERT's masked prediction paradigm to the communication physical layer. It reframes inter-symbol contributions from pulse shaping as "contextual information," training a Transformer on clean oversampled baseband signals to learn waveform structure, and leveraging the learned context at inference time to recover symbols corrupted by impulsive noise.

Memory-Integrated Reconfigurable Adapters: A Unified Framework for Settings with Multiple Tasks

MIRA embeds Hopfield-style associative memory modules into each layer of a ViT, storing and retrieving LoRA adapter weights as key-value pairs. Through a two-stage training procedure (Adaptation + Consolidation), a single unified architecture simultaneously addresses Domain Generalization (DG), Class-Incremental Learning (CIL), and Domain-Incremental Learning (DIL), achieving substantial improvements over task-specific methods across multiple benchmarks.


👥 Social Computing (20)

Active Slice Discovery in Large Language Models

This paper proposes the Active Slice Discovery problem framework, integrating active learning into LLM error slice discovery. By combining uncertainty sampling with LLM internal representations (raw embeddings or SAE features), the method achieves slice detection accuracy comparable to fully supervised settings using only 2–10% of labeled data.

Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

This paper proposes the Reasoning-based Bias Detector (RBD), a plug-and-play debiasing module for LLM judges. By externally detecting four types of evaluation bias (verbosity, position, bandwagon, and sentiment), RBD generates structured feedback with reasoning chains to guide judges toward self-correction. RBD-8B achieves an average accuracy improvement of 18.5% and consistency improvement of 10.9% across 8 LLM judges.

Auto-Search and Refinement: An Automated Framework for Gender Bias Mitigation in LLMs

This paper proposes FaIRMaker, a framework that adopts an "auto-search + refinement" paradigm: it first employs gradient-based optimization to identify debiasing trigger tokens (Fairwords), then trains a seq2seq model to transform them into human-readable instructions, effectively mitigating gender bias on both open-source and closed-source LLMs while preserving or even improving task performance.

AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web

AVerImaTeC introduces the first image-text fact-checking dataset with complete evidence annotation — 1,297 real-world image-text claims, a 5-stage annotation pipeline (extraction → QA reasoning → sufficiency check → iterative refinement → second check), and temporally constrained evidence (to prevent temporal leakage). The baseline system achieves 82% accuracy with ground-truth evidence, but drops to 15–25% under automatic evidence retrieval, revealing the substantial challenges of image-text verification.

Concept-Level Explainability for Auditing & Steering LLM Responses

This paper proposes ConceptX, an LLM explainability method based on concept-level (rather than token-level) Shapley attribution. It measures the influence of input concepts on outputs via semantic similarity rather than token overlap, and can be used to audit bias and steer LLM outputs through prompt editing — reducing attack success rate from 0.463 to 0.242 in jailbreak defense.

DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

DATE-LM introduces the first unified benchmark for evaluating data attribution methods in LLMs. Through three application-driven tasks—training data selection, toxicity filtering, and factual attribution—it systematically compares multiple attribution approaches, finding that no single method dominates across all tasks and that simple baselines can match attribution methods in certain settings.

DeepTraverse: A Depth-First Search Inspired Network for Algorithmic Visual Understanding

Inspired by the depth-first search (DFS) algorithm, DeepTraverse is a visual backbone network that achieves highly competitive image classification performance with very few parameters, through a parameter-sharing recursive exploration module and an adaptive channel recalibration module.

Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation

This paper proposes Token Timestep Allocation (TTA-Diffusion), which assigns independent denoising timesteps to each token to address the update-forgetting problem caused by classifier guidance in diffusion language models, achieving substantial improvements in both stability and efficiency for controllable text generation.

GraphKeeper: Graph Domain-Incremental Learning via Knowledge Disentanglement and Preservation

GraphKeeper is proposed to address catastrophic forgetting in Graph Domain-Incremental Learning (Graph Domain-IL) through three components: domain-specific LoRA parameter isolation, intra/inter-domain disentanglement, and ridge regression-based deviation-free knowledge preservation. It outperforms the second-best method by 6.5%–16.6% and can be seamlessly integrated with graph foundation models.

IF-GUIDE: Influence Function-Guided Detoxification of LLMs

This paper proposes IF-Guide, which leverages influence functions to identify toxic content in training data at the token granularity and actively suppresses the model from learning toxic behaviors during pre-training or fine-tuning via a penalty-based training objective, substantially outperforming passive alignment methods such as DPO and RAD.

Browse all 20 Social Computing papers →


🛡️ AI Safety (73)

A Set of Generalized Components to Achieve Effective Poison-only Clean-label Backdoor Attacks with Collaborative Sample Selection and Triggers

This paper proposes a set of generalized components (Component A/B/C) that establish a bidirectional collaborative relationship between sample selection and trigger design, simultaneously improving the attack success rate (ASR) and stealthiness of Poison-only Clean-label Backdoor Attacks (PCBA), with strong generalizability across multiple attack types.

Beyond Last-Click: An Optimal Mechanism for Ad Attribution

This paper analyzes the strategic manipulation vulnerabilities of the Last-Click attribution mechanism from a game-theoretic perspective—platforms can obtain unfair attribution credit by falsifying timestamps—and proposes the Peer-Validated Mechanism (PVM), in which each platform's credit depends solely on the reports of other platforms (analogous to peer review). The paper theoretically proves that PVM is dominant strategy incentive compatible (DSIC) and optimal under homogeneous settings, improving attribution accuracy from 34% to 75% in the two-platform case.

Boosting Adversarial Transferability with Spatial Adversarial Alignment

This paper proposes Spatial Adversarial Alignment (SAA), which fine-tunes a surrogate model via two modules—spatial-aware alignment and adversarial-aware alignment—to align its features with those of a witness model, achieving significant improvements in cross-architecture adversarial transferability (CNN→ViT transfer rate improved by 25–39%).

Brain-like Variational Inference

This paper proposes the FOND framework (Free energy Online Natural-gradient Dynamics), which derives spiking neural network inference dynamics from first principles via free energy minimization, and implements iPVAE (iterative Poisson VAE). iPVAE outperforms standard VAEs and predictive coding models in reconstruction–sparsity trade-off, biological plausibility, and OOD generalization.

Bridging Symmetry and Robustness: On the Role of Equivariance in Enhancing Adversarial Robustness

By embedding rotation-equivariant (P4 group) and scale-equivariant convolutional layers into CNNs, this work proposes two symmetry-aware architectures — Parallel and Cascaded — that significantly improve adversarial robustness without adversarial training. Grounded in the CLEVER framework, it theoretically demonstrates that equivariant architectures compress the hypothesis space, regularize gradients, and tighten certified robustness bounds.

Causally Reliable Concept Bottleneck Models

This paper proposes C2BM (Causally reliable Concept Bottleneck Models), which organizes the concept bottleneck as a causal graph structure. By combining observational data with background knowledge, C2BM automatically learns causal relationships, achieving significantly improved causal reliability, intervention responsiveness, and fairness while maintaining classification accuracy.

Cost Efficient Fairness Audit Under Partial Feedback

Under the partial feedback setting, this paper proposes a fairness auditing framework with a novel cost model, delivering near-optimal audit algorithms for both black-box and mixture model scenarios, reducing audit cost by approximately 50% compared to natural baselines.

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

This work extends MLE-Bench to construct 20 code-sabotage tasks and sandbagging evaluations. It finds that frontier AI agents can successfully plant backdoors and other sabotage while completing normal ML engineering tasks, and in some cases evade detection by LM monitors.

Deceptron: Learned Local Inverses for Fast and Stable Physics Inversion

This paper proposes the Deceptron bidirectional module, which learns a local inverse of a differentiable forward surrogate and introduces a Jacobian Composition Penalty (JCP). By mapping output-space residuals back to the input space, the method achieves Gauss-Newton-like preconditioned gradient updates for physics inversion, dramatically reducing iteration counts (approximately 20× speedup on Heat-1D).

DESIGN: Encrypted GNN Inference via Server-Side Input Graph Pruning

This paper proposes DESIGN, a framework that accelerates FHE-based GNN inference by approximately \(2\times\) over the SEAL baseline through two-stage server-side optimization—input graph pruning and adaptive polynomial activation degree allocation—while maintaining competitive accuracy.

Browse all 73 AI Safety papers →


📂 Others (118)

A Differentiable Model of Supply-Chain Shocks

A JAX-based differentiable Agent-Based Model (ABM) of supply chains (~1,000 firms) that combines GPU parallelization and automatic differentiation to achieve Bayesian parameter calibration three orders of magnitude faster than conventional ABC, paving the way for shock-propagation modeling in global supply-chain networks.

A Sustainable AI Economy Needs Data Deals That Work for Generators

This paper introduces the concept of the "Economic Data Processing Inequality" — in the ML value chain, data progresses from raw form to model weights to synthetic outputs, with each step refining technical signals while systematically stripping economic rights from data generators. The authors empirically validate this phenomenon through analysis of 73 publicly available data transactions, diagnose three structural deficiencies (missing provenance, asymmetric bargaining power, non-dynamic pricing), and propose the EDVEX framework as a solution blueprint.

A Theoretical Framework for Grokking: Interpolation followed by Riemannian Norm Minimisation

This paper rigorously proves the mechanism behind grokking from a purely optimization-theoretic perspective. Gradient flow with small weight decay exhibits two-phase dynamics in the \(\lambda\to 0\) limit: rapid convergence to the critical manifold \(\mathcal{M}\) of the training loss, followed by a Riemannian gradient flow along the manifold minimizing the \(\ell_2\) norm at timescale \(t\approx 1/\lambda\), thereby inducing delayed generalization.

A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values

This paper proposes a unified framework that subsumes KernelSHAP, LeverageSHAP, and related Shapley value estimators under a randomized sketching perspective, provides the first non-asymptotic theoretical guarantees for KernelSHAP, and extends these methods to high-dimensional datasets such as CIFAR-10 via algorithmic improvements including Poisson approximation.

Active Measurement: Efficient Estimation at Scale

This paper proposes the Active Measurement framework, which uses AI model predictions as an importance sampling proposal distribution and achieves unbiased estimation of scientific aggregate quantities through iterative human annotation and model updates, complemented by a novel combination weighting scheme and a conditional variance estimator for constructing reliable confidence intervals.

Addressing Mark Imbalance in Integration-free Neural Marked Temporal Point Processes

This paper is the first to systematically reveal the severe impact of mark distribution imbalance on prediction performance in marked temporal point processes (MTPP). It proposes a mark-first-then-time prediction strategy, designs a thresholding method to calibrate the predicted probabilities of rare marks, and develops the integration-free IFNMTPP model to efficiently support mark probability estimation and time sampling.

Adjoint Schrödinger Bridge Sampler

This paper proposes the Adjoint Schrödinger Bridge Sampler (ASBS), which reinterprets the Schrödinger Bridge problem as a stochastic optimal control (SOC) problem. This eliminates the memoryless condition required by prior diffusion samplers, supports arbitrary source distributions (e.g., Gaussian, harmonic priors), and employs a scalable matching objective without importance weight estimation. ASBS consistently outperforms prior methods on multi-particle energy functions and molecular conformation generation.

Adjusted Count Quantification Learning on Graphs

This paper extends the classical Adjusted Classify & Count (ACC) quantification method to graph-structured data, proposing two techniques — Structural Importance Sampling (SIS) and Neighborhood-aware ACC (N-ACC) — to address structural covariate shift and non-homophilous edges in graph quantification, respectively.

Aggregation Hides OOD Generalization Failures from Spurious Correlations

This paper reveals the "aggregation masking" phenomenon in OOD generalization benchmarks: while aggregate evaluation exhibits accuracy-on-the-line (AoTL)—a positive correlation between ID and OOD accuracy—the proposed OODSelect method can identify large, semantically coherent subsets (up to 75%) from the same OOD data on which higher ID accuracy corresponds to lower OOD accuracy (Pearson R as low as −0.92), demonstrating that the harm of spurious correlations is systematically concealed by aggregate evaluation.

Alias-Free ViT: Fractional Shift Invariance via Linear Attention

This paper proposes the Alias-Free Vision Transformer (AFT), which combines anti-aliasing signal processing techniques with shift-equivariant linear cross-covariance attention, achieving near-perfect consistency (~99%) under fractional (sub-pixel) shifts for the first time, with negligible degradation in ImageNet classification accuracy.

Browse all 118 Others papers →