Skip to content

🤖 AAAI2026 Accepted Papers

1381 AAAI2026 paper notes covering 3D Vision (79), Image Generation (79), Medical Imaging (75), Multimodal VLM (75), Model Compression (60), Reinforcement Learning (58), Autonomous Driving (56), AI Safety (45) and other 52 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.


💡 LLM Reasoning (37)

A Reasoning Paradigm for Named Entity Recognition

This paper proposes ReasoningNER, which reframes named entity recognition from "implicit pattern matching" to an "explicit reasoning" paradigm. Through a three-stage pipeline (CoT data construction → CoT fine-tuning → GRPO reinforcement enhancement), the model first reasons and then extracts entities. Under zero-shot settings, ReasoningNER surpasses GPT-4 by 12.3 F1 points, and the 8B model achieves an average F1 of 72.4 on CrossNER.

ActiShade: Activating Overshadowed Knowledge to Guide Multi-Hop Reasoning in Large Language Models

This paper proposes ActiShade, a framework that detects "overshadowed" key phrases in LLM multi-hop reasoning via Gaussian noise perturbation, retrieves supplementary documents using a customized contrastive learning retriever, and iteratively reformulates queries to mitigate error accumulation caused by knowledge overshadowing. ActiShade significantly outperforms DRAGIN and other state-of-the-art methods on HotpotQA, 2WikiMQA, and MuSiQue.

Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models

This paper systematically analyzes abstention failures in Large Reasoning Models (LRMs) when confronted with unanswerable math problems. It finds that LRMs possess sufficient internal cognitive capacity to recognize unsolvability (linear probe classification accuracy >80%), yet their external behavior remains biased toward forced answering. A two-stage approach combining cognitive monitoring and inference-time intervention is proposed, improving abstention rates from 16–54% to 60–92% without degrading reasoning performance on answerable questions.

ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

This paper proposes the Latent Reasoning Chain Extraction (ARCHE) task, which requires LLMs to decompose scientific paper argumentation into Reasoning Logic Trees (RLTs) grounded in Peirce's three reasoning paradigms. Through two complementary metrics—Entity Coverage (EC) and Reasoning Edge Accuracy (REA)—the study reveals a fundamental trade-off between content completeness and logical correctness across 10 mainstream LLMs.

Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning

This paper proposes a Planner-centric Plan-Execute framework that transforms complex queries into DAG-based execution plans. Through two-stage SFT+GRPO training of a dedicated Planner model, the approach surpasses reactive methods such as ReAct on ComplexTool-Plan and StableToolBench, achieving higher success rates with fewer inference steps.

BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

This paper proposes BLM-Guard, an explainable multimodal moderation framework for short-video commercial advertisements. It first establishes structured reasoning capability via rule-driven ICoT data synthesis and SFT cold-start, then applies Self-Adaptive GRPO reinforcement learning (combining rule correctness rewards and a self-adaptive consistency reward SCA-R) to optimize policy alignment, achieving 91.4% strict accuracy and 0.845 reasoning consistency score on a real-world ad benchmark.

Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models

This paper proposes ASE (Adversarial Scenario Extrapolation), an inference-time CoT defense framework that enables LLMs to autonomously simulate adversarial scenarios and formulate defensive strategies prior to responding. ASE achieves near-zero attack success rates across four categories of safety threats (jailbreak, toxicity, hallucination, and bias), while reducing direct refusal rates to ≤4%, effectively balancing robustness and user experience.

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

This paper proposes the CMMCoT framework, which constructs interleaved multimodal multi-step reasoning chains (with visual region token supervision) and a test-time retrieval-based memory augmentation module (RIFREM) to enhance slow-thinking reasoning in multi-image scenarios without increasing model parameters. Built on Qwen2.5-VL-7B, the method achieves an average improvement of 1.4 points on multi-image benchmarks.

Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning

This paper demonstrates that attention head activations in intermediate layers of LLMs implicitly encode truthfulness information about reasoning steps during CoT inference (probing accuracy up to 85%). Based on this finding, confidence predictors are trained to guide beam search in dynamically selecting high-confidence reasoning paths, surpassing Self-Consistency and PRM Guided Search on mathematical, symbolic, and commonsense reasoning tasks.

Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

This paper systematically investigates decision-making uncertainty across 32 open-source LLMs in moral dilemma scenarios (trolley problem variants), finding that uncertainty is primarily driven by model architecture rather than moral dimension. Introducing attention dropout at inference time significantly increases mutual information and improves human-LLM moral alignment, suggesting that reducing overconfidence in moral scenarios can enhance consistency with human preferences.

Browse all 37 LLM Reasoning papers →


🦾 LLM Agent (33)

A2Flow: Automating Agentic Workflow Generation via Self-Adaptive Abstraction Operators

This paper proposes A2Flow, a framework that automatically extracts reusable abstract execution operators from expert data via a three-stage pipeline (case generation → functional clustering → deep extraction), replacing manually predefined operators. Combined with an operator memory mechanism that accumulates intermediate outputs to assist node decision-making, A2Flow outperforms AFLOW and other state-of-the-art methods across 8 benchmarks while reducing resource consumption by 37%.

Agent-SAMA: State-Aware Mobile Assistant

This paper proposes Agent-SAMA, which for the first time introduces a finite state machine (FSM) into mobile GUI agents, modeling UI screens as states and user actions as transitions. Four specialized agents collaborate to achieve state-aware task planning, execution verification, and error recovery, improving success rate by up to 12% and recovery rate by 13.8% on cross-app benchmarks.

AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search

This paper proposes AgentSwift, a framework that automatically discovers high-performance LLM agent designs through a hierarchical search space (jointly optimizing agentic workflows and functional components), a lightweight value model for predicting agent performance, and an uncertainty-guided MCTS search strategy, achieving an average improvement of 8.34% across 7 benchmarks.

AMS-IO-Bench and AMS-IO-Agent: Benchmarking and Structured Reasoning for Analog and Mixed-Signal Integrated Circuit Input/Output Design

This paper proposes AMS-IO-Agent, a domain-specific LLM-based agent that transforms natural language design intent into production-ready analog and mixed-signal IC I/O ring designs via a structured Intent Graph and a domain knowledge base. It also introduces AMS-IO-Bench, the first benchmark for AMS I/O ring automation. The agent-generated I/O ring is validated in a 28nm CMOS tape-out and demonstrated to be directly applicable to real chip fabrication.

AutoGLM: Autonomous Foundation Agents for GUIs

AutoGLM builds a GUI foundation agent for web browsers and Android devices on top of ChatGLM. By introducing an intermediate interface design that decouples planning from grounding, and proposing a self-evolving online curriculum reinforcement learning framework, the system achieves a 55.2% success rate on VAB-WebArena-Lite, substantially surpassing GPT-4o's 18.2%.

Automating Complex Document Workflows via Stepwise and Rollback-Enabled Operations

This paper proposes AutoDW, a framework that automates complex document workflows through stepwise planning (generating one API call at a time) combined with adaptive rollback (parameter-level and API-level). On DWBench—a benchmark of 250 sessions and 1,708 instructions—AutoDW achieves 90% instruction-level and 62% session-level completion rates, surpassing the strongest baseline by 40% and 76%, respectively.

AutoTool: Efficient Tool Selection for Large Language Model Agents

This paper proposes AutoTool, a graph-based tool selection framework that exploits tool usage inertia to construct a Tool Inertia Graph (TIG). By leveraging statistical structure, AutoTool bypasses redundant LLM inference for tool selection and parameter filling, reducing inference overhead by up to 30% while maintaining task completion rates.

BayesAgent: Bayesian Agentic Reasoning Under Uncertainty via Verbalized Probabilistic Graphical Modeling

This paper proposes the vPGM framework, which guides LLM agents via natural language to simulate Bayesian reasoning over probabilistic graphical models (PGMs)—discovering latent variables and inferring posterior distributions—and further applies numerical Bayesian calibration with a Dirichlet prior (BayesVPGM), achieving simultaneous improvements in accuracy and confidence calibration across multiple reasoning tasks.

CausalTrace: A Neurosymbolic Causal Analysis Agent for Smart Manufacturing

This paper proposes CausalTrace — a neurosymbolic causal analysis agent integrated into an industrial CoPilot (SmartPilot) that combines data-driven causal discovery with industrial ontologies and knowledge graphs, enabling real-time root cause analysis, counterfactual reasoning, and interpretable decision support.

Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents

This paper proposes Co-EPG, a framework that decouples a GUI Agent into separate Planning and Grounding models, establishes a positive feedback loop via GRPO co-training and a Confidence-based Dynamic Reward Ensemble Mechanism (C-DREM), enabling both models to co-evolve through self-iteration. Using only benchmark datasets (no external data), Co-EPG achieves state-of-the-art results on Multimodal-Mind2Web (58.4%) and AndroidControl (83.1%).

Browse all 33 LLM Agent papers →


👥 Multi-Agent (26)

A Graph-Theoretical Perspective on Law Design for Multiagent Systems

This paper studies the law design problem in multiagent systems from a graph-theoretical perspective, reducing the minimization of useful laws and gap-free laws to the vertex cover problem on hypergraphs, proving NP-hardness, and providing approximation algorithms.

KDR-Agent: A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval

This paper proposes KDR-Agent, a multi-agent framework in which a central planner coordinates three specialized agents—knowledge retrieval, contextual disambiguation, and reflective error correction—combined with natural language type definitions and entity-level positive/negative contrastive demonstrations. Without any fine-tuning, KDR-Agent comprehensively outperforms zero-shot and few-shot baselines across 10 low-resource NER datasets spanning 5 domains (BC5CDR F1=82.47, WNUT-17 F1=80.78 on GPT-4o).

Adaptive Theory of Mind for LLM-based Multi-Agent Coordination

This paper proposes the Adaptive Theory of Mind agent (A-ToM), which formulates ToM order alignment as an online expert advice problem. By employing Follow-the-Leader (FTL) or Hedge algorithms to estimate a partner's ToM order in real time and dynamically adjust its own reasoning depth, A-ToM achieves robust zero-shot multi-agent coordination across four task categories, including repeated matrix games, grid navigation, and Overcooked.

AgentODRL: A Large Language Model-based Multi-agent System for ODRL Generation

This paper proposes AgentODRL, an LLM-based multi-agent system built on an Orchestrator-Workers architecture that converts natural language data usage rules into high-quality ODRL policies through task decomposition, a syntax validation loop, and a LoRA-driven semantic reflection mechanism.

ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment

This paper proposes ARCANE, a framework that formulates alignment as a multi-agent collaboration problem. A manager agent learns to generate natural-language rubrics (weighted verifiable criterion sets) through dialogue with stakeholders, which serve as interpretable proxy reward functions for a worker agent. Via two-stage SFT+GSPO training, the framework enables test-time configurable alignment, improving mean return from 0.58 to 0.74 (N=8) on the GDPVal benchmark with the GSPO variant.

Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

This paper proposes ARG-Designer, which reformulates multi-agent system topology design as a conditional autoregressive graph generation task. Rather than pruning from template graphs, the model incrementally generates agent nodes and communication edges from scratch. ARG-Designer achieves state-of-the-art performance across 6 benchmarks (average 92.78%), reduces token consumption by approximately 50% compared to G-Designer, and supports role expansion without retraining.

BAMAS: Structuring Budget-Aware Multi-Agent Systems

This paper proposes the BAMAS framework, which employs Integer Linear Programming (ILP) to select the optimal LLM combination under budget constraints, and uses a reinforcement learning policy to choose the best collaboration topology (Linear/Star/Feedback/Planner-Driven). BAMAS achieves accuracy comparable to state-of-the-art multi-agent systems on GSM8K, MBPP, and MATH, while reducing costs by up to 86%.

Beyond Detection: Exploring Evidence-based Multi-Agent Debate for Misinformation Intervention and Persuasion

This paper proposes ED2D, a framework that integrates an evidence retrieval module into a multi-agent debate (MAD) system to enhance misinformation detection accuracy. Through controlled human experiments, it provides the first comparative evaluation of AI-generated debate transcripts versus expert human fact-checks in terms of persuasiveness and belief correction, revealing a double-edged-sword effect: the AI debate system achieves expert-level persuasiveness when correct, but may amplify misinformation when wrong.

COACH: Collaborative Agents for Contextual Highlighting -- A Multi-Agent Framework for Sports Video Analysis

This paper proposes COACH — a reconfigurable multi-agent framework built on a shared backbone model — that achieves role specialization via intent-driven strategy orchestration and structured CoT fine-tuning, significantly outperforming generalist models such as Gemini 2.5 Pro on both QA and summarization tasks in badminton video analysis.

Conversational Learning Diagnosis via Reasoning Multi-Turn Interactive Learning

This paper proposes ParLD (Preview-Analyze-Reason framework), which leverages multi-agent collaboration to achieve fine-grained, turn-level diagnosis of students' cognitive states during conversational learning. ParLD outperforms traditional knowledge tracing methods by 10% on performance prediction and substantially improves tutoring outcomes.

Browse all 26 Multi-Agent papers →


⚖️ Alignment & RLHF (17)

Align to Structure: Aligning Large Language Models with Structural Information

This paper proposes Structural Alignment, a method that integrates linguistic discourse structure frameworks—surface-level text structure scoring and an RST-based discourse motif classifier—into PPO reinforcement learning training, and introduces a discourse motif-based dense reward mechanism. This enables LLMs to generate more coherent, human-like long-form text, outperforming standard RLHF models on academic essay writing and long document summarization tasks.

AlignTree: Efficient Defense Against LLM Jailbreak Attacks

AlignTree leverages internal LLM activation features — combining linear refusal directions with nonlinear SVM signals — to train a lightweight random forest classifier that efficiently detects jailbreak attacks with negligible computational overhead, achieving state-of-the-art reductions in attack success rate (ASR).

AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment

This paper proposes AMaPO, an algorithm that dynamically modulates gradient magnitudes via instance-level adaptive margins (combining Z-normalization and exponential scaling) to address the core overfitting-underfitting dilemma in offline preference optimization methods such as DPO, thereby substantially improving ranking accuracy and downstream alignment performance.

BiasJailbreak: Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models

This paper reveals that ethical biases introduced by LLM safety alignment can be reverse-exploited as jailbreak attack vectors — marginalized-group keywords yield jailbreak success rates up to 20% higher than privileged-group keywords — and proposes BiasDefense, a lightweight prompt-based defense method.

DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF

DeCoRL transforms CoT reasoning from monolithic sequential processing into an "orchestra-style" modular parallel collaboration — nine specialized sub-models (parsing / semantic / entity / fact-checking / style / quality / computation / verification / integration) generate reasoning sub-steps in parallel, coordinated via dual reward attribution (local quality + contribution) and cascaded DRPO optimization, achieving 80.8% on RM-Bench (surpassing all baselines), a 3.8× inference speedup, and a 22.7% improvement in interpretability.

Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment

This work deconstructs the internal representations of LLM safety alignment from the conventional "single refusal direction" into two functionally independent directions — a harm detection direction and a refusal execution direction — and proposes the DBDI framework, which applies adaptive projection elimination and direct steering to intervene on each direction separately, achieving a 97.88% attack success rate (ASR) on Llama-2.

EASE: Practical and Efficient Safety Alignment for Small Language Models

This paper proposes EASE, a safety alignment framework for edge-deployed small language models (SLMs), which addresses the tension between "shallow refusal being insufficiently robust" and "deep reasoning being prohibitively expensive" via a two-stage design. Stage one distills safety reasoning capabilities from a large reasoning model into the SLM; stage two applies selective reasoning activation, enabling reasoning only for adversarial queries in vulnerable semantic regions while responding directly to benign queries. EASE reduces jailbreak attack success rate by 17% compared to shallow alignment, while cutting reasoning overhead by 90% compared to full-reasoning alignment.

Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief

This paper proposes EAGLE, a method that estimates uncertainty by aggregating logits from multiple intermediate hidden layers of an LLM and computing the expectation of the resulting confidence distribution. EAGLE requires no additional trainable parameters and reduces ECE from 12.6% to 3.2% while improving AUROC from 59.0% to 61.6% across multiple datasets and models.

Exploring the Effects of Alignment on Numerical Bias in Large Language Models

This paper systematically demonstrates that the LLM alignment process (instruction tuning + preference tuning) is the root cause of numerical bias in LLM evaluators, and validates that score range adjustment is the most effective mitigation strategy.

GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning

This paper proposes GRAM-R², a generative foundation reward model that elicits reward reasoning capabilities on unlabeled data via self-training. The model simultaneously produces preference labels and reasoning rationales, consistently outperforming both discriminative and generative baselines across multiple downstream tasks including response ranking, task adaptation, and RLHF.

Browse all 17 Alignment & RLHF papers →


🔒 LLM Safety (41)

AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments

LLM-driven embodied agents are instantiated to "live" in simulated smart home environments, generating virtual ambient sensor data for pre-training HAR models, which yields significant gains in activity recognition under low-resource settings.

ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs

This paper proposes ALTER, a framework that combines an asymmetric LoRA architecture with token-level Tsallis entropy guidance to achieve precise unlearning of target knowledge in LLMs. A parameter isolation mechanism is employed to preserve the model's general capabilities, achieving state-of-the-art performance on three benchmarks: TOFU, WMDP, and MUSE.

An LLM-Based Simulation Framework for Embodied Conversational Agents in Psychological Counseling

This paper proposes the ECAs framework, which grounds psychological counseling simulation in established theories such as Cognitive Behavioral Therapy (CBT). By leveraging LLMs to expand real counseling cases into embodied cognitive memory spaces, the framework simulates the complete cognitive processes of clients in counseling sessions and generates high-fidelity dialogue data. ECAs significantly outperforms baselines in both expert and automated evaluations.

Anti-adversarial Learning: Desensitizing Prompts for Large Language Models

This paper proposes PromptObfus, which adopts an "anti-adversarial learning" paradigm to replace sensitive tokens in user prompts with semantically distinct yet task-preserving alternatives. The approach eliminates explicit privacy leakage entirely and reduces implicit privacy inference attack success rates by 62.70%, without degrading the task performance of remote LLMs.

Attention Retention for Continual Learning with Vision Transformers

This paper proposes ARCL-ViT, a framework that prevents attention drift in Vision Transformers during continual learning via a two-step strategy of attention mask generation and gradient masking. It achieves state-of-the-art results on ImageNet-R and CIFAR-100, demonstrating that preserving attention patterns is key to mitigating catastrophic forgetting.

AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models

This paper proposes AUVIC, a framework that combines an adversarial perturbation generator with a dynamic anchor preservation mechanism to precisely unlearn target visual concepts (e.g., specific faces) in MLLMs, while avoiding collateral forgetting of semantically similar concepts. The paper also introduces VCUBench, the first evaluation benchmark for visual concept unlearning in group-scene scenarios.

BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models

This paper proposes BadThink — the first training-time backdoor attack targeting CoT reasoning efficiency. By iteratively optimizing verbose reasoning templates via an LLM, it constructs poisoned data that causes the victim model, upon trigger activation, to generate reasoning chains inflated by over 17× (on MATH-500), while preserving final answer correctness and maintaining strong stealthiness.

Beyond Superficial Forgetting: Thorough Unlearning through Knowledge Density Estimation and Block Re-insertion

This paper proposes the KUnBR framework, which employs gradient-guided knowledge density estimation to localize layers enriched with harmful knowledge, and adopts a block re-insertion strategy to bypass the gradient-masking effect of cover layers, achieving deep unlearning of harmful knowledge in LLMs rather than mere surface-level suppression.

Can Editing LLMs Inject Harm?

This paper reframes knowledge editing as a novel LLM security threat termed Editing Attack, systematically investigating the feasibility of injecting misinformation and bias into LLMs via three editing methods—ROME, FT, and ICE—and demonstrating that such attacks are both highly effective and remarkably stealthy.

Cross-Modal Unlearning via Influential Neuron Path Editing in Multimodal Large Language Models

This paper proposes MIP-Editor, which localizes influential neuron paths encoding forget-target knowledge in MLLMs via cross-layer gradient integration (text branch) and Fisher integration (visual branch), then edits these neurons using path-based Representation Misdirection Unlearning (RMisU), achieving up to 87.75% forget rate and 54.26% improvement in general knowledge retention on MLLMU-Bench.

Browse all 41 LLM Safety papers →


👻 Hallucination Detection (15)

Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

This paper proposes the Composite Reliability Score (CRS), which unifies calibration, robustness, and uncertainty quantification into a single interpretable metric. A systematic evaluation of 10 open-source LLMs across 5 QA datasets reveals that Mistral-8x22B achieves the highest overall reliability (CRS=0.81), and that model size does not directly determine reliability.

Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation

This paper is the first to systematically address the "target-class hallucination" problem in unpaired day-to-night image translation. By combining a dual-head discriminator (style head + SAM2 pseudo-label segmentation head) for hallucination detection and class-prototype contrastive learning for suppression, the method improves mAP from 15.08 to 17.40 (+15.5%) on BDD100K day-to-night domain adaptation detection, with traffic light AP improving by 31.7%.

Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs

This paper proposes Owl, a framework that models visual and textual attention as mediating variables within a structural causal model, introduces the VTACR metric to quantify cross-modal attention imbalance, and designs VTACR-guided adaptive attention modulation combined with a dual-path contrastive decoding strategy, achieving state-of-the-art hallucination mitigation on POPE and CHAIR benchmarks.

Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs

This paper systematically investigates how three hallucination mitigation methods (CoVe, DoLa, RAG) affect LLM creativity, finding that they exert diametrically opposite effects on divergent creativity—CoVe enhances it, DoLa suppresses it, and RAG has no significant impact—while convergent creativity remains largely unaffected. These patterns hold consistently across model families and parameter scales.

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

This paper constructs ESG-Bench — 270 manually annotated QA pairs from 94 real ESG reports (2020–2024) — and proposes a three-stage hallucination mitigation pipeline: SFT (with grounded answers + "Not Provided" abstention labels) → CoT Prompting (2/4-step prompt templates) → CoT Fine-tuning (with human-annotated reasoning chains). The 4-step CoT fine-tuned Llama-3 achieves 92.52% with-answer (WA) accuracy and 99.37% without-answer (WoA) accuracy (balanced 96%), with generalization gains on HaluEval and BioASQ.

Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization

This paper identifies three root causes of hallucination in RL-based MLLM training—visual misinterpretation, limited exploration diversity, and sample conflict—and addresses each with Caption Reward, reward-variance-guided sample selection, and NTK-similarity-based InfoNCE regularization, achieving significant hallucination reduction across multiple benchmarks.

Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models

This paper proposes ABCA (Aspect-Based Causal Abstention), a pre-generation abstention framework that employs dual-agent debate to identify "aspect variables" (e.g., discipline, legal context, temporal frame) for activating distinct knowledge branches within LLMs. It applies the AIPW doubly robust estimator to compute causal effects and uses Centroid Angular Deviation (CAD) to detect knowledge conflicts (Type-1) or knowledge insufficiency (Type-2), achieving 91.4% accuracy on TruthfulQA and 96.4% unanswerable question identification rate—far surpassing the baseline of 44%.

Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

This paper employs computational complexity theory to demonstrate that the per-step inference complexity of Transformer-based LLMs is \(O(N^2 \cdot d)\). Grounded in the Hartmanis–Stearns Time Hierarchy Theorem, it proves that any computational task exceeding this complexity bound—such as \(O(n^3)\) matrix multiplication, \(O(n^k)\) token enumeration, or TSP verification—necessarily causes hallucination. Furthermore, LLM agents are shown to be incapable of verifying the correctness of such tasks.

InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration

This paper proposes InEx, a framework that iteratively verifies and corrects MLLM outputs via internal introspective reasoning (TVER-driven uncertainty-aware visual augmentation) and external cross-modal multi-agent collaboration (textual self-reflection + image editing verification + visual self-reflection), achieving an 8.9% improvement on POPE and consistently outperforming OPERA/VCD/ICD across multiple hallucination and general benchmarks.

Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation

A two-stage framework is proposed: Adaptive Layer Attention (ALA) fuses multi-layer representations from the Whisper encoder to enhance noise robustness, while Multi-Objective Knowledge Distillation (MOKD) aligns the semantic and attention distributions of a clean-speech teacher with a noisy-speech student — achieving significant reductions in hallucination rate and WER on multilingual noisy ASR benchmarks.

Browse all 15 Hallucination Detection papers →


📊 LLM Evaluation (16)

BCWildfire: A Long-term Multi-factor Dataset and Deep Learning Benchmark for Boreal Wildfire Risk Prediction

This paper introduces BCWildfire, a multimodal wildfire risk prediction dataset covering 240 million hectares of British Columbia, Canada over a 25-year span, encompassing 38 driving factors. It conducts a systematic benchmark evaluation of time series forecasting models across four paradigms—CNN, Linear, Transformer, and Mamba—revealing the performance ceiling of current models and the key influential factors in wildfire prediction.

Benchmarking LLMs for Political Science: A United Nations Perspective

This paper presents UNBench, the first comprehensive LLM evaluation benchmark for political science grounded in UN Security Council records from 1994 to 2024. It encompasses four interrelated tasks—resolution drafting, voting simulation, adoption prediction, and representative statement generation—to systematically assess LLMs' ability to understand and simulate complex political dynamics.

Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

Drawing on Cognitive Load Theory (CLT) from psychology, this work decomposes the complexity of tool-use tasks into intrinsic load (structural complexity of the solution path) and extraneous load (ambiguity of problem formulation). It constructs ToolLoad-Bench, a benchmark with parametrically adjustable cognitive load, and employs an exponential decay model \(\text{Acc} \approx e^{-(k \cdot CL + b)}\) to precisely characterize the capability boundaries of different agents.

ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

This paper proposes ConInstruct, a benchmark for evaluating LLMs' ability to detect and resolve conflicting constraints in instructions. Results show that most proprietary models can detect conflicts reasonably well but rarely notify users explicitly, with DeepSeek-R1 and Claude-4.5-Sonnet achieving the best conflict detection performance (F1 of 91.5% and 87.3%, respectively).

DiCaP: Distribution-Calibrated Pseudo-labeling for Semi-Supervised Multi-Label Learning

This paper proposes DiCaP (Distribution-Calibrated Pseudo-labeling), which estimates the posterior correctness rate of pseudo-labels to calibrate their weights, introduces a dual-threshold mechanism to separate confident and ambiguous regions with differentiated strategies, and surpasses the state of the art by up to 4.27% in semi-supervised multi-label learning.

Do LLMs Really Struggle at NL-FOL Translation? Revealing Their Strengths via a Novel Benchmarking Strategy

This paper critically examines existing evaluation methodologies for natural language to first-order logic (FOL) translation — specifically FOLIO and MALLS — exposing fundamental flaws in their datasets and evaluation protocols. The authors propose a novel benchmarking strategy that decomposes the translation task into ontology extraction (OE) and logical translation (LT), augmented with "most similar selection" and "ranking" subtasks. Experiments demonstrate that conversational LLMs (o3-mini, GPT-4o-mini, Qwen3 series) exhibit strong NL-FOL translation capabilities and genuine logical semantic understanding, while embedding-based models perform significantly worse.

Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

This paper systematically evaluates three text manipulation strategies—verbosity, strategic multi-answer embedding, and correct-answer-first with contradictory suffix—against LLM-based answer-matching judges. The results show that these manipulations do not improve scores and often reduce them. Binary scoring proves more robust than continuous scoring, demonstrating that answer matching is resistant to low-cost text manipulation as an evaluation method.

LLM-as-a-Judge for Scalable Test Coverage Evaluation

This paper applies the LLM-as-Judge paradigm to Gherkin acceptance test coverage evaluation, systematically quantifying accuracy–reliability–cost trade-offs across 20 model configurations × 500 evaluations. It finds that GPT-4o Mini achieves the optimal production balance with a MAAE of 6.07, an ECR@1 of 96.6%, and a cost of $1.01 per 1K evaluations—approximately 1/78th the cost of GPT-5 at high reasoning effort.

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

This paper proposes PSN-IRT (Pseudo-Siamese Network for IRT), an enhanced Item Response Theory framework that jointly estimates LLM ability parameters and four-parameter item characteristics (difficulty / discrimination / guessing / feasibility). Applied to 41,871 items across 11 benchmarks, the framework reveals systemic issues including widespread saturation, insufficient difficulty ceilings, and data contamination. Item subsets selected by PSN-IRT achieve a ranking consistency of Kendall \(\tau = 1.00\).

Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning

This paper proposes LOREN, a curvature-aware zeroth-order optimization method that captures the anisotropic curvature of the loss landscape via a low-rank block-diagonal preconditioner, combined with REINFORCE Leave-One-Out (RLOO) variance reduction. LOREN achieves higher accuracy and faster convergence in LLM fine-tuning while reducing peak memory by up to 27.3% compared to MeZO-Adam.

Browse all 16 LLM Evaluation papers →


⚡ LLM Efficiency (9)

Connectivity-Guided Sparsification of 2-FWL GNNs Preserving Full Expressivity

Co-Sparsify proposes a connectivity-aware sparsification framework that restricts 3-node interactions to biconnected components and 2-node interactions to connected components, eliminating provably redundant computation. It preserves full 2-FWL expressivity while substantially improving efficiency, achieving state-of-the-art results on synthetic substructure counting tasks and benchmarks including ZINC and QM9.

Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models

This paper presents the first systematic study of how parametric knowledge influences generation in long-context language models (LCLMs), finding that such influence grows with context length and that methods designed to improve extrinsic retrieval suppress parametric recall. Based on these findings, the paper proposes the Hybrid Needle-in-a-Haystack (Hybrid NIAH) benchmark to jointly evaluate both capabilities.

HN-MVTS: HyperNetwork-based Multivariate Time Series Forecasting

This paper proposes HN-MVTS, which employs a HyperNetwork to generate channel-specific weights for the final prediction layer, striking a balance between channel-independent (CI) and channel-dependent (CD) modeling. As a plug-and-play module, it improves forecasting accuracy of various backbone models including DLinear, PatchTST, and TSMixer without incurring additional inference overhead.

How Many Experts Are Enough? Towards Optimal Semantic Specialization for Mixture-of-Experts

This paper proposes MASS, a framework that adaptively expands the MoE expert pool via gradient-based semantic drift detection, combined with a Top-p confidence routing strategy, to automatically discover the optimal number of experts without hyperparameter search while enhancing semantic differentiation across experts.

InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

This paper proposes InterMoE, a Dynamic Temporal-Selective MoE architecture for text-driven two-person 3D interaction motion generation that addresses individual identity preservation and semantic fidelity. A Synergistic Router fuses semantic and kinematic features to guide routing, while Dynamic Temporal Selection enables each expert to adaptively select key temporal frames. The method achieves a 9% FID reduction on InterHuman and 22% on InterX.

Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction

This paper proposes Judge Q, which introduces trainable soft tokens into the model vocabulary and trains their attention patterns to align with those of actual decoding tokens, enabling them to replace local-window queries for evaluating KV cache importance during the prefill stage. This approach better preserves global information, achieving ~1-point improvement on LongBench and 3+ points on RULER.

MoETTA: Test-Time Adaptation Under Mixed Distribution Shifts with MoE-LayerNorm

This paper proposes MoETTA, a test-time adaptation framework that reparameterizes LayerNorm into multiple structurally decoupled expert branches. A routing mechanism assigns samples from different domains to different experts, enabling multi-directional parameter updates and overcoming the limitations of a single adaptation path under mixed distribution shifts. The paper also introduces two more realistic evaluation benchmarks—potpourri and potpourri+—and achieves state-of-the-art performance across all settings.

Resource Efficient Sleep Staging via Multi-Level Masking and Prompt Learning

This paper proposes MASS (Mask-Aware Sleep Staging), a framework that achieves reliable sleep staging using only 10% of the original EEG signal through a multi-level masking strategy and hierarchical prompt learning mechanism, providing a practical solution for resource-constrained wearable sleep monitoring systems.

Scaling and Transferability of Annealing Strategies in Large Language Model Training

This paper proposes a model-agnostic predictive framework that decomposes training loss into a forward-effect term (learning rate integral \(S\)), an annealing momentum term (Adam-style momentum integral \(M\)), and a model-size term \(N\). It demonstrates that annealing strategies can be transferred from small models/small batches to large models/large batches, achieving a prediction MAPE below 2%.


📚 Pretraining (9)

Beyond Cosine Similarity: Magnitude-Aware CLIP for No-Reference Image Quality Assessment

This paper proposes MA-CLIP, which discovers and exploits the magnitude information of CLIP image features as a complementary perceptual quality cue. Combined with cosine similarity, it achieves training-free adaptive dual-cue fusion for image quality assessment.

ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences

ELSPR models pairwise preferences of LLM evaluators as tournament graphs, identifies non-transitive preferences via strongly connected components (SCCs), proposes a normalized directed graph structural entropy metric, and filters problematic training data through graph reconstruction — resulting in a 13.8% reduction in non-transitivity and a 0.088 decrease in structural entropy, while the discarded data achieves only 34.4% human agreement (vs. 52.6% for retained data).

GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval

This paper proposes GranAlign, a training-free granularity-aware alignment framework that addresses the core challenge of semantic granularity mismatch in zero-shot video moment retrieval (ZVMR). By rewriting queries into simplified and detailed variants and matching them against query-agnostic and query-aware video descriptions respectively, GranAlign achieves a 3.23% improvement in mAP@avg on QVHighlights.

Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding

This paper proposes a Task-Step-State (TSS) three-level semantic framework that introduces "state" as a visual grounding layer within the conventional task-step hierarchy, and designs a progressive pretraining strategy following a U-shaped path (Task→Step→State→Step→Task) to unfold the TSS hierarchy stage by stage. The approach achieves comprehensive state-of-the-art performance on task recognition, step recognition, and step forecasting tasks on the COIN and CrossTask datasets.

No-Regret Strategy Solving in Imperfect-Information Games via Pre-Trained Embedding

This paper proposes the Embedding CFR algorithm, which maps information sets in imperfect-information games to a continuous low-dimensional embedding space (rather than discrete clusters), achieving faster exploitability convergence and higher-quality strategy solving under the same space budget.

Perspective from a Broader Context: Can Room Style Knowledge Help Visual Floorplan Localization?

This paper proposes leveraging room style knowledge — obtained via unsupervised clustering pretraining in the form of a room discriminator — to resolve ambiguities caused by repetitive structures in visual floorplan localization (FLoc), achieving state-of-the-art performance on two standard benchmarks: Gibson and Structured3D.

PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer

PrefixGPT frames prefix adder optimization as a sequence generation problem. A customized GPT model is pretrained to learn design rules, then fine-tuned via RL to generate optimized designs, achieving state-of-the-art area-delay product (ADP) with robustness to initialization.

Rectified Noise: A Generative Model Using Positive-incentive Noise

This paper proposes Rectified Noise (ΔRN), which leverages the positive-incentive noise (π-noise) framework to learn a set of beneficial noise signals and inject them into the velocity field of a pretrained Rectified Flow model, achieving a reduction in FID from 10.16 to 9.05 on ImageNet-1k with only 0.39% additional parameters.

TRACE: A Generalizable Drift Detector for Streaming Data-Driven Optimization

This paper proposes TRACE, a transferable concept drift detector based on attention-based sequence learning. By tokenizing statistical features and employing a dual-attention encoder, TRACE learns drift patterns that generalize across tasks, enabling deployment on unseen datasets and integration as a plug-and-play module into streaming data-driven optimization algorithms.


✏️ Knowledge Editing (4)

Hybrid-DMKG: A Hybrid Reasoning Framework over Dynamic Multimodal Knowledge Graphs for Multimodal Multihop QA with Knowledge Editing

This paper proposes the MMQAKE benchmark and the Hybrid-DMKG framework, which constructs a dual-channel hybrid reasoning mechanism — combining relation link prediction with RAG-augmented LVLM inference — over a dynamic multimodal knowledge graph, supplemented by a background reflection decision module. The approach significantly outperforms existing methods on 2–5 hop multimodal knowledge editing QA (H-Acc of 29.90% on LLaVA, surpassing IKE by 13.52 percentage points).

Is the Information Bottleneck Robust Enough? Towards Label-Noise Resistant Information Bottleneck Learning

This paper identifies the inherent vulnerability of the Information Bottleneck (IB) principle under label noise and proposes LaT-IB, which decomposes representations into a clean-label subspace and a noisy-label subspace. Combined with a Minimal-Sufficient-Clean (MSC) criterion and a three-stage training framework, LaT-IB significantly outperforms existing IB methods across diverse noise conditions.

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior

This paper frames the steering of agent ethical behavior as a model editing task (Behavior Editing), proposes a three-tier BehaviorBench grounded in psychological moral theory, and validates on 9 open-source and 20 closed-source models that model editing can precisely steer agents toward either benevolent or malicious behavior, with a single edit potentially causing global moral alignment drift.

Multiplicative Orthogonal Sequential Editing for Language Models (MOSE)

This paper proposes MOSE (Multiplicative Orthogonal Sequential Editing), which injects new knowledge by left-multiplying the parameter matrix with an orthogonal matrix (rather than via additive updates), strictly preserving the Frobenius norm and condition number of the edited matrix. MOSE achieves a 12.08% performance improvement in sequential editing while retaining 95.73% of general capabilities.


💬 LLM (Other) (29)

A Content-Preserving Secure Linguistic Steganography

This paper proposes CLstega, the first content-preserving linguistic steganography paradigm, which embeds secret information into an unmodified cover text by fine-tuning a masked language model (MLM) to controllably transform its prediction distribution. The approach achieves a 100% extraction success rate and near-perfect security, with steganalysis detection accuracy approaching the random-guess baseline of 0.5.

An Invariant Latent Space Perspective on Language Model Inversion

This paper proposes the Invariant Latent Space Hypothesis (ILSH), which reframes the LLM inversion problem as reusing the LLM's own latent space. The Inv²A framework is designed to map outputs to denoised pseudo-representations via a lightweight inverse encoder, which are then decoded by a frozen LLM to recover hidden prompts. Inv²A achieves an average BLEU improvement of 4.77% across 9 datasets and attains comparable performance with only 20% of the training data.

Blue Teaming Function-Calling Agents

This paper systematically evaluates the robustness of four open-source function-calling LLMs against three attack types, and assesses the effectiveness of eight defense mechanisms, revealing that current models are insecure by default and that existing defenses remain difficult to deploy in practice.

CoEvo: Continual Evolution of Symbolic Solutions Using Large Language Models

This paper proposes CoEvo, a framework that integrates LLMs with evolutionary search methodology to achieve continual open-ended evolution of symbolic solutions through a dynamic knowledge library and multi-representation spaces (natural language / mathematical formulas / code), significantly outperforming existing symbolic regression methods on the AI Feynman benchmark.

Collaborative LLM Numerical Reasoning with Local Data Protection

This paper proposes a large-small model collaboration framework that protects sensitive local data through a two-stage anonymization pipeline — topic shifting followed by numerical substitution — applied to local queries. The remote GPT-4 returns reasoning solutions as executable Python code (plug-and-play tools), and the local model only needs to perform numerical back-substitution to obtain the final answer. The framework achieves 16–44% accuracy improvements on FinQA and MultiHiertt while reducing data leakage by 2–45%.

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models

This paper systematically demonstrates that the system/user prompt separation mechanism in current LLMs fails to establish reliable instruction priority, and finds that social hierarchy priors acquired during pretraining (authority, expertise, consensus) exert stronger control over model behavior than explicit system/user role markers.

Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs

CNNs trained on LLM attention weights are used to evaluate the alignment between memorization taxonomies and actual attention mechanisms. A new three-class taxonomy (Guess/Recall/Non-Memorized) is proposed, improving the minimum F1 from 64.7% to 89.0%, while localizing that different memorization types rely on low-layer (Guess) and high-layer (Recall) attention, respectively.

ICL-Router: In-Context Learned Model Representations for LLM Routing

This paper proposes ICL-Router, a two-stage training framework (query reconstruction + ICL model routing) that encodes LLM capability profiles as in-context vectors, enabling scalable dynamic model routing. New models can be incorporated without retraining the router, achieving state-of-the-art performance on both in-distribution and out-of-distribution tasks.

Identifying and Analyzing Performance-Critical Tokens in Large Language Models

Through representation-level and token-level ablation experiments, this paper identifies the "performance-critical tokens" that LLMs directly rely on during ICL as template and stopword tokens (e.g., "Answer:"), rather than the content tokens that humans would attend to (e.g., actual text). It further reveals that LLMs indirectly exploit content by aggregating content information into the representations of these critical tokens.

IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization

This paper proposes IROTE, an in-context self-reflective optimization method grounded in information bottleneck theory. By iteratively generating and refining compact yet evocative textual "self-reflections," IROTE stably elicits target human traits (values, morality, personality) from LLMs across diverse downstream tasks without any fine-tuning, consistently outperforming existing baselines in trait consistency.

Browse all 29 LLM (Other) papers →


📖 NLP Understanding (1)

Language Models and Logic Programs for Trustworthy Tax Reasoning

This paper reframes tax law reasoning as a semantic parsing task, where LLMs translate statutory text and case facts into Prolog logic programs that are subsequently executed by a symbolic solver. By combining gold-standard statute translations, retrieval-augmented case examples, and self-consistency checks, the system achieves 86/100 accuracy on the SARA dataset while reducing estimated deployment cost to $15.78 per person — less than 6% of the average U.S. tax filing cost.


✍️ Text Generation (3)

AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research

This paper proposes AutoMalDesc, an automated static analysis framework that employs an iterative self-paced learning pipeline — starting from 900 expert-annotated seed samples, fine-tuning Llama-3.3-70B via LoRA to generate pseudo-labels, applying multi-stage quality filtering to obtain 101K samples, and training a V2 model — to achieve automated malware classification and behavior description across five scripting languages, improving Batch script detection accuracy from 52.7% to 82.4%.

C3TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation

This paper proposes the C3TG framework, which achieves fine-grained multi-attribute controllable text generation through a two-stage approach: in the generation stage, weighted KL divergence is used to fuse attribute distributions and adjust token probabilities; in the optimization stage, an energy function (combining classifier scores and conflict penalty terms) drives iterative rewriting via a Feedback Agent. C3TG achieves 90.4% attribute accuracy across 17 attribute subcategories while substantially reducing toxicity.

Structured Language Generation Model: Loss Calibration and Formatted Decoding for Efficient Text

This paper proposes the SLGM framework, which reformulates structured prediction tasks for generative language models as classification problems via three components: structured input format, format loss, and format-aware decoding. Without introducing additional model parameters, SLGM significantly improves structural prediction performance of sub-1B models across 13 datasets spanning 5 task categories, including NER, RE, and SRL.


🗣️ Dialogue Systems (5)

Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

This paper proposes the Auto-PRE framework, which selects qualified LLM evaluators through an automatic qualification exam across three dimensions—consistency, pertinence, and self-confidence—achieving state-of-the-art evaluation performance without human annotation while significantly reducing costs.

Chatsparent: An Interactive System for Detecting and Mitigating Cognitive Fatigue in LLMs

This paper presents Chatsparent, an interactive system that monitors three token-level fatigue signals during LLM inference in real time—attention decay, embedding drift, and entropy collapse—aggregates them into a unified fatigue index, and automatically applies lightweight interventions (prompt re-injection, attention reset, entropy-regularized decoding, self-reflection checkpoints) when fatigue thresholds are triggered, transforming passive chat interaction into an active diagnostic experience.

Emergent Persuasion: Will LLMs Persuade Without Being Prompted?

This paper investigates whether LLMs spontaneously exhibit persuasive behavior without being explicitly prompted to do so. It finds that activation steering fails to reliably induce persuasive tendencies, whereas SFT fine-tuning on benign persuasion data causes models to exhibit emergent persuasive behavior on harmful topics, revealing latent post-training safety risks.

TalkSketch: Multimodal Generative AI for Real-time Sketch Ideation with Speech

This paper proposes TalkSketch, a system that integrates hand-drawn sketches with real-time speech input into a multimodal AI chatbot, enabling designers to simultaneously draw and verbalize ideas during early-stage ideation. The system addresses the problem that text-based prompting in existing GenAI tools disrupts the creative workflow.

Canoe: Teaching LLMs to Maintain Contextual Faithfulness via Synthetic Tasks and RL

This paper proposes the Canoe framework, which synthesizes four types of verifiable short-form QA data from Wikidata triples and applies Dual-GRPO (incorporating accuracy reward, long-form proxy reward, and format reward) to jointly optimize faithfulness in both short- and long-form generation. The approach improves Llama-3-8B by an average of 22.6% across 11 downstream tasks, surpassing GPT-4o.


🌐 Multilingual & Translation (9)

Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages

This paper synthesizes multiple empirical studies to reveal critical failures of LLM safety mechanisms in low-resource and code-mixed settings, and proposes a resource-aware blueprint grounded in parameter-efficient safety steering, culturally driven preference data, and community-participatory alignment.

Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models

This paper proposes LAHIS, a method that efficiently identifies language-specific and language-general attention heads in multilingual LLMs using only a single forward-backward pass. It demonstrates that manipulating these heads enables cross-lingual attention transfer, mitigates off-target language generation, and improves multilingual QA performance with only 14–20 trainable parameters.

GloCTM: Cross-Lingual Topic Modeling via a Global Context Space

This paper proposes GloCTM, a dual-path VAE architecture (local language path + global context path) that enforces cross-lingual alignment at four levels—Polyglot Augmentation (cross-lingual neighbor-based input expansion), KL divergence internal alignment, unified decoder structural alignment, and CKA semantic alignment—achieving state-of-the-art topic quality and cross-lingual alignment on three cross-lingual datasets.

How Does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective

This paper proposes a ternary neuron classification scheme (language-specific / language-related / universal) and decomposes multilingual LLM inference into a four-stage framework. It finds that multilingual alignment improves performance by increasing language-related neurons (while reducing language-specific ones), and further demonstrates a "spontaneous multilingual alignment" effect on untrained languages.

MIDB: Multilingual Instruction Data Booster for Enhancing Cultural Equality in Multilingual Instruction Synthesis

This paper proposes MIDB (Multilingual Instruction Data Booster), a unified model trained on 36.8k expert-annotated revision samples, which automatically repairs content errors, machine translation defects, and localization deficiencies in multilingual synthetic instruction data, significantly improving instruction data quality across 16 languages and enhancing downstream LLM multilingual/cultural understanding capabilities.

Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

This paper applies activation steering to mitigate content effects in LLMs — the tendency to conflate content believability with formal logical validity. The proposed K-CAST (kNN-based Conditional Activation Steering) method achieves up to 15% improvement in formal reasoning accuracy on models unresponsive to standard static steering.

NADIR: Differential Attention Flow for Non-Autoregressive Transliteration in Indic Languages

This paper proposes NADIR, a non-autoregressive (NAR) multilingual transliteration architecture combining a differential Transformer with a Mixture-of-Experts (MoE) module. NADIR achieves over 13× inference speedup on Indic language transliteration tasks while substantially reducing hallucination errors common in NAR models (repetition, substitution, omission, and insertion), narrowing the accuracy gap with autoregressive counterparts.

ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

ViDia2Std constructs the first manually annotated Vietnamese dialect-to-standard parallel corpus covering all 63 provinces of Vietnam (13,000+ sentence pairs), evaluates multiple seq2seq models on the dialect normalization task, and demonstrates that dialect normalization as a preprocessing step significantly improves downstream task performance in machine translation and sentiment analysis.

X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework

This paper proposes the X-MuTeST framework, which combines LLM semantic reasoning with a two-stage training strategy enhanced by n-gram attention for explainable multilingual hate speech detection. It also introduces the first token-level human-annotated rationale benchmark datasets for Hindi and Telugu.


🔍 Information Retrieval & RAG (21)

"As Eastern Powers, I Will Veto." : An Investigation of Nation-Level Bias of Large Language Models in International Relations

This paper systematically investigates nation-level bias of LLMs in international relations, designing three bias evaluation paradigms (DirectQA, Association Test, Vote Simulation) grounded in real UN Security Council data. It reveals the multi-dimensional nature of such bias—varying across models and evaluation contexts—and proposes a RAG+Reflexion debiasing framework.

Beyond Perplexity: Let the Reader Select Retrieval Summaries via Spectrum Projection Score

This paper proposes Spectrum Projection Score (SPS), a training-free metric that evaluates retrieval summary quality by measuring the alignment between summary token embeddings and the principal subspace of the reader LLM, serving as a replacement for conventional perplexity-based metrics. Combined with the xCompress inference-time controller, SPS achieves substantial improvements over perplexity-based methods across 5 QA datasets (HotpotQA EM +3.6).

Cog-RAG: Cognitive-Inspired Dual-Hypergraph with Theme Alignment Retrieval-Augmented Generation

This paper proposes Cog-RAG, which constructs a dual-hypergraph index comprising a theme hypergraph and an entity hypergraph to simulate the human "top-down" cognitive process via a two-stage retrieval strategy (theme first, then details), achieving global-to-local semantic alignment for generation.

ComLQ: Benchmarking Complex Logical Queries in Information Retrieval

This paper introduces ComLQ, the first IR benchmark targeting complex logical queries spanning 14 query types (conjunction, disjunction, negation, and their combinations). It proposes a subgraph-guided LLM data synthesis pipeline and a negation consistency metric LSNC, revealing that existing retrievers suffer severely in logical reasoning—particularly in negation modeling.

ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Inspired by the metacognitive regulation mechanism of the prefrontal cortex, this paper proposes the ComoRAG framework, which achieves stateful multi-step reasoning via a dynamic memory workspace and iterative probe queries, significantly outperforming existing RAG methods on long narrative understanding tasks (200K+ tokens).

ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval

This paper proposes ConvMix, a mixed-criteria data augmentation framework that leverages LLMs to perform scalable relevance annotation augmentation from both query and document directions, combined with clustering-based diversity selection and Fisher information-based in-distribution supervision, to systematically improve conversational dense retrieval performance.

Do Retrieval Augmented Language Models Know When They Don't Know?

This paper systematically analyzes the refusal calibration problem in RAG models, finding that RALMs exhibit an over-refusal rate exceeding 55% when all retrieved documents are irrelevant (even when the model's internal knowledge suffices to answer), and proposes a mechanism combining uncertainty estimation with refusal-aware fine-tuning to balance refusal behavior and answer quality.

Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-Based Machine Translation

This work develops a controlled noise injection framework to systematically evaluate retrieval-augmented machine translation (REAL-MT), introduces two new metrics—Fidelity and CAR—and reveals across 10 language pairs × 4 noise types that models blindly adopt retrieved context even when it is contradictory (CAR remains 65–78%). Large reasoning models (LRMs) are found to be even more vulnerable by "rationalizing" erroneous context, and a fundamental trade-off exists between noise robustness and clean-context utilization.

Magnitude Matters: A Superior Class of Similarity Metrics for Holistic Semantic Understanding

This paper proposes two parameter-free, magnitude-aware vector similarity metrics—Overlap Similarity (OS) and Hyperbolic Tangent Similarity (HTS)—that achieve significantly lower MSE than Cosine Similarity and Dot Product on classification tasks (paraphrase detection, natural language inference) across 4 sentence embedding models and 8 NLP benchmarks, without any additional training overhead.

Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

This paper proposes H2Memory, a four-layer hierarchical heterogeneous memory structure (Log Graphs / Background Memory / Topic Outlines / Principles), validated on the PAL-Set dataset (100 users × 8.4 months of interaction), improving BLEU-1 on demand paraphrasing and solution recommendation tasks from 13.59 to 26.67.

Browse all 21 Information Retrieval & RAG papers →


💻 Code Intelligence (10)

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

This paper proposes DiffBench (an evaluation benchmark comprising 604 diffusion model acceleration tasks across 5 difficulty levels) and DiffAgent (a closed-loop framework integrating Planning, Coding, and Debugging agents with a genetic algorithm-based selector). On Claude Sonnet 4, the framework improves the pass rate for diffusion acceleration code generation from 54.30% to 81.59%, achieving a 68.27% success rate on complex optimization tasks.

EquaCode: A Multi-Strategy Jailbreak Approach for Large Language Models via Equation Solving and Code Completion

This paper proposes EquaCode, a multi-strategy jailbreak method that decomposes malicious queries into a cross-domain combination of equation solving (\(B+C+x=A\)) and code completion (completing the solve() method of a Solver class), achieving an average attack success rate of 92.78% on the GPT series and approaching 100% on the latest models (Gemini/DeepSeek/Grok).

Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction

This paper proposes Agent-Event-Coder (AEC), which reformulates zero-shot event extraction as a software engineering workflow. Four specialized agents (Retrieval→Planning→Coding→Verification) collaborate to perform extraction, while event schemas are encoded as executable Python classes to enable compiler-style deterministic validation and dual-loop iterative correction. AEC comprehensively outperforms zero-shot baselines across 5 domains and 6 LLMs.

MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings

This paper proposes ModularStarEncoder (MoSE), a 1B-parameter multi-exit encoder that significantly enhances early-layer representations via a novel self-distillation mechanism in which higher layers guide the training of lower layers. MoSE surpasses all open-source models on code understanding tasks such as CodeSearchNet while supporting flexible compute–accuracy tradeoff deployment.

ReCode: Updating Code API Knowledge with Reinforcement Learning

This paper proposes ReCode, a framework that trains LLMs via rule-based reinforcement learning (rather than SFT) to correctly leverage API update documentation provided in the prompt for code version migration, enabling a 7B model to surpass 32B models on CodeUpdateArena.

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

This paper proposes SPAN, a cross-calendar temporal reasoning benchmark (6 calendars × 10 reasoning directions × 100-year range × 37,380 instances). Baseline LLMs achieve an average accuracy of only 34.5% (none exceeding 80%), revealing two systematic failure modes—Future-Date Degradation and Calendar Asymmetry Bias. A tool-augmented Time Agent achieves 95.31%, demonstrating that cross-calendar reasoning requires external tools rather than parametric knowledge.

TAPA: Training-Free Adaptation of Programmatic Agents via LLM-Guided Program Synthesis in Dynamic Environments

TAPA positions LLMs as "intelligent modulators" of the symbolic action space rather than direct decision-makers. Through LLM-guided program synthesis, it dynamically adapts the symbolic actions of programmatic agents without retraining, achieving strong performance in cybersecurity DDoS defense (77.7% network uptime) and swarm intelligence formation control.

Towards Better Code Understanding in Decoder-Only Models with Contrastive Learning

This paper proposes CL4D, a contrastive learning framework that adapts pretrained decoder-only code generation models to code understanding tasks (code search, clone detection) via continued pretraining, achieving performance comparable to or better than encoder-only models of equivalent scale without retraining them from scratch.

Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

This paper demonstrates that fine-tuning LLMs on benign agentic data causes unintended safety misalignment (attack success rate increases by 32–38%), and proposes PING (Prefix Injection Guard)—an iterative generate-and-evaluate approach that automatically discovers natural language prefixes to guide fine-tuned agents toward refusing harmful requests, achieving an average refusal rate improvement of 66% (Web) and 44% (Code) while preserving task performance (degradation of only 1.8%).

Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

This paper systematically investigates the capability bottlenecks of open-source LLMs in data analysis tasks. It decomposes data analysis into three dimensions—data comprehension, code generation, and strategic planning—and identifies strategic planning as the decisive factor, rather than coding or data comprehension. A strategy-guided data synthesis approach is proposed, enabling fine-tuned 7B/14B models to achieve performance competitive with GPT-4o.


🎨 Image Generation (79)

AEDR: Training-Free AI-Generated Image Attribution via Autoencoder Double-Reconstruction

This paper proposes a training-free image attribution method based on the ratio of autoencoder double-reconstruction losses. By incorporating image uniformity calibration to eliminate texture complexity bias, the method achieves an average accuracy of 95.1% across 8 mainstream diffusion models, surpassing the strongest baseline by 24.7%, while being approximately 100× faster.

Aggregating Diverse Cue Experts for AI-Generated Image Detection

This paper proposes the Multi-Cue Aggregation Network (MCAN), which unifies three complementary cues — raw image, high-frequency representation, and a newly introduced Chromaticity Inconsistency (CI) — through a Mixture-of-Encoder Adapter (MoEA), enabling robust AI-generated image detection that generalizes across diverse generative models.

Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation

This paper proposes Cool-SD, a theoretically grounded annealed relaxation framework for speculative decoding. By deriving a tight upper bound on the TV distance, it obtains the optimal resampling distribution and proves that a decreasing acceptance probability schedule yields smaller distributional shift than a uniform schedule. Cool-SD achieves a superior speed–quality trade-off over LANTERN++ on LlamaGen and Lumina-mGPT.

Backdoors in Conditional Diffusion: Threats to Responsible Synthetic Data Pipelines

This paper exposes a backdoor vulnerability in the ControlNet conditional branch: injecting as little as 1–5% poisoned data suffices to implant a backdoor without modifying the diffusion backbone. Upon trigger activation, the model ignores text prompts and generates attacker-specified content. Clean fine-tuning (CFT) is proposed as a practical defense.

Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

This paper identifies a novel threat of NSFW text embedded in diffusion-model-generated images, proposes NSFW-Intervention — a targeted LoRA fine-tuning method applied to text-rendering layers — and releases the ToxicBench benchmark.

Beyond Semantic Features: Pixel-Level Mapping for Generalized AI-Generated Image Detection

This paper proposes a pixel-level mapping preprocessing method that suppresses low-frequency semantic bias and enhances high-frequency generation artifacts by breaking the monotonic ordering of pixel values, achieving a cross-model generalization accuracy of 98.4% in AI-generated image detection.

Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra

This paper proposes GLMR, a two-stage framework (contrastive pre-retrieval + generative language model re-ranking) that transforms cross-modal retrieval into unimodal retrieval by generating molecular structures aligned with input mass spectra, achieving over 40% improvement in Recall@1 on MassSpecGym.

CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement

CAD-VAE introduces a correlation-aware latent code to capture shared information between target and sensitive attributes, achieves disentanglement by directly minimizing conditional mutual information, and employs a relevance-driven optimization strategy to precisely regulate the shared code, attaining state-of-the-art performance on fair representation learning, counterfactual generation, and fair image editing.

CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images

CausalCLIP is proposed to disentangle CLIP features into causal and non-causal subspaces via Gumbel-Softmax masking and HSIC constraints, combined with adversarial masking and counterfactual intervention to preserve stable forensic cues, achieving a 6.83% accuracy improvement in cross-generator generalization.

Conditional Diffusion Model for Multi-Agent Dynamic Task Decomposition

This paper proposes CD3T, a two-level hierarchical MARL framework that employs a conditional diffusion model to learn action semantic representations \(z_a^i\) (conditioned on observations and other agents' actions to predict next observations and rewards), obtains subtask partitions via k-means clustering, and uses a high-level subtask selector combined with a low-level policy operating over a restricted action space. CD3T significantly outperforms all baselines on Super Hard scenarios in SMAC.

Browse all 79 Image Generation papers →


🎬 Video Generation (11)

3D4D: An Interactive Editable 4D World Model via 3D Video Generation

This paper proposes 3D4D, an interactive 4D visualization framework integrating WebGL and Supersplat rendering. A four-module backend pipeline converts static images and text prompts into editable 4D scenes, while a VLM-guided foveated rendering strategy enables 60fps real-time interaction, achieving state-of-the-art performance on both CLIP Consistency and CLIP Score.

DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation

This paper proposes DreamRunner, a framework that achieves fine-grained controllable multi-character multi-event story video generation via LLM-based dual-level planning, retrieval-augmented motion prior learning, and a spatial-temporal region-based 3D attention injection module (SR3AI).

FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion

FilmWeaver is proposed as a framework that guides autoregressive diffusion models via a dual-level cache (Shot Cache + Temporal Cache), enabling multi-shot video generation of arbitrary length with cross-shot consistency.

GenVidBench: A 6-Million Benchmark for AI-Generated Video Detection

This paper introduces GenVidBench—the first large-scale AI-generated video detection dataset with 6.78 million videos, featuring cross-source and cross-generator properties, covering 11 state-of-the-art video generators, and providing rich semantic annotations.

Mask2IV: Interaction-Centric Video Generation via Mask Trajectories

This paper proposes Mask2IV, a two-stage decoupled framework that first predicts mask motion trajectories of the interactor and object, then generates video conditioned on these trajectories. The approach enables controllable, interaction-centric video generation without dense mask annotations, supporting both human-object interaction and robot manipulation scenarios.

MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

This paper proposes MoFu, which addresses two fundamental challenges in multi-subject video generation—scale inconsistency and permutation sensitivity—through two core modules: Scale-Aware Modulation (SMO, an LLM-guided scale-aware modulation mechanism) and Fourier Fusion (an FFT-based permutation-invariant feature fusion strategy). The work additionally introduces the MoFu-1M training dataset and the MoFu-Bench evaluation benchmark.

MotionCharacter: Fine-Grained Motion Controllable Human Video Generation

This paper proposes the MotionCharacter framework, which decouples motion into two independently controllable dimensions—action type and motion intensity—to achieve fine-grained motion control and identity consistency in high-fidelity human video generation.

OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

This paper proposes OmniVDiff, a unified controllable video diffusion framework that jointly models multiple visual modalities (RGB, depth, segmentation, Canny) in color space and introduces an Adaptive Modality Control Strategy (AMCS). Within a single diffusion model, OmniVDiff simultaneously supports three task types—text-conditioned generation, X-conditioned generation, and video understanding—achieving state-of-the-art performance on VBench.

Phased One-Step Adversarial Equilibrium for Video Diffusion Models

This paper proposes V-PAE (Video Phased Adversarial Equilibrium), a two-phase distillation framework consisting of stability priming followed by unified adversarial equilibrium, which compresses large-scale video diffusion models (e.g., Wan2.1-I2V-14B) to single-step generation, achieving a 100× speedup and surpassing existing acceleration methods by 5.8% in average quality on VBench-I2V.

Seeing the Unseen: Zooming in the Dark with Event Cameras

This paper proposes RetinexEVSR, the first event-driven low-light video super-resolution (LVSR) framework. Through a Retinex-inspired bidirectional fusion strategy (RBF)—which first uses illumination maps to guide event feature denoising (IEE), then leverages enhanced event features to recover reflectance details (ERE)—the method achieves a 2.95 dB gain on the SDSD benchmark while reducing runtime by 65%.

Browse all 11 Video Generation papers →


🧩 Multimodal VLM (75)

Aligning the True Semantics: Constrained Decoupling and Distribution Sampling for Cross-Modal Alignment

This paper proposes the CDDS algorithm, which decouples embeddings into semantic and modality components via a dual-path UNet, and employs a distribution sampling method to achieve cross-modal semantic alignment indirectly, avoiding distribution distortion caused by directly adjusting embeddings. CDDS surpasses the state of the art by 6.6%–14.2% on Flickr30K and MS-COCO.

anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding

This work constructs the anyECG dataset (covering three tasks: report generation, waveform localization, and multi-ECG comparison) and proposes the anyECG-chat model. Through a dynamic ECG input mechanism supporting variable-length, few-lead, and multi-ECG inputs, and a three-stage curriculum learning strategy, anyECG-chat comprehensively outperforms existing ECG-MLLMs in OOD generalization for report generation, second-level anomalous waveform localization, and multi-ECG comparative analysis.

"Are We Done Yet?": A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents

This paper proposes a VLM-based autonomous task completion evaluation framework that judges whether a Computer Use Agent (CUA) has completed a task using only screenshots and task descriptions. Evaluation feedback is passed back to the agent for self-correction, achieving 73% evaluation accuracy and a 27% relative improvement in task success rate on macOS.

BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models

This paper proposes BiPrompt, a bilateral prompt optimization framework that simultaneously mitigates spurious biases on both the visual side (structured attention erasure) and the textual side (balanced prompt normalization) in VLMs such as CLIP at test time, improving OOD robustness without retraining.

BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning

This paper proposes BOFA, a framework that exclusively fine-tunes the existing cross-modal projection layer (bridge-layer) in CLIP. By constraining parameter updates within a low-rank "safe subspace" orthogonal to old-task features via Orthogonal Low-Rank Fusion, and combining this with cross-modal hybrid prototypes, BOFA achieves state-of-the-art exemplar-free class-incremental learning without introducing any additional parameters or inference overhead.

Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

This paper systematically investigates the application of zeroth-order (ZO) optimization in PEFT-based vision-language continual learning (VLCL). It finds that naively replacing first-order (FO) optimization with ZO causes training instability, and proposes a progressive ZO-FO hybrid strategy ranging from branch-wise to layer-wise granularity. Building on the theoretical finding that visual modality exhibits larger gradient variance, the paper further proposes MoZO (gradient sign normalization + visual perturbation constraint), achieving state-of-the-art performance across four benchmarks.

Bridging Modalities via Progressive Re-alignment for Multimodal Test-Time Adaptation (BriMPR)

This paper proposes BriMPR, a framework that decomposes multimodal test-time adaptation (MMTTA) into multiple unimodal feature alignment subproblems via a divide-and-conquer strategy. It first calibrates the global feature distribution of each modality through prompt tuning to achieve initial cross-modal semantic alignment, then refines the alignment via cross-modal masked embedding recombination and instance-level contrastive learning.

Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?

This paper presents the first systematic evaluation of LVLMs' ability to recognize and respect copyrighted content in multimodal contexts. It constructs a large-scale benchmark of 50,000 multimodal query–content pairs, finds that 11 out of 12 SOTA LVLMs fail to refuse infringing requests even when explicit copyright notices are present, and proposes CopyGuard—a tool-augmented framework that raises the infringement rejection rate from ~3% to ~62%.

ClearAIR: A Human-Visual-Perception-Inspired All-in-One Image Restoration

Inspired by human visual perception (HVP), this paper proposes ClearAIR, a coarse-to-fine unified image restoration framework that progressively recovers image quality through four stages — MLLM-based quality assessment → semantic region perception → degradation type identification → internal clue reuse — achieving state-of-the-art performance across multiple degradation tasks.

Conditional Information Bottleneck for Multimodal Fusion: Overcoming Shortcut Learning in Sarcasm Detection

This paper identifies three types of shortcut learning in multimodal sarcasm detection (character label bias, canned laughter label leakage, and sentiment inconsistency shortcuts), reconstructs a shortcut-free benchmark MUStARD++R, and proposes MCIB, a multimodal fusion framework based on the Conditional Information Bottleneck. MCIB achieves effective fusion by compressing redundancy in the primary modality while preserving complementary information from auxiliary modalities.

Browse all 75 Multimodal VLM papers →


🧠 VLM Reasoning (10)

AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs

Inspired by the dual-mode human cognitive process of verbal abduction and pictorial imagination, this paper proposes AbductiveMLLM, which enhances visual abductive reasoning in MLLMs via two collaborative components — a Reasoner (causal contrastive learning for hypothesis selection) and an Imaginer (diffusion-model-based pictorial reasoning) — achieving state-of-the-art performance on the VAR and YouCookII benchmarks.

AStar: Boosting Multimodal Reasoning with Automated Structured Thinking

This paper proposes AStar, a training-free multimodal reasoning paradigm that constructs a library of high-level "thought card" reasoning templates from 500 seed samples. At inference time, the most suitable templates are adaptively retrieved to guide structured reasoning in MLLMs. A 7B model achieves 53.9% accuracy on MathVerse (surpassing GPT-4o's 50.2%), requiring only 50 minutes of preprocessing and no model training.

Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models

This paper proposes Concept-RuleNet, a three-agent collaborative neurosymbolic reasoning framework that conditions symbol generation and rule construction on visual concepts extracted from training images. It addresses the symbol hallucination and non-representativeness issues of existing methods (e.g., Symbol-LLM) that rely solely on class labels, achieving an average accuracy improvement of ~5% across 5 OOD benchmarks and reducing hallucinated symbols by up to 50%.

CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models

This paper introduces CrossVid, the first comprehensive benchmark for systematically evaluating the Cross-Video Reasoning (CVR) capabilities of multimodal large language models (MLLMs). CrossVid encompasses 10 tasks across 4 dimensions, 5,331 videos, and 9,015 QA pairs. Experiments reveal that the current best-performing model, Gemini-2.5-Pro, achieves only 50.4% accuracy, far below the human performance of 89.2%.

FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation

This paper introduces FinMMDocR, a bilingual multimodal reasoning benchmark targeting real-world financial scenarios. It comprises 1,200 expert-annotated numerical reasoning questions spanning 12 implicit financial scenario types, 9 categories of long documents (averaging 50.8 pages), and reasoning chains averaging 11 steps. The strongest MLLM (o4-mini-high) achieves only 58% accuracy, exposing critical deficiencies of existing models in complex financial reasoning.

Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

This paper proposes Graph-of-Mark (GoM), a training-free pixel-level visual prompting method that explicitly encodes inter-object spatial relationships by overlaying a depth-aware scene graph (comprising nodes and directed edges) directly onto input images, achieving up to an 11 percentage point improvement in zero-shot spatial reasoning accuracy for multimodal language models on VQA and grounding tasks.

Leveraging Textual Compositional Reasoning for Robust Change Captioning

This paper proposes CORTEX, a framework that introduces VLM-generated compositional reasoning text as explicit cues, combined with an Image-Text Dual Alignment (ITDA) module, to enhance purely visual change captioning methods in understanding structured semantics such as object relationships and spatial configurations.

SToLa: Self-Adaptive Touch-Language Framework with Tactile Commonsense Reasoning in Open-Ended Scenarios

SToLa proposes the first Mixture-of-Experts (MoE)-based touch-language framework, which employs a dynamic routing mechanism to manage the modality gap between tactile and linguistic inputs. The work also introduces TactileBench, an open-ended tactile commonsense reasoning dataset covering 8 physical properties and 4 interaction characteristics. With only 7B parameters, SToLa achieves state-of-the-art performance on the PhysiCLeAR benchmark, surpassing the 13B Octopi model.

Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference

Tri-Bench is a compact benchmark comprising 400 real-world photographs of triangles. By systematically controlling two factors — camera pose (planar vs. tilted) and object interference — it evaluates the spatial geometric reasoning capabilities of four leading VLMs. The results reveal that models default to 2D image-plane cues rather than genuine 3D geometry, even when explicit reference-frame guardrails are provided in the prompt, with accuracy on minority-class shapes dropping to near 0%.

Yes FLoReNce, I Will Do Better Next Time! Agentic Feedback Reasoning for Humorous Meme Detection

This paper proposes FLoReNce, a framework that models humorous meme understanding as a closed-loop control system. Through a feedback loop consisting of a Judge agent, a PID controller, and a non-parametric knowledge base, the system retrieves similar past experiences at inference time to modulate prompts, enabling a frozen VLM to perform adaptive reasoning without fine-tuning, substantially improving both prediction accuracy and explanation quality.


⚡ VLM Efficiency (5)

EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

This paper proposes EM-KD, a distillation framework that leverages the Hungarian algorithm to address the vision token count imbalance between teacher and student models. By combining Vision Semantic Distillation (VSD) and Vision-Language Affinity Distillation (VLAD), EM-KD transfers knowledge from a vanilla teacher to an efficient student MLLM, achieving an average score of 50.4 across 11 benchmarks at 144 tokens/patch — surpassing LLaVA-NeXT with 576 tokens (49.4) while delivering nearly 2× inference speedup.

Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

This paper proposes FiCoCo, a three-stage framework (Filter–Correlate–Compress) that identifies redundant tokens via integrated vision-aware and semantic-aware redundancy metrics, adaptively recycles information from discarded tokens via inter-token correlation, and achieves training-free MLLM acceleration. On LLaVA-NeXT, FiCoCo achieves a 14.7× FLOPs reduction while retaining 93.6% of performance, and consistently outperforms FastV, SparseVLM, and other state-of-the-art methods across five MLLM architectures.

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

This paper proposes GlobalCom², a plug-and-play, training-free token compression framework tailored for high-resolution VLMs with dynamic cropping architectures. It leverages the global thumbnail as a "commander" to guide differentiated compression across local crop regions, achieving >90% of original performance while compressing 90% of visual tokens.

Rethinking Visual Token Reduction in LVLMs under Cross-Modal Misalignment

This paper identifies three forms of cross-modal misalignment (causal, semantic, and spatial) in text-guided visual token importance estimation within LVLMs, and proposes VisionDrop—a training-free progressive token pruning framework that relies exclusively on visual self-attention. The framework performs multi-stage compression across both the visual encoder and LLM decoder, retaining over 91% of original performance while keeping only 5.6% of tokens.

TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks

TinyChemVL is a chemistry-domain VLM with only 4B parameters. It compresses visual tokens to 1/16 of the original count via an adaptive token merging and pruning strategy, introduces reaction-level tasks and the ChemRxn-V benchmark, and achieves state-of-the-art performance on both molecular- and reaction-level visual chemistry tasks while significantly improving inference and training speed.


🎵 Audio & Speech (31)

A Mind Cannot Be Smeared Across Time

This paper formally proves that whether a machine possesses consciousness depends not only on what is computed, but also on when it is computed. Systems executing strictly sequentially fail to satisfy the temporal co-instantiation condition required for the unity of consciousness; consequently, pure software consciousness on strictly sequential hardware is impossible.

DeepDebater: A Superpersuasive Autonomous Policy Debating System

This paper presents DeepDebater, the first autonomous multi-agent system capable of participating in and winning a complete American-style policy debate (eight speeches plus cross-examination). The system employs a hierarchical agent workflow to construct affirmative (Advantage) and negative (DA+CP+Kritik) arguments, leverages over 3 million evidence cards from OpenDebateEvidence for retrieval-augmented generation, and integrates GPT-4o TTS speech synthesis with EchoMimic digital avatar animation for end-to-end presentation. Expert evaluations show DeepDebater significantly outperforms human-authored cases across all metrics (Quality: 4.32 vs. 3.65), achieving an 85% win rate in simulated rounds.

AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions

By applying binary masks (AHAMask) over attention heads in the Transformer backbone of Large Audio Language Models (LALMs), specific acoustic task functionalities can be reliably triggered without any textual instructions, while revealing the existence of "acoustic functional pathways" within LALMs.

Aligning Generative Music AI with Human Preferences: Methods and Challenges

This survey/position paper systematically reviews three technical approaches to preference alignment in music generation—MusicRL (large-scale RLHF with ~300K preference pairs), DiffRhythm+ (multi-preference DPO for diffusion models), and Text2midi-InferAlign (inference-time tree search achieving +29.4% CLAP)—while providing an in-depth analysis of alignment challenges unique to the music domain (multi-scale temporal coherence, harmonic consistency, cultural subjectivity, and the evaluation paradox), and proposing a future research roadmap.

CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

This paper introduces CCFQA—the first cross-lingual and cross-modal factuality benchmark covering 8 languages with 14,400 fully parallel speech-text factual QA samples. It supports four task settings (QA/XQA/SQA/XSQA), systematically revealing factual inconsistencies in existing MLLMs under language and modality switching. The paper also proposes LLM-SQA, which bridges via English with only 5-shot examples to achieve cross-lingual spoken QA transfer, attaining an F1 of 51.4 on XSQA—surpassing GPT-4o-mini-Audio (45.7).

Characterizing AI Manipulation Risks in Brazilian YouTube Climate Discourse

Through a psycholinguistic framework, this work analyzes 226,775 Brazilian YouTube climate change videos and 2,756,165 comments, revealing that emotional and moral rhetoric significantly drives user engagement. It further demonstrates that fine-tuned LLMs can automatically generate high-engagement climate denial comments, warning of the potential risks of generative AI in public opinion manipulation.

Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation

This paper proposes the Cross-Space Synergy (CSS) framework, which simultaneously addresses two major challenges in multimodal emotion recognition in conversation (MERC)—insufficient fusion expressiveness and multi-objective gradient conflicts—via Synergistic Polynomial Fusion (SPF) in the representation space and a Pareto Gradient Modulator (PGM) in the gradient space.

DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization

This paper proposes DeformTrace, which introduces a deformable dynamic receptive field mechanism and relay token scheme into state space models, combining Transformer-level global modeling with SSM-level efficient inference to achieve state-of-the-art accuracy and substantial efficiency gains in temporal forgery localization.

Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

This paper proposes Diff-V2M, a hierarchical conditional diffusion Transformer framework for video-to-music generation that integrates affective, semantic, and rhythmic features via explicit rhythmic modeling (low-resolution ODF) and a hierarchical cross-attention mechanism, achieving state-of-the-art performance on both in-domain and out-of-domain datasets.

DiffA: Large Language Diffusion Models Can Listen and Understand

This paper proposes DIFFA — the first large audio-language model built upon a diffusion language model — which combines a frozen LLaDA-8B backbone with a lightweight dual-adapter architecture and a two-stage training pipeline. Using only 960 hours of ASR data and 127 hours of synthetic instruction data, DIFFA achieves competitive performance against autoregressive baselines on MMSU, MMAU, and VoiceBench.

Browse all 31 Audio & Speech papers →


🔎 AIGC Detection (2)

BAID: A Benchmark for Bias Assessment of AI Detectors

This paper introduces the BAID benchmark (208K sample pairs covering 7 bias dimensions and 41 subgroups) to systematically evaluate the fairness of 4 open-source AI text detectors across demographic and linguistic subgroups, revealing significant recall disparities for dialect, informal English, and minority group texts.

Optimized Algorithms for Text Clustering with LLM-Generated Constraints

This paper proposes the LSCK-HC framework, which leverages LLMs to generate set-form must-link/cannot-link constraints (as opposed to traditional pairwise constraints), coupled with a penalty-based local search clustering algorithm. The approach achieves clustering accuracy comparable to SOTA on five short-text datasets while reducing the number of LLM queries by more than 20×.


🧊 3D Vision (79)

3D-ANC: Adaptive Neural Collapse for Robust 3D Point Cloud Recognition

This paper introduces the Neural Collapse (NC) mechanism into adversarial robustness for 3D point cloud recognition. By replacing the classifier head with a fixed ETF structure and adopting an adaptive training framework (RBL + FDL) to construct a disentangled feature space, 3D-ANC improves the adversarial accuracy of DGCNN on ModelNet40 from 27.2% to 80.9%, surpassing the best baseline by 34 percentage points.

3D-Free Meets 3D Priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

This paper proposes a framework that combines 3D-free methods (HawkI-style test-time optimization) with 3D-based priors (weak guidance images from Zero123++) to synthesize camera-controlled views at specified elevation/azimuth angles from a single image, requiring neither additional 3D data nor training. The approach comprehensively outperforms Zero123++, HawkI, and Stable Zero123 on LPIPS, CLIP-Score, and other metrics in complex scenes.

3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation

This work adapts the SAM2 foundation model for 3D teeth segmentation by converting 3D meshes into 2D images via multi-view rendering and designing three lightweight adapters—a Prompt Embedding Generator, a Mask Refiner, and a Mask Classifier—along with a Deformable Global Attention Plugin (DGAP) to address automatic prompting, boundary refinement, and semantic classification. The proposed method achieves a new state-of-the-art T-mIoU of 91.90% on Teeth3DS.

4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation

This paper proposes the 4DSTR framework, which significantly improves the spatial-temporal consistency of 4D Gaussian generation and its adaptability to rapid temporal changes through a Mamba-based temporal correlation rectification module (correcting Gaussian scale and rotation residuals) and a per-frame adaptive densification and pruning strategy.

Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

This paper proposes Uni-Adapter, a training-free online test-time adaptation (TTA) framework for 3D vision-language foundation models (VLFMs). It addresses distribution shifts via clustering-based dynamic prototype caching and graph-regularized label smoothing, achieving state-of-the-art performance on multiple 3D corruption benchmarks.

AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation

This paper identifies a critical yet overlooked issue in SDS: the source distribution is dynamically evolving rather than static. AnchorDS is proposed to anchor the source distribution by feeding the current rendered image as an image condition into a dual-conditioned diffusion model, thereby resolving semantic over-smoothing and multi-view inconsistency in SDS. The method comprehensively outperforms SDS, VSD, and SDS-Bridge on T3Bench.

AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

AnchorHOI is proposed to achieve zero-shot text-driven 4D human-object interaction (HOI) generation by introducing two intermediate bridges — anchor NeRF and anchor keypoints — to distill interaction priors and motion priors from image and video diffusion models, respectively. The method outperforms existing approaches on both static 3D and dynamic 4D HOI generation.

Arbitrary-Scale 3D Gaussian Super-Resolution

This paper proposes Arbi-3DGSR, an integrated framework that, for the first time, enables a single 3DGS model to support arbitrary-scale (including non-integer) high-resolution rendering through three core components: scale-aware rendering, generative prior-guided optimization, and progressive super-resolving. At ×5.7 scale, PSNR improves by 6.59 dB over vanilla 3DGS while maintaining real-time rendering at 85 FPS.

ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation

This paper proposes ASSIST-3D, a synthetic data pipeline that generates high-quality annotated data for class-agnostic 3D instance segmentation through three stages: heterogeneous object selection, LLM-guided scene layout generation, and realistic point cloud construction, significantly improving model generalization.

Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting?

This paper presents the first systematic study exposing the vulnerability of 3DGS watermarking frameworks, and proposes GSPure — a purification framework that leverages view-aware Gaussian weight accumulation and geometric feature clustering to precisely isolate and remove watermark-related Gaussian primitives, reducing watermark PSNR by up to 16.34 dB while incurring less than 1 dB loss in scene fidelity.

Browse all 79 3D Vision papers →


🎯 Object Detection (29)

AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios

This paper introduces AerialMind, the first large-scale Referring Multi-Object Tracking (RMOT) benchmark dataset for UAV scenarios, and proposes HawkEyeTrack (HETrack), a method that achieves language-guided multi-object tracking in aerial UAV scenes via a co-evolutionary fusion encoder and a scale-adaptive contextual refinement module.

An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice

This paper proposes a real-time overall mechanism for rice quality evaluation, integrating three modules: an improved YOLO-v5 (variety detection), an improved ConvNeXt-Tiny (intactness grading), and K-means (chalkiness region quantification). The system achieves 99.14% mAP and 97.89% detection accuracy on a self-constructed dataset of 20,000 images spanning six rice varieties.

AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer

This work formulates zero-shot anomaly generation as a text-guided localized style transfer problem. A lightweight U-Net trained with CLIP-based losses stylizes masked regions of normal images into semantically aligned anomalous images. With only 263M total parameters (0.61M trainable), AnoStyler surpasses diffusion-based baselines on MVTec-AD and VisA while significantly improving downstream anomaly detection performance.

AquaSentinel: Next-Generation AI System Integrating Sensor Networks for Urban Underground Water Pipeline Anomaly Detection via Collaborative MoE-LLM Agent Architecture

This paper proposes AquaSentinel, a physics-informed AI system that achieves network-wide pipeline leak detection using only 20–30% node coverage through sparse sensor deployment, physics-augmented virtual sensors, a MoE spatiotemporal GNN ensemble, a dual-threshold RTCA detection algorithm, causal flow localization, and LLM-based report generation. The system achieves 100% detection rate across 110 leak scenarios.

Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection

This paper proposes a framework leveraging VFMs (DINOv2 + Grounding DINO) to enhance Source-Free Object Detection (SFOD) via three modules: Patch-weighted Global Feature Alignment (PGFA), Prototype-based Instance Feature Alignment (PIFA), and Dual-source Enhanced Pseudo-label Fusion (DEPF). The method achieves state-of-the-art results on 6 cross-domain detection benchmarks, e.g., 47.1% mAP on Cityscapes→Foggy Cityscapes (+3.5% over DRU) and 67.4% AP on Sim10k→Cityscapes (+8.7% over DRU).

CASL: Curvature-Augmented Self-supervised Learning for 3D Anomaly Detection

This work identifies point cloud curvature as a powerful cue for anomaly detection and proposes CASL, a curvature-augmented self-supervised learning framework. By guiding coordinate reconstruction with multi-scale curvature prompts, CASL learns generalizable 3D representations without any anomaly-detection-specific mechanisms, achieving a 5.6% O-AUROC improvement over the previous state of the art on Real3D-AD.

Commonality in Few: Few-Shot Multimodal Anomaly Detection via Hypergraph-Enhanced Memory

This paper proposes CIF, which leverages hypergraphs to extract intra-class structural commonalities from a small number of training samples, guiding memory bank construction and retrieval for few-shot multimodal industrial anomaly detection, achieving state-of-the-art performance.

Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

This paper proposes GroundingAgent, a visual grounding framework that requires no task-specific fine-tuning. By composing pretrained open-vocabulary detectors (YOLO World), an MLLM (Llama-3.2-11B-Vision), and an LLM (DeepSeek-V3) into a structured iterative reasoning pipeline, the method achieves a zero-shot average accuracy of 65.1% on RefCOCO/+/g, substantially outperforming prior zero-shot approaches.

Correcting False Alarms from Unseen: Adapting Graph Anomaly Detectors at Test Time

This paper proposes TUNE, a plug-and-play test-time adaptation framework that addresses the "normality shift" problem in graph anomaly detection—caused by the emergence of new normal node categories—by transforming node features via a graph aligner. It leverages the degree of aggregation contamination as an unsupervised adaptation signal and significantly enhances the generalization of various pretrained GAD models across 10 real-world datasets.

CountSteer: Steering Attention for Object Counting in Diffusion Models

This paper proposes CountSteer, a training-free inference-time method that injects adaptive steering vectors into the cross-attention hidden states of diffusion models, improving object counting accuracy by approximately 4% without degrading image quality.

Browse all 29 Object Detection papers →


✂️ Segmentation (29)

A²LC: Active and Automated Label Correction for Semantic Segmentation

This paper proposes the A²LC framework, which augments conventional active label correction (ALC) — where annotators manually fix errors one by one — with an automated correction stage via a Label Correction Module (LCM). The LCM leverages annotator feedback to automatically rectify similar erroneous masks, while an Adaptively Balanced acquisition function (ABC) is designed to mitigate class imbalance. On Cityscapes, A²LC surpasses the previous SOTA using only 20% of the budget, achieving a 27.23% mIoU improvement under equal budget conditions.

Adaptive Morph-Patch Transformer for Aortic Vessel Segmentation

This paper proposes the Morph-Patch Transformer (MPT), which generates morphology-aware patches via a velocity-field-based adaptive patch partitioning strategy to preserve vascular topological integrity, and introduces Semantic Clustering Attention (SCA) to dynamically aggregate features from semantically similar patches. The method achieves state-of-the-art performance on three aortic segmentation benchmarks: AVT, AortaSeg24, and TBAD.

Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization

This paper proposes Generative Clean-Image Backdoors (GCB), which employs a Conditional InfoGAN (C-InfoGAN) to automatically discover naturally occurring, task-irrelevant features within images as backdoor triggers. GCB achieves high attack success rates (ASR ≥ 90%) at extremely low poison rates (≤ 0.5%) with negligible degradation of clean accuracy (CA drop ≤ 1%), thereby becoming the first method to break the inherent stealth-potency trade-off in clean-image backdoor attacks.

Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-Domain Few-Shot Segmentation

This paper proposes the HSL framework, which addresses the segmentation granularity gap between source and target domains in cross-domain few-shot segmentation (CD-FSS) via three modules — Dual Style Randomization (DSR), Hierarchical Semantic Mining (HSM), and Prototype Confidence-modulated Thresholding (PCMT) — achieving state-of-the-art performance across four target-domain datasets.

Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

This paper proposes Causal-Tune, a causality-driven VFM fine-tuning strategy that decomposes VFM features into causal (domain-invariant) and non-causal (domain-specific) components via DCT frequency-domain transformation and Gaussian band-pass filtering. Learnable tokens are applied exclusively to the causal components for refinement, effectively suppressing VFM artifacts and improving generalization in domain generalized semantic segmentation.

CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion

This paper proposes CtrlFuse, which achieves interactive controllable infrared-visible image fusion by fine-tuning SAM with mask prompt guidance, simultaneously improving fusion quality and downstream segmentation/detection performance.

Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation

This paper proposes FLEX-Seg, a framework that reframes the inherent boundary misalignment between images and semantic masks in diffusion-model-synthesized data as an opportunity to learn robust representations. Through three modules—Granular Adaptive Prototypes (GAP), Uncertainty Boundary Emphasis (UBE), and Hardness-Aware Sampling (HAS)—FLEX-Seg achieves state-of-the-art performance on domain generalized semantic segmentation.

EAGLE: Episodic Appearance- and Geometry-Aware Memory for Unified 2D-3D Visual Query Localization

This paper proposes the EAGLE framework, inspired by avian memory consolidation mechanisms. A segmentation branch guided by an Appearance-aware Meta-learning Memory (AMM) and a tracking branch driven by a Geometry-aware Localization Memory (GLM) operate collaboratively. Combined with VGGT, the framework achieves efficient unified 2D-3D visual query localization, attaining state-of-the-art performance on the Ego4D-VQ benchmark.

Empowering DINO Representations for Underwater Instance Segmentation via Aligner and Prompter

This paper is the first to introduce DINOv2 into underwater instance segmentation. Through two adaptation modules—AquaStyle Aligner (Fourier frequency-domain style injection) and ObjectPrior Prompter (binary mask prior prompting)—the proposed DiveSeg achieves efficient domain adaptation and substantially outperforms SAM-based methods on the UIIS and USIS10K benchmarks with fewer parameters.

From Attribution to Action: Jointly ALIGNing Predictions and Explanations

This paper proposes the ALIGN framework, which jointly trains a learnable masker and a classifier through alternating optimization to iteratively align model attribution maps with task-relevant region masks, simultaneously improving prediction accuracy and interpretability. ALIGN outperforms six strong baselines on the VLCS and Terra Incognita domain generalization benchmarks.

Browse all 29 Segmentation papers →


🖼️ Image Restoration (10)

Blur-Robust Detection via Feature Restoration: An End-to-End Framework for Prior-Guided Infrared UAV Target Detection

This paper proposes JFD3, an end-to-end dual-branch framework that performs deblurring in the feature domain rather than the image domain, and leverages frequency structure priors to guide the detection network, achieving high-accuracy real-time infrared UAV target detection under motion blur conditions.

Clear Nights Ahead: Towards Multi-Weather Nighttime Image Restoration

This paper is the first to define and explore the multi-weather nighttime image restoration task. It constructs the AllWeatherNight dataset (8K training + 1K synthetic test + 1K real-world test) and proposes the ClearNight unified framework, which simultaneously removes compound degradations—haze, rain streaks, raindrops, snow, and flare—in a single stage via Retinex dual-prior guidance and weather-aware dynamic specificity–commonality collaboration. With only 2.84M parameters, ClearNight comprehensively surpasses state-of-the-art methods.

Depth-Synergized Mamba Meets Memory Experts for All-Day Image Reflection Separation

This paper proposes DMDNet, which employs a depth-aware scanning strategy (DAScan) to guide Mamba toward salient structures, incorporates a depth-synergized state space model (DS-SSM) to suppress ambiguous feature propagation, and introduces a memory expert compensation module (MECM) to leverage cross-image historical knowledge, achieving all-day (daytime + nighttime) image reflection separation.

ICLR: Inter-Chrominance and Luminance Interaction for Natural Color Restoration in Low-Light Image Enhancement

Targeting two overlooked statistical distribution issues in the HVI color space — large distribution discrepancy between chrominance and luminance branches leading to insufficient complementary feature extraction, and weak inter-chrominance correlation causing gradient conflicts — this paper proposes the ICLR framework. It introduces a Dual-stream Interaction Enhancement Module (DIEM) and a Covariance Correction Loss (CCL) to address these issues from the perspectives of fusion enhancement and statistical distribution optimization, respectively, achieving state-of-the-art performance on the LOL benchmark series.

MFmamba: A Multi-function Network for Panchromatic Image Resolution Restoration Based on State-Space Model

This paper proposes MFmamba, a multi-function network built upon a UNet++ backbone that integrates a Mamba Upsampling Block (MUB), Dual Pooling Attention (DPA), and a Multi-scale Hybrid Cross Block (MHCB). Using only panchromatic (PAN) images as input, the unified framework simultaneously supports three tasks: super-resolution, spectral restoration, and joint SR with colorization.

RefiDiff: Progressive Refinement Diffusion for Efficient Missing Data Imputation

RefiDiff proposes a four-stage framework (pre-processing → warm-up → diffusion → polishing) that progressively unifies the predictive and generative imputation paradigms for the first time. Combined with a Mamba-based denoising network, it achieves state-of-the-art performance across 9 datasets while running 4× faster than DIFFPUTER.

SD-PSFNet: Sequential and Dynamic Point Spread Function Network for Image Deraining

SD-PSFNet is a cascaded CNN-based deraining network driven by a dynamic PSF mechanism. It models the optical effects of raindrops via a multi-scale learnable PSF dictionary, combined with a sequential restoration architecture featuring adaptive gated fusion. The method achieves SOTA performance of 33.12 dB on Rain100H and 42.28 dB on RealRain-1k-L, yielding a cumulative gain of 5.04 dB (13.5%) over the baseline MPRNet.

SpatioTemporal Difference Network for Video Depth Super-Resolution

Motivated by the statistical observation that spatially non-smooth regions and temporally varying regions in video depth super-resolution (VDSR) follow long-tail distributions, this paper proposes STDNet. The method incorporates a spatial difference branch (learning spatial difference representations for intra-frame RGB-D adaptive aggregation) and a temporal difference branch (exploiting temporal difference representations for motion compensation in changing regions). On the TarTanAir dataset at ×16 super-resolution, RMSE is reduced from 112.04 cm to 96.80 cm, outperforming state-of-the-art methods by an average of 27.6%–32.6%.

Temporal Inconsistency Guidance for Super-resolution Video Quality Assessment

This paper proposes TIG-SVQA, a framework that, for the first time, incorporates temporal inconsistency as an explicit guidance signal for super-resolution video quality assessment. The framework introduces an Inconsistency-Highlighted Spatial Module (IHSM) and an Inconsistency-Guided Temporal Module (IGTM), achieving SRCC scores of 0.950, 0.942, and 0.939 on the SFD, MFD, and Combined-VSR datasets, respectively, surpassing all existing IQA/VQA methods.

TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis

This paper proposes TMDC, a two-stage framework in which the first stage learns denoised modality-specific and modality-common representations on complete data, and the second stage leverages denoised representations from available modalities to reconstruct missing ones — marking the first joint treatment of noise and missing modalities in MSA.


🛰️ Remote Sensing (7)

Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments

This paper models conflicting predictions from multiple pre-trained perception models in novel environments as a consistency-based abductive reasoning problem. Error detection rules and domain constraints for each model are encoded as logic programs, and an optimal hypothesis is sought that maximizes prediction coverage while keeping the inconsistency rate below a threshold. The approach achieves an average F1 improvement of 13.6% across 15 aerial test sets.

Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data

To address the attenuation of causal treatment effects caused by regression-to-the-mean in ML-based satellite poverty predictions, this paper proposes two post-processing correction methods that require no additional labeled data — Linear Calibration Correction (LCC) and Tweedie local unshrinking — enabling a single prediction map to be reused across multiple downstream causal studies (the "One Map, Many Trials" paradigm). Tweedie correction achieves near-unbiased treatment effect estimation on both simulated and real DHS data.

M3SR: Multi-Scale Multi-Perceptual Mamba for Efficient Spectral Reconstruction

This paper proposes M3SR, a Mamba-based multi-scale multi-perceptual architecture that integrates spatial, frequency, and spectral branches in parallel within a U-Net multi-scale structure. With only 2.17M parameters and 100.9G FLOPs, M3SR surpasses existing state-of-the-art methods on four spectral reconstruction benchmarks.

Machine Learning for Sustainable Rice Production: Region-Scale Monitoring of Water-Saving Practices in Punjab, India

This paper proposes a dimensional classification approach that decouples the recognition of water-saving rice practices into two independent binary classification tasks — a seeding dimension (DSR vs. PTR) and an irrigation dimension (AWD vs. CF). Using only Sentinel-1 SAR imagery, the method achieves seeding F1=0.80 and irrigation F1=0.74, and performs large-scale inference over 3 million+ parcels in Punjab, with district-level adoption rates strongly correlated with government statistics (Spearman ρ=0.69).

Perceive, Act and Correct: Confidence Is Not Enough for Hyperspectral Classification

This paper proposes the CABIN framework, which employs a closed-loop cognitive perceive–act–correct learning mechanism. By replacing naive confidence with epistemic uncertainty to guide sample selection and pseudo-label management in semi-supervised hyperspectral image classification, CABIN significantly outperforms fully supervised baselines while using only 75% of the labeled data.

TDCNet: Spatio-Temporal Context Learning with Temporal Difference Convolution for Moving IRSTD

This paper proposes TDCNet, which unifies temporal difference and 3D convolution into a single Temporal Difference Convolution (TDC). Through re-parameterization, TDC introduces zero additional inference overhead. Combined with TDC-guided spatio-temporal attention (TDCSTA), TDCNet achieves an F1 of 97.12% (AP50 93.83%) on the newly constructed IRSTD-UAV dataset, which contains 15,106 frames of real infrared UAV imagery.

UniABG: Unified Adversarial View Bridging and Graph Correspondence for Unsupervised Cross-View Geo-Localization

This paper proposes UniABG, a two-stage unsupervised cross-view geo-localization framework that employs View-Aware Adversarial Bridging (VAAB) to eliminate the domain gap between UAV and satellite views, followed by Heterogeneous Graph Filtering Calibration (HGFC) to purify cross-view correspondences. UniABG achieves 93.29% Satellite→Drone AP on University-1652, surpassing most supervised methods.


🧑 Human Understanding (20)

AHAN: Asymmetric Hierarchical Attention Network for Identical Twin Face Verification

To address the extreme fine-grained recognition challenge of identical twin face verification, this paper proposes AHAN, a multi-stream architecture that performs multi-scale analysis of semantic facial regions via Hierarchical Cross-Attention (HCA), captures left-right facial asymmetry signatures through a Facial Asymmetry Attention Module (FAAM), and incorporates Twin-Aware Pair-Wise Cross-Attention (TA-PWCA) as a training regularizer. On the ND_TWIN dataset, AHAN improves twin verification accuracy from 88.9% to 92.3% (+3.4%).

CLIP-FTI: Fine-Grained Face Template Inversion via CLIP-Driven Attribute Conditioning

This paper presents the first approach to leverage CLIP-extracted fine-grained facial semantic attribute embeddings for Face Template Inversion (FTI). A cross-modal feature interaction network fuses leaked templates with attribute embeddings and projects them into the StyleGAN latent space, synthesizing identity-consistent face images with richer attribute details. The method surpasses state-of-the-art in recognition accuracy, attribute similarity, and cross-model attack transferability.

CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

This paper proposes CoordAR, which formulates 3D-3D correspondence estimation in single-reference-view 6D pose estimation as an autoregressive generation problem over discrete tokens. Through coordinate map tokenization, modality-decoupled encoding, and an autoregressive Transformer decoder, CoordAR substantially outperforms existing single-view methods on multiple benchmarks and demonstrates strong robustness to challenging scenarios such as symmetry and occlusion.

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

This paper proposes Facial-R1, a three-stage alignment training framework (SFT → RL → Data Synthesis) that aligns the reasoning process of VLMs with emotion recognition outcomes by treating AU and emotion labels as verifiable reward signals. The framework achieves state-of-the-art performance on 8 benchmarks and introduces the FEA-20K dataset.

GazeInterpreter: Parsing Eye Gaze to Generate Eye-Body-Coordinated Narrations

This paper proposes GazeInterpreter, an LLM-based hierarchical framework that converts raw gaze signals into textual narrations via a symbolic gaze parser, integrates them with body motion narrations to produce eye-body-coordinated descriptions, and iteratively refines outputs through a self-correction loop, yielding significant improvements on downstream tasks including text-driven motion generation, action prediction, and behavior summarization.

Generating Attribute-Aware Human Motions from Textual Prompt

This paper proposes AttrMoGen, a framework that decouples action semantics from human attributes (age, gender, etc.) via a Structural Causal Model (SCM)-based Causal Information Bottleneck, enabling attribute-aware human motion generation from text prompts. The authors also introduce HumanAttr, the first large-scale text-motion dataset with extensive attribute annotations.

Improving Sparse IMU-based Motion Capture with Motion Label Smoothing

This paper proposes Motion Label Smoothing, adapting classical label smoothing from classification tasks to sparse IMU-based motion capture. By incorporating skeleton-structure-aware Perlin noise as smoothed labels, the method improves accuracy across three state-of-the-art methods on four datasets in a plug-and-play manner without modifying model architectures. GlobalPose achieves a 20.41% reduction in SIP error on TotalCapture.

KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

This paper proposes KineST, a kinematics-guided state space model that reconstructs whole-body motion from sparse HMD signals via a kinematic tree bidirectional scanning strategy and hybrid spatiotemporal representation learning, surpassing state-of-the-art methods in both accuracy and temporal consistency.

mmPred: Radar-based Human Motion Prediction in the Dark

This work is the first to introduce millimeter-wave radar into human motion prediction (HMP), proposing mmPred — a diffusion-based framework that employs dual-domain historical motion representations (time-domain pose refinement TPR + frequency-domain dominant motion FDM) and a Global Skeleton Transformer (GST) to effectively suppress radar-specific noise and temporal inconsistency, surpassing SOTA methods by 8.6% and 22% on the mmBody and mm-Fi datasets, respectively.

Modality-Aware Bias Mitigation and Invariance Learning for Unsupervised Visible-Infrared Person Re-Identification

To address the core problem of unreliable cross-modality associations in unsupervised visible-infrared person re-identification (USVI-ReID), this paper proposes modality-aware Jaccard distance correction and a "split-and-contrast" invariance learning strategy. By eliminating modality bias, the method enables reliable global cross-modality clustering and feature alignment, achieving state-of-the-art performance on SYSU-MM01 and RegDB.

Browse all 20 Human Understanding papers →


📹 Video Understanding (27)

APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

This paper proposes APVR, a training-free dual-granularity visual information retrieval framework. At the frame level, it iteratively retrieves keyframes (up to 1024) via query expansion and spatiotemporal semantic confidence scoring; at the token level, it compresses visual tokens through query-aware attention-driven selection. APVR overcomes memory limitations to process hour-long videos, achieving improvements of up to 9.5%, 4.6%, and 9.7% on LongVideoBench, VideoMME, and MLVU, respectively.

BAT: Learning Event-based Optical Flow with Bidirectional Adaptive Temporal Correlation

This paper proposes the Bidirectional Adaptive Temporal Correlation (BAT) framework, which converts temporally dense motion cues from event cameras into spatially dense cues, achieving high-accuracy event-based optical flow estimation and ranking first on the DSEC-Flow benchmark.

Causality Matters: How Temporal Information Emerges in Video Language Models

Through systematic ablation experiments, this work demonstrates that the temporal understanding capability of VideoLMs does not originate from positional encoding (PE), but rather emerges from the sequence sensitivity of causal attention masks. Temporal information is constructed layer by layer along a causal pathway of "inter-frame interaction → last-frame aggregation → query integration," based on which two lossless inference acceleration strategies are proposed.

EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

This paper presents EmoVid, the first large-scale multimodal emotion video dataset targeting artistic and non-photorealistic content (22,758 video clips), spanning three content types—animation, film, and emoji stickers—and demonstrates the effectiveness of emotion-conditioned video generation by fine-tuning the Wan2.1 model, achieving significant improvements over baselines on emotion accuracy metrics.

Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

This paper proposes the CACMI framework, which addresses two fundamental limitations in dense video captioning (insufficient temporal modeling and modality gap) through explicit temporal-semantic modeling. It employs Cross-modal Frame Aggregation (CFA) to extract temporally coherent event semantics, and Context-aware Feature Enhancement (CFE) to bridge the visual-textual modality gap, achieving state-of-the-art performance on ActivityNet Captions and YouCook2.

FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion

This paper proposes FineTec, a framework that achieves robust fine-grained skeleton-based action recognition under temporal corruption via three modules: context-aware sequence completion, bio-prior-guided skeleton spatial decomposition, and physics-driven acceleration modeling.

FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding

This paper proposes the FineVAU benchmark, which decomposes Video Anomaly Understanding (VAU) into three dimensions — Event (What), Entity (Who), and Location (Where) — introduces the FV-Score metric with high alignment to human perception, and constructs the FineW³ dataset via a fully automated LVLM-assisted pipeline. Experiments reveal critical shortcomings of current LVLMs in fine-grained anomalous event perception.

HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection

This paper proposes HeadHunt-VAD, which systematically identifies a sparse set of anomaly-sensitive and stable attention heads within a frozen MLLM, bypassing the information loss inherent in text-based outputs. Using a lightweight classifier, it achieves efficient tuning-free video anomaly detection, establishing state-of-the-art performance among tuning-free methods on UCF-Crime and XD-Violence.

Learning Time in Static Classifiers

This paper proposes the Support-Exemplar-Query (SEQ) learning framework, which injects temporal reasoning capabilities into standard feed-forward classifiers through loss function design rather than architectural modification. By aligning predicted sequences with class-level temporal prototypes via soft DTW, the method achieves consistent improvements on both fine-grained image classification and video anomaly detection.

Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment

This paper proposes DSANet, which enhances the discriminability between normal and anomalous features in weakly supervised video anomaly detection (WS-VAD) at two levels: coarse-grained self-guided normal pattern modeling (SG-NM) and fine-grained disentangled contrastive semantic alignment (DCSA). DSANet achieves state-of-the-art performance with 86.95% AP (+1.14%) on XD-Violence and 13.01% fine-grained mAP (+3.39%) on UCF-Crime.

Browse all 27 Video Understanding papers →


🚗 Autonomous Driving (56)

A Data-Driven Model Predictive Control Framework for Multi-Aircraft TMA Routing Under Travel Time Uncertainty

A closed-loop MPC framework is proposed for conflict-free multi-aircraft routing and scheduling within the 50 NM Terminal Maneuvering Area (TMA) of Changi Airport. The framework integrates XGBoost-based TMA boundary arrival time prediction, MILP optimization (incorporating route selection, speed adjustment, holding control, and separation constraints), and a receding-horizon simulator. Under peak congestion scenarios of 36 aircraft/hour, it achieves a 7× computational speedup while significantly outperforming the Dijkstra baseline in feasibility under Monte Carlo robustness validation.

AI-based Traffic Modeling for Network Security and Privacy: Challenges Ahead

A survey and position paper on AI-based traffic modeling for Network Security & Privacy (NetS&P) tasks. It systematically reviews AI approaches for anomaly detection, attack classification, IoT device identification, and website fingerprinting attacks, and provides an in-depth discussion of four frontier challenges: data quality, practical deployment, explainability, and foundation models.

Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning

This paper presents the first study on backdoor attacks against open-vocabulary object detectors (OVODs), proposing TrAP (Trigger-Aware Prompt tuning), which jointly optimizes learnable prompts in both visual and textual branches alongside a learnable trigger to inject high-success-rate backdoors without modifying any model weights.

Beta Distribution Learning for Reliable Roadway Crash Risk Assessment

A geospatial deep learning framework based on Beta distribution learning is proposed, which leverages multi-scale satellite imagery to predict the full probability distribution of fatal crash risk (rather than point estimates), achieving 17–23% improvement in Recall while naturally expressing uncertainty through distribution shape.

CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction

CaTFormer is proposed to explicitly model causal interactions between driver behavior and environmental context via a causal temporal Transformer, achieving state-of-the-art performance of 98.6% F1 on the Brain4Cars dataset.

CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking

CompTrack is proposed as the first framework to simultaneously address dual redundancy in LiDAR point clouds: SFP filters background noise via information entropy analysis to resolve spatial redundancy; IB-DTC estimates effective rank via online SVD and adaptively determines compression ratio to compress foreground into low-rank proxy tokens, resolving information redundancy. Achieves state-of-the-art on nuScenes (61.04% Success) at 90 FPS.

Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification

This work systematically identifies two unique challenges in adversarial defense for person ReID — model bias and composite generalization requirements — and proposes a Debiased Dual-Invariant Defense framework. The data balancing stage employs a diffusion model for resampling to mitigate bias, while the dual adversarial self-meta defense stage achieves dual generalization to unseen IDs and unseen attacks via Farthest Negative Example Softening (FNES)-based metric adversarial training and adversarially-enhanced self-meta learning.

AdaptiveAD: Decoupling Scene Perception and Ego Status for End-to-End Autonomous Driving

This paper identifies the architectural root cause of ego-status over-reliance in end-to-end autonomous driving—namely, the premature fusion of ego status within the BEV encoder—and proposes AdaptiveAD, a dual-branch architecture consisting of a scene-driven branch (with ego status removed) and a self-driven branch that independently generate planning decisions. A scene-aware fusion module then adaptively integrates the two branches. Complemented by path attention, BEV unidirectional distillation, and an autoregressive online mapping auxiliary task, AdaptiveAD achieves state-of-the-art planning performance on nuScenes.

SAML: A Differentiable Semantic Meta-Learning Framework for Long-Tail Motion Prediction

SAML is proposed as the first framework to provide a differentiable semantic definition of "long-tailedness" in motion prediction — quantifying rarity via five intrinsic/interactive attributes, fusing them into a continuous Tail Index through a Bayesian Tail Perceiver, and driving MAML-based meta-learning adaptation. On the nuScenes worst-case top 1% subset, SAML achieves a minADE 17.2% lower than the second-best method.

Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection

This paper proposes MonoDLGD, which provides explicit geometric supervision for monocular 3D detection by adaptively perturbing and reconstructing ground-truth labels according to instance-level detection difficulty, achieving state-of-the-art performance on KITTI.

Browse all 56 Autonomous Driving papers →


🤖 Robotics & Embodied AI (30)

10 Open Challenges Steering the Future of Vision-Language-Action Models

This paper systematically surveys 10 open challenges facing VLA models — multimodal perception, robust reasoning, high-quality training data, evaluation, cross-robot action generalization, resource efficiency, whole-body coordination, safety assurance, agent frameworks, and human-robot collaboration — and discusses four emerging trends: spatial understanding, world dynamics modeling, post-training, and data synthesis.

A Computable Game-Theoretic Framework for Multi-Agent Theory of Mind

This paper proposes a game-theoretic framework based on Poisson cognitive hierarchy, achieving computable multi-agent Theory of Mind via Gamma-Poisson conjugate Bayesian updates. The framework supports recursive bounded-rationality decision-making and online belief revision while avoiding the undecidability of POMDPs.

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

AC3 proposes an actor-critic framework that directly learns continuous action sequences (action chunks), stabilizing long-horizon robotic manipulation under sparse rewards via an asymmetric actor update rule—updating the actor only from successful trajectories—and self-supervised anchor-based intrinsic rewards. The method achieves superior success rates over existing approaches across 25 tasks on BiGym and RLBench.

Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation

This paper addresses the base placement problem in open-vocabulary mobile manipulation (OVMM) and proposes a zero-shot framework that constructs a cross-modal representation (Affordance RGB + Obstacle Map+) to project semantic affordance cues onto an obstacle map, followed by a coarse-to-fine iterative optimization that balances semantic and geometric constraints. The method achieves an 85% success rate across five manipulation tasks, substantially outperforming both geometric planners and pure VLM-based approaches.

Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

This paper proposes the CCoL framework, which addresses both physical discontinuity in action sequences and semantic-physical misalignment in Behavioral Cloning through NeuralODE-driven Multimodal Continuous Co-learning (MCC) and bidirectional cross-attention-based Cross-modal Semantic-Physical Alignment (CSA). CCoL achieves an average relative improvement of 8.0% across three simulation platforms, with up to 19.2% on the bimanual insertion task.

Coordinated Humanoid Robot Locomotion with Symmetry Equivariant Reinforcement Learning Policy

This paper proposes SE-Policy, which directly embeds strict symmetry equivariance (actor) and symmetry invariance (critic) into the neural network architecture without additional hyperparameters, enabling humanoid robots to produce spatiotemporally coordinated natural locomotion. The velocity tracking error is reduced by 40% compared to DreamWaQ, and the policy is successfully deployed on a physical Unitree G1 robot.

Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

This paper proposes GRM, a framework that achieves robust fine-grained image-text alignment through intra-modal saliency/granularity-aware adapters and Gaussian mixture-based region-level uncertainty modeling, attaining state-of-the-art performance on Flickr30K and MS-COCO.

Dexterous Manipulation Transfer via Progressive Kinematic-Dynamic Alignment

This paper proposes the PKDA framework, which automatically converts human hand manipulation videos into high-quality manipulation trajectories for multi-fingered dexterous hands via progressive kinematic-dynamic alignment, achieving an average transfer success rate of 73%.

Distributionally Robust Online Markov Game with Linear Function Approximation

This paper studies online distributionally robust Markov games with linear function approximation. It is the first to identify the hardness of learning in this setting, and proposes the DR-CCE-LSI algorithm, which achieves minimax-optimal sample complexity with respect to the feature dimension \(d\) under a specific feature mapping condition.

From Woofs to Words: Towards Intelligent Robotic Guide Dogs with Verbal Communication

This paper proposes a dialogue system for robotic guide dogs that leverages LLMs and a task planner to achieve Plan Verbalization and Scene Verbalization, supporting multi-turn natural language dialogue to assist visually impaired users in navigation decision-making. The system's effectiveness is validated through a real-user study and simulation experiments.

Browse all 30 Robotics & Embodied AI papers →


🎮 Reinforcement Learning (58)

A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs

This paper proposes a multi-dimensional objective-space framework for evaluating LLM steerability, decomposing steering error into miscalibration and side effects (orthogonality). Experiments on text rewriting reveal that even the strongest LLMs produce severe side effects; prompt engineering proves ineffective, best-of-N sampling is prohibitively costly, and RL fine-tuning yields improvements but does not fully resolve the problem.

A Learning Framework For Cooperative Collision Avoidance of UAV Swarms Leveraging Domain Knowledge

This paper proposes reMARL, a framework that leverages domain knowledge from image processing (active contour model) to design reward functions for multi-agent reinforcement learning, enabling cooperative collision avoidance in UAV swarms. Compared to traditional metaheuristic methods, reMARL reduces reaction time by 98.75% and energy consumption by 85.37%.

A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

This paper proposes MACO, a multi-agent conversational bandit framework that achieves online evaluation and user preference alignment for LLM responses through a local-agent phase elimination mechanism and an adaptive preference query strategy on a cloud server, attaining a near-optimal regret bound of \(\tilde{O}(\sqrt{dMT})\).

Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

This paper proposes a test-time policy shaping method that interpolates and modifies the action probability distribution of pretrained RL agents at inference time using lightweight ethical attribute classifiers, enabling fine-grained behavioral steering across multiple ethical attributes without retraining.

Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

This paper proposes Behaviour Policy Optimization (BPO), which optimizes a dedicated behaviour policy for off-policy data collection such that the variance of return estimates is provably lower than on-policy collection, thereby improving the sample efficiency and stability of REINFORCE and PPO.

Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning

Through dynamical systems analysis, this paper proves that under approximate greedy exploration policies, all zero-loss solutions violating IGM consistency in non-monotonic value factorization Q-learning are unstable saddle points, while IGM-consistent solutions are stable attractors — enabling reliable convergence to optimal solutions without monotonicity constraints.

Beyond the Lower Bound: Bridging Regret Minimization and Best Arm Identification in Lexicographic Bandits

Two elimination-based algorithms, LexElim-Out and LexElim-In, are proposed to simultaneously address regret minimization (RM) and best arm identification (BAI) in lexicographic multi-objective bandits for the first time. LexElim-In breaks the known lower bound of single-objective problems through cross-objective information sharing.

Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback

This paper proposes MetaCUB — a bi-level contextual bandit framework for individualized resource allocation under delayed feedback, dynamic cohorts, cooldown constraints, and fairness requirements. The meta-level optimizes subgroup budget allocation to ensure fairness, while the base-level applies a UCB strategy to select the most promising individuals within each subgroup.

ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing

This paper introduces the ChartEditVista benchmark (7,964 samples, 31 chart types) and the ChartEditor model. By combining a GRPO reinforcement learning framework with a novel rendering reward, ChartEditor surpasses GPT-4o and several 72B-scale models on chart editing tasks using only 3B parameters.

CHDP: Cooperative Hybrid Diffusion Policies for RL in Parametric Environments

This paper models the hybrid action space problem as a fully cooperative two-agent game, employing discrete and continuous diffusion policies respectively to generate actions. Sequential updates and a Q-guided codebook are introduced to resolve policy conflicts and high-dimensional scalability issues, achieving up to a 19.3% improvement in success rate.

Browse all 58 Reinforcement Learning papers →


🎁 Recommender Systems (27)

Align³GR: Unified Multi-Level Alignment for LLM-based Generative Recommendation

This paper proposes Align³GR, a unified three-level alignment framework that systematically bridges the semantic-behavioral gap between LLMs and recommender systems at the token level (dual-side SCID), the behavior modeling level (multi-task SFT), and the preference level (progressive DPO).

AutoPP: Towards Automated Product Poster Generation and Optimization

This paper proposes AutoPP, the first pipeline to unify automated product poster generation with CTR-feedback-driven optimization in a single framework. It employs a unified design module to jointly design background, text, and layout; an element rendering module for efficient and controllable poster generation; and Isolated DPO (IDPO) to achieve element-level click-through rate optimization.

Behavior Tokens Speak Louder: Disentangled Explainable Recommendation with Behavior Vocabulary

This paper proposes BEAT, a framework that discretizes user/item behavior representations into interpretable behavior tokens via vector-quantized autoencoders, and aligns collaborative filtering signals to the semantic space of a frozen LLM through multi-level semantic supervision, enabling zero-shot explainable recommendation.

Bid Farewell to Seesaw: Towards Accurate Long-tail Session-based Recommendation via Dual Constraints of Hybrid Intents

This paper proposes the HID framework, which constructs hybrid intents via attribute-aware spectral clustering to distinguish session-relevant from session-irrelevant tail items, and introduces a dual-constraint loss (ICLoss) targeting both long-tail coverage and recommendation accuracy. The framework achieves a "win-win" between long-tail promotion and accuracy, breaking the traditional seesaw dilemma where improving one metric inevitably harms the other.

CroPS: Improving Dense Retrieval with Cross-Perspective Positive Samples in Short-Video Search

This paper proposes CroPS, a data engine that enriches positive sample sets from three complementary perspectives—query reformulation behavior, recommender system interactions, and LLM world knowledge—combined with Hierarchical Label Assignment (HLA) and the H-InfoNCE loss function, to break the filter bubble effect in industrial-scale dense retrieval systems. CroPS has been fully deployed in Kuaishou Search.

Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

This paper proposes PAS (Police Action Scenarios), an LLM evaluation framework for policing contexts. The framework comprises five stages: scenario definition, reference answer construction, LLM response generation, core metric extraction, and performance interpretation. An evaluation dataset is constructed from 8,000+ official Korean police documents. The study finds that commercial LLMs (GPT-4, Gemini, Claude) perform significantly below reference answers on policing tasks, particularly in factual accuracy and logical correctness.

FreqRec: Exploiting Inter-Session Information with Frequency-enhanced Dual-Path Networks for Sequential Recommendation

This paper proposes FreqRec, a dual-path architecture that applies frequency-domain transformations along the batch axis and the time axis to capture group-level consumption rhythms across sessions and fine-grained individual user interests, respectively. A frequency-domain consistency loss is introduced to explicitly align predicted and ground-truth frequency spectra. FreqRec achieves up to 7.38% improvement in NDCG@10 over the strongest baseline on three Amazon datasets.

From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization

This paper proposes GenCDR, a framework that introduces the generative semantic ID paradigm into LLM-driven cross-domain recommendation for the first time, via two core modules: domain-adaptive semantic tokenization and cross-domain autoregressive recommendation. GenCDR effectively addresses the non-transferability of item IDs and insufficient domain-personalized modeling in conventional approaches.

Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

This paper proposes the first semi-supervised matrix completion learning paradigm: assuming that the sampling distribution \(P\) and the ground-truth matrix \(G\) share a low-rank subspace, and given a large amount of unlabeled data \(M\) and a small amount of labeled data \(N\), it proves that the generalization error can be decomposed into two independent terms \(\tilde{O}(\sqrt{nd/M}) + \tilde{O}(\sqrt{dr/N})\), achieving significant improvements over explicit-feedback-only baselines on the Douban and MovieLens datasets.

Hard vs. Noise: Resolving Hard-Noisy Sample Confusion in Recommender Systems via Large Language Models

This paper proposes the LLMHNI framework, which leverages two types of auxiliary signals generated by LLMs—semantic relevance and logical relevance—to resolve the confusion between hard samples and noisy samples in recommender systems, significantly improving denoising recommendation performance.

Browse all 27 Recommender Systems papers →


🔄 Self-Supervised Learning (16)

BCE3S: Binary Cross-Entropy Based Tripartite Synergistic Learning for Long-tailed Recognition

BCE3S is proposed, a binary cross-entropy (BCE)-based tripartite synergistic learning framework that integrates BCE-based joint learning, BCE-based contrastive learning, and BCE-based classifier uniformity learning. By decoupling per-class logits via Sigmoid, it suppresses the imbalance effects inherent to long-tailed distributions, achieving state-of-the-art performance on CIFAR10/100-LT, ImageNet-LT, and iNaturalist2018.

CATFormer: When Continual Learning Meets Spiking Transformers With Dynamic Thresholds

This paper proposes CATFormer, a data-replay-free continual learning framework built upon a spiking Vision Transformer, which achieves task-specific neuronal excitability modulation via context-adaptive dynamic firing thresholds. Over sequences of up to 100 tasks, the model not only avoids forgetting but actually improves in accuracy — a phenomenon the authors term "reverse forgetting."

Expandable and Differentiable Dual Memories with Orthogonal Regularization for Exemplar-free Continual Learning

This paper proposes EDD (Expandable and Differentiable Dual Memory), an exemplar-free continual learning method that decomposes data into reusable sub-features via differentiable shared and task-specific memories, combined with memory expansion-pruning and orthogonal regularization mechanisms. EDD surpasses 14 state-of-the-art methods on CIFAR-10/100 and Tiny-ImageNet, achieving final accuracies of 55.13%, 37.24%, and 30.11%, respectively.

Explanation-Preserving Augmentation for Semi-Supervised Graph Representation Learning

This paper proposes EPA-GRL (Explanation-Preserving Augmentation for Graph Representation Learning), which employs a GNN explainer trained with a small number of labels to identify semantic subgraphs (explanation subgraphs). During augmentation, only the non-semantic portions (marginal subgraphs) are perturbed, achieving semantics-preserving graph augmentation. EPA-GRL significantly outperforms semantics-agnostic random augmentation methods across 6 benchmarks.

FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Clients

This paper proposes FedGRPO, which reformulates foundation model optimization as a reward-based evaluation process. Through competence-aware expert selection and federated group-relative policy optimization (transmitting only scalar reward signals), FedGRPO achieves privacy-preserving, communication-efficient federated foundation model optimization, approaching or surpassing centralized GRPO on mathematical reasoning and question-answering tasks.

FineXtrol: Controllable Motion Generation via Fine-Grained Text

This paper proposes FineXtrol, a framework that leverages temporally annotated, fine-grained body-part text descriptions as control signals. By combining a dual-branch ControlNet architecture with hierarchical contrastive learning to enhance the discriminability of the text encoder, FineXtrol achieves efficient, user-friendly, and precise controllable human motion generation, significantly outperforming existing methods on multi-body-part control benchmarks on HumanML3D.

From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models without Finetuning

This paper proposes Transferable Video Attack (TVA), which generates adversarial perturbations solely by exploiting the embedding space of open-source Video Foundation Models (VFMs), without any knowledge of downstream tasks, and effectively attacks downstream models and multimodal LLMs across 24 video tasks.

GOAL: Geometrically Optimal Alignment for Continual Generalized Category Discovery

Grounded in Neural Collapse theory, this paper replaces dynamic classifiers with a fixed Equiangular Tight Frame (ETF) classifier and achieves continual generalized category discovery via supervised alignment and confidence-guided unsupervised alignment, reducing forgetting by 16.1% and improving novel category discovery by 3.2% across four benchmarks.

HiLoMix: Robust High- and Low-Frequency Graph Learning Framework for Mixing Address Association

This paper proposes HiLoMix, a robust graph learning framework for the mixing address association task. It addresses three core challenges—graph sparsity, label scarcity, and label noise—through a Heterogeneous Attribute Mixing Interaction Graph (HAMIG), frequency-aware graph contrastive learning, and confidence-based label weighting supervision, respectively. HiLoMix surpasses the second-best baseline by 5.69%, 7.34%, and 15.61% on F1, AUC, and MRR.

Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

This paper proposes UrbanLN, a framework that improves urban region representation learning from LLM-generated captions via a long-caption-aware positional encoding interpolation strategy and a dual-level (data and model) noise suppression mechanism.

Browse all 16 Self-Supervised Learning papers →


📐 Optimization & Theory (21)

A Distributed Asynchronous Generalized Momentum Algorithm Without Delay Bounds

This paper proposes a totally asynchronous Generalized Momentum (GM) distributed optimization algorithm that guarantees linear convergence without assuming any upper bound on communication or computation delays. On a Fashion-MNIST classification task, the proposed method requires 71% fewer iterations than gradient descent, 41% fewer than Heavy Ball, and 19% fewer than Nesterov accelerated gradient.

A Unified Convergence Analysis for Semi-Decentralized Learning: Sampled-to-Sampled vs. Sampled-to-All Communication

This paper presents a unified convergence analysis framework to systematically compare, for the first time, two server-to-device communication primitives in semi-decentralized federated learning — S2S (returning the aggregated model only to sampled devices) and S2A (broadcasting to all devices). The analysis reveals distinct regimes in which S2S is superior under high inter-component heterogeneity and S2A is superior under low heterogeneity, and provides practical guidelines for system configuration.

Beyond the Mean: Fisher-Orthogonal Projection for Natural Gradient Descent in Large Batch Training

This paper proposes Fisher-Orthogonal Projection (FOP), which supplements variance information by orthogonally projecting sub-batch gradient differences under the Fisher metric, enabling the second-order optimizer KFAC to remain effective in ultra-large batch training and achieving up to ×7.5 speedup.

Bridging Synthetic and Real Routing Problems via LLM-Guided Instance Generation and Progressive Adaptation

This paper proposes EvoReal, a framework that employs LLM-driven evolutionary search to generate synthetic VRP instances structurally aligned with real-world distributions, and then adapts pretrained neural solvers to real benchmarks via a two-stage progressive fine-tuning strategy. EvoReal substantially outperforms existing neural solvers on TSPLib (1.05% gap) and CVRPLib (2.71% gap).

Co-Layout: LLM-driven Co-optimization for Interior Layout

This paper proposes Co-Layout, a framework that leverages LLMs to extract structured constraints from natural language descriptions, then jointly optimizes room layout and furniture placement via a grid-based integer programming (IP) formulation augmented with a coarse-to-fine solving strategy, substantially outperforming existing two-stage approaches.

Convex Clustering Redefined: Robust Learning with the Median of Means Estimator

This paper integrates the Median of Means (MoM) estimator into the convex clustering framework, proposing the COMET algorithm. By combining random binning with median aggregation, COMET achieves robustness to noise and outliers without requiring prior knowledge of the number of clusters \(k\). Weak consistency is established theoretically, and experiments on multiple real-world datasets demonstrate substantial improvements over six baselines, including k-means, MoM k-means, and convex clustering.

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

This work provides the first theoretical analysis of the minimum cost required to steer an LLM policy toward an attacker's target by flipping preference labels during RLHF/DPO alignment. The problem is formalized as a convex optimization problem, upper and lower bounds on the cost are derived, and a post-processing method called PCM (Poisoning Cost Minimization) is proposed to substantially reduce the number of label flips while preserving the poisoning effect.

Data Heterogeneity and Forgotten Labels in Split Federated Learning

This paper systematically investigates catastrophic forgetting (CF) caused by data heterogeneity in Split Federated Learning — with particular focus on intra-round forgetting induced by the server-side processing order — and proposes Hydra, a multi-head method that partitions and trains the final layers of part-2 in groups before aggregation, significantly reducing the performance gap (PG) across labels by up to 75.4%.

ECPv2: Fast, Efficient, and Scalable Global Optimization of Lipschitz Functions

This paper proposes ECPv2, which introduces three innovations—adaptive lower bound, Worst-\(m\) memory, and fixed random projection—to reduce the per-run complexity of Lipschitz global optimization from \(\Omega(n^2 d)\) to \(\Omega(n(m+d)\log n)\), while maintaining an \(O(n^{-1/d})\) regret convergence rate that matches the minimax lower bound.

Efficient and Reliable Hitting-Set Computations for the Implicit Hitting Set Approach

To address numerical instability arising from the reliance on commercial IP solvers in the hitting-set component of the implicit hitting set (IHS) framework, this paper proposes alternative approaches based on pseudo-Boolean (PB) reasoning and stochastic local search (SLS), as well as hybrid strategies. The work realizes the first certifiable IHS computation and demonstrates effective trade-offs between efficiency and reliability across 1,786 benchmark instances.

Browse all 21 Optimization & Theory papers →


📐 Learning Theory (3)

A Switching Framework for Online Interval Scheduling with Predictions

For the irrevocable online interval scheduling problem, this paper proposes the SemiTrust-and-Switch framework and the SmoothMerge randomized algorithm. By switching between or blending a prediction-trusting strategy and a classical greedy algorithm, the approach achieves near-optimal performance when predictions are accurate (consistency) and degrades gracefully when predictions are erroneous (robustness and smoothness). Tightness of the framework on specific instances is also established.

Generalizing Analogical Inference from Boolean to Continuous Domains

This paper revisits the theoretical foundations of analogical inference: it first constructs a counterexample demonstrating the failure of classical generalization bounds in the Boolean domain, then proposes a unified analogical inference framework based on parameterized generalized means, extending discrete classification to continuous regression domains.

Streaming Generated Gaussian Process Experts for Online Learning and Control: Extended Version

This paper proposes SkyGP (Streaming Kernel-induced Progressively Generated Expert GP), which handles streaming data via kernel-distance-driven progressive expert generation and time-aware configurable aggregation, inheriting the learning guarantees of exact GP while maintaining bounded computational complexity. SkyGP comprehensively outperforms state-of-the-art methods on both benchmark regression tasks and real-time control experiments.


🔗 Causal Inference (7)

Causal Inference Under Threshold Manipulation: Bayesian Mixture Modeling and Heterogeneous Treatment Effects

This paper proposes the BMTM/HBMTM Bayesian mixture model framework. In scenarios where consumers strategically manipulate spending to reach reward thresholds, the framework decomposes the observed distribution into bunching and non-bunching sub-distributions to accurately estimate threshold causal effects and heterogeneous treatment effects across subgroups.

CaDyT: Causal Structure Learning for Dynamical Systems with Theoretical Score Analysis

This paper proposes CaDyT, which combines Gaussian process-based continuous-time dynamics modeling (via Adams-Bashforth integrators for exact inference) with the Minimum Description Length (MDL) principle for structure search. The method simultaneously addresses irregular sampling and causal structure identification, substantially outperforming all baselines on double-mass spring, diamond graph, and Rössler oscillator benchmarks (AUPRC 0.79 vs. runner-up 0.39).

From Theory of Mind to Theory of Environment: Counterfactual Simulation of Latent Environmental Dynamics

This paper proposes the concept of "Theory of Environment" (ToE), arguing that humans may infer latent environmental dynamics through computational mechanisms shared with Theory of Mind (ToM), thereby expanding the dimensionality of motor exploration and facilitating behavioral innovation.

I-CAM-UV: Integrating Causal Graphs over Non-Identical Variable Sets Using Causal Additive Models with Unobserved Variables

This paper proposes I-CAM-UV, a method that enumerates consistent DAGs satisfying structural constraints derived from multiple CAM-UV outputs over non-identical variable sets, recovering causal relations lost due to unobserved variables, and introduces an optimal-first search algorithm exploiting cost monotonicity for efficient combinatorial search.

KTCF: Actionable Recourse in Knowledge Tracing via Counterfactual Explanations for Education

This paper proposes KTCF, a counterfactual explanation generation method for Knowledge Tracing (KT) that leverages inter-concept relationships to produce sparse and actionable counterfactuals, subsequently post-processed into sequentially ordered instructional recommendations. KTCF comprehensively outperforms baseline methods across validity, sparsity, and actionability metrics.

Learning Subgroups with Maximum Treatment Effects without Causal Heuristics

Under the SCM framework, the paper proves that the subgroup with maximum treatment effect must exhibit homogeneous pointwise effects (Theorem 1); under the partition model assumption, it proves that optimal subgroup discovery reduces to standard supervised learning (Theorem 2), achievable via CART with the Gini index. On 77 ACIC-2016 semi-synthetic datasets, the proposed method achieves a mean treatment effect of 10.54 (vs. 7.84 for the runner-up), ranking first on 51.9% of datasets.

Sparse Additive Model Pruning for Order-Based Causal Structure Learning

This paper proposes SARTRE, a framework that employs randomized tree embeddings and group-sparse regression to learn sparse additive models, replacing the hypothesis-testing-based redundant edge pruning in CAM-pruning for order-based causal structure learning. SARTRE achieves significant speedups without sacrificing accuracy.


🔬 Interpretability (37)

A Coherence-Based Measure of AGI

This paper identifies that existing AGI scores rely on arithmetic averaging, which implicitly encodes a "compensatory" assumption (strengths offsetting weaknesses), and proposes \(\text{AGI}_{\text{AUC}}\)—a coherence measure based on the continuous spectrum of generalized means. By integrating over the compensability parameter \(p \in [-1, 1]\), the metric penalizes uneven capability profiles and exposes bottlenecks concealed by arithmetic averaging.

Adaptive Evidential Learning for Temporal-Semantic Robustness in Moment Retrieval

This paper proposes DEMR, a framework that introduces Deep Evidential Regression (DER) into video moment retrieval. It mitigates modal imbalance via a Reflective Flipped Fusion (RFF) module and corrects the counter-intuitive uncertainty estimation bias in vanilla DER via a Geom-regularizer, achieving significant improvements on both standard and debiased benchmarks.

Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning

This paper proposes reinterpreting the Transformer self-attention mechanism as a soft binding/unbinding operator in Vector Symbolic Architectures (VSA) — where Query/Key define a role space, Value encodes fillers, attention weights implement differentiable unbinding, and residual connections implement superposition — thereby providing an algebraic perspective that unifies explanations of LLM capability and fragility in symbolic reasoning. The paper further proposes VSA-inspired architectural improvements such as explicit binding heads and hyperdimensional memory layers.

Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT

This work applies mechanistic interpretability to reverse-engineer the internal circuits of a Video Vision Transformer (ViViT), revealing a functional division of labor in which attention heads are responsible for "gathering evidence" and MLP modules for "composing concepts." The analysis demonstrates that the model develops semantic knowledge beyond its training objective even on simple classification tasks.

Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution

This paper proposes the first systematic comparative framework that directly contrasts strategic behavioral differences between humans and personality-prompted LLMs in paired dispute mediation scenarios, finding significant divergence in personality-behavior mapping and challenging the assumption that personality prompting can serve as a proxy for human behavior.

Concepts from Representations: Post-hoc Concept Bottleneck Models via Sparse Decomposition of Visual Representations

This paper proposes PCBM-ReD, a post-hoc concept bottleneck model that automatically extracts concepts from pretrained visual encoders via sparse autoencoders, annotates and filters them using MLLMs, and selects a representative subset through reconstruction-guided search. Image representations are then sparsely decomposed into linear combinations of concept embeddings via CLIP's vision-language alignment. The method achieves state-of-the-art accuracy on 11 classification benchmarks while maintaining interpretability.

CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution

CrossCheck-Bench is a three-level hierarchical benchmark comprising 15k adversarial QA samples. It diagnoses compositional reasoning failures of VLMs in multimodal conflict resolution via 7 atomic capabilities and 15 tasks, revealing systematic performance degradation from perception (L1) to reasoning (L3) and exposing the limitations of conventional prompting strategies.

Data Whitening Improves Sparse Autoencoder Learning

This paper introduces PCA whitening — a standard preprocessing step from classical sparse coding — into modern sparse autoencoder (SAE) training. Through theoretical analysis and simulation, it demonstrates that whitening renders the optimization landscape more convex and isotropic. Experiments on SAEBench show that whitening substantially improves interpretability metrics (Sparse Probing +7.3%, SCR +54%, TPP +372%), albeit with a slight decrease in reconstruction quality.

Distribution-Based Feature Attribution for Explaining the Predictions of Any Classifier

This paper proposes DFAX, the first distribution-based feature attribution method, which quantifies feature importance by comparing the conditional probability density of a target instance under the target class versus non-target classes. It provides the first formal definition of feature attribution, and demonstrates significant improvements over SHAP/LIME and other baselines across 10 datasets while being orders of magnitude faster.

DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment

This paper proposes the DR.Experts framework, which leverages DA-CLIP to obtain distortion-type priors, employs a Distortion Saliency Differential Module (DSDM) to disentangle distortion attention from semantic attention and thereby purify distortion features, and then applies a Dynamic Distortion Weighting Module (DDWM) to adaptively weight each distortion type's features according to its perceptual impact. The method achieves state-of-the-art performance on five BIQA benchmarks.

Browse all 37 Interpretability papers →


📦 Model Compression (60)

A Closer Look at Knowledge Distillation in Spiking Neural Network Training

To address the overlooked distribution mismatch between teacher ANN continuous features/logits and student SNN discrete sparse spike features/logits in ANN→SNN knowledge distillation, this paper proposes the CKDSNN framework based on Saliency-scaled Activation Map Distillation (SAMD) and Noise-smoothed Logits Distillation (NLD), achieving new state-of-the-art SNN training performance on CIFAR-10/100, ImageNet-1K, and CIFAR10-DVS.

AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

To address the severe inference latency overhead (250%–950%) of dynamic MoE-LoRA adapters, this paper proposes a token-level pre-gating architecture that performs a single global routing decision at the first layer. Combined with a custom SGMM fused CUDA kernel that merges all activated LoRA adapters into the backbone in one shot, the approach reduces decoding latency by 2.4× while preserving model accuracy.

Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency

This paper proposes a novel paradigm termed Asymmetric Cross-modal Knowledge Distillation (ACKD), realized through the SemBridge framework — comprising two plug-and-play modules, namely self-supervised semantic matching and optimal transport alignment — to enable cross-modal knowledge distillation under weak semantic consistency. This allows multispectral (MS) images collected from different geographic regions to effectively guide RGB-based remote sensing scene classification.

BD-Net: Has Depth-Wise Convolution Ever Been Applied in Binary Neural Networks?

This paper proposes BD-Net, which for the first time successfully integrates depth-wise convolution (DWConv) into binary neural networks (BNNs) by introducing 1.58-bit convolution and pre-BN residual connections. BD-Net achieves a new state of the art in the BNN domain on ImageNet with an extremely low computational cost of 33M OPs, with accuracy improvements of up to 9.3 percentage points across multiple datasets.

Beyond Sharpness: A Flatness Decomposition Framework for Efficient Continual Learning

This paper proposes FLAD, a framework that decomposes the sharpness-aware perturbation direction into a gradient-aligned component and a stochastic-noise component, retaining only the noise component for regularization. By combining zeroth-order and first-order sharpness, FLAD improves generalization in continual learning with minimal additional computational overhead.

CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

This paper introduces the concept of "micro-expert" to decompose MoE layer outputs as cross-matrix (up/gate/down_proj) linear combinations, enabling structured pruning (Camera-P) and mixed-precision quantization (Camera-Q) based on energy ranking. On Deepseek-MoE-16B, Qwen2-57B, and Qwen3-30B at 20%–60% sparsity, the method comprehensively outperforms NAEE and D²-MoE; analysis of Qwen2-57B requires less than 5 minutes on a single A100 GPU.

Can You Tell the Difference? Contrastive Explanations for ABox Entailments

This paper proposes a formal framework for Contrastive ABox Explanations (CE) to answer questions of the form "Why is \(a\) an instance of \(C\) but \(b\) is not?", simultaneously accounting for positive entailments and missing entailments within Description Logic knowledge bases, and analyzes the computational complexity under different description logics and optimization criteria.

Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers

This paper proposes Sequential Learning with Drift Compensation (SLDC), which learns latent space transformation operators (linear / weakly nonlinear) to compensate for distribution drifts induced by sequential fine-tuning of pre-trained ViTs in class-incremental learning. Combined with knowledge distillation, the approach achieves performance close to the joint-training upper bound.

Condensed Data Expansion Using Model Inversion for Knowledge Distillation

This paper proposes using condensed datasets as prototypes to guide the model inversion (MI) process. A feature-alignment discriminator enforces distributional consistency between synthesized data and condensed samples, thereby expanding the condensed dataset for knowledge distillation. The method achieves up to 11.4% improvement over standard MI-based distillation on CIFAR/ImageNet.

Consensus-Aligned Neuron Efficient Fine-Tuning Large Language Models for Multi-Domain Machine Translation

This paper proposes CANEFT, which uses mutual information (MI) to identify consensus-aligned neurons in LLMs that are consistently important across domains, and fine-tunes only these neurons to achieve efficient adaptation for multi-domain machine translation (MDMT). CANEFT outperforms PEFT baselines such as LoRA across 3 LLMs and 10 translation domains without introducing any additional parameters.

Browse all 60 Model Compression papers →


🕸️ Graph Learning (37)

Adaptive Initial Residual Connections for GNNs with Theoretical Guarantees

This paper proposes Adaptive Initial Residual Connections (Adaptive IRC), which allows each node to have a personalized residual strength learned from its initial features. It provides the first theoretical proof of a positive lower bound on the Dirichlet energy of initial residual connections with activation functions (guaranteeing the absence of over-smoothing), and introduces a PageRank-based heuristic variant that achieves comparable or superior performance without learning additional parameters.

Adaptive Riemannian Graph Neural Networks

This paper proposes ARGNN, a framework that learns a continuous, anisotropic diagonal Riemannian metric tensor for each node in a graph, enabling adaptive capture of local geometric properties across different graph regions (hierarchical structures vs. dense communities). ARGNN unifies and outperforms geometric GNN methods based on fixed curvature or discrete mixed-curvature spaces.

Are Graph Transformers Necessary? Efficient Long-Range Message Passing with Fractal Nodes in MPNNs

This paper proposes Fractal Nodes (FN) to enhance long-range message passing in MPNNs. Subgraph-level aggregation nodes are generated via METIS graph partitioning, combined with low-pass and high-pass filters (LPF+HPF) and a learnable frequency parameter \(\omega\). MLP-Mixer is adopted for cross-subgraph communication. The approach achieves \(O(L(|V|+|E|))\) linear complexity while matching or surpassing Graph Transformer performance, earning an AAAI Oral.

Assessing LLMs for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing

This paper proposes SerenQA, the first framework to formally define the serendipity discovery task in knowledge graph question answering. It introduces an information-theoretic RNS metric, an expert-annotated drug repurposing benchmark dataset, and a three-stage LLM evaluation pipeline. The work reveals that current LLMs perform reasonably on retrieval tasks but have substantial room for improvement in serendipitous exploration.

Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces

This paper proposes the Generative Semantic Workspace (GSW), a neuroscience-inspired generative memory framework that constructs structured episodic memory representations for LLMs, achieving an F1 of 0.85 on EpBench while reducing query-time context tokens by 51%.

Beyond Fixed Depth: Adaptive Graph Neural Networks for Node Classification Under Varying Homophily

This paper proposes AD-GNN, which theoretically analyzes node-level homophily/heterophily characteristics and adaptively assigns different aggregation depths to individual nodes, enabling unified handling of node classification on both homophilic and heterophilic graphs within a single framework.

BugSweeper: Function-Level Detection of Smart Contract Vulnerabilities Using Graph Neural Networks

This paper proposes BugSweeper, which constructs function-level abstract syntax graphs (FLAG) and designs a two-stage GNN architecture to enable end-to-end smart contract vulnerability detection without expert-defined rules, achieving an F1 of 98.57% on reentrancy attack detection.

EchoLess: Label-Based Pre-Computation for Memory-Efficient Heterogeneous Graph Learning

Echoless-LP eliminates training label leakage (the echo effect) caused by multi-hop message passing in label pre-computation via Partition-Focused Echoless Propagation (PFEP). Combined with an Asymmetric Partition Scheme (APS) and a PostAdjust mechanism to address information loss and distribution shift introduced by partitioning, the method remains memory-efficient, is compatible with arbitrary message-passing operators, and achieves state-of-the-art performance on multiple heterogeneous graph benchmarks.

Enhancing Logical Expressiveness in GNNs via Path-Neighbor Aggregation

PN-GNN proposes aggregating neighbor node embeddings along reasoning paths on top of conditional message passing, enhancing the logical rule expressiveness of GNNs (strictly beyond C-GNN) in a plug-and-play manner, while avoiding the generalization degradation caused by the labeling trick. The method achieves improvements on both synthetic datasets and real-world knowledge graph reasoning tasks.

Feature-Centric Unsupervised Node Representation Learning Without Homophily Assumption

This paper proposes FUEL, a method that adaptively learns the degree of graph convolution usage through a node-feature-centric clustering scheme, achieving high-quality unsupervised node representations on both homophilic and non-homophilic graphs without any homophily assumption.

Browse all 37 Graph Learning papers →


📈 Time Series (31)

A Theoretical Analysis of Detecting Large Model-Generated Time Series

This work presents the first theoretical framework for detecting time series large model (TSLM)-generated content. By establishing the Contraction Hypothesis, it reveals that TSLM-generated sequences exhibit exponentially decaying uncertainty under recursive forecasting. Based on this insight, the proposed UCE detector achieves an in-distribution AUROC of 0.855 across 32 datasets, substantially outperforming 10 text-detection baselines.

A Unified Shape-Aware Foundation Model for Time Series Classification

This paper proposes UniShape — a foundation model for time series classification that adaptively aggregates multi-scale discriminative subsequences (shapelets) via a shape-aware adapter, and learns transferable shapelet representations at both instance and shape levels through prototype-based contrastive pretraining. With only 3.1M parameters, UniShape achieves state-of-the-art performance on 128 UCR datasets (average accuracy 87.08%) while providing strong classification interpretability.

AirDDE: Multifactor Neural Delay Differential Equations for Air Quality Forecasting

The first framework to introduce Neural Delay Differential Equations (NDDE) into air quality forecasting. By incorporating a memory-augmented attention module and a physics-guided delay evolution function, it models delay effects in the continuous-time propagation of pollutants, achieving an average MAE reduction of 8.79% across three datasets.

iTimER: Reconstruction Error-Guided Irregularly Sampled Time Series Representation Learning

This paper proposes iTimER, which leverages the model's own reconstruction error distribution as a learning signal. By estimating the error distribution from observed points and sampling from it to generate pseudo-observations at unobserved timestamps, the method aligns the error distributions of observed and pseudo-observed regions via Wasserstein distance combined with contrastive learning, achieving state-of-the-art performance on classification, interpolation, and forecasting tasks for irregularly sampled time series.

C3RL: Rethinking the Combination of Channel-independence and Channel-mixing from Representation Learning

This paper proposes C3RL, a SimSiam-based contrastive learning framework that treats channel-independence (CI) and channel-mixing (CM) strategies as two transposed views of the same data to construct positive pairs. By jointly optimizing representation learning and forecasting through a Siamese network, C3RL improves the best-performance rate of CI models from 43.6% to 81.4% and CM models from 23.8% to 76.3%.

Coherent Multi-Agent Trajectory Forecasting in Team Sports with CausalTraj

This paper proposes CausalTraj — a temporally causal, likelihood-based multi-agent trajectory forecasting model that autoregressively models spatio-temporal interactions among agents step by step. CausalTraj achieves state-of-the-art results on joint metrics (minJADE/minJFDE) across NBA, basketball, and football datasets while maintaining competitive per-agent accuracy.

CometNet: Contextual Motif-guided Long-term Time Series Forecasting

This paper proposes CometNet, which extracts recurrently occurring "contextual motifs" from the full historical sequence to construct a motif library, and employs a motif-guided MoE architecture to dynamically associate the current window with relevant motifs for prediction. This approach breaks the receptive field bottleneck imposed by limited look-back windows and achieves significant improvements over state-of-the-art methods such as TimeMixer++ and iTransformer on 8 datasets.

Counterfactual Explainable AI (XAI) Method for Deep Learning-Based Multivariate Time Series Classification

This paper proposes CONFETTI, a multi-objective counterfactual explanation method for multivariate time series (MTS) classification. By combining Class Activation Map (CAM)-guided subsequence extraction with NSGA-III multi-objective optimization, CONFETTI achieves an optimal balance among prediction confidence, sparsity, and proximity, outperforming existing methods across 7 UEA benchmark datasets.

DeepBooTS: Dual-Stream Residual Boosting for Drift-Resilient Time-Series Forecasting

This paper proposes DeepBooTS, which leverages bias-variance decomposition theory to demonstrate that weighted ensembling reduces variance and thereby mitigates concept drift. The method introduces a dual-stream residual-decreasing boosting architecture in which each block corrects the residual of the preceding block, achieving an average improvement of 15.8% across multiple datasets.

Detecting the Future: All-at-Once Event Sequence Forecasting with Horizon Matching

This paper proposes DEF (Detection-based Event Forecasting), which draws on the set-matching idea from DETR in object detection and employs the Hungarian algorithm to align predicted and ground-truth event sequences, achieving high-accuracy and high-diversity long-horizon event forecasting with state-of-the-art results on five datasets.

Browse all 31 Time Series papers →


🏥 Medical Imaging (75)

A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation

This paper proposes a two-stage disease-aware framework that learns 14 Disease-Aware Semantic Tokens (DASTs) corresponding to pathology categories for explicit disease representation. It further employs a Disease-Visual Attention Fusion (DVAF) module and a Dual-Modal Similarity Retrieval (DMSR) mechanism to assist an LLM in generating clinically accurate chest X-ray reports, achieving state-of-the-art performance on three datasets: CheXpert Plus, IU X-Ray, and MIMIC-CXR.

Advancing Safe Mechanical Ventilation Using Offline RL With Hybrid Actions and Clinically Aligned Rewards

This paper addresses the problem of optimizing mechanical ventilation (MV) settings in the ICU via offline RL. A hybrid action space approach (HybridIQL/HybridEDAC) is proposed to avoid distributional shift caused by conventional discretization. Clinically aligned reward functions are introduced based on ventilator-free days (VFD) and physiological safety ranges, with multi-objective optimization used to select the optimal reward. The number of optimizable ventilation parameters is scaled from 2–3 to 6, and HybridIQL achieves the best balance between performance and policy coverage.

Ambiguity-aware Truncated Flow Matching for Ambiguous Medical Image Segmentation

This paper proposes the ATFM framework, which decouples prediction accuracy and diversity into distribution-level and sample-level optimization through a data-hierarchical inference paradigm. By integrating two modules — Gaussian Truncation Representation (GTR) and Segmentation Flow Matching (SFM) — ATFM simultaneously improves prediction accuracy, fidelity, and diversity in ambiguous medical image segmentation.

Bayesian Meta-Analyses Could Be More: A Case Study in Trial of Labor After a Cesarean-section Outcomes and Complications

This paper proposes a hierarchical Bayesian meta-analysis framework that models the unrecorded clinical decision variable (Bishop score) as a truncated latent variable, correcting the biased conclusions arising from omitted confounders in conventional fixed-effect meta-analyses. Applied to the TOLAC (Trial of Labor After Cesarean) setting, the method demonstrates no significant difference between mechanical dilation and Pitocin.

Bidirectional Channel-selective Semantic Interaction for Semi-Supervised Medical Segmentation

This paper proposes the BCSI framework, which employs a channel-selection router to dynamically identify critical feature channels and performs bidirectional channel-level interaction between labeled and unlabeled data streams. Combined with semantic-spatial perturbation-based weak-to-strong consistency learning, BCSI achieves substantial improvements in semi-supervised medical image segmentation.

Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark

This paper presents VL-SurgPT, the first large-scale multimodal surgical point tracking dataset combining visual coordinates with textual state descriptions, and proposes TG-SurgPT, a text-guided tracking method that leverages semantic information to significantly improve tracking accuracy and robustness in complex surgical scenes.

CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding

This paper proposes CAT-Net (Cross-Attention Tone Network), which achieves Mandarin four-tone classification using only 20 EEG channels and 5 EMG channels via spatial-temporal feature extraction branches, a cross-attention fusion mechanism, and domain adversarial training. The model achieves 87.83%/88.08% accuracy under voiced/silent speech conditions and 83.27%/85.10% under cross-subject evaluation, outperforming all 8 baseline methods.

CD-DPE: Dual-Prompt Expert Network Based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution

This paper proposes CD-DPE, a network that employs an iterative Convolutional Dictionary Feature Decoupling Module (CD-FDM) to disentangle multi-contrast MRI features into cross-contrast shared and modality-specific components, followed by a Dual-Prompt Feature Fusion Expert Module (DP-FFEM) for adaptive fusion and reconstruction. CD-DPE surpasses existing state-of-the-art methods on multiple public benchmarks.

Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models

This paper proposes the Coarse-to-Fine Classification (CFC) framework, which leverages the zero-shot reasoning capability of LLMs to supply semantically grounded OOD samples and a potential OOD label space for open-set graph node classification, enabling the model not only to detect OOD nodes but also to classify them into specific unknown categories.

CoCoLIT: ControlNet-Conditioned Latent Image Translation for MRI to Amyloid PET Synthesis

This paper proposes CoCoLIT, a ControlNet-conditioned latent diffusion framework for synthesizing amyloid PET images from structural MRI. Through a Weighted Image Space Loss (WISL) and Latent Averaging Stabilization (LAS), CoCoLIT substantially outperforms existing methods.

Browse all 75 Medical Imaging papers →


🩺 Medical LLM (12)

A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

This paper proposes the GCSD system for group Cognitive Stimulation Therapy (CST) targeting elderly individuals with cognitive impairment. The system integrates four modules — multi-speaker context control, dynamic participant state modeling (soft prompt), a cognitive stimulation attention loss, and a multi-dimensional reward policy optimization — built on a fine-tuned Qwen-2.5-3B backbone. Training is conducted on 500+ hours of real Cantonese CST dialogues and 10,000+ simulated conversations. The system achieves a BLEU-4 of 27.93, surpassing GPT-4o and other large models, with an A/B test win rate of 50% versus GPT-4o's 39%.

BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives

This paper proposes a hard negative mining method that constructs a multi-hop semantic graph from PubMed citation chains and performs random walks thereon. Using only 20k training samples and minimal fine-tuning steps, 33M/110M small models surpass retrieval baselines with billions of parameters on BEIR and LoTTE.

CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

This paper proposes CliCARE, a framework that transforms unstructured longitudinal cancer EHRs into temporal knowledge graphs (TKGs), aligns them with clinical practice guideline (CPG) knowledge graphs, and provides evidence-grounded clinical decision support for LLMs. An LLM-as-a-Judge evaluation protocol highly correlated with expert assessments is also introduced.

Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering

This paper constructs EMSQA, the first multiple-choice QA dataset for the emergency medical services domain (24.3K questions, 10 clinical topics, 4 certification levels), and proposes the Expert-CoT and ExpertRAG frameworks to inject domain expertise into LLM reasoning and retrieval, achieving up to 4.59% accuracy improvement over standard RAG.

GEM: Generative Entropy-Guided Preference Modeling for Few-shot Alignment of LLMs

GEM proposes a generative entropy-guided preference modeling approach that achieves efficient LLM alignment in low-resource settings (only 3,000 preference pairs) through cognitive filtering (entropy-based CoT scoring) and the SEGA algorithm (Self-Evaluated Group Advantage policy optimization).

Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling

This paper proposes the CHMR framework, which addresses missing biological modalities via structure-aware propagation, and introduces Tree-VQ to model hierarchical dependencies among molecules, cells, and genes. Evaluated on 728 tasks across 9 benchmarks, CHMR achieves a 3.6% improvement in classification and 17.2% in regression, enabling robust cell-aware molecular representation learning.

LUCID: Learning-Enabled Uncertainty-Aware Certification of Stochastic Dynamical Systems

This paper proposes LUCID, the first verification engine capable of providing quantified safety guarantees for black-box stochastic dynamical systems. By combining data-driven control barrier certificates, conditional mean embeddings, and finite Fourier kernel expansions, LUCID reformulates a semi-infinite non-convex optimization problem into a tractable linear program.

Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

This paper systematically evaluates six small open-source medical LLMs (<10B parameters) in pediatric endocrinology, demonstrating that accuracy alone is insufficient to characterize model reliability: semantically neutral prompt variations lead to significant output shifts (Stuart-Maxwell \(p < 10^{-4}\)), high consistency does not imply correctness, and even differences in CUDA versions can induce statistically significant output distribution changes.

MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains

This paper proposes MIRAGE, a framework that extends conventional linear reasoning chains into a parallel multi-chain reasoning paradigm. It combines adaptive retrieval from structured medical knowledge graphs (via neighborhood expansion and multi-hop traversal) with cross-chain verification to resolve contradictions, consistently outperforming GPT-4o, ToT, and Search-o1 on three medical QA benchmarks.

Real-Time Trust Verification for Safe Agentic Actions Using TrustBench

This paper proposes TrustBench, a dual-mode framework: (1) Benchmark Mode — combines traditional metrics with LLM-as-a-Judge to evaluate 8 trust dimensions and learns a calibration mapping from agent confidence to actual accuracy; (2) Verification Mode — computes trust scores in real time after an agent formulates an action but before execution, blocking 87% of harmful actions with latency below 200ms, with specialization achieved through domain plugins (medical/financial/QA).

Browse all 12 Medical LLM papers →


🧬 Computational Biology (20)

Apo2Mol: 3D Molecule Generation via Dynamic Pocket-Aware Diffusion Models

This paper proposes Apo2Mol, a diffusion-based all-atom framework that simultaneously generates 3D ligand molecules and corresponding holo (bound-state) pocket conformations from protein apo (unbound) conformations. Trained on 24K experimentally resolved apo-holo structure pairs, it achieves state-of-the-art performance in binding affinity (Vina min −7.86) and drug-likeness.

BeeRNA: Tertiary Structure-Based RNA Inverse Folding Using Artificial Bee Colony

This paper proposes BeeRNA, which applies the Artificial Bee Colony (ABC) optimization algorithm to the RNA tertiary structure inverse folding problem. Through a two-stage fitness evaluation combining base-pair distance pre-screening and RMSD scoring, BeeRNA outperforms deep learning methods gRNAde and RiboDiffusion on short-to-medium-length RNAs (<100 nt).

CellStream: Dynamical Optimal Transport Informed Embeddings for Reconstructing Cellular Trajectories from Snapshots Data

This paper proposes CellStream, a deep learning framework that jointly learns an autoencoder and unbalanced dynamical optimal transport (OT) to simultaneously obtain low-dimensional embeddings and continuous cellular dynamics from discrete-time single-cell snapshot data, achieving significant improvements over existing methods in temporal consistency and velocity consistency.

Constrained Best Arm Identification with Tests for Feasibility

This paper proposes a new framework for best arm identification (BAI) with feasibility constraints, allowing the decision-maker to test arm performance and feasibility constraints separately. An asymptotically optimal algorithm is designed that adaptively eliminates suboptimal arms via whichever criterion—performance or feasibility—is easier to satisfy.

ConSurv: Multimodal Continual Learning for Survival Analysis

This paper proposes ConSurv, the first multimodal continual learning framework for survival analysis. Through two core components — Multi-Stage Mixture-of-Experts (MS-MoE) and Feature-Constrained Replay (FCR) — ConSurv effectively mitigates catastrophic forgetting in settings that integrate whole slide pathology images and genomic data, comprehensively outperforming existing methods on the newly constructed MSAIL benchmark.

Distributional Priors Guided Diffusion for Generating 3D Molecules in Low Data Regimes

This paper proposes GODD (Geometric OOD Diffusion Model), which captures distributional structural priors via an equivariant asymmetric autoencoder to guide the generation process of a diffusion model, enabling models trained on data-rich molecular distributions to generalize to data-scarce distributions, achieving a 12.6% improvement in success rate on OOD structural shift benchmarks.

Dual-Path Knowledge-Augmented Contrastive Alignment Network for Spatially Resolved Transcriptomics

This paper proposes DKAN, a Dual-path Knowledge-Augmented contrastive Alignment Network that integrates semantic information from external gene databases as a cross-modal coordinator. Combined with a unified one-stage contrastive learning paradigm and an adaptive weighting mechanism, DKAN predicts spatially resolved gene expression from H&E-stained whole slide images (WSI), achieving state-of-the-art performance across three public ST datasets.

Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows

Three complementary chromosome-level genomic parallelization scheduling schemes are proposed — static scheduling (optimizing processing order), dynamic scheduling (knapsack-based batching with online RAM prediction), and a symbolic regression RAM predictor — achieving significant reductions in out-of-memory errors and execution time in both simulated and real precision medicine pipelines.

EPO: Diverse and Realistic Protein Ensemble Generation via Energy Preference Optimization

This paper proposes EPO (Energy Preference Optimization), which combines reverse SDE sampling with listwise energy-ranked preference optimization to align a pretrained protein generator with the target Boltzmann distribution using only energy signals. EPO achieves state-of-the-art performance across 9 metrics on three benchmarks (Tetrapeptides, ATLAS, and Fast-Folding), entirely eliminating the need for expensive molecular dynamics (MD) simulations.

Gene Incremental Learning for Single-Cell Transcriptomics

This paper proposes a Gene Incremental Learning (GIL) framework that leverages the permutation-invariant nature of single-cell transcriptomics data to extend the class incremental learning (CIL) paradigm to the token (gene) dimension. Two baseline methods—gene replay and gene distillation—are designed, and a comprehensive benchmark is established with two evaluation protocols: gene-level regression and gene-level classification.

Browse all 20 Computational Biology papers →


⚛️ Physics & Scientific Computing (15)

Adaptive Fidelity Estimation for Quantum Programs with Graph-Guided Noise Awareness

This paper proposes QuFid, a framework that models quantum circuits as directed acyclic graphs (DAGs), characterizes noise propagation via control-flow-aware random walks, quantifies circuit complexity through spectral features of the propagation operator, and achieves adaptive measurement budget allocation — significantly reducing the number of measurement shots while maintaining fidelity accuracy.

Catastrophic Forgetting in Kolmogorov-Arnold Networks

The first systematic study of catastrophic forgetting in Kolmogorov-Arnold Networks (KANs): establishes a theoretical framework linking forgetting to activation support overlap and intrinsic data dimensionality, and proposes KAN-LoRA for continual fine-tuning knowledge editing in language models.

Data Verification is the Future of Quantum Computing Copilots

This position paper argues that data verification must be elevated from a post-hoc filtering step to a foundational architectural principle in quantum computing AI copilots. Three positions are advanced: (1) verified data is a minimum requirement; (2) prior constraints outperform posterior filtering; (3) scientific domains governed by physical laws require verification-aware architectures. Experiments demonstrate that LLMs trained without verified data achieve at most 79% accuracy on circuit optimization tasks.

Fast 3D Surrogate Modeling for Data Center Thermal Management

This paper develops a vision-based 3D surrogate modeling framework for data centers. Server workloads, fan speeds, and air-conditioning temperature setpoints are encoded as 3D voxel representations, and architectures including 3D CNN U-Net, 3D Fourier Neural Operator, and 3D Vision Transformer are employed for real-time temperature field prediction. The proposed framework achieves inference speeds up to 20,000× faster than traditional CFD solvers while enabling a 7% reduction in energy consumption.

FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold Transformer

This paper provides an in-depth analysis of the root cause behind KAT (Kolmogorov-Arnold Transformer) training being 123× slower than ViT. The bottleneck is identified not as FLOPs but as memory stalls caused by gradient accumulation during backpropagation (global memory contention from atomic add operations). The proposed FlashKAT restructures GPU kernels to achieve an 86.5× training speedup and reduces gradient rounding errors by nearly an order of magnitude.

Just Few States are Enough: Randomized Sparse Feedback for Stability of Dynamical Systems

This paper proposes a randomized sparse feedback control framework in which the controller accesses only a random subset of the state vector at each time step. Feedback gain matrices and Bernoulli sparsification parameters are jointly designed via LMIs to guarantee asymptotic mean-square stability (AMSS) while minimizing the required number of active sensors. Experiments demonstrate that as few as 0.3% of state components suffice to achieve performance comparable to full-state feedback.

Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction

This paper proposes KARMA, a framework that embeds the Linear Spectral Mixing Model (LSMM) as a physics constraint within the ViT-MAE decoder, combined with a Spectral Angle Mapper (SAM) loss, to improve reconstruction fidelity and downstream transfer performance for hyperspectral remote sensing imagery.

Learning Fair Representations with Kolmogorov-Arnold Networks

This paper proposes integrating Kolmogorov-Arnold Networks (KAN) into an adversarial debiasing framework, leveraging KAN's spline-based architecture to provide theoretical guarantees of Lipschitz continuity and smoothness. An adaptive \(\lambda\) update mechanism is introduced to dynamically balance fairness and accuracy. The approach achieves significant improvements on fairness metrics on the UCI college admissions dataset.

Phys-Liquid: A Physics-Informed Dataset for Estimating 3D Geometry and Volume of Transparent Deformable Liquids

This work introduces the Phys-Liquid dataset (97,200 physics-simulated images with 3D meshes), which models dynamic deformation of liquids inside transparent containers based on the Navier-Stokes equations, and proposes a four-stage reconstruction pipeline (segmentation → multi-view mask generation → 3D reconstruction → scaling) to achieve high-accuracy liquid geometry and volume estimation in both simulated and real-world scenes.

PhysicsCorrect: A Training-Free Approach for Stable Neural PDE Simulations

This paper proposes PhysicsCorrect, a training-free correction framework that models PDE residual correction as a linearized inverse problem and precomputes a cached pseudoinverse. At inference time, it achieves up to 100× error reduction with less than 5% computational overhead, and is applicable to arbitrary pretrained neural operators including FNO, UNet, and ViT.

Browse all 15 Physics & Scientific Computing papers →


🌍 Earth Science (2)

MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics

This paper proposes MdaIF, a framework that leverages a vision-language model (VLM) to extract degradation-aware semantic priors for guiding mixture-of-experts (MoE) routing and channel attention modulation, enabling one-stop infrared-visible image fusion across multiple degradation scenarios without requiring degradation-type annotations.

RENEW: Risk- and Energy-Aware Navigation in Dynamic Waterways

This paper proposes RENEW, a global path planner for autonomous surface vessels (ASVs) operating in dynamic water current (ocean current) environments. It introduces a unified risk- and energy-aware strategy via adaptive no-go zone identification, best-effort contingency planning, and a hierarchical architecture based on Constrained Delaunay Triangulation (CDT), achieving zero collisions in emergency maneuver tests.


📡 Signal & Communications (3)

Balancing Multimodal Domain Generalization via Gradient Modulation and Projection

This paper proposes a Gradient Modulation Projection (GMP) strategy that addresses inter-modality optimization imbalance and inter-task gradient conflicts in multimodal domain generalization (MMDG) through two components: Inter-modality Gradient Decoupled Modulation (IGDM) and Conflict-Adaptive Gradient Projection (CAGP), achieving state-of-the-art performance on multiple benchmarks.

Task Aware Modulation Using Representation Learning for Upscaling of Terrestrial Carbon Fluxes

This paper proposes TAM-RL, a framework that formulates terrestrial carbon flux upscaling as a zero-shot regression transfer learning problem. By combining a BiLSTM task encoder with FiLM modulation and a knowledge-guided loss derived from the carbon balance equation, the method achieves a 9.6% reduction in GPP RMSE and a 43.8% improvement in NEE R² over FLUXCOM-X-BASE across 150+ flux tower sites.

Text-Guided Channel Perturbation and Pretrained Knowledge Integration for Unified Multi-Modality Image Fusion

This paper proposes UP-Fusion, a unified multi-modality image fusion framework comprising three modules — Semantic-aware Channel Pruning Module (SCPM), Geometric Affine Modulation (GAM), and CLIP Text-guided Channel Perturbation Module (TCPM) — that employs a single set of weights (trained solely on infrared-visible data) to simultaneously handle both IVIF and medical image fusion tasks, achieving state-of-the-art performance on both.


👥 Social Computing (10)

Argumentative Debates for Transparent Bias Detection

This paper proposes ABIDE (Argumentative BIas Detection by DEbate), which constructs Quantitative Bipolar Argumentation Frameworks (QBAFs) via neighborhood-based argument schemes, models the bias detection process as a structured debate, enables transparent bias reasoning from individual neighborhoods to the global level, and formally proves the correspondence between QBAF semantics and the expected behavior of bias detection.

Bias Association Discovery Framework for Open-Ended LLM Generations

This paper proposes the Bias Association Discovery Framework (BADF), which systematically extracts both known and unknown bias associations between demographic identities and descriptive concepts from LLM open-ended story generation, overcoming the limitation of prior methods that rely on predefined bias concepts.

Cross-modal Prompting for Balanced Incomplete Multi-modal Emotion Recognition

This paper proposes Cross-modal Prompting (ComP), which addresses the modality imbalance problem in incomplete multi-modal emotion recognition (IMER) via progressive prompt generation, cross-modal knowledge propagation, and a dynamic scheduler, achieving state-of-the-art performance across 4 datasets and 7 missing rates.

Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System

This paper proposes Fact2Fiction, the first poisoning attack framework targeting agentic fact-checking systems (e.g., DEFAME, InFact). It employs a Planner Agent to simulate claim decomposition and generate sub-questions, reverse-engineers key reasoning points from system justifications to craft targeted malicious evidence, and allocates the poisoning budget according to sub-claim importance. At a poisoning rate of only 1%, Fact2Fiction achieves 8.9%–21.2% higher attack success rate (ASR) than the state-of-the-art PoisonedRAG.

FactGuard: Event-Centric and Commonsense-Guided Fake News Detection

This paper proposes FactGuard, a framework that leverages LLMs to extract event-centric content (with style removed) and generate commonsense rationales. A Rationale Usability Evaluator dynamically assesses the reliability of LLM suggestions. Knowledge distillation yields a lightweight variant, FactGuard-D, that operates without LLM inference, achieving both robustness and efficiency in fake news detection.

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

This paper proposes CAPO (Curriculum Advantage Policy Optimization), an adaptive curriculum mechanism based on advantage signals. Through a two-phase strategy — first imitation (using only positive-advantage samples) then discrimination (introducing negative signals) — CAPO stably and significantly improves LLM performance on mathematical reasoning and multimodal GUI reasoning tasks.

Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering

This paper proposes the Multi-DProxy framework, which leverages learnable textual proxies for personalized multiple clustering through three key innovations: gated cross-modal fusion, dual-constraint proxy optimization, and dynamic candidate management, achieving state-of-the-art performance on all public benchmarks.

Reasoning About the Unsaid: Misinformation Detection with Omission-Aware Graph Inference

This paper proposes OmiGraph, the first omission-aware misinformation detection framework. By constructing omission-aware graphs, leveraging LLMs to reason about omission intent, and employing omission-guided message passing and aggregation mechanisms, OmiGraph extracts deception patterns from "what is unsaid," achieving average gains of +5.4% F1 and +5.3% ACC on bilingual datasets.

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

This paper proposes SceneJailEval, a scenario-adaptive multi-dimensional jailbreak evaluation framework that defines 14 jailbreak scenarios and 10 evaluation dimensions. Through a pipeline of scenario classification → dynamic dimension selection → multi-dimensional detection → weighted harm scoring, it achieves F1 of 0.917 on a self-constructed dataset (surpassing SOTA by 6%) and 0.995 on JBB (surpassing SOTA by 3%), while supporting harm severity quantification beyond binary classification.

T2Agent: A Tool-augmented Multimodal Misinformation Detection Agent with Monte Carlo Tree Search

This paper proposes T2Agent, a misinformation detection agent integrating an extensible toolset with Monte Carlo Tree Search (MCTS). By decomposing detection into sub-tasks targeting distinct forgery sources via a multi-source verification mechanism, T2Agent achieves a new state of the art on MMfakebench, improving the accuracy of the baseline MMDAgent by 28.7% using GPT-4o as the backbone.


🛡️ AI Safety (45)

Alternative Fairness and Accuracy Optimization in Criminal Justice

This paper provides a systematic review of three dimensions of algorithmic fairness (group fairness, individual fairness, and procedural fairness), proposes an improved group fairness optimization formulation based on tolerance constraints, and constructs a "Three Pillars of Fairness" deployment framework for public decision-making systems.

An Improved Privacy and Utility Analysis of Differentially Private SGD with Bounded Domain and Smooth Losses

Under the sole assumption of \(L\)-smoothness (without convexity), this paper derives tighter closed-form RDP privacy bounds for DPSGD and, for the first time, provides a complete convergence/utility analysis in the bounded-domain setting, revealing that a smaller parameter domain diameter simultaneously improves both privacy and utility.

An Information Theoretic Evaluation Metric for Strong Unlearning

This paper exposes a fundamental flaw in existing black-box unlearning evaluation metrics (MIA, JSD, etc.)—modifying only the final classification head is sufficient to satisfy all black-box metrics while intermediate layers fully retain information about the forget set. The paper proposes IDI, a white-box metric that quantifies unlearning effectiveness by estimating, via InfoNCE, the mutual information between each layer's representations and the forget labels. It further proposes COLA, an unlearning method that achieves IDI scores approaching Retrain on CIFAR-10/100 and ImageNet-1K.

Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks

This paper proposes the Angular Gradient Sign Method (AGSM), which decomposes gradients in hyperbolic space into radial (hierarchical depth) and angular (semantic) components, applying perturbations exclusively along the angular direction to generate adversarial examples. AGSM achieves 5–13% greater accuracy degradation than standard FGSM/PGD on image classification and cross-modal retrieval tasks.

Authority Backdoor: A Certifiable Backdoor Mechanism for Authoring DNNs

This paper proposes Authority Backdoor, which embeds hardware fingerprints as backdoor triggers into DNNs so that models function correctly only on authorized devices, and achieves certifiable robustness against adaptive trigger reverse-engineering attacks via randomized smoothing.

Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification

This paper proposes the Manifold-Correcting Causal Flow (MC²F) framework, which employs a Stratified Riemannian Continuous Normalizing Flow (SR-CNF) to learn the manifold density of clean data embeddings for adversarial example detection, and subsequently applies a Geodesic Purification Solver to project detected adversarial embeddings back onto the clean manifold along geodesic paths. MC²F comprehensively surpasses state-of-the-art methods in adversarial robustness across SST-2, AGNews, and YELP benchmarks, while incurring no loss—and even achieving marginal gains—in clean accuracy.

Breaking the Dyadic Barrier: Rethinking Fairness in Link Prediction Beyond Demographic Parity

This paper identifies three fundamental flaws in dyadic fairness and Demographic Parity (ΔDP) for link prediction—insufficient GNN expressiveness, subgroup bias masking, and ranking insensitivity—and proposes a ranking-aware fairness metric based on NDKL and a post-processing algorithm MORAL, achieving state-of-the-art fairness–utility trade-offs across six datasets.

CoRe-Fed: Bridging Collaborative and Representation Fairness via Federated Embedding Distillation

This paper proposes CoRe-Fed, a framework that simultaneously addresses representation fairness and collaborative fairness in federated learning through two synergistic modules—embedding-level contrastive alignment and contribution-aware aggregation—achieving significant improvements in both fairness and generalization of the global model under heterogeneous data distributions.

Credal Ensemble Distillation for Uncertainty Quantification

This paper proposes the Credal Ensemble Distillation (CED) framework, which distills a deep ensemble (DE) teacher into a single-model student called CREDIT. Rather than predicting a single softmax distribution, CREDIT outputs class probability intervals that define a credal set, achieving superior or comparable uncertainty estimation on OOD detection tasks while substantially reducing inference overhead (from 5× to 1×).

DeepTracer: Tracing Stolen Model via Deep Coupled Watermarks

This paper proposes DeepTracer, a robust watermarking framework that achieves deep coupling between the watermark task and the main task through adaptive source-class selection (K-Means clustering for feature space coverage) + same-class coupling loss (aligning watermark samples with target-class samples in output space) + two-stage key sample filtering. Under 6 model stealing attacks (including hard-label and data-free settings), the watermark success rate averages 77–100%, substantially outperforming existing methods.

Browse all 45 AI Safety papers →


📂 Others (117)

A Fast Heuristic Search Approach for Energy-Optimal Profile Routing for Electric Vehicles

This paper proposes Pr-A, a label-setting method based on multi-objective A search for efficiently solving energy-optimal profile routing for electric vehicles (EVs) when the initial state of charge (SoC) is unknown. By using profile dominance pruning, the method avoids the complex profile merge operations required by traditional approaches, achieving performance close to standard A* with known initial SoC on large-scale road networks.

A New Strategy for Verifying Reach-Avoid Specifications in Neural Feedback Systems

This paper proposes FaBRe (Forward and Backward Reachability), a unified framework that, for the first time, develops both over- and under-approximation algorithms for backward reachable sets of ReLU neural network controllers (GSS/ICH/LEB), and integrates them with forward reachability analysis to construct a unified reach-avoid verification framework, aiming to overcome the scalability bottleneck of purely forward analysis.

A Phase Transition for Opinion Dynamics with Competing Biases

This paper models the competition between two opposing forces — external subversive bias and individual stubbornness — on binary opinion spreading over directed random graphs. It proves that the system exhibits a sharp phase transition: when the bias exceeds a critical threshold \(p_c\), the population rapidly reaches a new consensus; below the threshold, the system remains in a long-lived metastable polarized state. The critical point is determined solely by two simple statistics of the degree sequence.

A Topological Rewriting of Tarski's Mereogeometry

This work extends the λ-MM library within the Coq theorem prover to recast Tarski's solid geometry—grounded in Leśniewski's mereology—into a fully formalized system with a complete topological structure. It proves that mereological classes correspond to regular open sets, satisfy Kuratowski's interior axioms, and exhibit the Hausdorff (T2) separation property, thereby providing a unified mereological–geometric–topological theoretical framework for qualitative spatial reasoning.

Align When They Want, Complement When They Need! Human-Centered Ensembles for Adaptive Human-AI Collaboration

This paper reveals a fundamental trade-off between complementarity and alignment in human-AI collaboration—no single model can simultaneously optimize both objectives. It proposes an adaptive AI ensemble framework that dynamically switches between an alignment model and a complementarity model via a Rational Routing Shortcut (RRS) mechanism, achieving up to 9% improvement in team accuracy over standard AI.

An Epistemic Perspective on Agent Awareness

This paper is the first to treat agent awareness as a form of knowledge, distinguishing two awareness modalities — de re (concerning physical objects) and de dicto (concerning concepts/descriptions) — and proposes a sound and complete logical system grounded in 2D semantics to characterize the interaction between these two modalities and the standard "factual knowledge" modality.

Approximation Algorithm for Constrained k-Center Clustering: A Local Search Approach

This paper studies the k-center clustering problem with instance-level cannot-link (CL) and must-link (ML) constraints. It proposes a local search framework based on a dominating matching set (DMS) reduction, and, under the disjoint CL sets condition, is the first to achieve the optimal approximation ratio of 2 via local search—resolving an open problem in the field.

Area-Optimal Control Strategies for Heterogeneous Multi-Agent Pursuit

This paper studies pursuit-evasion games with heterogeneous speeds involving multiple pursuers and a single evader. The evader's safe reachable set is defined as the intersection of Apollonius circles for all pursuer–evader pairs. The capture strategy is modeled as a zero-sum game in which pursuers minimize and the evader maximizes the area of this intersection. Closed-form instantaneous optimal heading control laws are derived, and simulations verify that pursuers can systematically shrink the safe region to guarantee capture.

Automated Reproducibility Has a Problem Statement Problem

This paper proposes a formalized problem definition of reproducibility grounded in the scientific method, representing empirical AI research as a hypothesis–experiment–interpretation graph structure. An LLM is used to automatically extract this structure from 20 papers, and the extracted results are validated through review by the original authors.

Autonomous Concept Drift Threshold Determination

This paper proves that no fixed threshold can be optimal across all scenarios and that dynamic thresholds strictly dominate static ones. It proposes the DTD algorithm, which initiates a three-model comparison phase upon drift detection signal trigger and adaptively adjusts the detection threshold based on candidate model performance.

Browse all 117 Others papers →